This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Over-the-Air Federated Learning Over MIMO Channels: A Sparse-Coded Multiplexing Approach

Chenxi Zhong, and Xiaojun Yuan,  C. Zhong, and X. Yuan are with the National Key Laboratory of Science and Technology on Communications, University of Electronic Science and Technology of China, Chengdu, China (e-mail: [email protected]; [email protected]). The corresponding author is Xiaojun Yuan.
Abstract

The communication bottleneck of over-the-air federated learning (OA-FL) lies in uploading the gradients of local learning models. In this paper, we study the reduction of the communication overhead in the gradients uploading by using the multiple-input multiple-output (MIMO) technique. We propose a novel sparse-coded multiplexing (SCoM) approach that employs sparse-coding compression and MIMO multiplexing to balance the communication overhead and the learning performance of the FL model. We derive an upper bound on the learning performance loss of the SCoM-based MIMO OA-FL scheme by quantitatively characterizing the gradient aggregation error. Based on the analysis results, we show that the optimal number of multiplexed data streams to minimize the upper bound on the FL learning performance loss is given by the minimum of the numbers of transmit and receive antennas. We then formulate an optimization problem for the design of precoding and post-processing matrices to minimize the gradient aggregation error. To solve this problem, we develop a low-complexity algorithm based on alternating optimization (AO) and alternating direction method of multipliers (ADMM), which effectively mitigates the impact of the gradient aggregation error. Numerical results demonstrate the superb performance of the proposed SCoM approach.

Index Terms:
Over-the-air federated learning, multiple-input multiple-output access channel, multiplexing, turbo compressed sensing.

I Introduction

Sixth-generation (6G) wireless communications, as expected to support the connection density up to millions of wireless devices per square kilometer, will provide a solid foundation to fulfil the vision of ubiquitous intelligence [zhou2020service]. To develop a powerful intelligence model, it is necessary to exploit the diversity of data distributed over a large number of edge devices. A straightforward paradigm is to require edge devices to send local data to a central parameter server (PS) for training the model centrally. However, sending raw data to the PS requires a huge communication overhead and may expose user privacy. To overcome these drawbacks, federated learning (FL) is a promising substitute that allows edge devices to collaborate on training a machine learning (ML) model without sharing their local data with others [mcmahan2017communication]. Instead of uploading raw data, in an FL training round, each edge device sends its local gradient to the PS, and the PS aggregates the local gradients, updates and sends back the global model to the devices.

Gradient uploading is a critical bottleneck of deploying FL on a wireless network, since it is difficult to support the communication demands of massive edge devices with limited communication resources (e.g. time, bandwidth, and space). For example, the dimension of the recent ML models is extremely large, e.g., the ResNet152 has 60 million parameters [he2016deep], while the GPT-3 has 175 billion parameters [brown2020language]. Yet, the available channel bandwidth is typically small due to the bandwidth and latency limitations, e.g., 1 LTE frame of 5MHz bandwidth and 10ms duration can carry only 50,00050,000 complex symbols. Fortunately, in FL, the PS does not need to know the local gradient of each device but the aggregated gradient, usually the mean of all the local gradients. Based on this property, over-the-air FL (OA-FL) is proposed in [nazer2007computation, zhu2020broadband, amiri2020federated], where edge devices share the same wireless resources to upload their local gradients. Thanks to the analog superposition of electromagnetic waves, the local gradients are aggregated over-the-air in the process of uploading. Compared with the traditional orthogonal multiple access (OMA) approaches [yang2021energy, vu2020cell, vu2022joint], OA-FL does not require more communication resources as the number of devices increases [amiri2020federated], which greatly alleviates the communication bottleneck of gradient uploading. Pioneering studies demonstrate that OA-FL also exhibits strong noise tolerance and significant latency improvement [zhu2020broadband, liu2020reconfigurable].

Due to the appealing features of OA-FL, much research effort has been devoted to the design of efficient OA-FL systems. For example, ref. [lin2017deep] pointed out that the local gradients can be sparsified, compressed, and quantized to reduce the communication overhead without causing substantial losses in accuracy. Ref. [amiri2020federated] proposed an efficient scheme, where the local gradients are sparsified and linear coding compressed before uploaded, and the aggregated gradient is recovered at the PS via compressed sensing methods. The authors in [ma2022over] used partial-orthogonal compressing matrices and turbo compressed sensing (Turbo-CS) [ma2014turbo], achieving a lower complexity scheme of sparse coding. It has been shown in [ma2022over] that sparse coding enables the OA-FL system to achieve a lower communication overhead and a faster convergence rate.

The above schemes are all based on single-input single-output (SISO) systems. Multiple-input multiple-output (MIMO) with array signal processing has been widely recognized as a powerful technique to enhance the system capacity. MIMO multiplexing significantly reduces the number of channel uses by transmitting multiple spatial data streams in parallel through antenna arrays[spencer2004zero]. However, MIMO multiplexing causes the inter-stream interference which corrupts the aggregated gradient and the test accuracy for OA-FL. There have been some preliminary attempts to alleviate the impact of inter-stream interference by designing the precoding matrices at the devices and the post-processing matrix at the PS. For instance, ref. [zhu2018mimo] set the precoding matrix to the pseudo-inverse of channel matrix, and derived a closed-form equalizer as the post-processing matrix using differential geometry. The scheme proposed in [chen2018over] also uses the pseudo-inverse of channel matrix as the precoding matrix, and computes the post-processing matrix based on the receive antenna selection. However, both methods are based on the channel inversion, which may significantly amplify noise and hence exacerbate the gradients aggregation error, especially when some devices suffer from deep channel fading [zhu2020broadband, liu2020reconfigurable, zhong2022over].

In this paper, we consider an over-the-air federated learning (OA-FL) network where the local gradients are uploaded over a MIMO multiple access (MAC) channel. The MIMO MAC channel comprises a central PS with multiple antennas and several edge devices with multiple antennas. We propose a novel Sparse-Coded Multiplexing (SCoM) approach that integrates sparse coding with MIMO multiplexing in gradients uploading. Benefiting from two techniques, the SCoM achieves a strikingly better balance between the communication overhead and the learning performance. On one hand, sparse-coding utilizes the sparsity of the gradient to compress the gradient, reducing the communication overhead. On the other hand, MIMO multiplexing reduces the number of channel uses by transmitting multiple streams in parallel, and suppresses the gradients aggregation error through precoding and post-processing matrices. The main contributions are summarized as follows.

  • We propose a novel SCoM approach for gradients uploading in MIMO OA-FL. We derive an upper bound on the learning performance loss of the SCoM-based MIMO OA-FL scheme by quantitatively characterizing the gradient aggregation error.

  • Based on the analytical result, we formulate a joint precoding and post-processing matrices optimization problem for suppressing the gradient aggregation error. We design a low-complexity algorithm that employs alternating optimization (AO) and alternating direction method of multipliers (ADMM) to jointly optimize the precoding and post-processing matrices.

  • We derive the optimal number of multiplexed data streams for SCoM to balance the communication overhead and the gradient aggregation performance. More specifically, the optimal number of multiplexed data streams to minimize the upper bound of the learning performance loss is given by min{NT,NR}\min\{N_{\mathrm{T}},N_{\mathrm{R}}\}, where NTN_{\mathrm{T}} denotes the number of transmit antennas, and NRN_{\mathrm{R}} denotes the number of receive antennas.

Numerical results demonstrate that our proposed SCoM approach achieves the same test accuracy with much lower communication overhead than other existing approaches, which indicates the superior performance of the SCoM approach.

The rest of this paper is structured as follows. Section II introduces the FL model and the MIMO MAC channel. Section III presents the proposed SCoM approach. The analysis of the learning performance of the SCoM approach is presented in Section IV. In Section V, we present the optimization problem to minimize the gradient aggregation error, and propose a low-complexity algorithm to jointly optimize precoding and post-processing matrices. Section VI presents numerical results to evaluate the SCoM approach and Section VII concludes the paper.

Notation: \mathbbR\mathbb{R} and \mathbbC\mathbb{C} denote the sets of real and complex numbers, respectively. \tr()\tr(\cdot), \rank()\rank(\cdot), ()(\cdot)^{\dagger}, ()T(\cdot)^{\mathrm{T}}, and ()H(\cdot)^{\mathrm{H}} are used to denote the trace, the rank, the conjugate, the transpose, and the conjugate transpose of the matrix, respectively. [M][M] denotes the set {m|1mM}\{m|1\leq m\leq M\}. 𝐬(1:N)\mathbf{s}(1:N) denotes a sub-vector of 𝐬\mathbf{s} that contains entries from index 11 to index NN. The expectation operator is denoted by \mathbbE[]\mathbb{E}[\cdot]. We use 𝐈N\mathbf{I}_{N}, and 𝟎N×M\mathbf{0}_{N\times M} to denote the identity matrix of size N×NN\times N and the zero matrix of size N×MN\times M, respectively. We use 2\|\cdot\|_{2}, and F\|\cdot\|_{\mathrm{F}} to denote the l2l_{2}-norm and the Frobenius norm, respectively. 𝒞𝒩(μ,σ)\mathcal{CN}(\mu,\sigma) denotes the circularly-symmetric complex Gaussian (CSCG) distribution that has a mean of μ\mu and a covariance of σ\sigma.

II System Model

II-A Federated Learning

We start with the description of the FL task deployed on a wireless communication system, where the system consists of one central PS and MM edge devices. We assume that the training data of the FL task are all distributedly stored on the edge devices. Let 𝒜m\mathcal{A}_{m} denote the local dataset of the mm-th device, and QmQ_{m} denote the cardinality of 𝒜m\mathcal{A}_{m}. Q=m=1MQmQ=\sum_{m=1}^{M}Q_{m} is the total number of training data samples for the FL task. \boldsymbolθ\mathbbRD\boldsymbol{\theta}\in\mathbb{R}^{D} is the model parameter vector with DD being the total length of the model parameter. The target of the FL task is to minimize an empirical loss function F(\boldsymbolθ)F(\boldsymbol{\theta}) based on the local datasets {𝒜m}\{\mathcal{A}_{m}\}, given by

min\boldsymbolθF(\boldsymbolθ)=1Qm=1MQmFm(\boldsymbolθ)=1Qm=1Mn=1Qmf(\boldsymbolθ;\boldsymbolζm,n),\min_{\boldsymbol{\theta}}\quad F(\boldsymbol{\theta})=\frac{1}{Q}\sum\nolimits_{m=1}^{M}Q_{m}F_{m}(\boldsymbol{\theta})=\frac{1}{Q}\sum\nolimits_{m=1}^{M}\sum\nolimits_{n=1}^{Q_{m}}f\left(\boldsymbol{\theta};\boldsymbol{\zeta}_{m,n}\right), (1)

where Fm(\boldsymbolθ)=1Qmn=1Qmf(\boldsymbolθ;\boldsymbolζm,n)F_{m}(\boldsymbol{\theta})=\frac{1}{Q_{m}}\sum_{n=1}^{Q_{m}}f\!(\boldsymbol{\theta};\boldsymbol{\zeta}_{m,n}) is the local loss function of device mm, and f(\boldsymbolθ;\boldsymbolζm,n)f\!(\boldsymbol{\theta};\boldsymbol{\zeta}_{m,n}) is the sample-wise loss function for the nn-th training sample \boldsymbolζm,n\boldsymbol{\zeta}_{m,n} in 𝒜m\mathcal{A}_{m}.

To minimize the empirical loss function F(\boldsymbolθ)F(\boldsymbol{\theta}) in \eqrefeq:FLTarget, the FL training involves TT communication rounds between the edge devices and the PS for F(\boldsymbolθ)F(\boldsymbol{\theta}) to reach convergence. Specifically, each communication round t[T]t\in[T] consists of four steps:

  • Global model download: The PS sends the global model \boldsymbolθ(t)\boldsymbol{\theta}^{(t)} to each edge device.

  • Local gradients computation: The local gradient 𝐠m(t)\mathbbRD\mathbf{g}_{m}^{(t)}\in\mathbb{R}^{D} is computed by device mm based on their own data and the global model, given by

    𝐠m(t)=Fm(\boldsymbolθ(t))=1Qmn=1Qmf(\boldsymbolθ(t);\boldsymbolζm,n).\mathbf{g}_{m}^{(t)}=\nabla F_{m}(\boldsymbol{\theta}^{(t)})=\frac{1}{Q_{m}}\sum\nolimits_{n=1}^{Q_{m}}\nabla f\left(\boldsymbol{\theta}^{(t)};\boldsymbol{\zeta}_{m,n}\right). (2)
  • Local gradients upload: The edge devices send the local gradients to the PS through wireless channels.

  • Global model aggregation: The local gradients are aggregated as

    𝐠(t)=1Qm=1MQm𝐠m(t)=m=1Mqm𝐠m(t),\mathbf{g}^{(t)}=\frac{1}{Q}\sum\nolimits_{m=1}^{M}Q_{m}\mathbf{g}_{m}^{(t)}=\sum\nolimits_{m=1}^{M}q_{m}\mathbf{g}_{m}^{(t)}, (3)

    where 𝐠(t)\mathbf{g}^{(t)} denotes the aggregated gradient, and qm=Qm/Q,m[M]q_{m}=Q_{m}/Q,\forall m\in[M]. The global model \boldsymbolθ(t+1)\boldsymbol{\theta}^{(t+1)} is updated by

    \boldsymbolθ(t+1)=\boldsymbolθ(t)η𝐠(t),\boldsymbol{\theta}^{(t+1)}=\boldsymbol{\theta}^{(t)}-\eta\mathbf{g}^{(t)}, (4)

    where η\mathbbR\eta\in\mathbb{R} denotes the learning rate.

II-B MIMO Channel Model

\includegraphics

[width=0.6]Images/SystemModel.pdf

Figure 1: An illustration of the considered MIMO OA-FL system.

We now introduce the wireless multi-user multiple-input multiple-output (MIMO) channel for the above FL system. As depicted in Fig. 1, the considered MIMO OA-FL system consists of a PS with NRN_{\mathrm{R}} antennas, and MM edge devices with each equipped with NTN_{\mathrm{T}} antennas. As in previous studies in OA-FL [amiri2020federated, sery2021over, cao2022transmission], we make two assumptions: that the download of the global model is through error-free links111 In practice, the channel noise causes communication errors in the model download. This issue can be addressed by the schemes proposed by [vu2020cell, vu2022joint]. This is beyond the scope of this paper and hence omitted here. and that the devices upload the local gradients to the PS synchronously222 The existing techniques in 4G Long Term Evolution, e.g., the timing advance (TA) mechanism, can achieve the synchronization of the gradient symbols among the edge devices [3gpp.38.213]. . We now focus on the process of local gradients uploading. We consider a block-fading channel, where the channel state information (CSI) remains constant during the gradients uploading. Let 𝐇m(t)\mathbbCNR×NT\mathbf{H}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{R}}\times N_{\mathrm{T}}} denote the CSI matrix between the mm-th device and the PS at the tt-th round. We assume that the PS has perfect knowledge of the CSI of the wireless channels between the devices and the PS333 The approaches of CSI estimation over MIMO MAC channels can be referred to [nguyen2013compressive, wen2014channel, vu2020cell, vu2022joint].. Thus, at the PS, the receive signal matrix from the above MIMO multiple access (MAC) channel is given by

𝐘(t)=m=1M𝐇m(t)𝐗m(t)+𝐍(t)\mathbbCNR×K,\mathbf{Y}^{(t)}=\sum\nolimits_{m=1}^{M}\mathbf{H}_{m}^{(t)}\mathbf{X}_{m}^{(t)}+\mathbf{N}^{(t)}\in\mathbb{C}^{N_{\mathrm{R}}\times K}, (5)

where KK denotes the number of channel uses at the tt-th round; 𝐗m(t)\mathbbCNT×K\mathbf{X}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{T}}\times K} denotes the transmit data matrix for the mm-th device; and 𝐍(t)\mathbbCNR×K\mathbf{N}^{(t)}\in\mathbb{C}^{N_{\mathrm{R}}\times K} is an additive white Gaussian noise (AWGN) matrix, with the entries independently drawn from 𝒞𝒩(0,σnoise)\mathcal{CN}(0,\sigma_{\mathrm{noise}}). Let 𝐱m,k\mathbf{x}_{m,k} denote the kk-th column of 𝐗m(t)\mathbf{X}_{m}^{(t)}. Here, we consider the following transmit power constraint:

\mathbbE[𝐱m,k(t)22]P0,k[K],\mathbb{E}[\|\mathbf{x}_{m,k}^{(t)}\|_{2}^{2}]\leq P_{0},\forall k\in[K], (6)

where P0P_{0} is the transmit power budget.

What remains are to map the local gradients {𝐠m(t)}\{\mathbf{g}_{m}^{(t)}\} to the transmit matrices {𝐗m(t)}\{\mathbf{X}_{m}^{(t)}\} at the edge devices, and to recover the aggregated gradient from the receive signal matrix 𝐘(t)\mathbf{Y}^{(t)}. These issues are discussed in detail in the next section.

III Proposed Sparse-Coded Multiplexing Approach

With the development of deep learning, the size of model is increasing. To upload the large number of FL local gradients over the aforementioned MIMO channel, the key challenge is the heavy communication burden. Although transmitting the data streams in parallel with MIMO multiplexing efficiently reduces the communication overhead, MIMO multiplexing also causes the interference between the data streams, resulting in a loss of FL learning performance.

To address these challenges, we propose a novel transmission scheme, i.e., the Sparse-Coded Multiplexing (SCoM) approach, for the above MIMO OA-FL system, as shown in Fig. 2. The SCoM approach employs two techniques: sparse-coding and MIMO multiplexing. On one hand, sparse-coding utilizes the sparsity of the gradient to compress the gradient, reducing the communication overhead. Meanwhile, sparse-coding leverages the compression matrix to encode the data streams, thereby reducing the correlation between data streams and suppressing inter-stream interference. On the other hand, MIMO multiplexing reduces the number of channel uses by transmitting multi-stream data through antenna arrays, and suppresses inter-stream interference through precoding and post-processing matrices. The details of the SCoM approach are given as follows.

\includegraphics

[width=]Images/FlowGraph.pdf

Figure 2: A flow diagram of local gradients uploading at round tt.

III-A Processing on Devices

To support the local gradients uploading in the MIMO OA-FL system, the pre-processing operations are first conducted on the edge devices, including the gradient sparsification [amiri2020federated] and the gradient compression[ma2014turbo].

To be specific, for the mm-th edge device, the local gradient 𝐠m(t)\mathbf{g}_{m}^{(t)} at the tt-th round is first complexified to fully utilize the spectral efficiency of complex channels. The complexified gradient is denoted by

𝐠mcx(t)={𝐠mcx(t)}+j{𝐠mcx(t)}\mathbbCD/2,\mathbf{g}_{m}^{\mathrm{cx}(t)}=\Re\{\mathbf{g}_{m}^{\mathrm{cx}(t)}\}+j\Im\{\mathbf{g}_{m}^{\mathrm{cx}(t)}\}\in\mathbb{C}^{D/2}, (7)

where j=1j=\sqrt{-1}. Then, the accumulated gradient is obtained by

𝐠mac(t)=𝐠mcx(t)+\boldsymbolΔm(t)\mathbbCD/2,\mathbf{g}_{m}^{\mathrm{ac}(t)}=\mathbf{g}_{m}^{\mathrm{cx}(t)}+\boldsymbol{\Delta}_{m}^{(t)}\in\mathbb{C}^{D/2}, (8)

where \boldsymbolΔm(t)\boldsymbol{\Delta}_{m}^{(t)} denotes the error accumulation vector of the mm-th device at the tt-th round with Δm(0)\Delta_{m}^{(0)} initialized as 𝟎\mathbf{0}. Having calculated the accumulated gradient, in the sparsification, the mm-th device obtains the sparsified gradient via

𝐠msp(t)=sp(𝐠mac(t),λ)\mathbbCD/2,\mathbf{g}_{m}^{\mathrm{sp}(t)}=\mathrm{sp}(\mathbf{g}_{m}^{\mathrm{ac}(t)},\lambda)\in\mathbb{C}^{D/2}, (9)

where λ[0,1]\lambda\in[0,1] denotes the sparsity ratio. The operator sp()\mathrm{sp}(\cdot) retains the λD/2\lambda D/2 entries of 𝐠mac(t)\mathbf{g}_{m}^{\mathrm{ac}(t)} with the largest absolute value magnitude, and sets the remaining (1λ)D/2(1-\lambda)D/2 entries to 0. The error accumulation vector \boldsymbolΔm(t+1)\boldsymbol{\Delta}_{m}^{(t+1)} is updated via

\boldsymbolΔm(t+1)=𝐠mac(t)𝐠msp(t).\boldsymbol{\Delta}_{m}^{(t+1)}=\mathbf{g}_{m}^{\mathrm{ac}(t)}-\mathbf{g}_{m}^{\mathrm{sp}(t)}. (10)

Then 𝐠msp(t)\mathbf{g}_{m}^{\mathrm{sp}(t)} is normalized to 𝐠mno(t)\mathbf{g}_{m}^{\mathrm{no}(t)} by

𝐠mno(t)=(𝐠msp(t)𝐬)/σm(t)\mathbbCD/2,\mathbf{g}_{m}^{\mathrm{no}(t)}=(\mathbf{g}_{m}^{\mathrm{sp}(t)}\odot\mathbf{s})/\sqrt{\sigma_{m}^{(t)}}\in\mathbb{C}^{D/2}, (11)

where 𝐬\mathbbCD/2\mathbf{s}\in\mathbb{C}^{D/2} is a random flipping vector, with each entry of 𝐬\mathbf{s} being independent and identically distributed (i.i.d.) drawn from {1,+1}\{-1,+1\}; and σm(t)=1D/2d=1D/2|gmsp(t)[d]|2\sigma_{m}^{(t)}=\frac{1}{D/2}\sum_{d=1}^{D/2}|g_{m}^{\mathrm{sp}(t)}[d]|^{2} denotes the variance of {gmsp(t)[d]}d=1D/2\{g_{m}^{\mathrm{sp}(t)}[d]\}_{d=1}^{D/2}, with gmsp(t)[d]g_{m}^{\mathrm{sp}(t)}[d] being the dd-th entry of 𝐠msp(t)\mathbf{g}_{m}^{\mathrm{sp}(t)}. From \eqrefeq:g_normalize, the entries of 𝐠mno(t)\mathbf{g}_{m}^{\mathrm{no}(t)} have zero-mean and unit variance.

Each device mm then compresses 𝐠mno(t)\mathbf{g}_{m}^{\mathrm{no}(t)} into a low-dimensional vector via a common compressing matrix. Specifically, the compressed gradient is given by

𝐠mcp(t)=𝐀𝐠mno(t)\mathbbCC,\mathbf{g}_{m}^{\mathrm{cp}(t)}=\mathbf{A}\mathbf{g}_{m}^{\mathrm{no}(t)}\in\mathbb{C}^{C}, (12)

where 𝐀\mathbbCC×(D/2)\mathbf{A}\in\mathbb{C}^{C\times(D/2)} denotes the common compressing matrix444 We assume that both the compression matrix 𝐀\mathbf{A} and the flipping vector 𝐬\mathbf{s} keep invariant throughout the FL training process, and are shared among the devices prior to the FL training., CC denotes the length of the compressed gradient, and κ=CD/2\kappa=\frac{C}{D/2} denotes the compression ratio. Inspired by [ma2014turbo, ma2022over], we employ a partial DFT matrix as the compressing matrix, given by 𝐀=𝐒\boldsymbolΞ\mathbf{A}=\mathbf{S}\boldsymbol{\Xi}. 𝐒\mathbbRC×D/2\mathbf{S}\in\mathbb{R}^{C\times D/2} is a selection matrix consisting of CC randomly selected and reordered rows of the D/2×D/2D/2\times D/2 identity matrix 𝐈D/2\mathbf{I}_{D/2}; and \boldsymbolΞ\mathbbRD/2×D/2\boldsymbol{\Xi}\in\mathbb{R}^{D/2\times D/2} is a unitary DFT matrix, where the (d,d)(d,d^{\prime})-th entry of \boldsymbolΞ\boldsymbol{\Xi} is given by 1D/2exp(2πj(d1)(d1)D/2)\frac{1}{\sqrt{D/2}}\exp\left(-2\pi j\frac{(d-1)(d^{\prime}-1)}{D/2}\right). We note that the partial DFT matrix has lower computational complexity and better performance, compared with other types of compressed matrix such as i.i.d. Gaussian matrix [ma2015performance].

To transmit the data with multiple streams, device mm then reshapes the compressed gradient 𝐠mcp(t)\mathbf{g}_{m}^{\mathrm{cp}(t)} into 𝐆m(t)\mathbbCNS×K\mathbf{G}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{S}}\times K} as

𝐆m(t)=[𝐠m,1cp(t),,𝐠m,NScp(t)]T,\mathbf{G}_{m}^{(t)}=[\mathbf{g}_{m,1}^{\mathrm{cp}(t)},\cdots,\mathbf{g}_{m,N_{\mathrm{S}}}^{\mathrm{cp}(t)}]^{T}, (13)

where NSN_{\mathrm{S}} denotes the number of data streams, 𝐠m,ncp(t)=𝐠mcp(t)((n1)K+1:nK),n[NS]\mathbf{g}_{m,n}^{\mathrm{cp}(t)}=\mathbf{g}_{m}^{\mathrm{cp}(t)}((n-1)K+1:nK),n\in[N_{\mathrm{S}}]. Naturally, the number of channel uses satisfies the following equation:

K=CNS=κD/2NS.K=\frac{C}{N_{\mathrm{S}}}=\frac{\kappa D/2}{N_{\mathrm{S}}}. (14)

We are now ready to describe the design of the transmit matrix 𝐗m(t)\mathbf{X}_{m}^{(t)}. The transmit matrix 𝐗m(t)\mathbf{X}_{m}^{(t)} is given by 𝐗m(t)=𝐏m(t)𝐆m(t)\mathbf{X}_{m}^{(t)}=\mathbf{P}_{m}^{(t)}\mathbf{G}_{m}^{(t)}, where 𝐏m(t)\mathbbCNT×NS\mathbf{P}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{T}}\times N_{\mathrm{S}}} denotes the precoding matrix for the mm-th device. Let 𝐠m,k(t)\mathbbCNS\mathbf{g}_{m,k}^{(t)}\in\mathbb{C}^{N_{\mathrm{S}}} denote the kk-th column of 𝐆m(t)\mathbf{G}_{m}^{(t)}. We have 𝐱m,k(t)=𝐏m(t)𝐠m,k(t),k[K]\mathbf{x}_{m,k}^{(t)}=\mathbf{P}_{m}^{(t)}\mathbf{g}_{m,k}^{(t)},\forall k\in[K]. From the transmit power constraint in \eqrefeq:trans_power_constr1, we have

\mathbbE[𝐏m(t)𝐠m,k(t)22]\overset(a)=𝐏m(t)F2P0,k[K],\mathbb{E}[\|\mathbf{P}_{m}^{(t)}\mathbf{g}_{m,k}^{(t)}\|_{2}^{2}]\overset{(a)}{=}\|\mathbf{P}_{m}^{(t)}\|_{\mathrm{F}}^{2}\leq P_{0},\forall k\in[K], (15)

where the step (a) is due to the normalization in \eqrefeq:g_normalize.

III-B Processing on the PS

We now describe the processing operations on the PS. In the following, we first transfer the recovery of the aggregated gradient to a compressed sensing problem, and then adopt the Turbo compressed sensing (Turbo-CS) algorithm [ma2014turbo] to solve this problem.

At the PS, the receive signal matrix 𝐘\mathbf{Y} in \eqrefeq:receive_data is first processed through the post processing matrix 𝐅(t)\mathbf{F}^{(t)}, and the post-processed matrix 𝐑^(t)\hat{\mathbf{R}}^{(t)} is given by

𝐑^(t)=𝐅(t)(m=1M𝐇m(t)𝐏m(t)𝐆m(t)+𝐍(t))\mathbbCNS×K.\hat{\mathbf{R}}^{(t)}=\mathbf{F}^{(t)}\left(\sum\nolimits_{m=1}^{M}\mathbf{H}_{m}^{(t)}\mathbf{P}_{m}^{(t)}\mathbf{G}_{m}^{(t)}+\mathbf{N}^{(t)}\right)\in\mathbb{C}^{N_{\mathrm{S}}\times K}. (16)

𝐑^\hat{\mathbf{R}} is an approximation to the compressed gradient aggregation matrix 𝐑(t)=m=1Mqm𝐆m\mathbf{R}^{(t)}=\sum\nolimits_{m=1}^{M}q_{m}\mathbf{G}_{m}, and the residual error is given by

𝐖(t)=𝐑^(t)𝐑(t).\mathbf{W}^{(t)}=\hat{\mathbf{R}}^{(t)}-\mathbf{R}^{(t)}. (17)

Based on \eqrefeq:error_receive_data, the PS converts the post-processed matrix 𝐑^\hat{\mathbf{R}} into an equivalent vector form:

𝐫^(t)=\vectorize(𝐑^(t))T=m=1Mqm\vectorize(𝐆m(t)T)+\vectorize(𝐖(t))T\overset(a)=𝐀𝐠no(t)+𝐰(t)\mathbbCC,\hat{\mathbf{r}}^{(t)}=\vectorize(\hat{\mathbf{R}}^{(t)}{}^{\mathrm{T}})=\sum\nolimits_{m=1}^{M}q_{m}\vectorize\left(\mathbf{G}_{m}^{(t)\mathrm{T}}\right)+\vectorize\left(\mathbf{W}^{(t)}{}^{\mathrm{T}}\right)\overset{(a)}{=}\mathbf{A}\mathbf{g}^{\mathrm{no}(t)}+\mathbf{w}^{(t)}\in\mathbb{C}^{C}, (18)

where step (a) is from \eqrefeq:g_compress and \eqrefeq:G_def, together with the definition of 𝐠no(t)=m=1Mqm𝐠mno(t)\mathbf{g}^{\mathrm{no}(t)}=\sum_{m=1}^{M}q_{m}\mathbf{g}_{m}^{\mathrm{no}(t)} and 𝐰(t)=\vectorize(𝐖(t))T\mathbf{w}^{(t)}=\vectorize\left(\mathbf{W}^{(t)}{}^{\mathrm{T}}\right). We assume that the entries of 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)} are independently drawn from a Bernoulli Gaussian distribution:

gdno(t){0,&\textprobability=1λ(t),𝒞𝒩(0,σg(t)),\textprobability=λ(t),,d[D/2]g_{d}^{\mathrm{no}(t)}\sim\cases{0},&\text{probability}=1-\lambda^{\prime(t)},\\ \mathcal{CN}(0,\sigma_{g}^{(t)}),\text{probability}=\lambda^{\prime(t)},,\forall d\in[D/2] (19)

where gdno(t)g_{d}^{\mathrm{no}(t)} is the dd-th entry of 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)}, λ(t)\lambda^{\prime(t)} is the sparsity ratio of 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)}, which can be estimated by the Expectation-Maximization algorithm [vila2013expectation], and σg(t)\sigma_{g}^{(t)} is the variance of the nonzero entries in 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)}. Moreover, we assume that the entries of 𝐰(t)\mathbf{w}^{(t)} are i.i.d. drawn from 𝒞𝒩(0,σw(t))\mathcal{CN}(0,\sigma_{w}^{(t)}), where σw(t)\sigma_{w}^{(t)} represents the mean square error (MSE) of the compressed gradient aggregation, given by

σw(t)=1NSK\mathbbE𝐑^(t)𝐑(t)F2.\sigma_{w}^{(t)}=\frac{1}{N_{\mathrm{S}}K}\mathbb{E}\|\hat{\mathbf{R}}^{(t)}-\mathbf{R}^{(t)}\|_{\mathrm{F}}^{2}. (20)

The recovery of 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)} from 𝐫^(t)\hat{\mathbf{r}}^{(t)} in \eqrefeq:compressed_problem is a compressed sensing problem, where the local gradients {𝐠m(t)}\{\mathbf{g}_{m}^{(t)}\} are compressed by the partial DFT matrix 𝐀\mathbf{A} on the edge devices. From [ma2014turbo, ma2022over], we see that the Turbo-CS algorithm is the state-of-the-art to solve the compressed sensing problem with partial-orthogonal sensing matrices. Thus, we employ the Turbo-CS algorithm to recover 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)}:

𝐠^no(t)=\textTurboCS(𝐫^(t))\mathbbCC/2.\hat{\mathbf{g}}^{\mathrm{no}(t)}=\text{Turbo-CS}(\hat{\mathbf{r}}^{(t)})\in\mathbb{C}^{C/2}. (21)

The details of the Turbo-CS algorithm are presented in the next subsection. Meanwhile, the performance of the Turbo-CS algorithm to recover the aggregated gradient is related to the variance of the noise 𝐰(t)\mathbf{w}^{(t)}, i.e., σw(t)\sigma_{w}^{(t)}. A smaller σw(t)\sigma_{w}^{(t)} leads to a better recovery performance and a less loss of learning accuracy, which is discussed in Section IV.

III-C Turbo-CS Algorithm

As shown in Fig. 3, Turbo-CS conducts the iteration between modules A and B until convergence. Module A is a linear minimum mean-squared error (LMMSE) estimator handling the linear constraint in \eqrefeq:compressed_problem, and module B is a minimum mean-squared error (MMSE) denoiser exploiting the sparsity in \eqrefeq:g_agg_sparsity. We next give the operations of Turbo-CS in detail.

\includegraphics

[width = 0.7]Images/TurboCS.pdf

Figure 3: An illustration of the Turbo-CS algorithm.

The iterative process begins with module A. The inputs of the LMMSE estimator in module A are the a priori mean 𝐠Apri(t)\mathbbCD/2\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}\in\mathbb{C}^{D/2}, the a priori covariance vApri(t)v_{\mathrm{A}}^{\mathrm{pri}(t)}, and the observed vector 𝐫^(t)\hat{\mathbf{r}}^{(t)}. With given 𝐠Apri(t)\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}, vApri(t)v_{\mathrm{A}}^{\mathrm{pri}(t)}, and 𝐫^(t)\hat{\mathbf{r}}^{(t)}, the a posteriori mean 𝐠Apost(t)\mathbf{g}_{\mathrm{A}}^{\mathrm{post}(t)} and the a posteriori covariance vApost(t)v_{\mathrm{A}}^{\mathrm{post}(t)} of the LMMSE estimator are given by [kay1993fundamentals] {subequations} {align} g_A^post(t) & = g_A^pri(t) + vApri(t)vApri(t)+ σw(t) A^H (^r^(t) - A g_A^pri(t)),
v_A^post(t) = v_A^pri(t) - κ⋅( vApri(t))2vApri(t)+ σw(t), where κ\kappa is the compression ratio defined below \eqrefeq:g_compress. Then the extrinsic mean and variance of the LMMSE estimator are given by {subequations} {align} g_A^ext(t) &= v_A^ext(t) ( gApost(t)vApost(t)-gApri(t)vApri(t) ),
v_A^ext(t) = (1/v_A^post(t)-1/v_A^pri(t) )^-1. The extrinsic messages {𝐠Aext(t),vAext(t)}\{\mathbf{g}_{\mathrm{A}}^{\mathrm{ext}(t)},v_{\mathrm{A}}^{\mathrm{ext}(t)}\} are used to update the a priori mean 𝐠Bpri(t)\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)} and the a priori variance vBpri(t)v_{\mathrm{B}}^{\mathrm{pri}(t)} as 𝐠Bpri(t)=𝐠Aext(t)\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}=\mathbf{g}_{\mathrm{A}}^{\mathrm{ext}(t)}, and vBpri(t)=vAext(t)v_{\mathrm{B}}^{\mathrm{pri}(t)}=v_{\mathrm{A}}^{\mathrm{ext}(t)}. Both {𝐠Bpri(t),vBpri(t)}\{\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)},v_{\mathrm{B}}^{\mathrm{pri}(t)}\} are the inputs of the MMSE denoiser in module B.

In module B, we model the a priori mean 𝐠Bpri(t)\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)} as an observation of 𝐠no(t)\mathbf{g}^{\mathrm{no}(t)} corrupted by additive noise: 𝐠Bpri(t)=𝐠no(t)+\boldsymbolδ(t)\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}\!=\!\mathbf{g}^{\mathrm{no}(t)}\!+\!\boldsymbol{\delta}^{(t)}, where δd(t)𝒞𝒩(0,vBpri(t))\delta_{d}^{(t)}\sim\mathcal{CN}(0,v_{\mathrm{B}}^{\mathrm{pri}(t)}) denotes the dd-th entry of \boldsymbolδ(t)\boldsymbol{\delta}^{(t)}. The a posteriori mean 𝐠Bpost(t)\mathbf{g}_{\mathrm{B}}^{\mathrm{post}(t)} and the variance vBpost(t)v_{\mathrm{B}}^{\mathrm{post}(t)} of the MMSE denoiser are given by {subequations} {align} g_B^post(t) &= \mathbbE[g^no(t)—g_B^pri(t)],
v_B^post(t) = 1D/2∑_d=1^D/2 \operatornamevar[g^no(t)_d—g_B,d^pri(t)], where \operatornamevar[a|b]=\mathbbE[|a\mathbbE[a|b]|2|b]\operatorname{var}[a|b]=\mathbb{E}\Big{[}\big{|}a-\mathbb{E}[a|b]\big{|}^{2}\Big{|}b\Big{]}, and gB,dpri(t)g_{\mathrm{B},d}^{\mathrm{pri}(t)} is the dd-th entry of 𝐠Bpri(t)\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}. Then the extrinsic messages of the MMSE denoiser are given by {subequations} {align} &g_B^ext(t) = v_B^ext(t) (gBpost(t)vBpost(t) - gBpri(t)vBpri(t) ),
v_B^ext(t) = ( 1/v_B^post(t) - 1/v_B^pri(t) )^-1. The extrinsic messages {𝐠Bext(t),vBext(t)}\{\mathbf{g}_{\mathrm{B}}^{\mathrm{ext}(t)},v_{\mathrm{B}}^{\mathrm{ext}(t)}\} are used to update the a priori mean 𝐠Apri(t)\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)} and the a priori variance vApri(t)v_{\mathrm{A}}^{\mathrm{pri}(t)} as 𝐠Apri(t)=𝐠Bext(t)\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}=\mathbf{g}_{\mathrm{B}}^{\mathrm{ext}(t)}, and vApri(t)=vBext(t)v_{\mathrm{A}}^{\mathrm{pri}(t)}=v_{\mathrm{B}}^{\mathrm{ext}(t)}. Both {𝐠Apri(t),vApri(t)}\{\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)},v_{\mathrm{A}}^{\mathrm{pri}(t)}\} are the inputs of the LMMSE estimator in module A. At the end of the iterative process, the final estimate is based on the a posteriori output of the module B, i.e., 𝐠^no(t)=𝐠Bpost(t)\hat{\mathbf{g}}^{\mathrm{no}(t)}=\mathbf{g}_{\mathrm{B}}^{\mathrm{post}(t)}.

Then, based on the output of Turbo-CS, the aggregated gradient 𝐠^(t)\hat{\mathbf{g}}^{(t)} is given by

𝐠^(t)=[{σ(t)(𝐠^no(t)𝐬)},{σ(t)(𝐠^no(t)𝐬)}]\mathbbRD\hat{\mathbf{g}}^{(t)}=[\Re\{\sqrt{\sigma^{(t)}}(\hat{\mathbf{g}}^{\mathrm{no}(t)}\odot\mathbf{s})\},\Im\{\sqrt{\sigma^{(t)}}(\hat{\mathbf{g}}^{\mathrm{no}(t)}\odot\mathbf{s})\}]\in\mathbb{R}^{D} (22)

where σ(t)=1Mm=1Mσm(t)\sigma^{(t)}=\frac{1}{M}\sum_{m=1}^{M}\sigma_{m}^{(t)}. The PS then updates the model \boldsymbolθ(t+1)\boldsymbol{\theta}^{(t+1)} by

\boldsymbolθ(t+1)=\boldsymbolθ(t)η𝐠^(t).\boldsymbol{\theta}^{(t+1)}=\boldsymbol{\theta}^{(t)}-\eta\hat{\mathbf{g}}^{(t)}. (23)

To sum up, at each round tt, the receive matrix 𝐑^(t)\hat{\mathbf{R}}^{(t)} is first reshaped into a vector form in \eqrefeq:compressed_problem. Given the observed vector 𝐫^(t)\hat{\mathbf{r}}^{(t)} and the initial values {𝐠Apri(t)=𝟎,vApri(t)=1}\{\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}=\mathbf{0},v_{\mathrm{A}}^{\mathrm{pri}(t)}=1\}, Turbo-CS iteratively calculates \eqrefeq:g_v_post_A-\eqrefeq:g_v_ext_B until a certain termination criterion is met. Finally, the output 𝐠^no(t)\hat{\mathbf{g}}^{\mathrm{no}(t)} is scaled as 𝐠^(t)\hat{\mathbf{g}}^{(t)} for the model update in \eqrefeq:SGD_error.

III-D Overall Scheme

The proposed SCoM approach to the local gradients uploading is summarized below, where lines 3-5, 11-17 are executed at the PS, and lines 6-10 are executed at the devices.

{algorithm}

[htb] \floatnamealgorithm Proposed SCoM-Based MIMO OA-FL Scheme {algorithmic}[1] \REQUIRETT, {Qm}\{Q_{m}\}, NSN_{\mathrm{S}}, NRN_{\mathrm{R}}, NTN_{\mathrm{T}}, κ\kappa, λ\lambda, 𝐀\mathbf{A} and 𝐬\mathbf{s}. \STATEInitialization: t=0t=0, the global model \boldsymbolθ(0)\boldsymbol{\theta}^{(0)}. \FOR t[T]t\in[T] \STATEPS does: \STATEEstimate the CSI matrices {𝐇(t)}\{\mathbf{H}^{(t)}\}, and optimize 𝐅(t)\mathbf{F}^{(t)} and {𝐏m(t)}\{\mathbf{P}_{m}^{(t)}\}; \STATESend \boldsymbolθ(t)\boldsymbol{\theta}^{(t)} and {𝐏m(t)}\{\mathbf{P}_{m}^{(t)}\} to the edge devices through error free links; \STATEEach device mm does in parallel: \STATECompute 𝐠m(t)\mathbf{g}_{m}^{(t)} based on the 𝒜m\mathcal{A}_{m} and \boldsymbolθ(t)\boldsymbol{\theta}^{(t)} via \eqrefeq:GradDef; \STATECompute 𝐆m(t)\mathbf{G}_{m}^{(t)} via \eqrefeq:Complexification-\eqrefeq:G_def; \STATEUpdate \boldsymbolΔm(t+1)\boldsymbol{\Delta}_{m}^{(t+1)} via \eqrefeq:Delta_update; \STATESend 𝐆m(t)\mathbf{G}_{m}^{(t)} to the PS with the precoding matrix 𝐏m(t)\mathbf{P}_{m}^{(t)} over the MIMO channel in \eqrefeq:receive_data; \STATEPS does: \STATEReshape 𝐑^(t)\hat{\mathbf{R}}^{(t)} into 𝐫^(t)\hat{\mathbf{r}}^{(t)} via \eqrefeq:compressed_problem;