Over-the-Air Federated Learning Over MIMO Channels: A Sparse-Coded Multiplexing Approach

Chenxi Zhong, and Xiaojun Yuan, C. Zhong, and X. Yuan are with the National Key Laboratory of Science and Technology on Communications, University of Electronic Science and Technology of China, Chengdu, China (e-mail: [email protected]; [email protected]). The corresponding author is Xiaojun Yuan.

Abstract

The communication bottleneck of over-the-air federated learning (OA-FL) lies in uploading the gradients of local learning models. In this paper, we study the reduction of the communication overhead in the gradients uploading by using the multiple-input multiple-output (MIMO) technique. We propose a novel sparse-coded multiplexing (SCoM) approach that employs sparse-coding compression and MIMO multiplexing to balance the communication overhead and the learning performance of the FL model. We derive an upper bound on the learning performance loss of the SCoM-based MIMO OA-FL scheme by quantitatively characterizing the gradient aggregation error. Based on the analysis results, we show that the optimal number of multiplexed data streams to minimize the upper bound on the FL learning performance loss is given by the minimum of the numbers of transmit and receive antennas. We then formulate an optimization problem for the design of precoding and post-processing matrices to minimize the gradient aggregation error. To solve this problem, we develop a low-complexity algorithm based on alternating optimization (AO) and alternating direction method of multipliers (ADMM), which effectively mitigates the impact of the gradient aggregation error. Numerical results demonstrate the superb performance of the proposed SCoM approach.

Index Terms:

Over-the-air federated learning, multiple-input multiple-output access channel, multiplexing, turbo compressed sensing.

I Introduction

Sixth-generation (6G) wireless communications, as expected to support the connection density up to millions of wireless devices per square kilometer, will provide a solid foundation to fulfil the vision of ubiquitous intelligence [zhou2020service]. To develop a powerful intelligence model, it is necessary to exploit the diversity of data distributed over a large number of edge devices. A straightforward paradigm is to require edge devices to send local data to a central parameter server (PS) for training the model centrally. However, sending raw data to the PS requires a huge communication overhead and may expose user privacy. To overcome these drawbacks, federated learning (FL) is a promising substitute that allows edge devices to collaborate on training a machine learning (ML) model without sharing their local data with others [mcmahan2017communication]. Instead of uploading raw data, in an FL training round, each edge device sends its local gradient to the PS, and the PS aggregates the local gradients, updates and sends back the global model to the devices.

Gradient uploading is a critical bottleneck of deploying FL on a wireless network, since it is difficult to support the communication demands of massive edge devices with limited communication resources (e.g. time, bandwidth, and space). For example, the dimension of the recent ML models is extremely large, e.g., the ResNet152 has 60 million parameters [he2016deep], while the GPT-3 has 175 billion parameters [brown2020language]. Yet, the available channel bandwidth is typically small due to the bandwidth and latency limitations, e.g., 1 LTE frame of 5MHz bandwidth and 10ms duration can carry only $50,000$ complex symbols. Fortunately, in FL, the PS does not need to know the local gradient of each device but the aggregated gradient, usually the mean of all the local gradients. Based on this property, over-the-air FL (OA-FL) is proposed in [nazer2007computation, zhu2020broadband, amiri2020federated], where edge devices share the same wireless resources to upload their local gradients. Thanks to the analog superposition of electromagnetic waves, the local gradients are aggregated over-the-air in the process of uploading. Compared with the traditional orthogonal multiple access (OMA) approaches [yang2021energy, vu2020cell, vu2022joint], OA-FL does not require more communication resources as the number of devices increases [amiri2020federated], which greatly alleviates the communication bottleneck of gradient uploading. Pioneering studies demonstrate that OA-FL also exhibits strong noise tolerance and significant latency improvement [zhu2020broadband, liu2020reconfigurable].

Due to the appealing features of OA-FL, much research effort has been devoted to the design of efficient OA-FL systems. For example, ref. [lin2017deep] pointed out that the local gradients can be sparsified, compressed, and quantized to reduce the communication overhead without causing substantial losses in accuracy. Ref. [amiri2020federated] proposed an efficient scheme, where the local gradients are sparsified and linear coding compressed before uploaded, and the aggregated gradient is recovered at the PS via compressed sensing methods. The authors in [ma2022over] used partial-orthogonal compressing matrices and turbo compressed sensing (Turbo-CS) [ma2014turbo], achieving a lower complexity scheme of sparse coding. It has been shown in [ma2022over] that sparse coding enables the OA-FL system to achieve a lower communication overhead and a faster convergence rate.

The above schemes are all based on single-input single-output (SISO) systems. Multiple-input multiple-output (MIMO) with array signal processing has been widely recognized as a powerful technique to enhance the system capacity. MIMO multiplexing significantly reduces the number of channel uses by transmitting multiple spatial data streams in parallel through antenna arrays[spencer2004zero]. However, MIMO multiplexing causes the inter-stream interference which corrupts the aggregated gradient and the test accuracy for OA-FL. There have been some preliminary attempts to alleviate the impact of inter-stream interference by designing the precoding matrices at the devices and the post-processing matrix at the PS. For instance, ref. [zhu2018mimo] set the precoding matrix to the pseudo-inverse of channel matrix, and derived a closed-form equalizer as the post-processing matrix using differential geometry. The scheme proposed in [chen2018over] also uses the pseudo-inverse of channel matrix as the precoding matrix, and computes the post-processing matrix based on the receive antenna selection. However, both methods are based on the channel inversion, which may significantly amplify noise and hence exacerbate the gradients aggregation error, especially when some devices suffer from deep channel fading [zhu2020broadband, liu2020reconfigurable, zhong2022over].

In this paper, we consider an over-the-air federated learning (OA-FL) network where the local gradients are uploaded over a MIMO multiple access (MAC) channel. The MIMO MAC channel comprises a central PS with multiple antennas and several edge devices with multiple antennas. We propose a novel Sparse-Coded Multiplexing (SCoM) approach that integrates sparse coding with MIMO multiplexing in gradients uploading. Benefiting from two techniques, the SCoM achieves a strikingly better balance between the communication overhead and the learning performance. On one hand, sparse-coding utilizes the sparsity of the gradient to compress the gradient, reducing the communication overhead. On the other hand, MIMO multiplexing reduces the number of channel uses by transmitting multiple streams in parallel, and suppresses the gradients aggregation error through precoding and post-processing matrices. The main contributions are summarized as follows.

•

We propose a novel SCoM approach for gradients uploading in MIMO OA-FL. We derive an upper bound on the learning performance loss of the SCoM-based MIMO OA-FL scheme by quantitatively characterizing the gradient aggregation error.
•

Based on the analytical result, we formulate a joint precoding and post-processing matrices optimization problem for suppressing the gradient aggregation error. We design a low-complexity algorithm that employs alternating optimization (AO) and alternating direction method of multipliers (ADMM) to jointly optimize the precoding and post-processing matrices.
•

We derive the optimal number of multiplexed data streams for SCoM to balance the communication overhead and the gradient aggregation performance. More specifically, the optimal number of multiplexed data streams to minimize the upper bound of the learning performance loss is given by $\min\{N_{\mathrm{T}},N_{\mathrm{R}}\}$ , where $N_{\mathrm{T}}$ denotes the number of transmit antennas, and $N_{\mathrm{R}}$ denotes the number of receive antennas.

Numerical results demonstrate that our proposed SCoM approach achieves the same test accuracy with much lower communication overhead than other existing approaches, which indicates the superior performance of the SCoM approach.

The rest of this paper is structured as follows. Section II introduces the FL model and the MIMO MAC channel. Section III presents the proposed SCoM approach. The analysis of the learning performance of the SCoM approach is presented in Section IV. In Section V, we present the optimization problem to minimize the gradient aggregation error, and propose a low-complexity algorithm to jointly optimize precoding and post-processing matrices. Section VI presents numerical results to evaluate the SCoM approach and Section VII concludes the paper.

Notation: $\mathbb{R}$ and $\mathbb{C}$ denote the sets of real and complex numbers, respectively. $\tr(\cdot)$ , $\rank(\cdot)$ , $(\cdot)^{\dagger}$ , $(\cdot)^{\mathrm{T}}$ , and $(\cdot)^{\mathrm{H}}$ are used to denote the trace, the rank, the conjugate, the transpose, and the conjugate transpose of the matrix, respectively. $[M]$ denotes the set $\{m|1\leq m\leq M\}$ . $\mathbf{s}(1:N)$ denotes a sub-vector of $\mathbf{s}$ that contains entries from index $1$ to index $N$ . The expectation operator is denoted by $\mathbb{E}[\cdot]$ . We use $\mathbf{I}_{N}$ , and $\mathbf{0}_{N\times M}$ to denote the identity matrix of size $N\times N$ and the zero matrix of size $N\times M$ , respectively. We use $\|\cdot\|_{2}$ , and $\|\cdot\|_{\mathrm{F}}$ to denote the $l_{2}$ -norm and the Frobenius norm, respectively. $\mathcal{CN}(\mu,\sigma)$ denotes the circularly-symmetric complex Gaussian (CSCG) distribution that has a mean of $\mu$ and a covariance of $\sigma$ .

II System Model

II-A Federated Learning

We start with the description of the FL task deployed on a wireless communication system, where the system consists of one central PS and $M$ edge devices. We assume that the training data of the FL task are all distributedly stored on the edge devices. Let $\mathcal{A}_{m}$ denote the local dataset of the $m$ -th device, and $Q_{m}$ denote the cardinality of $\mathcal{A}_{m}$ . $Q=\sum_{m=1}^{M}Q_{m}$ is the total number of training data samples for the FL task. $\boldsymbol{\theta}\in\mathbb{R}^{D}$ is the model parameter vector with $D$ being the total length of the model parameter. The target of the FL task is to minimize an empirical loss function $F(\boldsymbol{\theta})$ based on the local datasets $\{\mathcal{A}_{m}\}$ , given by

\min_{\boldsymbol{\theta}}\quad F(\boldsymbol{\theta})=\frac{1}{Q}\sum\nolimits_{m=1}^{M}Q_{m}F_{m}(\boldsymbol{\theta})=\frac{1}{Q}\sum\nolimits_{m=1}^{M}\sum\nolimits_{n=1}^{Q_{m}}f\left(\boldsymbol{\theta};\boldsymbol{\zeta}_{m,n}\right),

(1)

where $F_{m}(\boldsymbol{\theta})=\frac{1}{Q_{m}}\sum_{n=1}^{Q_{m}}f\!(\boldsymbol{\theta};\boldsymbol{\zeta}_{m,n})$ is the local loss function of device $m$ , and $f\!(\boldsymbol{\theta};\boldsymbol{\zeta}_{m,n})$ is the sample-wise loss function for the $n$ -th training sample $\boldsymbol{\zeta}_{m,n}$ in $\mathcal{A}_{m}$ .

To minimize the empirical loss function $F(\boldsymbol{\theta})$ in \eqrefeq:FLTarget, the FL training involves $T$ communication rounds between the edge devices and the PS for $F(\boldsymbol{\theta})$ to reach convergence. Specifically, each communication round $t\in[T]$ consists of four steps:

•

Global model download: The PS sends the global model $\boldsymbol{\theta}^{(t)}$ to each edge device.

•

Local gradients computation: The local gradient $\mathbf{g}_{m}^{(t)}\in\mathbb{R}^{D}$ is computed by device $m$ based on their own data and the global model, given by

\mathbf{g}_{m}^{(t)}=\nabla F_{m}(\boldsymbol{\theta}^{(t)})=\frac{1}{Q_{m}}\sum\nolimits_{n=1}^{Q_{m}}\nabla f\left(\boldsymbol{\theta}^{(t)};\boldsymbol{\zeta}_{m,n}\right).

(2)

•

Local gradients upload: The edge devices send the local gradients to the PS through wireless channels.
•

Global model aggregation: The local gradients are aggregated as

$\mathbf{g}^{(t)}=\frac{1}{Q}\sum\nolimits_{m=1}^{M}Q_{m}\mathbf{g}_{m}^{(t)}=\sum\nolimits_{m=1}^{M}q_{m}\mathbf{g}_{m}^{(t)},$ (3)

where $\mathbf{g}^{(t)}$ denotes the aggregated gradient, and $q_{m}=Q_{m}/Q,\forall m\in[M]$ . The global model $\boldsymbol{\theta}^{(t+1)}$ is updated by

$\boldsymbol{\theta}^{(t+1)}=\boldsymbol{\theta}^{(t)}-\eta\mathbf{g}^{(t)},$ (4)

where $\eta\in\mathbb{R}$ denotes the learning rate.

II-B MIMO Channel Model

\includegraphics

[width=0.6]Images/SystemModel.pdf

Figure 1: An illustration of the considered MIMO OA-FL system.

We now introduce the wireless multi-user multiple-input multiple-output (MIMO) channel for the above FL system. As depicted in Fig. 1, the considered MIMO OA-FL system consists of a PS with $N_{\mathrm{R}}$ antennas, and $M$ edge devices with each equipped with $N_{\mathrm{T}}$ antennas. As in previous studies in OA-FL [amiri2020federated, sery2021over, cao2022transmission], we make two assumptions: that the download of the global model is through error-free links¹¹1 In practice, the channel noise causes communication errors in the model download. This issue can be addressed by the schemes proposed by [vu2020cell, vu2022joint]. This is beyond the scope of this paper and hence omitted here. and that the devices upload the local gradients to the PS synchronously²²2 The existing techniques in 4G Long Term Evolution, e.g., the timing advance (TA) mechanism, can achieve the synchronization of the gradient symbols among the edge devices [3gpp.38.213]. . We now focus on the process of local gradients uploading. We consider a block-fading channel, where the channel state information (CSI) remains constant during the gradients uploading. Let $\mathbf{H}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{R}}\times N_{\mathrm{T}}}$ denote the CSI matrix between the $m$ -th device and the PS at the $t$ -th round. We assume that the PS has perfect knowledge of the CSI of the wireless channels between the devices and the PS³³3 The approaches of CSI estimation over MIMO MAC channels can be referred to [nguyen2013compressive, wen2014channel, vu2020cell, vu2022joint].. Thus, at the PS, the receive signal matrix from the above MIMO multiple access (MAC) channel is given by

\mathbf{Y}^{(t)}=\sum\nolimits_{m=1}^{M}\mathbf{H}_{m}^{(t)}\mathbf{X}_{m}^{(t)}+\mathbf{N}^{(t)}\in\mathbb{C}^{N_{\mathrm{R}}\times K},

(5)

where $K$ denotes the number of channel uses at the $t$ -th round; $\mathbf{X}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{T}}\times K}$ denotes the transmit data matrix for the $m$ -th device; and $\mathbf{N}^{(t)}\in\mathbb{C}^{N_{\mathrm{R}}\times K}$ is an additive white Gaussian noise (AWGN) matrix, with the entries independently drawn from $\mathcal{CN}(0,\sigma_{\mathrm{noise}})$ . Let $\mathbf{x}_{m,k}$ denote the $k$ -th column of $\mathbf{X}_{m}^{(t)}$ . Here, we consider the following transmit power constraint:

\mathbb{E}[\|\mathbf{x}_{m,k}^{(t)}\|_{2}^{2}]\leq P_{0},\forall k\in[K],

(6)

where $P_{0}$ is the transmit power budget.

What remains are to map the local gradients $\{\mathbf{g}_{m}^{(t)}\}$ to the transmit matrices $\{\mathbf{X}_{m}^{(t)}\}$ at the edge devices, and to recover the aggregated gradient from the receive signal matrix $\mathbf{Y}^{(t)}$ . These issues are discussed in detail in the next section.

III Proposed Sparse-Coded Multiplexing Approach

With the development of deep learning, the size of model is increasing. To upload the large number of FL local gradients over the aforementioned MIMO channel, the key challenge is the heavy communication burden. Although transmitting the data streams in parallel with MIMO multiplexing efficiently reduces the communication overhead, MIMO multiplexing also causes the interference between the data streams, resulting in a loss of FL learning performance.

To address these challenges, we propose a novel transmission scheme, i.e., the Sparse-Coded Multiplexing (SCoM) approach, for the above MIMO OA-FL system, as shown in Fig. 2. The SCoM approach employs two techniques: sparse-coding and MIMO multiplexing. On one hand, sparse-coding utilizes the sparsity of the gradient to compress the gradient, reducing the communication overhead. Meanwhile, sparse-coding leverages the compression matrix to encode the data streams, thereby reducing the correlation between data streams and suppressing inter-stream interference. On the other hand, MIMO multiplexing reduces the number of channel uses by transmitting multi-stream data through antenna arrays, and suppresses inter-stream interference through precoding and post-processing matrices. The details of the SCoM approach are given as follows.

\includegraphics

[width=]Images/FlowGraph.pdf

Figure 2: A flow diagram of local gradients uploading at round

t

III-A Processing on Devices

To support the local gradients uploading in the MIMO OA-FL system, the pre-processing operations are first conducted on the edge devices, including the gradient sparsification [amiri2020federated] and the gradient compression[ma2014turbo].

To be specific, for the $m$ -th edge device, the local gradient $\mathbf{g}_{m}^{(t)}$ at the $t$ -th round is first complexified to fully utilize the spectral efficiency of complex channels. The complexified gradient is denoted by

\mathbf{g}_{m}^{\mathrm{cx}(t)}=\Re\{\mathbf{g}_{m}^{\mathrm{cx}(t)}\}+j\Im\{\mathbf{g}_{m}^{\mathrm{cx}(t)}\}\in\mathbb{C}^{D/2},

(7)

where $j=\sqrt{-1}$ . Then, the accumulated gradient is obtained by

\mathbf{g}_{m}^{\mathrm{ac}(t)}=\mathbf{g}_{m}^{\mathrm{cx}(t)}+\boldsymbol{\Delta}_{m}^{(t)}\in\mathbb{C}^{D/2},

(8)

where $\boldsymbol{\Delta}_{m}^{(t)}$ denotes the error accumulation vector of the $m$ -th device at the $t$ -th round with $\Delta_{m}^{(0)}$ initialized as $\mathbf{0}$ . Having calculated the accumulated gradient, in the sparsification, the $m$ -th device obtains the sparsified gradient via

\mathbf{g}_{m}^{\mathrm{sp}(t)}=\mathrm{sp}(\mathbf{g}_{m}^{\mathrm{ac}(t)},\lambda)\in\mathbb{C}^{D/2},

(9)

where $\lambda\in[0,1]$ denotes the sparsity ratio. The operator $\mathrm{sp}(\cdot)$ retains the $\lambda D/2$ entries of $\mathbf{g}_{m}^{\mathrm{ac}(t)}$ with the largest absolute value magnitude, and sets the remaining $(1-\lambda)D/2$ entries to $0$ . The error accumulation vector $\boldsymbol{\Delta}_{m}^{(t+1)}$ is updated via

\boldsymbol{\Delta}_{m}^{(t+1)}=\mathbf{g}_{m}^{\mathrm{ac}(t)}-\mathbf{g}_{m}^{\mathrm{sp}(t)}.

(10)

Then $\mathbf{g}_{m}^{\mathrm{sp}(t)}$ is normalized to $\mathbf{g}_{m}^{\mathrm{no}(t)}$ by

\mathbf{g}_{m}^{\mathrm{no}(t)}=(\mathbf{g}_{m}^{\mathrm{sp}(t)}\odot\mathbf{s})/\sqrt{\sigma_{m}^{(t)}}\in\mathbb{C}^{D/2},

(11)

where $\mathbf{s}\in\mathbb{C}^{D/2}$ is a random flipping vector, with each entry of $\mathbf{s}$ being independent and identically distributed (i.i.d.) drawn from $\{-1,+1\}$ ; and $\sigma_{m}^{(t)}=\frac{1}{D/2}\sum_{d=1}^{D/2}|g_{m}^{\mathrm{sp}(t)}[d]|^{2}$ denotes the variance of $\{g_{m}^{\mathrm{sp}(t)}[d]\}_{d=1}^{D/2}$ , with $g_{m}^{\mathrm{sp}(t)}[d]$ being the $d$ -th entry of $\mathbf{g}_{m}^{\mathrm{sp}(t)}$ . From \eqrefeq:g_normalize, the entries of $\mathbf{g}_{m}^{\mathrm{no}(t)}$ have zero-mean and unit variance.

Each device $m$ then compresses $\mathbf{g}_{m}^{\mathrm{no}(t)}$ into a low-dimensional vector via a common compressing matrix. Specifically, the compressed gradient is given by

\mathbf{g}_{m}^{\mathrm{cp}(t)}=\mathbf{A}\mathbf{g}_{m}^{\mathrm{no}(t)}\in\mathbb{C}^{C},

(12)

where $\mathbf{A}\in\mathbb{C}^{C\times(D/2)}$ denotes the common compressing matrix⁴⁴4 We assume that both the compression matrix $\mathbf{A}$ and the flipping vector $\mathbf{s}$ keep invariant throughout the FL training process, and are shared among the devices prior to the FL training., $C$ denotes the length of the compressed gradient, and $\kappa=\frac{C}{D/2}$ denotes the compression ratio. Inspired by [ma2014turbo, ma2022over], we employ a partial DFT matrix as the compressing matrix, given by $\mathbf{A}=\mathbf{S}\boldsymbol{\Xi}$ . $\mathbf{S}\in\mathbb{R}^{C\times D/2}$ is a selection matrix consisting of $C$ randomly selected and reordered rows of the $D/2\times D/2$ identity matrix $\mathbf{I}_{D/2}$ ; and $\boldsymbol{\Xi}\in\mathbb{R}^{D/2\times D/2}$ is a unitary DFT matrix, where the $(d,d^{\prime})$ -th entry of $\boldsymbol{\Xi}$ is given by $\frac{1}{\sqrt{D/2}}\exp\left(-2\pi j\frac{(d-1)(d^{\prime}-1)}{D/2}\right)$ . We note that the partial DFT matrix has lower computational complexity and better performance, compared with other types of compressed matrix such as i.i.d. Gaussian matrix [ma2015performance].

To transmit the data with multiple streams, device $m$ then reshapes the compressed gradient $\mathbf{g}_{m}^{\mathrm{cp}(t)}$ into $\mathbf{G}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{S}}\times K}$ as

\mathbf{G}_{m}^{(t)}=[\mathbf{g}_{m,1}^{\mathrm{cp}(t)},\cdots,\mathbf{g}_{m,N_{\mathrm{S}}}^{\mathrm{cp}(t)}]^{T},

(13)

where $N_{\mathrm{S}}$ denotes the number of data streams, $\mathbf{g}_{m,n}^{\mathrm{cp}(t)}=\mathbf{g}_{m}^{\mathrm{cp}(t)}((n-1)K+1:nK),n\in[N_{\mathrm{S}}]$ . Naturally, the number of channel uses satisfies the following equation:

K=\frac{C}{N_{\mathrm{S}}}=\frac{\kappa D/2}{N_{\mathrm{S}}}.

(14)

We are now ready to describe the design of the transmit matrix $\mathbf{X}_{m}^{(t)}$ . The transmit matrix $\mathbf{X}_{m}^{(t)}$ is given by $\mathbf{X}_{m}^{(t)}=\mathbf{P}_{m}^{(t)}\mathbf{G}_{m}^{(t)}$ , where $\mathbf{P}_{m}^{(t)}\in\mathbb{C}^{N_{\mathrm{T}}\times N_{\mathrm{S}}}$ denotes the precoding matrix for the $m$ -th device. Let $\mathbf{g}_{m,k}^{(t)}\in\mathbb{C}^{N_{\mathrm{S}}}$ denote the $k$ -th column of $\mathbf{G}_{m}^{(t)}$ . We have $\mathbf{x}_{m,k}^{(t)}=\mathbf{P}_{m}^{(t)}\mathbf{g}_{m,k}^{(t)},\forall k\in[K]$ . From the transmit power constraint in \eqrefeq:trans_power_constr1, we have

\mathbb{E}[\|\mathbf{P}_{m}^{(t)}\mathbf{g}_{m,k}^{(t)}\|_{2}^{2}]\overset{(a)}{=}\|\mathbf{P}_{m}^{(t)}\|_{\mathrm{F}}^{2}\leq P_{0},\forall k\in[K],

(15)

where the step (a) is due to the normalization in \eqrefeq:g_normalize.

III-B Processing on the PS

We now describe the processing operations on the PS. In the following, we first transfer the recovery of the aggregated gradient to a compressed sensing problem, and then adopt the Turbo compressed sensing (Turbo-CS) algorithm [ma2014turbo] to solve this problem.

At the PS, the receive signal matrix $\mathbf{Y}$ in \eqrefeq:receive_data is first processed through the post processing matrix $\mathbf{F}^{(t)}$ , and the post-processed matrix $\hat{\mathbf{R}}^{(t)}$ is given by

\hat{\mathbf{R}}^{(t)}=\mathbf{F}^{(t)}\left(\sum\nolimits_{m=1}^{M}\mathbf{H}_{m}^{(t)}\mathbf{P}_{m}^{(t)}\mathbf{G}_{m}^{(t)}+\mathbf{N}^{(t)}\right)\in\mathbb{C}^{N_{\mathrm{S}}\times K}.

(16)

$\hat{\mathbf{R}}$ is an approximation to the compressed gradient aggregation matrix $\mathbf{R}^{(t)}=\sum\nolimits_{m=1}^{M}q_{m}\mathbf{G}_{m}$ , and the residual error is given by

\mathbf{W}^{(t)}=\hat{\mathbf{R}}^{(t)}-\mathbf{R}^{(t)}.

(17)

Based on \eqrefeq:error_receive_data, the PS converts the post-processed matrix $\hat{\mathbf{R}}$ into an equivalent vector form:

\hat{\mathbf{r}}^{(t)}=\vectorize(\hat{\mathbf{R}}^{(t)}{}^{\mathrm{T}})=\sum\nolimits_{m=1}^{M}q_{m}\vectorize\left(\mathbf{G}_{m}^{(t)\mathrm{T}}\right)+\vectorize\left(\mathbf{W}^{(t)}{}^{\mathrm{T}}\right)\overset{(a)}{=}\mathbf{A}\mathbf{g}^{\mathrm{no}(t)}+\mathbf{w}^{(t)}\in\mathbb{C}^{C},

(18)

where step (a) is from \eqrefeq:g_compress and \eqrefeq:G_def, together with the definition of $\mathbf{g}^{\mathrm{no}(t)}=\sum_{m=1}^{M}q_{m}\mathbf{g}_{m}^{\mathrm{no}(t)}$ and $\mathbf{w}^{(t)}=\vectorize\left(\mathbf{W}^{(t)}{}^{\mathrm{T}}\right)$ . We assume that the entries of $\mathbf{g}^{\mathrm{no}(t)}$ are independently drawn from a Bernoulli Gaussian distribution:

g_{d}^{\mathrm{no}(t)}\sim\cases{0},&\text{probability}=1-\lambda^{\prime(t)},\\ \mathcal{CN}(0,\sigma_{g}^{(t)}),\text{probability}=\lambda^{\prime(t)},,\forall d\in[D/2]

(19)

where $g_{d}^{\mathrm{no}(t)}$ is the $d$ -th entry of $\mathbf{g}^{\mathrm{no}(t)}$ , $\lambda^{\prime(t)}$ is the sparsity ratio of $\mathbf{g}^{\mathrm{no}(t)}$ , which can be estimated by the Expectation-Maximization algorithm [vila2013expectation], and $\sigma_{g}^{(t)}$ is the variance of the nonzero entries in $\mathbf{g}^{\mathrm{no}(t)}$ . Moreover, we assume that the entries of $\mathbf{w}^{(t)}$ are i.i.d. drawn from $\mathcal{CN}(0,\sigma_{w}^{(t)})$ , where $\sigma_{w}^{(t)}$ represents the mean square error (MSE) of the compressed gradient aggregation, given by

\sigma_{w}^{(t)}=\frac{1}{N_{\mathrm{S}}K}\mathbb{E}\|\hat{\mathbf{R}}^{(t)}-\mathbf{R}^{(t)}\|_{\mathrm{F}}^{2}.

(20)

The recovery of $\mathbf{g}^{\mathrm{no}(t)}$ from $\hat{\mathbf{r}}^{(t)}$ in \eqrefeq:compressed_problem is a compressed sensing problem, where the local gradients $\{\mathbf{g}_{m}^{(t)}\}$ are compressed by the partial DFT matrix $\mathbf{A}$ on the edge devices. From [ma2014turbo, ma2022over], we see that the Turbo-CS algorithm is the state-of-the-art to solve the compressed sensing problem with partial-orthogonal sensing matrices. Thus, we employ the Turbo-CS algorithm to recover $\mathbf{g}^{\mathrm{no}(t)}$ :

\hat{\mathbf{g}}^{\mathrm{no}(t)}=\text{Turbo-CS}(\hat{\mathbf{r}}^{(t)})\in\mathbb{C}^{C/2}.

(21)

The details of the Turbo-CS algorithm are presented in the next subsection. Meanwhile, the performance of the Turbo-CS algorithm to recover the aggregated gradient is related to the variance of the noise $\mathbf{w}^{(t)}$ , i.e., $\sigma_{w}^{(t)}$ . A smaller $\sigma_{w}^{(t)}$ leads to a better recovery performance and a less loss of learning accuracy, which is discussed in Section IV.

III-C Turbo-CS Algorithm

As shown in Fig. 3, Turbo-CS conducts the iteration between modules A and B until convergence. Module A is a linear minimum mean-squared error (LMMSE) estimator handling the linear constraint in \eqrefeq:compressed_problem, and module B is a minimum mean-squared error (MMSE) denoiser exploiting the sparsity in \eqrefeq:g_agg_sparsity. We next give the operations of Turbo-CS in detail.

\includegraphics

[width = 0.7]Images/TurboCS.pdf

Figure 3: An illustration of the Turbo-CS algorithm.

The iterative process begins with module A. The inputs of the LMMSE estimator in module A are the a priori mean $\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}\in\mathbb{C}^{D/2}$ , the a priori covariance $v_{\mathrm{A}}^{\mathrm{pri}(t)}$ , and the observed vector $\hat{\mathbf{r}}^{(t)}$ . With given $\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}$ , $v_{\mathrm{A}}^{\mathrm{pri}(t)}$ , and $\hat{\mathbf{r}}^{(t)}$ , the a posteriori mean $\mathbf{g}_{\mathrm{A}}^{\mathrm{post}(t)}$ and the a posteriori covariance $v_{\mathrm{A}}^{\mathrm{post}(t)}$ of the LMMSE estimator are given by [kay1993fundamentals] {subequations} {align} g_A^post(t) & = g_A^pri(t) + vApri(t)vApri(t)+ σw(t) A^H (^r^(t) - A g_A^pri(t)),
v_A^post(t) = v_A^pri(t) - κ⋅( vApri(t))2vApri(t)+ σw(t), where $\kappa$ is the compression ratio defined below \eqrefeq:g_compress. Then the extrinsic mean and variance of the LMMSE estimator are given by {subequations} {align} g_A^ext(t) &= v_A^ext(t) ( gApost(t)vApost(t)-gApri(t)vApri(t) ),
v_A^ext(t) = (1/v_A^post(t)-1/v_A^pri(t) )^-1. The extrinsic messages $\{\mathbf{g}_{\mathrm{A}}^{\mathrm{ext}(t)},v_{\mathrm{A}}^{\mathrm{ext}(t)}\}$ are used to update the a priori mean $\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}$ and the a priori variance $v_{\mathrm{B}}^{\mathrm{pri}(t)}$ as $\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}=\mathbf{g}_{\mathrm{A}}^{\mathrm{ext}(t)}$ , and $v_{\mathrm{B}}^{\mathrm{pri}(t)}=v_{\mathrm{A}}^{\mathrm{ext}(t)}$ . Both $\{\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)},v_{\mathrm{B}}^{\mathrm{pri}(t)}\}$ are the inputs of the MMSE denoiser in module B.

In module B, we model the a priori mean $\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}$ as an observation of $\mathbf{g}^{\mathrm{no}(t)}$ corrupted by additive noise: $\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}\!=\!\mathbf{g}^{\mathrm{no}(t)}\!+\!\boldsymbol{\delta}^{(t)}$ , where $\delta_{d}^{(t)}\sim\mathcal{CN}(0,v_{\mathrm{B}}^{\mathrm{pri}(t)})$ denotes the $d$ -th entry of $\boldsymbol{\delta}^{(t)}$ . The a posteriori mean $\mathbf{g}_{\mathrm{B}}^{\mathrm{post}(t)}$ and the variance $v_{\mathrm{B}}^{\mathrm{post}(t)}$ of the MMSE denoiser are given by {subequations} {align} g_B^post(t) &= \mathbbE[g^no(t)—g_B^pri(t)],
v_B^post(t) = 1D/2∑_d=1^D/2 \operatornamevar[g^no(t)_d—g_B,d^pri(t)], where $\operatorname{var}[a|b]=\mathbb{E}\Big{[}\big{|}a-\mathbb{E}[a|b]\big{|}^{2}\Big{|}b\Big{]}$ , and $g_{\mathrm{B},d}^{\mathrm{pri}(t)}$ is the $d$ -th entry of $\mathbf{g}_{\mathrm{B}}^{\mathrm{pri}(t)}$ . Then the extrinsic messages of the MMSE denoiser are given by {subequations} {align} &g_B^ext(t) = v_B^ext(t) (gBpost(t)vBpost(t) - gBpri(t)vBpri(t) ),
v_B^ext(t) = ( 1/v_B^post(t) - 1/v_B^pri(t) )^-1. The extrinsic messages $\{\mathbf{g}_{\mathrm{B}}^{\mathrm{ext}(t)},v_{\mathrm{B}}^{\mathrm{ext}(t)}\}$ are used to update the a priori mean $\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}$ and the a priori variance $v_{\mathrm{A}}^{\mathrm{pri}(t)}$ as $\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}=\mathbf{g}_{\mathrm{B}}^{\mathrm{ext}(t)}$ , and $v_{\mathrm{A}}^{\mathrm{pri}(t)}=v_{\mathrm{B}}^{\mathrm{ext}(t)}$ . Both $\{\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)},v_{\mathrm{A}}^{\mathrm{pri}(t)}\}$ are the inputs of the LMMSE estimator in module A. At the end of the iterative process, the final estimate is based on the a posteriori output of the module B, i.e., $\hat{\mathbf{g}}^{\mathrm{no}(t)}=\mathbf{g}_{\mathrm{B}}^{\mathrm{post}(t)}$ .

Then, based on the output of Turbo-CS, the aggregated gradient $\hat{\mathbf{g}}^{(t)}$ is given by

\hat{\mathbf{g}}^{(t)}=[\Re\{\sqrt{\sigma^{(t)}}(\hat{\mathbf{g}}^{\mathrm{no}(t)}\odot\mathbf{s})\},\Im\{\sqrt{\sigma^{(t)}}(\hat{\mathbf{g}}^{\mathrm{no}(t)}\odot\mathbf{s})\}]\in\mathbb{R}^{D}

(22)

where $\sigma^{(t)}=\frac{1}{M}\sum_{m=1}^{M}\sigma_{m}^{(t)}$ . The PS then updates the model $\boldsymbol{\theta}^{(t+1)}$ by

\boldsymbol{\theta}^{(t+1)}=\boldsymbol{\theta}^{(t)}-\eta\hat{\mathbf{g}}^{(t)}.

(23)

To sum up, at each round $t$ , the receive matrix $\hat{\mathbf{R}}^{(t)}$ is first reshaped into a vector form in \eqrefeq:compressed_problem. Given the observed vector $\hat{\mathbf{r}}^{(t)}$ and the initial values $\{\mathbf{g}_{\mathrm{A}}^{\mathrm{pri}(t)}=\mathbf{0},v_{\mathrm{A}}^{\mathrm{pri}(t)}=1\}$ , Turbo-CS iteratively calculates \eqrefeq:g_v_post_A-\eqrefeq:g_v_ext_B until a certain termination criterion is met. Finally, the output $\hat{\mathbf{g}}^{\mathrm{no}(t)}$ is scaled as $\hat{\mathbf{g}}^{(t)}$ for the model update in \eqrefeq:SGD_error.

III-D Overall Scheme

The proposed SCoM approach to the local gradients uploading is summarized below, where lines 3-5, 11-17 are executed at the PS, and lines 6-10 are executed at the devices.

{algorithm}

[htb] \floatnamealgorithm Proposed SCoM-Based MIMO OA-FL Scheme {algorithmic}[1] \REQUIRE $T$ , $\{Q_{m}\}$ , $N_{\mathrm{S}}$ , $N_{\mathrm{R}}$ , $N_{\mathrm{T}}$ , $\kappa$ , $\lambda$ , $\mathbf{A}$ and $\mathbf{s}$ . \STATEInitialization: $t=0$ , the global model $\boldsymbol{\theta}^{(0)}$ . \FOR $t\in[T]$ \STATEPS does: \STATEEstimate the CSI matrices $\{\mathbf{H}^{(t)}\}$ , and optimize $\mathbf{F}^{(t)}$ and $\{\mathbf{P}_{m}^{(t)}\}$ ; \STATESend $\boldsymbol{\theta}^{(t)}$ and $\{\mathbf{P}_{m}^{(t)}\}$ to the edge devices through error free links; \STATEEach device $m$ does in parallel: \STATECompute $\mathbf{g}_{m}^{(t)}$ based on the $\mathcal{A}_{m}$ and $\boldsymbol{\theta}^{(t)}$ via \eqrefeq:GradDef; \STATECompute $\mathbf{G}_{m}^{(t)}$ via \eqrefeq:Complexification-\eqrefeq:G_def; \STATEUpdate $\boldsymbol{\Delta}_{m}^{(t+1)}$ via \eqrefeq:Delta_update; \STATESend $\mathbf{G}_{m}^{(t)}$ to the PS with the precoding matrix $\mathbf{P}_{m}^{(t)}$ over the MIMO channel in \eqrefeq:receive_data; \STATEPS does: \STATEReshape $\hat{\mathbf{R}}^{(t)}$ into $\hat{\mathbf{r}}^{(t)}$ via \eqrefeq:compressed_problem;