This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint Device Selection and Power Control for Wireless Federated Learning

Wei Guo, , Ran Li, , Chuan Huang, , Xiaoqi Qin, , Kaiming Shen, , and Wei Zhang This work was supported in part the National Key R&D Program of China with grant No. 2018YFB1800800, by Natural Science Foundation of China with grant No. 62022070, by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence, by the Basic Research Project No. HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S&T Cooperation Zone, and also by Shenzhen Science & Innovation Fund under Grant JCYJ20180507182451820, and the Australian Research Council’s Project funding scheme under LP160101244. This work was submitted in part to 2022 IEEE Global Communications Conference. (Corresponding author: Chuan Huang.) W. Guo and R. Li are with the School of Science and Engineering (SSE) and Future Network of Intelligence Institute (FNii), The Chinese University of Hong Kong, Shenzhen, China, 518172. Emails: [email protected] and [email protected]. C. Huang and K. Shen are currently with the School of Science and Engineering (SSE) and Future Network of Intelligence Institute (FNii), The Chinese University of Hong Kong, Shenzhen, China, 518172, and Peng Cheng Laboratory, Shenzhen, China, 518066. Emails: [email protected] and [email protected]. X. Qin is with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China, 100876. Emial: [email protected]. W. Zhang is with the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, Australia, NSW 2052. Email: [email protected].
Abstract

This paper studies the joint device selection and power control scheme for wireless federated learning (FL), considering both the downlink and uplink communications between the parameter server (PS) and the terminal devices. In each round of model training, the PS first broadcasts the global model to the terminal devices in an analog fashion, and then the terminal devices perform local training and upload the updated model parameters to the PS via over-the-air computation (AirComp). First, we propose an AirComp-based adaptive reweighing scheme for the aggregation of local updated models, where the model aggregation weights are directly determined by the uplink transmit power values of the selected devices and which enables the joint learning and communication optimization simply by the device selection and power control. Furthermore, we provide a convergence analysis for the proposed wireless FL algorithm and the upper bound on the expected optimality gap between the expected and optimal global loss values is derived. With instantaneous channel state information (CSI), we formulate the optimality gap minimization problems under both the individual and sum uplink transmit power constraints, respectively, which are shown to be solved by the semidefinite programming (SDR) technique. Numerical results reveal that our proposed wireless FL algorithm achieves close to the best performance by using the ideal FedAvg scheme with error-free model exchange and full device participation.

Index Terms:
Federated Learning, over-the-air computation, device selection, power control, semidefinite relaxation.

I Introduction

The future 6th generation (6G) wireless communication systems are envisioned to support various intelligent applications and services, which are empowered by the significant increase of wireless edge devices (e.g., mobile phones and sensors) with growing computation and communication capabilities [1]. A vast amount of data generated by the edge devices can be utilized to train machine learning (ML) models to further enhance the intelligence of those applications and services [2]. Conventional machine learning methods, particularly those based on deep neural networks (DNN), are centralized and require to collect all the raw data to the central server or cloud in order to train the artificial intelligence (AI) model [3]. However, certain privacy and security issues arise during the data migrations, and limited wireless resource also poses some new technical challenges for the design of wireless AI systems [4, 5].

To address the issues of the centralized ML, federated learning (FL), a new distributed ML paradigm, was proposed in [6], which achieves great success and becomes increasingly popular in both the academia and industry. In FL, the distributed terminal devices, orchestrated by a single parameter server (PS), collaboratively train an AI model in an iterative fashion. During the whole iterative process, all participated terminal devices only exchange the model parameters with the PS and keep the raw data locally to protect the privacy and security. Nonetheless, FL faces several new challenges when being implemented in the wireless scenarios: 1)1) data heterogeneity: different from the centralized ML, FL highly suffers from the data heterogeneity [7, 8], which arises when the distributions of the generated data vary from device to device; and 2)2) communication complexity: the total training process for FL consumes a large amount of communication resources for serving large bunches of terminal devices, since the desired ML model is in general of high dimension and the number of updating rounds in the training process is thus considerably large [9, 10].

Aimed at enhancing the communication efficiency of the FL, comprehensive studies have been done from a communication perspective [11, 12, 14, 13, 15, 16, 19, 20, 22, 17, 21, 23, 24]. The authors in [11] proposed a joint communication and learning framework and formulated an optimization problem considering the user selection and wireless resource allocation to minimize the FL training loss. In [12], the authors identified the temporal dependency and varying significance of the training rounds, and proposed a joint device selection and bandwidth allocation scheme to maximize the weighted sum of selected clients in the long-term view. In [13], the authors proposed an update-aware device scheduling and resource allocation policy and analyzed the convergence of the FL algorithm with device scheduling. In [14], under the latency constraint, the authors proposed a joint device scheduling and resource allocation scheme to minimize the global loss of wireless FL, which achieves a desirable trade-off between the number of required training rounds and the latency per round. In [15], the authors proposed a joint wireless resource and quantization bits allocation policy to minimize the quantization error while guaranteeing the transmission outage probability of the uplink transmissions for the wireless FL. In [16], the authors proposed a control algorithm to determine the learning parameter to minimize the optimality gap between the expected and optimal loss function values. However, all the aforementioned works are conducted under the digital communications, and consequently the communication overhead and latency still increase with the number of participated edge devices [17]. Recently, by investigating the waveform superposition nature of the wireless multiple access (MAC) channels [18], over-the-air computation (AirComp) enabled analog uplink transmission for model aggregation methods have been proposed for the FL [19, 20, 22, 17, 21, 23, 24]. Specifically, in [19], the authors proposed a joint device scheduling and beamforming design to minimize the mean square error (MSE) of the aggregated signals to accelerate the convergence of the AirComp FL. In [20], the authors proposed an energy-aware device scheduling algorithm under the energy constraint to optimize the AirComp FL performance. In [17], the authors proposed a broadband AirComp FL and investigated the corresponding power control and device scheduling problem, and the authors in [21] further adopts a one-bit quantization scheme to modify the policy adopted in [17]. The authors in [22] proposed a gradient aware power control scheme to enhance the performance of the AirComp FL. In [23], the authors proposed a power control to minimize the optimality gap between the expected and optimal global loss values of the AirComp FL, and parallelly in [24], the authors proposed a joint device selection and power control for the AirComp FL with uniform power scaling and equal weights model aggregation.

All the above researches about FL over wireless channels only considered the uplink transmissions for the model parameters from terminal devices to the PS, assuming the availability of accurate global model at the devices through perfect downlink transmissions. However, the downlink transmissions usually suffer from quantization error, limited transmit power, limited bandwidth, and additive noise, which degrades the performance of the considered FL systems [25, 26, 27, 28, 29]. Therefore, broadcasting of inaccurate global model is also an important issue for the implementation of the FL algorithms. Without considering the wireless channels, the authors in [25] proposed a linear projection method to broadcast a compressed global model to the devices, and the authors in [26] derived the sufficient conditions for controlling the signal-to-noise ratios (SNRs) of both the downlink and uplink transmission to maintain the linear convergence rate of the FL algorithm [26]. Different from [25, 26] and under wireless implementations of FL, the authors in [27, 28] provided the convergence analysis of both the digital and analog downlink transmissions and showed the advantages of the analog downlink transmission, and the same result was presented in [29] numerically. To the best of our knowledge, there are few existing works addressing how to efficiently mitigate the impact of both the downlink and uplink wireless communications on the FL algorithm.

This paper provides a comprehensive analysis on the wireless FL algorithm, considering the analog (uncoded) downlink transmissions for global model broadcasting and AirComp enabled analog uplink transmissions for model aggregation, due to the superiority of the analog downlink and uplink transmissions for wireless FL as mentioned above. Compared to the previous works that only account for the uplink communications of FL [19, 20, 22, 17, 21, 23, 24] and only focus on the convergence analysis for FL [26, 27, 28], the goals of this paper are not only to investigate the impact of both the downlink and uplink wireless communications on the convergence behavior of the considered FL algorithm, but also to mitigate this impact to enhance the FL performance. The main contribution of this paper is summarized as follows:

  • We propose an AirComp-based adaptive reweighing scheme for model aggregation, where the adaptive weights are directly determined by the uplink transmit power values of the participated devices in each communication round. With respect to our proposed policy, we analyze the convergence behavior of the wireless FL algorithm and derive the upper bound on the expected optimality gap between the expected and optimal global loss values, which determines how bias the FL converges and theoretically quantifies the impact of both the downlink and uplink wireless communications on the convergence of the wireless FL algorithm, and thus needs to be minimized to enhance the FL performance.

  • With only instantaneous channel state information (CSI) known per round of the learning process, we formulate the optimality gap minimization problem as optimizing over the device selection and power control round by round, under both the individual and sum uplink transmit power constraints, respectively. Though the optimality gap minimization problem is a mixed integer programming (MIP) problem, we transform it into a continuous quadratically constrained ratio of two quadratic functions (QCRQ) minimization problem, which is efficiently solved by the semidefinite relaxation (SDR) technique. Furthermore, based on the optimal SDR solution, we derive the optimal device selection and power control to minimize the optimality gap for both the individual and sum uplink power constraint cases.

The reminder of this article is organized as follows. Section II presents the system model. In Section III, we provide the convergence analysis results of the wireless FL system in terms of the optimality gap. In Section IV, we formulate the optimality gap minimization problem and find the optimal device selection and power control to minimize the optimality gap. Numerical results are presented in Section V. Finally, Section VI concludes this paper.

Notation: The bold lower-case letter denotes a vector; the bold upper-case letter denotes a matrix; the calligraphic upper-case letter denotes a set. 𝔼()\mathbb{E}(\cdot) denotes the expectation operator. \nabla denotes the gradient operator. For a vector 𝐱\mathbf{x}, 𝐱2\|\mathbf{x}\|_{2} denotes the Euclidean norm of 𝐱\mathbf{x}; 𝐱T\mathbf{x}^{T} denotes the transpose of 𝐱\mathbf{x}; 𝐱H\mathbf{x}^{H} denotes the Hermitian transpose of a complex vector 𝐱\mathbf{x}. 𝐱,𝐲\langle\mathbf{x},\mathbf{y}\rangle denotes the inner product between vectors 𝐱\mathbf{x} and 𝐲\mathbf{y}. 𝟎\mathbf{0} and 𝟏\mathbf{1} denote the null vector and all-one vector, respectively. For a matrix 𝐌\mathbf{M}, 𝐌T\mathbf{M}^{T}, 𝚝𝚛(𝐌)\mathtt{tr}(\mathbf{M}), and 𝚛𝚊𝚗𝚔(𝐌)\mathtt{rank}(\mathbf{M}) denote the transpose, trace, and rank, of 𝐌\mathbf{M}, respectively. [𝐌]i,j[\mathbf{M}]_{i,j} denotes the (i,j)(i,j)-th element of 𝐌\mathbf{M}. 𝚍𝚒𝚊𝚐(x1,,xn)\mathtt{diag}(x_{1},\cdots,x_{n}) denotes a diagonal matrix with x1,,xnx_{1},\cdots,x_{n} being its diagonal elements. 𝐌0\mathbf{M}\succeq 0 means that matrix 𝐌\mathbf{M} is positive semidefinite. 𝐈\mathbf{I} denotes the identity matrix. \mathbb{R} and \mathbb{C} denote the sets of real numbers and complex numbers, respectively. We denote a circularly symmetric complex Gaussian (CSCG) distribution with the real and imaginary components with variance σ2/2\sigma^{2}/2 by 𝒞𝒩(0,σ2)\mathcal{CN}(0,\sigma^{2}). For a set 𝒜\mathcal{A}, |𝒜||\mathcal{A}| denotes its cardinality.

II System Model

II-A Preliminaries

Refer to caption
Figure 1: Wireless FL system with one PS and KK terminal devices.

We consider a wireless FL system, as shown in Fig. 1, where one PS coordinates a set of KK terminal devices 𝒦={1,,K}\mathcal{K}=\{1,\cdots,K\} through wireless channels to cooperatively train a shared ML model (e.g., DNN), denoted by 𝐰d\mathbf{w}\in\mathbb{R}^{d} of dimension dd. Each terminal device k𝒦k\in\mathcal{K} collects its own labeled training data, and constitutes a local data set 𝒟k={(𝐱k,i,yk,i)}i=1Dk\mathcal{D}_{k}=\{(\mathbf{x}_{k,i},y_{k,i})\}_{i=1}^{D_{k}} with Dk=|𝒟k|D_{k}=|\mathcal{D}_{k}| data samples, where 𝐱k,in\mathbf{x}_{k,i}\in\mathbb{R}^{n} is the ii-th input data with nn dimensions and yk,iy_{k,i}\in\mathbb{R} is the labeled output corresponding to 𝐱k,i\mathbf{x}_{k,i}. Then, the total training data set is given as 𝒟=k𝒦𝒟k\mathcal{D}=\bigcup_{k\in\mathcal{K}}\mathcal{D}_{k} of size D=|𝒟|=k=1KDkD=|\mathcal{D}|=\sum_{k=1}^{K}D_{k}.

Though the PS has no access to the data samples distributed at the terminal devices due to the privacy concern, the goal of training a global model 𝐰\mathbf{w} can be achieved by solving the following distributed learning problem

min𝐰F(𝐰)=k=1KqkFk(𝐰),\displaystyle\min_{\mathbf{w}}\ F(\mathbf{w})=\sum_{k=1}^{K}q_{k}F_{k}(\mathbf{w}), (1)

where qk=Dk/Dq_{k}=D_{k}/D is the fraction of data samples for device kk and Fk(𝐰)F_{k}(\mathbf{w}) is the local loss function defined as

Fk(𝐰)=1Dki=1Dk(𝐰;(𝐱k,i,yk,i)),F_{k}(\mathbf{w})=\frac{1}{D_{k}}\sum_{i=1}^{D_{k}}\mathcal{L}(\mathbf{w};(\mathbf{x}_{k,i},y_{k,i})), (2)

with (𝐰;(𝐱k,i,yk,i))\mathcal{L}(\mathbf{w};(\mathbf{x}_{k,i},y_{k,i})) being an empirical sample-wise loss function defined by the learning task that quantifies the loss of the model 𝐰\mathbf{w} for sample (𝐱k,i,yk,i)(\mathbf{x}_{k,i},y_{k,i}).

For the implementation of FL algorithms, the system solves the distributed learning problem (1) in an iterative fashion following the widely used broadcasting-computation-aggregation (BCA) principle, which involves the following three steps in each iteration: 1)1) the PS broadcasts a global model 𝐰\mathbf{w} to the terminal devices; 22) terminal device kk updates the global model 𝐰\mathbf{w} to a local model 𝐰kd\mathbf{w}_{k}\in\mathbb{R}^{d} after several local learning iterations based on its local data samples; 3)3) the PS aggregates 𝐰k\mathbf{w}_{k}’s to generate a new global model 𝐰\mathbf{w}. In the following, we discuss the considered wireless FL algorithm.

II-B Wireless FL Algorithm

This paper considers the scenario where both the downlink and uplink communications are performed in the analog manner and the quasi-static fading channel model is adopted, i.e., both the downlink and uplink channels remain unchanged during each communication round (corresponding to one FL iteration containing all the three BCA steps) and are independent across different communication rounds following the Rayleigh distribution. Specifically, at the tt-th communication round, the three steps of the wireless FL algorithm are described as follows:

  • (1)

    Broadcasting: In this step, the PS broadcasts the vector 𝐮dl(t)=ps(t)𝐰(t)\mathbf{u}_{\rm dl}(t)=p_{s}(t)\mathbf{w}(t) to terminal devices, containing the information of the global model 𝐰(t)\mathbf{w}(t) and with the downlink transmit power value ps(t)p_{s}(t) satisfying

    𝐮dl(t)22=ps(t)𝐰(t)22Pmaxdl(t),\|\mathbf{u}_{\rm dl}(t)\|_{2}^{2}=\|p_{s}(t)\mathbf{w}(t)\|_{2}^{2}\leq P_{\rm max}^{\rm dl}(t), (3)

    where Pmaxdl(t)P_{\rm max}^{\rm dl}(t) is the downlink transmit power budget at the tt-the round. Hence, the received vector at the kk-th terminal device is given as

    𝐯k(t)\displaystyle\mathbf{v}_{k}(t) =hkdl(t)𝐮dl(t)+𝐧k(t)\displaystyle=h_{k}^{\rm dl}(t)\mathbf{u}_{\rm dl}(t)+\mathbf{n}_{k}(t) (4)
    =hkdl(t)ps(t)𝐰(t)+𝐧k(t),\displaystyle=h_{k}^{\rm dl}(t)p_{s}(t)\mathbf{w}(t)+\mathbf{n}_{k}(t),

    where hkdlh_{k}^{\rm dl}\in\mathbb{C} is the complex-valued downlink channel coefficient between the PS and terminal device kk, and 𝐧k(t)d\mathbf{n}_{k}(t)\in\mathbb{C}^{d} is the independent identically distributed (i.i.d.) CSCG noise vector following the distribution 𝒞𝒩(0,σd2𝐈)\mathcal{CN}(0,\sigma_{d}^{2}\mathbf{I}) with σd2\sigma_{d}^{2} being the downlink noise power. With perfect knowledge of ps(t)p_{s}(t) and the CSI of the downlink channels, terminal device kk estimates the global model by descaling the received signal 𝐯k\mathbf{v}_{k}, i.e.,

    𝐰¯k(t)=ak(t)ps(t)hkdl(t)𝐯k(t)=𝐰(t)+𝐧~k(t),\bar{\mathbf{w}}_{k}(t)=\frac{a_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{v}_{k}(t)=\mathbf{w}(t)+\tilde{\mathbf{n}}_{k}(t), (5)

    where 𝐰¯k(t)\bar{\mathbf{w}}_{k}(t) is the estimated global model for terminal device kk at the tt-th round, ak(t)a_{k}(t) is the selection indicator for the device kk at the tt-th round (ak(t)=1a_{k}(t)=1 implies that the device kk is selected to participate in the training at the tt-th round, and otherwise, ak(t)=0a_{k}(t)=0), and 𝐧~k(t)=ak(t)ps(t)hkdl(t)𝐧k(t)\tilde{\mathbf{n}}_{k}(t)=\frac{a_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{n}_{k}(t) is the equivalent noise at the device kk. Then, terminal device kk sets its current local model to 𝐰k(t)=𝐰¯(t)\mathbf{w}_{k}(t)=\bar{\mathbf{w}}(t).

  • (2)

    Local Model Update: In this step, terminal device kk updates its local model 𝐰k(t)\mathbf{w}_{k}(t) by minimizing its local objective (2) via a local optimization algorithm, e.g., the mini-batch stochastic gradient descent (SGD) method, based on its collected data set 𝒟k\mathcal{D}_{k}. When implementing the mini-batch SGD algorithm, the local dataset 𝒟k\mathcal{D}_{k} at the device kk is divided into several mini-batch with size BB, and the device runs EE SGD iterations with each SGD iteration corresponding to one mini-batch. At the τ\tau-th SGD iteration, τ{1,,E}\tau\in\{1,\cdots,E\}, the local model is updated as

    𝐰kτ(t)=𝐰kτ1(t)ηtFk(𝐰kτ1(t);kτ(t)),\mathbf{w}_{k}^{\tau}(t)=\mathbf{w}_{k}^{\tau-1}(t)-\eta_{t}\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t)), (6)

    where ηt\eta_{t} is the learning rate at the tt-th round, kτ(t)\mathcal{B}_{k}^{\tau}(t) is a mini-batch with size |kτ(t)|=B|\mathcal{B}_{k}^{\tau}(t)|=B and its data points being independently and uniformly chosen from 𝒟k\mathcal{D}_{k}, and Fk(𝐰kτ1(t);kτ(t))d\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t))\in\mathbb{R}^{d} is the stochastic gradient of the local loss function with respect to (w.r.t.) the (τ1)(\tau-1)-th model parameter 𝐰kτ1(t)\mathbf{w}_{k}^{\tau-1}(t) and randomly sampled mini-batch kτ(t)\mathcal{B}_{k}^{\tau}(t), i.e.,

    Fk\displaystyle\nabla F_{k} (𝐰kτ1(t);kτ(t))\displaystyle(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t)) (7)
    =1Bξk,iτ(t)kτ(t)(𝐰kτ1(t);ξk,iτ(t)),\displaystyle=\frac{1}{B}\sum_{\xi_{k,i}^{\tau}(t)\in\mathcal{B}_{k}^{\tau}(t)}\nabla\mathcal{L}(\mathbf{w}_{k}^{\tau-1}(t);\xi_{k,i}^{\tau}(t)),

    with ξk,iτ(t)\xi_{k,i}^{\tau}(t) being the ii-th training data in mini-batch kτ(t)\mathcal{B}_{k}^{\tau}(t). The initial local model parameter at device kk before training is set as 𝐰¯(t)\bar{\mathbf{w}}(t) in (5), i.e., 𝐰k0(t)=𝐰¯k(t)\mathbf{w}_{k}^{0}(t)=\bar{\mathbf{w}}_{k}(t). After EE SGD iterations, the local model parameter at device kk is updated as 𝐰kE(t)\mathbf{w}_{k}^{E}(t).

  • (3)

    Model Aggregation: In this step, the terminal devices upload their updated local model parameters 𝐰kE(t)\mathbf{w}_{k}^{E}(t)’s to the PS, and the PS aggregates the received local models to generate a new global model. For the uplink transmissions of 𝐰kE(t)\mathbf{w}_{k}^{E}(t)’s, the terminal devices transmit their local model parameters concurrently through AirComp by exploiting the waveform superposition nature of the wireless MAC channel, since the information of interest at the PS for aggregation is just the weighted summation of the local model parameters. Specifically, the updated model parameter 𝐰kE(t)\mathbf{w}_{k}^{E}(t) at the device kk is multiplied with a pre-processing factor βk(t)\beta_{k}(t), which is given as

    βk(t)=ak(t)pk(t)(hkup(t))H|hkup(t)|2,\beta_{k}(t)=\frac{a_{k}(t)p_{k}(t)(h_{k}^{\rm up}(t))^{H}}{|h_{k}^{\rm up}(t)|^{2}}, (8)

    where pk(t)p_{k}(t) is the uplink transmit power value at device kk, hkup(t)h_{k}^{\rm up}(t) is the complex uplink channel coefficient from device kk to the PS. With perfect CSI of the uplink channels, the received vector 𝐯(t)\mathbf{v}(t) at the PS is given as

    𝐯(t)\displaystyle\mathbf{v}(t) =k𝒦hkup(t)βk(t)𝐰kE(t)+𝐧(t)\displaystyle=\sum_{k\in\mathcal{K}}h_{k}^{\rm up}(t)\beta_{k}(t)\mathbf{w}_{k}^{E}(t)+\mathbf{n}(t) (9)
    =k𝒦ak(t)pk(t)𝐰kE(t)+𝐧(t),\displaystyle=\sum_{k\in\mathcal{K}}a_{k}(t)p_{k}(t)\mathbf{w}_{k}^{E}(t)+\mathbf{n}(t),

    where 𝐧(t)d\mathbf{n}(t)\in\mathbb{C}^{d} is the i.i.d. CSCG noise vector following the distribution 𝒞𝒩(0,σu2𝐈)\mathcal{CN}(0,\sigma_{u}^{2}\mathbf{I}). In this paper, the uplink transmit power values at the devices satisfy the following two types of constraints: For the individual uplink transmit power constraint at each device, the transmit power value of device kk is supposed to satisfy

    βk(t)𝐰kE(t)22Pmaxk(t),\|\beta_{k}(t)\mathbf{w}_{k}^{E}(t)\|_{2}^{2}\leq P_{\rm max}^{k}(t), (10)

    where Pmaxk(t)P_{\rm max}^{k}(t) is the individual power budget at the tt-th round of device kk; for the sum uplink transmit power constraint over all devices, the transmit power values of the devices are supposed to satisfy

    k=1Kβk(t)𝐰kE(t)22Ptot(t),\sum_{k=1}^{K}\|\beta_{k}(t)\mathbf{w}_{k}^{E}(t)\|_{2}^{2}\leq P_{\rm tot}(t), (11)

    where Ptot(t)P_{\rm tot}(t) is the sum power budget at the tt-th round of all the participated devices.

    The PS scales the received signal 𝐯(t)\mathbf{v}(t) with a factor 1/ζ(t)1/\zeta(t) to aggregate and update the global model parameter as

    𝐰(t+1)\displaystyle\mathbf{w}(t+1) (12)
    =\displaystyle= 𝐯(t)ζ(t)=1ζ(t)k𝒦ak(t)pk(t)𝐰kE(t)+1ζ(t)𝐧(t),\displaystyle\frac{\mathbf{v}(t)}{\zeta(t)}=\frac{1}{\zeta(t)}\sum_{k\in\mathcal{K}}a_{k}(t)p_{k}(t)\mathbf{w}_{k}^{E}(t)+\frac{1}{\zeta(t)}\mathbf{n}(t),

    where ζ(t)\zeta(t) is set as the summation of all the products ak(t)pk(t)a_{k}(t)p_{k}(t), i.e., ζ(t)=k=1Kak(t)pk(t)\zeta(t)=\sum_{k=1}^{K}a_{k}(t)p_{k}(t), with perfect knowledge on all ak(t)a_{k}(t) and pk(t)p_{k}(t) at the PS. Hence, based on (9) and (12), the model aggregation is given as111The proposed model aggregation scheme requires perfect CSI about both the downlink and uplink channels. Nevertheless, our design and analysis can be extended to the case with imperfect CSI, and the impact of imperfect CSI will be studied for future work.

    𝐰(t+1)=k𝒦ρk(t)𝐰kE(t)+𝐧~(t),\mathbf{w}(t+1)=\sum_{k\in\mathcal{K}}\rho_{k}(t)\mathbf{w}_{k}^{E}(t)+\tilde{\mathbf{n}}(t), (13)

    where ρk(t)=ak(t)pk(t)j𝒦aj(t)pj(t)\rho_{k}(t)=\frac{a_{k}(t)p_{k}(t)}{\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)} is the weight of 𝐰kE(t)\mathbf{w}_{k}^{E}(t) for aggregation satisfying k=1Kρk(t)=1\sum_{k=1}^{K}\rho_{k}(t)=1 and 𝐧~(t)=𝐧(t)j𝒦aj(t)pj(t)\tilde{\mathbf{n}}(t)=\frac{\mathbf{n}(t)}{\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)} is the equivalent noise.

Remark 1.

Due to the dynamic nature of the wireless channels, the terminal devices may encounter a relatively weak downlink channel, i.e., |hkdl(t)|0|h_{k}^{\rm dl}(t)|\approx 0, when downloading the global model from the PS. Then, the received model 𝐰¯k(t)\bar{\mathbf{w}}_{k}(t) at device kk will be severely damaged by the equivalent noise 𝐧~k(t)\tilde{\mathbf{n}}_{k}(t), as shown in (5). Similarly, the devices may also encounter a relatively weak uplink channel, i.e., |hkup(t)|0|h_{k}^{\rm up}(t)|\approx 0, when uploading the updated model 𝐰kE(t)\mathbf{w}_{k}^{E}(t) to the PS. Then, the device kk may not be capable of transmitting 𝐰kE(t)\mathbf{w}_{k}^{E}(t) due to the limited transmit power budget according to (8), (10), and (11). Thus, device selection according to the dynamic uplink and downlink channel conditions under the limited transmit power budget is necessary for the wireless FL systems.

As mentioned in Remark 1, when practically implementing the wireless FL algorithm, not all devices are able to participate the training over the whole time. Hence, when considering both the downlink and uplink communications between the PS and the terminal devices, we need to develop an efficient device selection scheme while guaranteeing the convergence of the considered algorithm.

Remark 2.

In our proposed weighted model aggregation scheme (13), under the wireless environment, the adaptive weights in each round are determined directly by the uplink transmit power values of the participated terminal devices, which are controllable and can be optimized to mitigate the impact of wireless communications on the convergence of the considered wireless FL algorithm. This motivates us to design a proper joint device selection and power control scheme.

III Convergence Analysis

In this section, we analyze the convergence behavior of the considered wireless FL algorithm presented in Section II for the general smooth non-convex learning problems. We first present the assumptions and preliminaries, and then present the theoretical results on convergence for the considered wireless FL algorithm in terms of the optimality gap between the expected and optimal global loss values.

III-A Preliminaries

First, we make the following standard assumptions that are commonly adopted in the convergence analysis of the BCA-type FL algorithms [30, 23, 24, 15, 31].

Assumption 1 (L-smooth).

The global loss function is differentiable and the gradient is uniformly Lipschitz continuous with a positive Lipschitz constant LL, i.e., 𝐯,𝐰d\forall\mathbf{v},\mathbf{w}\in\mathbb{R}^{d},

F(𝐯)F(𝐰)2L𝐯𝐰2,\|\nabla F(\mathbf{v})-\nabla F(\mathbf{w})\|_{2}\leq L\|\mathbf{v}-\mathbf{w}\|_{2}, (14)

which is equivalent to

F(𝐯)F(𝐰)+(𝐯𝐰)TF(𝐰)+L2𝐯𝐰22.F(\mathbf{v})\leq F(\mathbf{w})+(\mathbf{v}-\mathbf{w})^{T}\nabla F(\mathbf{w})+\frac{L}{2}\|\mathbf{v}-\mathbf{w}\|_{2}^{2}. (15)
Assumption 2 (Unbiased and bounded variance mini-batch SGD).

The mini-batch SGD is unbiased, i.e.,

𝔼[Fk(𝐰;)]=Fk(𝐰),\mathbb{E}[\nabla F_{k}(\mathbf{w};\mathcal{B})]=\nabla F_{k}(\mathbf{w}), (16)

and the variance of stochastic gradients in each device is bounded, i.e.,

𝔼[Fk(𝐰;)Fk(𝐰)22]μ2.\mathbb{E}[\|\nabla F_{k}(\mathbf{w};\mathcal{B})-\nabla F_{k}(\mathbf{w})\|_{2}^{2}]\leq\mu^{2}. (17)
Assumption 3 (Bounded Gradient Divergence).

The variance of the local gradients for each local device is bounded as

𝔼[Fk(𝐰)F(𝐰)22]δ,\mathbb{E}\left[\|\nabla F_{k}(\mathbf{w})-\nabla F(\mathbf{w})\|_{2}^{2}\right]\leq\delta, (18)

where δ\delta measures the data heterogeneity [32].

Based on the global model broadcasting in (5), the mini-batch local SGD (6) and (7), and the proposed adaptive reweighing model aggregation scheme (13), the PS updates the global model at the (t+1)(t+1)-th round as

𝐰(t+1)\displaystyle\mathbf{w}(t+1) (19)
=\displaystyle= k𝒦ρk(t)𝐰kE(t)+𝐧~(t)\displaystyle\sum_{k\in\mathcal{K}}\rho_{k}(t)\mathbf{w}_{k}^{E}(t)+\tilde{\mathbf{n}}(t)
=(a)\displaystyle\overset{(a)}{=} k𝒦ρk(t)[𝐰k0(t)ηtτ=1EFk(𝐰kτ1(t);kτ(t))]\displaystyle\sum_{k\in\mathcal{K}}\rho_{k}(t)\left[\mathbf{w}_{k}^{0}(t)-\eta_{t}\sum_{\tau=1}^{E}\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t))\right]
+𝐧~(t)\displaystyle+\tilde{\mathbf{n}}(t)
=(b)\displaystyle\overset{(b)}{=} k𝒦ρk(t)𝐰¯k(t)+𝐧~(t)\displaystyle\sum_{k\in\mathcal{K}}\rho_{k}(t)\bar{\mathbf{w}}_{k}(t)+\tilde{\mathbf{n}}(t)
ηtk=1Kρk(t)τ=1EFk(𝐰kτ1(t);kτ(t))\displaystyle-\eta_{t}\sum_{k=1}^{K}\rho_{k}(t)\sum_{\tau=1}^{E}\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t))
=(c)\displaystyle\overset{(c)}{=} 𝐰(t)+k𝒦ρk(t)ps(t)hkdl(t)𝐧k(t)+𝐧~(t)\displaystyle\mathbf{w}(t)+\sum_{k\in\mathcal{K}}\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{n}_{k}(t)+\tilde{\mathbf{n}}(t)
ηtk=1Kρk(t)τ=1EFk(𝐰kτ1(t);kτ(t)),\displaystyle-\eta_{t}\sum_{k=1}^{K}\rho_{k}(t)\sum_{\tau=1}^{E}\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t)),

where (a)(a) follows the local mini-batch SGD iterations in (6), (b)(b) is to due to 𝐰k0(t)=𝐰¯k(t)\mathbf{w}_{k}^{0}(t)=\bar{\mathbf{w}}_{k}(t), and (c)(c) follows (5) and k𝒦ρk(t)=1\sum_{k\in\mathcal{K}}\rho_{k}(t)=1.

Define the local model difference Δ𝐰k(t)\Delta\mathbf{w}_{k}(t) as

Δ𝐰k(t)\displaystyle\Delta\mathbf{w}_{k}(t) =𝐰kE(t)𝐰k0(t)\displaystyle=\mathbf{w}_{k}^{E}(t)-\mathbf{w}_{k}^{0}(t) (20)
=ηtτ=1EFk(𝐰kτ1(t);kτ(t)),\displaystyle=-\eta_{t}\sum_{\tau=1}^{E}\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t)),

and define the virtual noisy global model 𝐰~(t)\tilde{\mathbf{w}}(t) as

𝐰~(t)=𝐰(t)+k𝒦ρk(t)ps(t)hkdl(t)𝐧k(t).\tilde{\mathbf{w}}(t)=\mathbf{w}(t)+\sum_{k\in\mathcal{K}}\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{n}_{k}(t). (21)

Thus, we can further rewrite 𝐰(t+1)\mathbf{w}(t+1) as

𝐰(t+1)=𝐰~(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t).\mathbf{w}(t+1)=\tilde{\mathbf{w}}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t). (22)

III-B Theoretical Results on Convergence

We first present the following lemmas and their proofs that are used in the proof of Theorem 1.

Lemma 1.

The virtual noisy global model 𝐰~(t)\tilde{\mathbf{w}}(t) defined in (21) is unbiased, i.e.,

𝔼[𝐰~(t)]=𝐰(t),\mathbb{E}\left[\tilde{\mathbf{w}}(t)\right]=\mathbf{w}(t), (23)

and the variance is bounded as

𝔼[𝐰~(t)𝐰(t)22]=dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2.\mathbb{E}\left[\|\tilde{\mathbf{w}}(t)-\mathbf{w}(t)\|_{2}^{2}\right]=\frac{d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}. (24)
Proof:

According to (21), we have 𝔼[𝐰~(t)𝐰(t)]=k𝒦ρk(t)ps(t)hkdl(t)𝔼[𝐧k(t)]=0\mathbb{E}[\tilde{\mathbf{w}}(t)-\mathbf{w}(t)]=\sum_{k\in\mathcal{K}}\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbb{E}[\mathbf{n}_{k}(t)]=0, since 𝐧k(t)\mathbf{n}_{k}(t)’s are i.i.d. following the distribution 𝒞𝒩(𝟎,σd2𝐈)\mathcal{CN}(\mathbf{0},\sigma_{d}^{2}\mathbf{I}), which proves (23). Similarly, we have

𝔼[𝐰~(t)𝐰(t)22]\displaystyle\mathbb{E}\left[\|\tilde{\mathbf{w}}(t)-\mathbf{w}(t)\|_{2}^{2}\right] =𝔼[k𝒦ρk(t)ps(t)hkdl(t)𝐧k(t)22]\displaystyle=\mathbb{E}\left[\left\|\sum_{k\in\mathcal{K}}\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{n}_{k}(t)\right\|_{2}^{2}\right] (25)
=k𝒦|ρk(t)ps(t)hkdl(t)|2𝔼[𝐧k(t)22]\displaystyle=\sum_{k\in\mathcal{K}}\left|\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\right|^{2}\mathbb{E}[\|\mathbf{n}_{k}(t)\|_{2}^{2}]
=dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2,\displaystyle=\frac{d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}},

which proves (24). ∎

Lemma 2.

The expectation of the square norm of the model difference at each round for device kk is bounded by

𝔼[Δ𝐰k(t)22]\displaystyle\mathbb{E}\left[\|\Delta\mathbf{w}_{k}(t)\|_{2}^{2}\right] (26)
\displaystyle\leq ηt2E2(μ2+4δ)+2Eηt2L2τ=1E𝔼[𝐰kτ1(t)𝐰(t)22]\displaystyle\eta_{t}^{2}E^{2}\left(\mu^{2}+4\delta\right)+2E\eta_{t}^{2}L^{2}\sum_{\tau=1}^{E}\mathbb{E}\left[\|\mathbf{w}_{k}^{\tau-1}(t)-\mathbf{w}(t)\|_{2}^{2}\right]
+4ηt2E2𝔼[F(𝐰(t))22].\displaystyle+4\eta_{t}^{2}E^{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right].
Lemma 3.

The sum of the expected square norm of the difference between the local updated model at each SGD iteration and the previous global model is bounded by

τ=1E𝔼[𝐰kτ1(t)𝐰(t)22]\displaystyle\sum_{\tau=1}^{E}\mathbb{E}\left[\|\mathbf{w}_{k}^{\tau-1}(t)-\mathbf{w}(t)\|_{2}^{2}\right] (27)
\displaystyle\leq 2dσd2Eps2(t)k𝒦ρk2(t)|hkdl(t)|2+2ηt2E3(μ2+4δ)14ηt2E2L2\displaystyle\frac{\frac{2d\sigma_{d}^{2}E}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}+2\eta_{t}^{2}E^{3}\left(\mu^{2}+4\delta\right)}{1-4\eta_{t}^{2}E^{2}L^{2}}
++8ηt2E3𝔼[F(𝐰(t))22]14ηt2E2L2.\displaystyle+\frac{+8\eta_{t}^{2}E^{3}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]}{1-4\eta_{t}^{2}E^{2}L^{2}}.
Lemma 4.

The expectation of the square norm of the gradient of the global loss function at each round is bounded by

F(𝐰(t))222L(F(𝐰(t))F),\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\leq 2L\left(F(\mathbf{w}(t))-F^{*}\right), (28)

where FF^{*} is the optimal global loss function value.

The proofs of Lemma 2, 3, 4 follow the same ideas in [15], [31], [33], with a slight modification according to the problem that we consider. Now, we present the main convergence analysis results in the following theorem.

Theorem 1.

Let Assumption (1)-(3) be hold. Assume the FL algorithm terminates after TT rounds, given an initial global model 𝐰(1)\mathbf{w}(1), the expected optimality gap between the expected and optimal global loss values 𝔼[F(𝐰(T+1))]F\mathbb{E}[F(\mathbf{w}(T+1))]-F^{*} is bounded by

𝔼[F(𝐰(T+1))]F\displaystyle\mathbb{E}[F(\mathbf{w}(T+1))]-F^{*}
\displaystyle\leq t=1TA(t)𝔼[F(𝐰(1))F]+t=1T1(i=t+1TA(i))G(t)\displaystyle\prod_{t=1}^{T}A(t)\mathbb{E}[F(\mathbf{w}(1))-F^{*}]+\sum_{t=1}^{T-1}\left(\prod_{i=t+1}^{T}A(i)\right)G(t)
+G(T),\displaystyle+G(T), (29)

with

A(t)=1+ηtEL(20ηt2E2L2+16ηtEL1)14ηt2E2L2,A(t)=1+\frac{\eta_{t}EL(20\eta_{t}^{2}E^{2}L^{2}+16\eta_{t}EL-1)}{1-4\eta_{t}^{2}E^{2}L^{2}}, (30)

and

G(t)=\displaystyle G(t)= (2ηt2E2L(1+ηtEL)14ηt2E2L2)(μ2+4δ)(a)\displaystyle\underbrace{\left(\frac{2\eta_{t}^{2}E^{2}L(1+\eta_{t}EL)}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)\left(\mu^{2}+4\delta\right)}_{(a)}
+ηtδE(k=1K1qk+1)(b)\displaystyle+\underbrace{\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)}_{(b)}
+(dσd2L(1+2ηtE+4ηt2E2L2)(14ηt2E2L2)ps2(t))k𝒦ρk2(t)|hkdl(t)|2(c)\displaystyle+\underbrace{\left(\frac{d\sigma_{d}^{2}L(1+2\eta_{t}E+4\eta_{t}^{2}E^{2}L^{2})}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}}_{(c)}
+2dσu2L(j𝒦aj(t)pj(t))2(d).\displaystyle+\underbrace{\frac{2d\sigma_{u}^{2}L}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}}_{(d)}.
Proof:

Please see Appendix. A. ∎

Remark 3.

The expected optimality gap between the expected and optimal global loss values given in the right hand side (RHS) of (1) reveals several important insights:

  1. 1.

    When the learning rate is set small enough, i.e., ηt120EL\eta_{t}\leq\frac{1}{20EL}, we have A(t)<1A(t)<1, which implies limTt=1TA(t)=0\lim_{T\rightarrow\infty}\prod_{t=1}^{T}A(t)=0. In this case, the proposed wireless FL algorithm converges to a biased solution, and the expected optimality gap 𝔼[F(𝐰(T+1))]F\mathbb{E}[F(\mathbf{w}(T+1))]-F^{*} is upper bounded only by the the last two terms on the RHS of (1), which is a linear combination of the performance gap G(t)G(t) in each round given in (1).

  2. 2.

    The performance gap G(t)G(t) in each round consists of 4 terms (a)(a)-(d)(d) with clear physical meanings: term (a)(a) is caused by gradient variance and data heterogeneity, term (b)(b) is caused solely by data heterogeneity, and terms (c)(c) and (d)(d) are caused by the downlink and uplink wireless communications, respectively.

  3. 3.

    If the FL algorithm is performed in the ideal environment, where the broadcasting and model aggregation are not decayed by wireless channels, including the channel fading and additive noise, terms (c)(c) and (d)(d) equals to zero. In this case, the proposed FL algorithm still converges to a biased solution due to the impact of gradient variance and data heterogeneity, which coincides the analysis of inconsistency issue of FL in [34].

IV Optimality Gap Minimization

In this section, we present the joint device selection and power control to minimize the optimimality gap between the expected and optimal global loss values derived in Theorem 1 for the considered wireless FL system under both the individual and sum power constraints, respectively.

IV-A Problem Formulation

As mentioned in Remark 3, by properly choosing the learning rate ηt120EL\eta_{t}\leq\frac{1}{20EL} to guarantee the convergence of the proposed wireless FL algorithm, the optimality gap between the expected and optimal global loss values derived in (1) becomes

Λ=t=1T1(i=t+1TA(i))G(t)+G(T),\Lambda=\sum_{t=1}^{T-1}\left(\prod_{i=t+1}^{T}A(i)\right)G(t)+G(T), (31)

which determines the performance of the algorithm when converges and needs to be minimized. Specifically, with instantaneous CSI in each round, we minimize the performance gap G(t)G(t) over the device selection indicator ak(t)a_{k}(t), downlink transmit power value ps(t)p_{s}(t) and uplink transmit power value pk(t)p_{k}(t) round by round. Besides, we ignore the constant terms (a)(a) and (b)(b) in G(t)G(t) that is related to the gradient variance and data heterogeneity, and focus on minimizing the sum of terms (c)(c) and (d)(d) in G(t)G(t), since the goal of this paper is to minimize the impact of the downlink and uplink wireless communications on the convergence of the considered wireless FL algorithm. Hence, by dropping the notation tt for simplicity, we can formulate the following performance gap minimization problem as

min{𝐚,ps,𝐩}\displaystyle\underset{\{\mathbf{a},p_{s},\mathbf{p}\}}{\text{min}}\ \ (dσd2L(1+2ηE+4η2E2L2)(14η2E2L2)ps2)k𝒦ρk2|hkdl|2\displaystyle\left(\frac{d\sigma_{d}^{2}L(1+2\eta E+4\eta^{2}E^{2}L^{2})}{(1-4\eta^{2}E^{2}L^{2})p_{s}^{2}}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}}{|h_{k}^{\rm dl}|^{2}}
+2dσu2L(j𝒦ajpj)2\displaystyle+\frac{2d\sigma_{u}^{2}L}{\left(\sum_{j\in\mathcal{K}}a_{j}p_{j}\right)^{2}} (32a)
s. t. ak{0,1},\displaystyle a_{k}\in\{0,1\}, (32b)
ps2P¯dl,\displaystyle p_{s}^{2}\leq\bar{P}_{\rm dl}, (32c)
𝐩𝒫,\displaystyle\mathbf{p}\in\mathcal{P}, (32d)

where 𝐚=[a1,,aK]\mathbf{a}=[a_{1},\cdots,a_{K}], 𝐩=[p1,,pk]\mathbf{p}=[p_{1},\cdots,p_{k}], (32b) denotes the feasible set of the selection indicators, and (32c) is the downlink transmit power constraint with P¯dl=Pmaxdl/𝐰22\bar{P}_{\rm dl}=P_{\rm max}^{\rm dl}/\|\mathbf{w}\|_{2}^{2}. For the individual power constraint, 𝒫\mathcal{P} in (32d) is given as

𝒫={𝐩:𝐚T𝐩T𝐐k𝐩𝐚Pmaxk0,k=1,,K},\displaystyle\mathcal{P}=\left\{\mathbf{p}:\mathbf{a}^{T}\mathbf{p}^{T}\mathbf{Q}_{k}\mathbf{p}\mathbf{a}-P_{\rm max}^{k}\leq 0,k=1,\cdots,K\right\}, (33)

which is the matrix-vector form of (10), with 𝐐kK×K,k=1,,K\mathbf{Q}_{k}\in\mathbb{R}^{K\times K},k=1,\cdots,K, being a diagonal matrix with [𝐐k]k,k=𝐰kE22|hkup|2[\mathbf{Q}_{k}]_{k,k}=\frac{\|\mathbf{w}_{k}^{E}\|_{2}^{2}}{|h_{k}^{\rm up}|^{2}} and all other entries being zero; for the sum power constraint, 𝒫\mathcal{P} in (32d) is given as

𝒫={𝐩:𝐚T𝐩T𝐐0𝐩𝐚Ptot0},\displaystyle\mathcal{P}=\left\{\mathbf{p}:\mathbf{a}^{T}\mathbf{p}^{T}\mathbf{Q}_{0}\mathbf{p}\mathbf{a}-P_{\rm tot}\leq 0\right\}, (34)

which is the matrix-vector form of (11), with 𝐐0=𝚍𝚒𝚊𝚐(𝐰1E22|h1up|2,,𝐰KE22|hKup|2)K×K\mathbf{Q}_{0}=\mathtt{diag}\left(\frac{\|\mathbf{w}_{1}^{E}\|_{2}^{2}}{|h_{1}^{\rm up}|^{2}},\cdots,\frac{\|\mathbf{w}_{K}^{E}\|_{2}^{2}}{|h_{K}^{\rm up}|^{2}}\right)\in\mathbb{R}^{K\times K}.

Remark 4.

Intuitively, directly minimizing Λ\Lambda in (31) over multiple rounds with non-causally known CSI could result in better performance than minimizing G(t)G(t) round by round. However, the unavailability of non-causal information of model parameter 𝐰(t)\mathbf{w}(t) and 𝐰kE(t)\mathbf{w}_{k}^{E}(t) at the PS and device kk, which determines the downlink and uplink transmit power constraints (32c) and (32d), hinders us to directly apply joint optimization over multiple rounds. Though some existing literature introduces additional assumption that 𝔼𝐰(t)22\mathbb{E}\|\mathbf{w}(t)\|_{2}^{2} or 𝔼𝐰kE(t)22\mathbb{E}\|\mathbf{w}_{k}^{E}(t)\|_{2}^{2} is upper bounded by some constants [16, 35], which can avoid the requirement of non-causal model information, such assumption is not satisfied for many practical situations, especially when the model dimension is large, and is thus beyond scope of this paper. Hence, this paper chooses to minimize G(t)G(t) round by round with only causal CSI, since Λ\Lambda is a linear combination of G(t)G(t), as discussed above.

IV-B Optimal Solution

In this section, we propose a joint device selection and power control algorithm to solve problem (32). Obviously, the formulated optimization problem (32) is a typical MIP problem, where the difficulty for solving this problem is the binary selection variables 𝐚\mathbf{a} and the non-convex objective function (32a).

However, we could replace the device selection 𝐚\mathbf{a} with the uplink transmit power value 𝐩\mathbf{p}, based on the observation that the selection indicator 𝐚\mathbf{a} is always coupled with uplink transmit power value 𝐩\mathbf{p}, and independent of the downlink transmit power value psp_{s}, as shown in (32). Hence, the device is selected only when its uplink transmit power value pkp_{k} is positive. Thus, we can drop the selection indicator 𝐚\mathbf{a} in (32a) and (32d) and its associated constraint (32b). Besides, to minimize the objective function (32a), the constraint (32c) should be met with equality, i.e., ps2=P¯dlp_{s}^{2}=\bar{P}_{\rm dl}, since (32a) monotonically decreases with ps2p_{s}^{2}. Hence, problem (32) can be equivalently transformed as

min𝐩\displaystyle\underset{\mathbf{p}}{\text{min}}\ \ (dσd2L(1+2ηE+4η2E2L2)(14η2E2L2)P¯dl)k𝒦ρk2|hkdl|2\displaystyle\left(\frac{d\sigma_{d}^{2}L(1+2\eta E+4\eta^{2}E^{2}L^{2})}{(1-4\eta^{2}E^{2}L^{2})\bar{P}_{\rm dl}}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}}{|h_{k}^{\rm dl}|^{2}}
+2dσu2L(j𝒦pj)2\displaystyle+\frac{2d\sigma_{u}^{2}L}{\left(\sum_{j\in\mathcal{K}}p_{j}\right)^{2}} (35a)
s. t. 𝐩𝒫0.\displaystyle\mathbf{p}\in\mathcal{P}_{0}. (35b)

For the individual uplink power constraint case, 𝒫0\mathcal{P}_{0} is given as

𝒫0={𝐩:𝐩T𝐐k𝐩Pmaxk0,k=1,,K},\displaystyle\mathcal{P}_{0}=\left\{\mathbf{p}:\mathbf{p}^{T}\mathbf{Q}_{k}\mathbf{p}-P_{\rm max}^{k}\leq 0,k=1,\cdots,K\right\}, (36)

and for the sum uplink power constraint case, 𝒫0\mathcal{P}_{0} is given as

𝒫0={𝐩:𝐩T𝐐0𝐩Ptot0},\displaystyle\mathcal{P}_{0}=\left\{\mathbf{p}:\mathbf{p}^{T}\mathbf{Q}_{0}\mathbf{p}-P_{\rm tot}\leq 0\right\}, (37)

By rearranging the objective function (35a) in the matrix-vector form, problem (35) is equivalent to

min𝐩\displaystyle\underset{\mathbf{p}}{\text{min}} 𝐩T𝚯𝐩+2dσu2L𝐩T𝟏𝟏T𝐩\displaystyle\frac{\mathbf{p}^{T}\mathbf{\Theta}\mathbf{p}+2d\sigma_{u}^{2}L}{\mathbf{p}^{T}\mathbf{1}\mathbf{1}^{T}\mathbf{p}} (38)
s. t. (35b),\displaystyle\eqref{opt2:Pk_max},

where 𝚯=𝚍𝚒𝚊𝚐(θ)K×K\mathbf{\Theta}=\mathtt{diag}(\mathbf{\theta})\in\mathbb{R}^{K\times K} with θ=[θ1,,θK]K\mathbf{\theta}=[\theta_{1},\cdots,\theta_{K}]\in\mathbb{R}^{K} and θk=dσd2L(1+2ηE+4η2E2L2)(14η2E2L2)P¯dl|hkdl|2\theta_{k}=\frac{d\sigma_{d}^{2}L(1+2\eta E+4\eta^{2}E^{2}L^{2})}{(1-4\eta^{2}E^{2}L^{2})\bar{P}_{\rm dl}|h_{k}^{\rm dl}|^{2}}. Now, problem (38) is a typical QCRQ minimization problem. It has been proved in [36] that with the given constraint (35b) for both the individual and sum uplink power constraint cases (36) and (37), the QCRQ problem (38) can be equivalently rewritten as the following homogeneous version by introducing a real auxiliary variable ss with 𝐲=s𝐩\mathbf{y}=s\mathbf{p}, i.e.,

min{𝐲,s}\displaystyle\underset{\{\mathbf{y},s\}}{\text{min}}\ \ 𝐲T𝚯𝐲+2dσu2Ls2\displaystyle\mathbf{y}^{T}\mathbf{\Theta}\mathbf{y}+2d\sigma_{u}^{2}Ls^{2} (39a)
s. t. 𝐲T𝟏𝟏T𝐲=1,\displaystyle\mathbf{y}^{T}\mathbf{1}\mathbf{1}^{T}\mathbf{y}=1, (39b)
𝐲𝒴,\displaystyle\mathbf{y}\in\mathcal{Y}, (39c)

where for the individual uplink power constraint case, 𝒴\mathcal{Y} in (39c) is given as

𝒴={𝐲:𝐲T𝐐k𝐲Pmaxks20,k=1,,K},\mathcal{Y}=\left\{\mathbf{y}:\mathbf{y}^{T}\mathbf{Q}_{k}\mathbf{y}-P_{\rm max}^{k}s^{2}\leq 0,k=1,\cdots,K\right\}, (40)

and for the sum uplink power constraint case, 𝒴\mathcal{Y} in (39c) is given as

𝒴={𝐲:𝐲T𝐐0𝐲Ptots20},\mathcal{Y}=\left\{\mathbf{y}:\mathbf{y}^{T}\mathbf{Q}_{0}\mathbf{y}-P_{\rm tot}s^{2}\leq 0\right\}, (41)

Though problem (39) is still non-convex, it can be approximated by its SDR [37]. First, we make the change of variables 𝐳=(𝐲T,s)T\mathbf{z}=(\mathbf{y}^{T},s)^{T}, and rewrite (39) as

min𝐳\displaystyle\underset{\mathbf{z}}{\text{min}}\ \ 𝐳T𝚯~𝐳\displaystyle\mathbf{z}^{T}\tilde{\mathbf{\Theta}}\mathbf{z} (42a)
s. t. 𝐳T𝐂𝐳=1,\displaystyle\mathbf{z}^{T}\mathbf{C}\mathbf{z}=1, (42b)
𝐳𝒵,\displaystyle\mathbf{z}\in\mathcal{Z}, (42c)

where 𝚯~\tilde{\mathbf{\Theta}} and 𝐂\mathbf{C} are given as

𝚯~=(𝚯𝟎𝟎T2dσu2L),𝐂=(𝟏𝟏T𝟎𝟎T0).\tilde{\mathbf{\Theta}}=\left(\begin{array}[]{cc}\mathbf{\Theta}&\mathbf{0}\\ \mathbf{0}^{T}&2d\sigma_{u}^{2}L\end{array}\right),\ \mathbf{C}=\left(\begin{array}[]{cc}\mathbf{1}\mathbf{1}^{T}&\mathbf{0}\\ \mathbf{0}^{T}&0\end{array}\right). (43)

For the individual uplink power constraint case, 𝒵\mathcal{Z} in (42c) is given as

𝒵={𝐳:𝐳T𝐐~k𝐳0,k=1,,K},\mathcal{Z}=\left\{\mathbf{z}:\mathbf{z}^{T}\tilde{\mathbf{Q}}_{k}\mathbf{z}\leq 0,k=1,\cdots,K\right\}, (44)

where 𝐐~k\tilde{\mathbf{Q}}_{k} is given as

𝐐~k=(𝐐k𝟎𝟎TPmaxk),\tilde{\mathbf{Q}}_{k}=\left(\begin{array}[]{cc}\mathbf{Q}_{k}&\mathbf{0}\\ \mathbf{0}^{T}&-P_{\rm max}^{k}\end{array}\right), (45)

and for the sum uplink power constraint case, 𝒵\mathcal{Z} in (42c) is given as

𝒵={𝐳:𝐳T𝐐~0𝐳0},\mathcal{Z}=\left\{\mathbf{z}:\mathbf{z}^{T}\tilde{\mathbf{Q}}_{0}\mathbf{z}\leq 0\right\}, (46)

where 𝐐~0\tilde{\mathbf{Q}}_{0} is given as

𝐐~0=(𝐐0𝟎𝟎TPtot).\tilde{\mathbf{Q}}_{0}=\left(\begin{array}[]{cc}\mathbf{Q}_{0}&\mathbf{0}\\ \mathbf{0}^{T}&-P_{\rm tot}\end{array}\right). (47)

Next, by introducing a new variable 𝐙=𝐳𝐳T\mathbf{Z}=\mathbf{z}\mathbf{z}^{T}, which is positive semidefinite (PSD) with rank-one, problem (42) is equivalent to

min𝐙\displaystyle\underset{\mathbf{Z}}{\text{min}}\ \ 𝚝𝚛(𝚯~𝐙)\displaystyle\mathtt{tr}(\tilde{\mathbf{\Theta}}\mathbf{Z}) (48a)
s. t. 𝚝𝚛(𝐂𝐙)=1,\displaystyle\mathtt{tr}(\mathbf{C}\mathbf{Z})=1, (48b)
𝐙𝒵0,\displaystyle\mathbf{Z}\in\mathcal{Z}_{0}, (48c)
𝚛𝚊𝚗𝚔(𝐙)=1,\displaystyle\mathtt{rank}(\mathbf{Z})=1, (48d)
𝐙0,\displaystyle\mathbf{Z}\succeq 0, (48e)

where for the individual uplink transmit power constraint case, we obtain

𝒵0={𝐙:𝚝𝚛(𝐐~k𝐙)0,k=1,,K};\mathcal{Z}_{0}=\left\{\mathbf{Z}:\mathtt{tr}(\tilde{\mathbf{Q}}_{k}\mathbf{Z})\leq 0,k=1,\cdots,K\right\}; (49)

for the sum uplink transmit power constraint case, we obtain

𝒵0={𝐙:𝚝𝚛(𝐐~0𝐙)0}.\mathcal{Z}_{0}=\left\{\mathbf{Z}:\mathtt{tr}(\tilde{\mathbf{Q}}_{0}\mathbf{Z})\leq 0\right\}. (50)

Now, problem (48) is still non-convex due to the non-convex rank-one constrain (48d). However, we can ignore the rank-one constraint (48d) and consider the relaxed version of problem (48) as

min𝐙\displaystyle\underset{\mathbf{Z}}{\text{min}}\ \ 𝚝𝚛(𝚯~𝐙),\displaystyle\mathtt{tr}(\tilde{\mathbf{\Theta}}\mathbf{Z}), (51a)
s. t. (48b),(48c),(48e),\displaystyle\eqref{sdr_opt2:dem=1},\eqref{sdr_opt2:up_power},\eqref{sdr_opt2:psd}, (51b)

which is convex and can be efficiently solved by using CVX [38], where we denote the optimal solution to problem (51) as 𝐙\mathbf{Z}^{*}. In the following Theorem 2, we show how to directly derive the optimal solution to the original problem (35) based on 𝐙\mathbf{Z}^{*} for both the individual and sum uplink transmit power constraint cases.

Theorem 2.

Given the optimal solution 𝐙\mathbf{Z}^{*} to problem (51) for both the individual and sum uplink transmit power constraint cases, we construct its KK-th order leading principal submatrix 𝐙K\mathbf{Z}_{K}^{*} by deleting its (K+1)(K+1)st row and column, which is rank-one and can be decomposed as 𝐙K=𝐛𝐛T\mathbf{Z}_{K}^{*}=\mathbf{b}\mathbf{b}^{T}. Then, the optimal solution to the original problem (35) is given as 𝐩=𝐛/[𝐙]K+1,K+1\mathbf{p}^{*}=\mathbf{b}/\sqrt{[\mathbf{Z}^{*}]_{K+1,K+1}}, where [𝐙]K+1,K+1[\mathbf{Z}^{*}]_{K+1,K+1} is the (K+1)(K+1)-th diagonal element of 𝐙\mathbf{Z}^{*}.

Proof:

The rank-one property of 𝐙K\mathbf{Z}_{K}^{*} can be explored by utilizing the strong duality between problem (51) and its dual problem, and the special structure of 𝚽~\tilde{\mathbf{\Phi}}, 𝐂\mathbf{C}, 𝐐~k\tilde{\mathbf{Q}}_{k}, and 𝐐~0\tilde{\mathbf{Q}}_{0}, similar to the proof of Theorem 1 in [39]. Then, given the rank-one decomposition of 𝐙K\mathbf{Z}_{K}^{*} as 𝐙K=𝐛𝐛T\mathbf{Z}_{K}^{*}=\mathbf{b}\mathbf{b}^{T}, it’s easy to verify that there always exists a rank-one decomposition of 𝐙\mathbf{Z}^{*} as 𝐙=𝐛~𝐛~T\mathbf{Z}^{*}=\tilde{\mathbf{b}}\tilde{\mathbf{b}}^{T} with 𝐛~=[𝐛T,[𝐙]K+1,K+1]T\tilde{\mathbf{b}}=\left[\mathbf{b}^{T},\sqrt{[\mathbf{Z}^{*}]_{K+1,K+1}}\right]^{T}. Finally, according to Theorem 3.2 in [36], the optimal solution to the original problem (35) is obtained as 𝐩=𝐛/[𝐙]K+1,K+1\mathbf{p}^{*}=\mathbf{b}/\sqrt{[\mathbf{Z}^{*}]_{K+1,K+1}}. ∎

Algorithm 1 Proposed Wireless FL Algorithm
1:initial global model parameter 𝐰(1)\mathbf{w}(1); batch size BB; number of local SGD iterations EE; learning rate ηt\eta_{t}; downlink transmit power budget Pmaxdl(t)P_{\rm max}^{\rm dl}(t); individual uplink transmit power budget Pmaxk(t)P_{\max}^{k}(t); sum uplink transmit power budget Ptot(t)P_{\rm tot}(t); communication round budget TT;
2:for t=1:Tt=1:T do
3:     Obtain downlink channel gains {hkdl(t)}k=1K\{h_{k}^{\rm dl}(t)\}_{k=1}^{K} from the PS to terminal devices;
4:     PS broadcasts 𝐰(t)\mathbf{w}(t) to the terminal devices;
5:     for k=1:Kk=1:K do
6:         Terminal device kk sets its current model as 𝐰¯k(t)\bar{\mathbf{w}}_{k}(t) based on (4);
7:         Terminal device kk updates 𝐰¯k(t)\bar{\mathbf{w}}_{k}(t) to 𝐰kE(t)\mathbf{w}_{k}^{E}(t) via mini-batch SGD (6);
8:     end for
9:     Obtain uplink channel gains {hkup(t)}k=1K\{h_{k}^{\rm up}(t)\}_{k=1}^{K} from terminal devices to the PS;
10:     Solve problem (38) via SDR as in Section IV to obtain optimal 𝐩(t)=[p1(t),,pK(t)]\mathbf{p}^{*}(t)=[p_{1}^{*}(t),\cdots,p_{K}^{*}(t)];
11:     for k=1:Kk=1:K do
12:         if pk>0p_{k}^{*}>0 then
13:              Set ak(t)=1a_{k}(t)=1 and select device kk to upload its updated model 𝐰kE(t)\mathbf{w}_{k}^{E}(t) with pre-processing factor (8);
14:         else
15:              Set ak(t)=0a_{k}(t)=0 and device kk keeps silent;
16:         end if
17:     end for
18:     The PS aggregates the received models and updates the global model as 𝐰(t+1)\mathbf{w}(t+1) based on (13);
19:end for

Finally, the proposed wireless FL algorithm is summarized in Algorithm 1. During each communication round, our proposed device selection and power control scheme only requires to solve one semidefinite programming (SDP) problem in (51). Since the number of the constraints is less than the problem size K+1K+1, the computation complexity of solving problem (51) is 𝒪((K+1)6)\mathcal{O}((K+1)^{6}) [37, 40], which is polynomial w.r.t. the number of the devices.

V Numerical Results

This section evaluates the performance of our proposed wireless FL algorithm for image classification on the well-known MNIST [41] and CIFAR-10 [42] datasets, respectively, where MNIST dataset consists of 10 classes of black-and-white handwriting digits ranging from “0” to “9” with 60000 training and 10000 test samples, and CIFAR-10 dataset consists of 10 classes of colored objects with 50000 training and 10000 test samples.

Here, both the uplink and downlink channels follow the i.i.d. Rayleigh fading, i.e., hkdl(t)h_{k}^{\rm dl}(t) and hkup(t),h_{k}^{\rm up}(t), are modeled as i.i.d., CSCG random variables with zero mean and unit variance. The downlink and upink noise variances are set as σd=σu=0.1\sigma_{d}=\sigma_{u}=0.1. Given the model dimension dd, the downlink transmit power budget Pmaxdl(t)P_{\rm max}^{\rm dl}(t) is set as Pmaxdl(t)=10×dP_{\rm max}^{\rm dl}(t)=10\times d and the individual uplink transmit power budget Pmaxk(t)P_{\rm max}^{k}(t) is set as Pmaxk(t)=5×dP_{\rm max}^{k}(t)=5\times d, k,t\forall k,t. We consider K=20K=20 local devices, and set E=5E=5, B=100B=100, and L=10L=10 [28]. The learning rates for both the two tasks are set as ηt=120EL\eta_{t}=\frac{1}{20EL} with a decaying rate of 0.995 for every 30 communication rounds, which satisfies the condition in Remark 3. We consider balanced and unbalanced data size settings, where we set Dk=800,kD_{k}=800,\forall k for the balanced data size setting, and we set DkD_{k} being uniformly distributed in the range of [500,1000][500,1000] for the unbalanced data size setting.

For the proposed scheme, we only present the results with the individual uplink transmit power constraint, and the cases with the sum uplink transmit power constraint are omitted for conciseness, since their results are the same as the individual uplink transmit constraints cases. For performance comparison, we consider three baselines: a)a) the ideal FedAvg scheme [6], where the broadcasting and model aggregation are not decayed by the wireless channels including the channel fading and additive noise, with full device participation and fixed weight qkq_{k} in (1) for device kk; b)b) the device selection with MSE threshold scheme, where maximum downlink power transmission and MSE threshold based device selection [19] are adopted; c)c) the truncated channel inversion scheme, where maximum downlink power transmission and truncated channel inversion for the uplink transmission and model aggregation in [28] and [43] are adopted.

V-A MNIST dataset

For this task, we aim to train a convolution neural network (CNN) as the classifier model, which consists of three 3×33\times 3 convolution layers (each with 8, 16, and 32 channels, respectively), each followed by a 2×22\times 2 max pooling layer with stride 2; fully connected layer with 10 units; and finally a softmax output layer. All convolutional layers are also followed by batch normalization layers and mapped by ReLU activation. We consider both the i.i.d. and non-i.i.d. cases for this task. For the i.i.d. case, we randomly assign training samples to each device kk; while for the non-i.i.d. case, we split the training samples into 5 disjoint subsets with each subset containing 2 classes of images, and each subset is chose to randomly assign training samples to K/5K/5 devices. Hence, each device only contains two classes of images and different devices contains different classes of images.

Refer to caption
(a) i.i.d. case
Refer to caption
(b) non-i.i.d. case
Figure 2: Test accuracy vs. the communication round with balanced data size.

Fig. 2 compares the performance of all the four schemes w.r.t. the communication round in terms of the test accuracy with balanced data size on the MNIST dataset. For the i.i.d. case in Fig. 2(a), it is observed that the test accuracies of the ideal FedAvg scheme, the proposed scheme, and the device selection with MSE threshould scheme increase with the communication round and finally converge. The proposed scheme achieves comparable test accuracy performance to the ideal FedAvg scheme, where the average accuracy over the last 20 communication rounds for the ideal FedAvg scheme is 93.09%93.09\% and the average accuracy over the last 20 communication rounds for the proposed scheme is 92.75%92.75\%. Besides, the proposed scheme outperforms the device selection with MSE threshould scheme, whose average accuracy over the last 20 communication rounds is 89.14%89.14\%. It is observed that all the above three schemes outperform the truncated channel inversion scheme, whose average accuracy over the last 20 communication rounds is 83.34%83.34\%. For the truncated channel inversion scheme, the test accuracy first increase with the communication round during round the first 125 rounds, and then becomes unstable during the rest communication rounds, and does not guarantee convergence. A possible explanation for this phenomenon is that the device selection of the truncated channel inversion scheme is only based on the uplink channel conditions, and the equal weight aggregation in this scheme gives more weight for the local model that damaged by the downlink communications, which degrades the training performance.

For the non-i.i.d. case in Fig. 2(b), all the schemes suffer from performance degradation compared to the i.i.d. case in Fig. 2(a), except for the ideal FedAvg scheme, whose average accuracy over the last 20 communication rounds is 92.32%92.32\%. The average accuracy over the last 20 communication rounds for the proposed scheme is 90.87%90.87\%, which is slightly worse than that of the ideal FedAvg scheme, but still larger than that of the device selection with MSE threshold scheme, whose average accuracy over the last 20 communication rounds is 77.86%77.86\%. The truncated channel inversion scheme still has the worst performance, whose convergence is still not guaranteed. Besides, the increasing trend of the proposed scheme becomes less stable compared to that in the i.i.d. case in Fig. 2(a), which is obviously caused by the non-i.i.d. data distributions.

Refer to caption
(a) i.i.d. case
Refer to caption
(b) non-i.i.d. case
Figure 3: Test accuracy vs. the communication round with unbalanced data size.

Fig. 3 compares the performance of the four schemes w.r.t. the communication round in terms of the test accuracy with unbalanced data size on the MNIST dataset. For the i.i.d. case in Fig. 3(a), similar to the balanced data size scenario in Fig. 2(a), the test accuracies of the ideal FedAvg scheme, the proposed scheme, and the device selection with MSE threshold scheme increase with the communication round and finally converge. The average accuracy over the last 20 communication rounds of the proposed scheme is 91.73%91.73\%, which is slightly lower than the test accuracy of the ideal FedAvg scheme, whose average accuracy over the last 20 communication rounds is 93.11%93.11\%, but still larger than both the device selection with MSE threshold and truncated channel inversion schemes and, whose average accuracies over the last 20 communication rounds are 89.74%89.74\% and 86.77%86.77\%, respectively.

For the non-i.i.d. case in Fig. 3(b), all the four schemes suffer from performance degradation compared to the i.i.d. case in Fig. 3(a), where the average accuracies over the last 20 communication rounds for the ideal FedAvg, proposed, device selection with MSE threshold, and truncated channel inversion schemes are 90.3%90.3\%, 88.97%88.97\%, 85.34%85.34\%, and 83.91%83.91\%, respectively. Obviously, the performance of the proposed scheme is still slightly worse than the ideal FedAvg scheme. Note that even the increasing trends of the test accuracy for the truncated channel inversion scheme are still unstable for both the i.i.d. and non-i.i.d. cases, they shows the convergence trend, which is slightly different from the balanced data size case in Fig. 2. This is possibly due to that the equal weight aggregation for the truncated channel inversion scheme combats the participation of the local model damaged by the downlink communications in certain level, since the weight for perfect model aggregation is inherently different in this unbalanced data size case. However, the performance of our proposed scheme still outperforms the truncated channel inversion scheme with unbalanced data size.

V-B CIFAR-10 dataset

For this more challenging task, we aim to train a more complex CNN model, which consists of three 5×55\times 5 convolution layers (each with 32, 32, and 64 channels, respectively), each followed by a 3×33\times 3 max pooling layer with stride 2; two fully connected layers (the first with 64 units and mapped by ReLU activation, and the second with 10 units); and final softmax output layer. All convolutional layers are also followed by batch normalization layers and mapped by ReLU activation. We consider the similar i.i.d. and non-i.i.d. cases to that in Section V-A, where we assign training samples in the same way.

Refer to caption
(a) i.i.d. case
Refer to caption
(b) non-i.i.d. case
Figure 4: Test accuracy vs. the communication round with balanced data size.

Fig. 4 compares the performance of all the four schemes w.r.t. the communication round in terms of the test accuracy with balanced data size on the CIFAR-10 dataset. For the i.i.d. case in Fig. 4(a), it is observed that the test accuracies of the ideal FedAvg scheme, the proposed scheme, and the device selection with MSE threshold scheme increase with the communication round and finally converge, while the truncated channel inversion scheme fails to converge. The proposed scheme still achieves comparable test accuracy performance to the ideal FedAvg scheme, where the average accuracy over the last 20 communication rounds for the ideal FedAvg scheme is 72.52%72.52\% and the average accuracy over the last 20 communication rounds for the proposed scheme is 71.52%71.52\%, and outperforms the performance of the device selection with MSE threshold scheme, whose average accuracy over the last 20 communication rounds is 66.47%66.47\%. For the non-i.i.d. case in Fig. 4(b), the convergences of the ideal FedAvg scheme, the proposed scheme, and the device selection with MSE threshold scheme are still guaranteed. However, all the above three schemes suffer from performance degradation compared to the i.i.d. case in Fig. 2(a), and their average accuracies over the last 20 communication rounds drop to 66.94%66.94\%, 65.87%65.87\%, and 62.97%62.97\%, respectively. In this case, the truncated channel inversion scheme still fails to converge.

Refer to caption
(a) i.i.d. case
Refer to caption
(b) non-i.i.d. case
Figure 5: Test accuracy vs. the communication round with unbalanced data size.

Fig. 5 compares the performance of the four schemes w.r.t. the communication round in terms of the test accuracy with unbalanced data size on the CIFAR-10 dataset. For the i.i.d. case in Fig. 5(a), similar to the balanced data size scenario in Fig. 4(a), the test accuracies of the ideal FedAvg scheme, the proposed scheme, and the device selection with MSE threshold scheme still increase with the communication round and finally converge, while the truncated channel inversion scheme still fails to converge. The performance of the proposed scheme is still slightly worse than the ideal FedAvg scheme, where the average accuracies over the last 20 communication rounds for the ideal FedAvg and proposed schemes are 72.39%72.39\% and 70.14%70.14\%, respectively. The proposed scheme still outperforms the device selection with MSE threshold scheme, whose average accuracies over the last 20 communication rounds is 65.05%65.05\%. For the non-i.i.d. case in Fig. 5(b), the truncated channel inversion scheme still fails to converge, while the convergences of the rest three schemes are still be guaranteed. Similar to the balanced data size case in Fig. 4, those three schemes suffer from performance degradation compared to the i.i.d. case in Fig. 5(a), where the average accuracies over the last 20 communication rounds for the ideal FedAvg scheme, proposed scheme, and the device selection with MSE threshold scheme drop to 66.84%66.84\%, 65.02%65.02\%, and 60.52%60.52\%, respectively. Obviously, the performance of the proposed scheme is still slightly worse than the ideal FedAvg scheme, and better than the device selection with MSE threshold scheme. Intuitively, the reason why the truncated channel inversion scheme fails to converge for the CIFAR-10 dataset in Fig. 4 and 5 is the selection criterion is only based on the uplink channel conditions and it hardly able to mitigate the impact of both the downlink and uplink communications on the convergence of FL for a more complex task.

VI Conclusion

In this paper, we studied the joint device selection and power control for the wireless FL system, considering both the analog downlink and uplink communications between the PS and terminal devices. We first proposed an AirComp-based adaptive reweighing scheme for model aggregation, where the weights are determined by the selected devices and their uplink transmit power values. Then, we analyzed the convergence behavior of the proposed algorithm, in terms of the expected optimality gap between the expected and optimal global loss values, and revealed how the downlink and uplink wireless communications affects the convergence of the considered FL algorithm. Based on the obtained theoretical results, we formulated an optimality gap minimization problem, for both the individual and sum uplink transmit power constraints, respectively, where we minimized the optimality gap round by round with instantaneous CSI in each round and derived the optimal joint device selection and power control by using the SDR technique for both the two constraint cases. Finally, numerical results showed that the proposed scheme performs very close to the ideal FedAvg scheme, and significantly outperforms other baseline schemes, i.e., the device selection with MSE threshold and conventional truncated channel inversion AirComp schemes, for both the i.i.d./non-i.i.d data distributions, and balanced/unbalanced data sizes across the devices.

Appendix A Proof of Theorem 1

First, according to (15), we have

F(𝐰(t+1))\displaystyle F(\mathbf{w}(t+1))
\displaystyle\leq F(𝐰(t))+𝐰(t+1)𝐰(t),F(𝐰(t))\displaystyle F(\mathbf{w}(t))+\langle\mathbf{w}(t+1)-\mathbf{w}(t),\nabla F\left(\mathbf{w}(t)\right)\rangle
+L2𝐰(t+1)𝐰(t)22\displaystyle+\frac{L}{2}\|\mathbf{w}(t+1)-\mathbf{w}(t)\|_{2}^{2}
=(a)\displaystyle\overset{(a)}{=} F(𝐰(t))+L2𝐰~(t)𝐰(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t)22\displaystyle F(\mathbf{w}(t))+\frac{L}{2}\left\|\tilde{\mathbf{w}}(t)-\mathbf{w}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t)\right\|_{2}^{2}
+𝐰~(t)𝐰(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t),F(𝐰(t)),\displaystyle+\left\langle\tilde{\mathbf{w}}(t)-\mathbf{w}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t),\nabla F(\mathbf{w}(t))\right\rangle,

where (a)(a) follows from (22). By taking expectation on both sides of (A), we obtain

𝔼[F(𝐰(t+1))]\displaystyle\mathbb{E}\left[F(\mathbf{w}(t+1))\right]
\displaystyle\leq 𝔼[F(𝐰(t))]\displaystyle\mathbb{E}\left[F(\mathbf{w}(t))\right]
+𝔼[𝐰~(t)𝐰(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t),F(𝐰(t))]A1\displaystyle+\underbrace{\mathbb{E}\left[\left\langle\tilde{\mathbf{w}}(t)-\mathbf{w}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t),\nabla F(\mathbf{w}(t))\right\rangle\right]}_{A_{1}}
+L2𝔼[𝐰~(t)𝐰(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t)22]A2.\displaystyle+\frac{L}{2}\underbrace{\mathbb{E}\left[\left\|\tilde{\mathbf{w}}(t)-\mathbf{w}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t)\right\|_{2}^{2}\right]}_{A_{2}}. (53)

Next, we bound the terms A1A_{1} and A2A_{2} in the RHS of (A) in the following.

First, we study the term A2A_{2}.

A2\displaystyle A_{2}
=\displaystyle= 𝔼[𝐰~(t)𝐰(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t)22]\displaystyle\mathbb{E}\left[\left\|\tilde{\mathbf{w}}(t)-\mathbf{w}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t)\right\|_{2}^{2}\right]
(a)\displaystyle\overset{(a)}{\leq} 2𝔼[𝐰~(t)𝐰(t)22]+2𝔼[k=1Kρk(t)Δ𝐰k(t)+𝐧~(t)22]\displaystyle 2\mathbb{E}\left[\left\|\tilde{\mathbf{w}}(t)-\mathbf{w}(t)\right\|_{2}^{2}\right]+2\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t)\right\|_{2}^{2}\right]
=(b)\displaystyle\overset{(b)}{=} 2dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2+2𝔼[k=1Kρk(t)Δ𝐰k(t)+𝐧~(t)22]\displaystyle\frac{2d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}+2\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t)\right\|_{2}^{2}\right]
(c)\displaystyle\overset{(c)}{\leq} 2dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2+4𝔼[k=1Kρk(t)Δ𝐰k(t)22]\displaystyle\frac{2d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}+4\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)\right\|_{2}^{2}\right]
+4dσu2(j𝒦aj(t)pj(t))2\displaystyle+\frac{4d\sigma_{u}^{2}}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
(d)\displaystyle\overset{(d)}{\leq} 4k=1Kρk(t)𝔼[Δ𝐰k(t)22]+4dσu2(j𝒦aj(t)pj(t))2\displaystyle 4\sum_{k=1}^{K}\rho_{k}(t)\mathbb{E}\left[\|\Delta\mathbf{w}_{k}(t)\|_{2}^{2}\right]+\frac{4d\sigma_{u}^{2}}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
+2dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2,\displaystyle+\frac{2d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}, (54)

where (a)(a) follows the inequality 𝐱1+𝐱𝟐222𝐱122+2𝐱222\|\mathbf{x}_{1}+\mathbf{x_{2}}\|_{2}^{2}\leq 2\|\mathbf{x}_{1}\|_{2}^{2}+2\|\mathbf{x}_{2}\|_{2}^{2}, (b)(b) follows (24) in Lemma 1, (c)(c) follows the latter inequality again, and uses the fact that 𝐧~(t)=𝐧(t)j𝒦aj(t)pj(t)\tilde{\mathbf{n}}(t)=\frac{\mathbf{n}(t)}{\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)} with 𝐧(t)\mathbf{n}(t) being the CSCG noise vector, and (d)(d) follows the Jensen’s inequality.

Combining (A) with Lemmas 2 and 3, we finally bound the term A2A_{2} as

A2\displaystyle A_{2}\leq 4k=1Kρk(t)𝔼[Δ𝐰k(t)22]+4dσu2(j𝒦aj(t)pj(t))2\displaystyle 4\sum_{k=1}^{K}\rho_{k}(t)\mathbb{E}\left[\|\Delta\mathbf{w}_{k}(t)\|_{2}^{2}\right]+\frac{4d\sigma_{u}^{2}}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
+2dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\frac{2d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}
\displaystyle\leq 4ηt2E2(μ2+4δ)+16ηt2E2𝔼[F(𝐰(t))22]\displaystyle 4\eta_{t}^{2}E^{2}\left(\mu^{2}+4\delta\right)+16\eta_{t}^{2}E^{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+2dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2+4dσu2(j𝒦aj(t)pj(t))2\displaystyle+\frac{2d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}+\frac{4d\sigma_{u}^{2}}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
+8Eηt2L2k=1Kρk(t)τ=1E𝔼[𝐰kτ1(t)𝐰(t)22]\displaystyle+8E\eta_{t}^{2}L^{2}\sum_{k=1}^{K}\rho_{k}(t)\sum_{\tau=1}^{E}\mathbb{E}\left[\|\mathbf{w}_{k}^{\tau-1}(t)-\mathbf{w}(t)\|_{2}^{2}\right]
\displaystyle\leq 4ηt2E2(μ2+4δ)+16ηt2E2𝔼[F(𝐰(t))22]\displaystyle 4\eta_{t}^{2}E^{2}\left(\mu^{2}+4\delta\right)+16\eta_{t}^{2}E^{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+2dσd2ps2(t)k𝒦ρk2(t)|hkdl(t)|2+4dσu2(j𝒦aj(t)pj(t))2\displaystyle+\frac{2d\sigma_{d}^{2}}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}+\frac{4d\sigma_{u}^{2}}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
+8Eηt2L2k=1Kρk(t)(2dσd2Eps2(t)k𝒦ρk2(t)|hkdl(t)|214ηt2E2L2\displaystyle+8E\eta_{t}^{2}L^{2}\sum_{k=1}^{K}\rho_{k}(t)\left(\frac{\frac{2d\sigma_{d}^{2}E}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}}{1-4\eta_{t}^{2}E^{2}L^{2}}\right.
+2ηt2E3(μ2+4δ)+8ηt2E3𝔼[F(𝐰(t))22]14ηt2E2L2)\displaystyle+\left.\frac{2\eta_{t}^{2}E^{3}\left(\mu^{2}+4\delta\right)+8\eta_{t}^{2}E^{3}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)
=\displaystyle= 4ηt2E214ηt2E2L2(μ2+4δ)+4dσu2(j𝒦aj(t)pj(t))2\displaystyle\frac{4\eta_{t}^{2}E^{2}}{1-4\eta_{t}^{2}E^{2}L^{2}}\left(\mu^{2}+4\delta\right)+\frac{4d\sigma_{u}^{2}}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
+16ηt2E214ηt2E2L2𝔼[F(𝐰(t))22]\displaystyle+\frac{16\eta_{t}^{2}E^{2}}{1-4\eta_{t}^{2}E^{2}L^{2}}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+2dσd2(1+4ηt2E2L2)(14ηt2E2L2)ps2(t)k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\frac{2d\sigma_{d}^{2}(1+4\eta_{t}^{2}E^{2}L^{2})}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}} (55)

Then, we study the term A1A_{1}.

A1\displaystyle A_{1}
=\displaystyle= 𝔼[𝐰~(t)𝐰(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t),F(𝐰(t))]\displaystyle\mathbb{E}\left[\left\langle\tilde{\mathbf{w}}(t)-\mathbf{w}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t),\nabla F(\mathbf{w}(t))\right\rangle\right]
=\displaystyle= 𝔼[k𝒦ρk(t)ps(t)hkdl(t)𝐧k(t)+k=1Kρk(t)Δ𝐰k(t)+𝐧~(t),\displaystyle\mathbb{E}\left[\left\langle\sum_{k\in\mathcal{K}}\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{n}_{k}(t)+\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t)+\tilde{\mathbf{n}}(t),\right.\right.
F(𝐰(t))]\displaystyle\left.\left.\nabla F(\mathbf{w}(t))\right\rangle\right]
=(a)\displaystyle\overset{(a)}{=} 𝔼[k=1Kρk(t)Δ𝐰k(t),F(𝐰(t))]\displaystyle\mathbb{E}\left[\left\langle\sum_{k=1}^{K}\rho_{k}(t)\Delta\mathbf{w}_{k}(t),\nabla F(\mathbf{w}(t))\right\rangle\right]
=\displaystyle= ηtτ=1E𝔼[k=1Kρk(t)Fk(𝐰kτ1(t);kτ(t)),F(𝐰(t))]\displaystyle-\eta_{t}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\langle\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t);\mathcal{B}_{k}^{\tau}(t)),\nabla F(\mathbf{w}(t))\right\rangle\right]
=(b)\displaystyle\overset{(b)}{=} ηtτ=1E𝔼[k=1Kρk(t)Fk(𝐰kτ1(t)),F(𝐰(t))]\displaystyle-\eta_{t}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\langle\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t)),\nabla F(\mathbf{w}(t))\right\rangle\right]
=(c)\displaystyle\overset{(c)}{=} ηt2τ=1E𝔼[k=1Kρk(t)Fk(𝐰kτ1(t))22]\displaystyle-\frac{\eta_{t}}{2}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t))\right\|_{2}^{2}\right]
ηt2τ=1E𝔼[F(𝐰(t))22]\displaystyle-\frac{\eta_{t}}{2}\sum_{\tau=1}^{E}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+ηt2τ=1E𝔼[k=1Kρk(t)Fk(𝐰kτ1(t))F(𝐰(t))22]\displaystyle+\frac{\eta_{t}}{2}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t))-\nabla F(\mathbf{w}(t))\right\|_{2}^{2}\right]
\displaystyle\leq ηtE2𝔼[F(𝐰(t))22]\displaystyle-\frac{\eta_{t}E}{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+ηt2τ=1E𝔼[F(𝐰(t))k=1Kρk(t)Fk(𝐰kτ1(t))22]\displaystyle+\frac{\eta_{t}}{2}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\nabla F(\mathbf{w}(t))-\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t))\right\|_{2}^{2}\right]
(d)\displaystyle\overset{(d)}{\leq} ηtE2𝔼[F(𝐰(t))22]\displaystyle-\frac{\eta_{t}E}{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+ηtτ=1E𝔼[F(𝐰(t))k=1Kρk(t)Fk(𝐰(t))22]B1\displaystyle+\underbrace{\eta_{t}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\nabla F(\mathbf{w}(t))-\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}(t))\right\|_{2}^{2}\right]}_{B_{1}}
+ηtτ=1E𝔼[k=1Kρk(t)(Fk(𝐰(t))Fk(𝐰kτ1(t)))22],B2\displaystyle+\underbrace{\eta_{t}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\left(\nabla F_{k}(\mathbf{w}(t))-\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t))\right)\right\|_{2}^{2}\right],}_{B_{2}} (56)

where (a)(a) is due to the fact that both k𝒦ρk(t)ps(t)hkdl(t)𝐧k(t)\sum_{k\in\mathcal{K}}\frac{\rho_{k}(t)}{p_{s}(t)h_{k}^{\rm dl}(t)}\mathbf{n}_{k}(t) and 𝐧~(t)\tilde{\mathbf{n}}(t) are zero mean vectors and independent of F(𝐰(t))\nabla F(\mathbf{w}(t)), (b)(b) follows Assumption 2, (c)(c) follows the identity 𝐱1,𝐱2=12(𝐱122+𝐱222𝐱1𝐱222)\langle\mathbf{x}_{1},\mathbf{x}_{2}\rangle=\frac{1}{2}(\|\mathbf{x}_{1}\|_{2}^{2}+\|\mathbf{x}_{2}\|_{2}^{2}-\|\mathbf{x}_{1}-\mathbf{x}_{2}\|_{2}^{2}), and (d)(d) follows the inequality 𝐱1+𝐱𝟐222𝐱122+2𝐱222\|\mathbf{x}_{1}+\mathbf{x_{2}}\|_{2}^{2}\leq 2\|\mathbf{x}_{1}\|_{2}^{2}+2\|\mathbf{x}_{2}\|_{2}^{2}.

For (A), we further need to bound the terms B1B_{1} and B2B_{2}. First, we obtain

B1=\displaystyle B_{1}= ηtτ=1E𝔼[F(𝐰(t))k=1Kρk(t)Fk(𝐰(t))22]\displaystyle\eta_{t}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\nabla F(\mathbf{w}(t))-\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}(t))\right\|_{2}^{2}\right]
=\displaystyle= ηtE𝔼[k=1KqkFk(𝐰(t))k=1Kρk(t)Fk(𝐰(t))22]\displaystyle\eta_{t}E\mathbb{E}\left[\left\|\sum_{k=1}^{K}q_{k}\nabla F_{k}(\mathbf{w}(t))-\sum_{k=1}^{K}\rho_{k}(t)\nabla F_{k}(\mathbf{w}(t))\right\|_{2}^{2}\right]
=(a)\displaystyle\overset{(a)}{=} ηtE𝔼[k=1K(qkρk(t))Fk(𝐰(t))\displaystyle\eta_{t}E\mathbb{E}\left[\left\|\sum_{k=1}^{K}(q_{k}-\rho_{k}(t))\nabla F_{k}(\mathbf{w}(t))\right.\right.
k=1K(qkρk(t))F(𝐰(t))22]\displaystyle\left.\left.-\sum_{k=1}^{K}(q_{k}-\rho_{k}(t))\nabla F(\mathbf{w}(t))\right\|_{2}^{2}\right]
=\displaystyle= ηtE𝔼[k=1Kqkρk(t)qkqk(Fk(𝐰(t))\displaystyle\eta_{t}E\mathbb{E}\left[\left\|\sum_{k=1}^{K}\frac{q_{k}-\rho_{k}(t)}{\sqrt{q_{k}}}\sqrt{q_{k}}\left(\nabla F_{k}(\mathbf{w}(t))\right.\right.\right.
F(𝐰(t)))22]\displaystyle\left.\left.\left.-\nabla F(\mathbf{w}(t))\right)\right\|_{2}^{2}\right]
(b)\displaystyle\overset{(b)}{\leq} ηtE(k=1K(ρk(t)qk)2qk)k=1Kqk𝔼[Fk(𝐰(t))\displaystyle\eta_{t}E\left(\sum_{k=1}^{K}\frac{(\rho_{k}(t)-q_{k})^{2}}{q_{k}}\right)\sum_{k=1}^{K}q_{k}\mathbb{E}\left[\|\nabla F_{k}(\mathbf{w}(t))\right.
F(𝐰(t))22]\displaystyle\left.-\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
(c)\displaystyle\overset{(c)}{\leq} ηtE(k=1K1qk+1)k=1Kqkδ\displaystyle\eta_{t}E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)\sum_{k=1}^{K}q_{k}\delta
=\displaystyle= ηtδE(k=1K1qk+1),\displaystyle\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right), (57)

where (a)(a) is due to k=1K(qkρk(t))=0,t\sum_{k=1}^{K}(q_{k}-\rho_{k}(t))=0,\ \forall t, (b)(b) follows the Cauchy-Schwarz inequality, and (c)(c) follows Assumption 3 and k=1Kqk=k=1Kρk(t)=1\sum_{k=1}^{K}q_{k}=\sum_{k=1}^{K}\rho_{k}(t)=1.

Then, we bound the term B2B_{2} as

B2=\displaystyle B_{2}= ηtτ=1E𝔼[k=1Kρk(t)(Fk(𝐰(t))Fk(𝐰kτ1(t)))22]\displaystyle\eta_{t}\sum_{\tau=1}^{E}\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}(t)\left(\nabla F_{k}(\mathbf{w}(t))-\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t))\right)\right\|_{2}^{2}\right]
(a)\displaystyle\overset{(a)}{\leq} ηtk=1Kρk(t)τ=1E𝔼[Fk(𝐰(t))Fk(𝐰kτ1(t))22]\displaystyle\eta_{t}\sum_{k=1}^{K}\rho_{k}(t)\sum_{\tau=1}^{E}\mathbb{E}\left[\|\nabla F_{k}(\mathbf{w}(t))-\nabla F_{k}(\mathbf{w}_{k}^{\tau-1}(t))\|_{2}^{2}\right]
(b)\displaystyle\overset{(b)}{\leq} ηtL2k=1Kρk(t)τ=1E𝔼[𝐰kτ1(t)𝐰(t)22],\displaystyle\eta_{t}L^{2}\sum_{k=1}^{K}\rho_{k}(t)\sum_{\tau=1}^{E}\mathbb{E}\left[\|\mathbf{w}_{k}^{\tau-1}(t)-\mathbf{w}(t)\|_{2}^{2}\right], (58)

where (a)(a) follows the Jensen’s inequality and (b)(b) follows (14) in Assumption 1.

Combining (A)-(A), we obtain

A1\displaystyle A_{1}\leq ηtE2𝔼[F(𝐰(t))22]+ηtδE(k=1K1qk+1)\displaystyle-\frac{\eta_{t}E}{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+ηtL2k=1Kρk(t)τ=1E𝔼[𝐰kτ1(t)𝐰(t)22]\displaystyle+\eta_{t}L^{2}\sum_{k=1}^{K}\rho_{k}(t)\sum_{\tau=1}^{E}\mathbb{E}\left[\|\mathbf{w}_{k}^{\tau-1}(t)-\mathbf{w}(t)\|_{2}^{2}\right]
(a)\displaystyle\overset{(a)}{\leq} ηtE2𝔼[F(𝐰(t))22]+ηtδE(k=1K1qk+1)\displaystyle-\frac{\eta_{t}E}{2}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+ηtL2k=1Kρk(t)(2dσd2Eps2(t)k𝒦ρk2(t)|hkdl(t)|214ηt2E2L2\displaystyle+\eta_{t}L^{2}\sum_{k=1}^{K}\rho_{k}(t)\left(\frac{\frac{2d\sigma_{d}^{2}E}{p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}}{1-4\eta_{t}^{2}E^{2}L^{2}}\right.
+2ηt2E3(μ2+4δ)+8ηt2E3𝔼[F(𝐰(t))22]14ηt2E2L2)\displaystyle+\left.\frac{2\eta_{t}^{2}E^{3}\left(\mu^{2}+4\delta\right)+8\eta_{t}^{2}E^{3}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)
=\displaystyle= 2ηt3E3L214ηt2E2L2(μ2+4δ)+ηtδE(k=1K1qk+1)\displaystyle\frac{2\eta_{t}^{3}E^{3}L^{2}}{1-4\eta_{t}^{2}E^{2}L^{2}}\left(\mu^{2}+4\delta\right)+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+2dσd2ηtEL2(14ηt2E2L2)ps2(t)k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\frac{2d\sigma_{d}^{2}\eta_{t}EL^{2}}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}
+20ηt3E3L2ηtE28ηt2E2L2𝔼[F(𝐰(t))22],\displaystyle+\frac{20\eta_{t}^{3}E^{3}L^{2}-\eta_{t}E}{2-8\eta_{t}^{2}E^{2}L^{2}}\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right], (59)

where (a)(a) follows Lemma 3.

Finally, combining (A), (A), and (A), we obtain

𝔼[F(𝐰(t+1))]\displaystyle\mathbb{E}\left[F(\mathbf{w}(t+1))\right]
\displaystyle\leq 𝔼[F(𝐰(t))]\displaystyle\mathbb{E}\left[F(\mathbf{w}(t))\right]
+(ηtE(20ηt2E2L2+16ηtEL1)28ηt2E2L2)𝔼[F(𝐰(t))22]\displaystyle+\left(\frac{\eta_{t}E(20\eta_{t}^{2}E^{2}L^{2}+16\eta_{t}EL-1)}{2-8\eta_{t}^{2}E^{2}L^{2}}\right)\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+(2ηt2E2L(1+ηtEL)14ηt2E2L2)(μ2+4δ)+ηtδE(k=1K1qk+1)\displaystyle+\left(\frac{2\eta_{t}^{2}E^{2}L(1+\eta_{t}EL)}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)\left(\mu^{2}+4\delta\right)+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+(dσd2L(1+2ηtE+4ηt2E2L2)(14ηt2E2L2)ps2(t))k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\left(\frac{d\sigma_{d}^{2}L(1+2\eta_{t}E+4\eta_{t}^{2}E^{2}L^{2})}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}
+2dL(j𝒦aj(t)pj(t))2.\displaystyle+\frac{2dL}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}. (60)

By subtracting FF^{*} at both sides of (A), we have

𝔼[F(𝐰(t+1))F]\displaystyle\mathbb{E}\left[F(\mathbf{w}(t+1))-F^{*}\right]
\displaystyle\leq 𝔼[F(𝐰(t))F]\displaystyle\mathbb{E}\left[F(\mathbf{w}(t))-F^{*}\right]
+(ηtE(20ηt2E2L2+16ηtEL1)28ηt2E2L2)𝔼[F(𝐰(t))22]\displaystyle+\left(\frac{\eta_{t}E(20\eta_{t}^{2}E^{2}L^{2}+16\eta_{t}EL-1)}{2-8\eta_{t}^{2}E^{2}L^{2}}\right)\mathbb{E}\left[\|\nabla F(\mathbf{w}(t))\|_{2}^{2}\right]
+(2ηt2E2L(1+ηtEL)14ηt2E2L2)(μ2+4δ)+ηtδE(k=1K1qk+1)\displaystyle+\left(\frac{2\eta_{t}^{2}E^{2}L(1+\eta_{t}EL)}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)\left(\mu^{2}+4\delta\right)+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+(dσd2L(1+2ηtE+4ηt2E2L2)(14ηt2E2L2)ps2(t))k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\left(\frac{d\sigma_{d}^{2}L(1+2\eta_{t}E+4\eta_{t}^{2}E^{2}L^{2})}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}
+2dσu2L(j𝒦aj(t)pj(t))2\displaystyle+\frac{2d\sigma_{u}^{2}L}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
(a)\displaystyle\overset{(a)}{\leq} (1+ηtEL(20ηt2E2L2+16ηtEL1)14ηt2E2L2)𝔼[F(𝐰(t))F]\displaystyle\left(1+\frac{\eta_{t}EL(20\eta_{t}^{2}E^{2}L^{2}+16\eta_{t}EL-1)}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)\mathbb{E}\left[F(\mathbf{w}(t))-F^{*}\right]
+(2ηt2E2L(1+ηtEL)14ηt2E2L2)(μ2+4δ)+ηtδE(k=1K1qk+1)\displaystyle+\left(\frac{2\eta_{t}^{2}E^{2}L(1+\eta_{t}EL)}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)\left(\mu^{2}+4\delta\right)+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+(dσd2L(1+2ηtE+4ηt2E2L2)(14ηt2E2L2)ps2(t))k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\left(\frac{d\sigma_{d}^{2}L(1+2\eta_{t}E+4\eta_{t}^{2}E^{2}L^{2})}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}
+2dσu2L(j𝒦aj(t)pj(t))2\displaystyle+\frac{2d\sigma_{u}^{2}L}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}
=\displaystyle= A(t)𝔼[F(𝐰(t))F]+G(t),\displaystyle A(t)\mathbb{E}\left[F(\mathbf{w}(t))-F^{*}\right]+G(t), (61)

with

A(t)=1+ηtEL(20ηt2E2L2+16ηtEL1)14ηt2E2L2,A(t)=1+\frac{\eta_{t}EL(20\eta_{t}^{2}E^{2}L^{2}+16\eta_{t}EL-1)}{1-4\eta_{t}^{2}E^{2}L^{2}}, (62)

and

G(t)=\displaystyle G(t)= (2ηt2E2L(1+ηtEL)14ηt2E2L2)(μ2+4δ)\displaystyle\left(\frac{2\eta_{t}^{2}E^{2}L(1+\eta_{t}EL)}{1-4\eta_{t}^{2}E^{2}L^{2}}\right)\left(\mu^{2}+4\delta\right)
+ηtδE(k=1K1qk+1)\displaystyle+\eta_{t}\delta E\left(\sum_{k=1}^{K}\frac{1}{q_{k}}+1\right)
+(dσd2L(1+2ηtE+4ηt2E2L2)(14ηt2E2L2)ps2(t))k𝒦ρk2(t)|hkdl(t)|2\displaystyle+\left(\frac{d\sigma_{d}^{2}L(1+2\eta_{t}E+4\eta_{t}^{2}E^{2}L^{2})}{(1-4\eta_{t}^{2}E^{2}L^{2})p_{s}^{2}(t)}\right)\sum_{k\in\mathcal{K}}\frac{\rho_{k}^{2}(t)}{|h_{k}^{\rm dl}(t)|^{2}}
+2dσu2L(j𝒦aj(t)pj(t))2,\displaystyle+\frac{2d\sigma_{u}^{2}L}{\left(\sum_{j\in\mathcal{K}}a_{j}(t)p_{j}(t)\right)^{2}}, (63)

where (a)(a) follows Lemma 4. Assume the FL algorithm terminates after TT rounds, given an initial global model 𝐰1\mathbf{w}_{1}, we carry out recursions as

𝔼[F(𝐰(T+1))]F\displaystyle\mathbb{E}[F(\mathbf{w}(T+1))]-F^{*}
\displaystyle\leq A(T)𝔼[F(𝐰(T))F]+G(T)\displaystyle A(T)\mathbb{E}\left[F(\mathbf{w}(T))-F^{*}\right]+G(T)
\displaystyle\leq A(T)(A(T1)𝔼[F(𝐰(T1))F]+G(T1))\displaystyle A(T)\left(A(T-1)\mathbb{E}\left[F(\mathbf{w}(T-1))-F^{*}\right]+G(T-1)\right)
+G(T)\displaystyle+G(T)
\displaystyle\leq \displaystyle\cdots
\displaystyle\leq t=1TA(t)𝔼[F(𝐰1)F]+t=1T1(i=t+1TA(i))G(t)\displaystyle\prod_{t=1}^{T}A(t)\mathbb{E}[F(\mathbf{w}_{1})-F^{*}]+\sum_{t=1}^{T-1}\left(\prod_{i=t+1}^{T}A(i)\right)G(t)
+G(T).\displaystyle+G(T). (64)

Thus, this completes the proof.

References

  • [1] K. B. Letaief, W. Chen, Y. Shi, J. Zhang and Y. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84-90, Aug. 2019.
  • [2] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Dec. 2016.
  • [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436-444, May 2015.
  • [4] C.-X. Jiang, H.-J. Zhang, Y. Ren, Z. Han, K.-C. Chen, and L. Hanzo, “Machine learning paradigms for next-generation wireless networks,” IEEE Wireless Commun., vol. 24, no. 2, pp. 98-105, Apr. 2017.
  • [5] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” IEEE Commun. Surveys Tuts., vol. 21, no.3, pp. 2224-2287, 3rd Quart., 2019.
  • [6] H. B. McMahan, E.Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artif. Int. Stat. (AISTATS), 2017, pp. 1273–1282.
  • [7] J. Konecˇ\check{\text{c}}ny`\grave{\text{y}}, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2016. [Online]. Available: https://arxiv.org/pdf/1610.05492.pdf
  • [8] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 3, pp. 2031–2063, 2nd Quart., 2020.
  • [9] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” 2018. [Online]. Available: https://arxiv.org/pdf/1806.00582.pdf
  • [10] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” 2018. [Online]. Available: https://arxiv.org/pdf/1812.06127.pdf
  • [11] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, Jan. 2021.
  • [12] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, Feb. 2021.
  • [13] M. M. Amiri, D. Gu¨\ddot{\text{u}}ndu¨\ddot{\text{u}}z, S. R. Kulkarni, and H. V. Poor, "Convergence of update aware device scheduling for federated learning at the wireless edge," IEEE Trans. Wireless Commun., vol. 20, no. 6, pp. 3643-3658, June 2021.
  • [14] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, Jan. 2021.
  • [15] Y. Wang, Y. Xu, Q. Shi,and T.H. Chang, T. H, “Quantized federated learning under transmission delay and outage constraints,” 2021. [Online]. Available: https://arxiv.org/pdf/2106.09397.pdf
  • [16] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp.1205–1221, Jun. 2019.
  • [17] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
  • [18] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Info. Theory, vol. 53, no. 10, pp. 3498–3516, Oct. 2007.
  • [19] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
  • [20] Y. Sun, S. Zhou, Z. Niu and D. Gündüz, "Dynamic scheduling for over-the-air federated edge learning with energy constraints," IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 227-242, Jan. 2022.
  • [21] G. Zhu, Y. Du, D. Gu¨\ddot{\text{u}}ndu¨\ddot{\text{u}}z, K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 2120-2135, Mar. 2020.
  • [22] N. Zhang and M. Tao, “Gradient statistics aware power control for over-the-air federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 8, pp. 5115-5128, Aug. 2021.
  • [23] X. Cao, G. Zhu, J. Xu, and S. Cui, “Optimized power control for over-the-air federated edge learning,” in IEEE Int. Conf. Commun. (ICC), 2020, pp. 1–6.
  • [24] X. Fan, Y. Wang, Y. Huo, and Z. Tian, “Joint optimization of Communications and federated learning over the air,” 2021. [Online]. Available: https://arxiv.org/pdf/2104.03490.pdf
  • [25] S. Caldas, J. Konecny, H. B. McMahan, and A. Talwalkar, “Expanding the reach of federated learning by reducing client resource requirements,” 2018. [Online]. Available: https://arxiv.org/pdf/1812.07210.pdf
  • [26] X. Wei and C. Shen. “Federated learning over noisy channels: Convergence analysis and design examples.” 2021. [Online]. Available: https://arxiv.org/pdf/2101.02198.pdf
  • [27] M. M. Amiri, D. Gu¨\ddot{\text{u}}ndu¨\ddot{\text{u}}z, S. R. Kulkarni, and H. Vincent Poor, “Federated learning with quantized global model updates,” 2020. [Online]. Available: https://arxiv.org/pdf/2006.10672.pdf
  • [28] M. M. Amiri, D. Gu¨\ddot{\text{u}}ndu¨\ddot{\text{u}}z, S. R. Kulkarni, and H. Vincent Poor, “Convergence of federated learning over a noisy downlink,” IEEE Trans. Wireless Commun., early access. doi: 10.1109/TWC.2021.3103874.
  • [29] J.-H. Ahn, O. Simeone, and J. Kang, “Cooperative learning via federated distillation over fading channels,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2020, pp. 8856–8860.
  • [30] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in Int. Conf. Learn. Represent. (ICLR), 2020.
  • [31] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5336-5346.
  • [32] X. Zhang, M. Hong, S. Dhople, W. Yin and Y. Liu.“Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data.” 2020. [Online]. Available: https://arxiv.org/pdf/2005.11418.pdf
  • [33] Y. J. Cho, J. Wang, and G. Joshi. “Client selection in federated learning: Convergence analysis and power-of-choice selection strategies,” 2020. [Online]. Available: https://arxiv.org/pdf/2010.01243.pdf
  • [34] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” 2020. [Online]. Available: https://arxiv.org/pdf/2007.07481.pdf
  • [35] X. Cao, G. Zhu, J. Xu, and S. Cui, “Transmission power control for over-the-air federated averaging at network edge”, 2021. [Online]. Available: https://arxiv.org/pdf/2111.05719.pdf
  • [36] A. Beck and M. Teboulle, “On minimizing quadratically constrained ratio of two quadratic functions,” J. Convex Anal., vol. 17, no. 3, pp. 789–804, Mar. 2010.
  • [37] Z.-Q. Luo, W.-K. Ma, A. M.-C. So, Y. Ye, and S. Zhang, “Semidefinite relaxation of quadratic optimization problems,” IEEE Signal Process. Mag., vol. 27, no. 3, pp. 20–34, May 2010.
  • [38] M. Grant and S. Boyd. (2016). CVX: Matlab software for disciplined convex programming. [Online]. Available: http://cvxr.com/cvx
  • [39] F. Jiang, J. Chen, and A. L. Swindlehurst, “Optimal power allocation for parameter tracking in a distributed amplify-and-forward sensor network,” IEEE Trans. Signal Process., vol. 62, no. 9, pp. 2200–2211, Mar. 2014.
  • [40] C. Helmberg, F. Rendl, R. Vanderbei, and H. Wolkowicz, “An interiorpoint method for semidefinite programming,” SIAM J. Optim., vol. 6, no. 2, pp. 342–361, 1996.
  • [41] Y. LeCun, C. Cortes, and C. Burges, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/, 1998.
  • [42] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” M.S. thesis, Univ. Toronto, Toronto, ON, Canada, Tech. Rep., 2009. [Online]. Available: http://www.cs.toronto.edu/ kriz/ learning-features-2009-TR.pdf
  • [43] S. Xia, J. Zhu, Y. Yang, Y. Zhou, Y. Shi and W. Chen, “Fast convergence algorithm for analog federated learning,” in IEEE Int. Conf. Commun. (ICC), 2020, pp. 1–6.