This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint Optimization of Communications and Federated Learning Over the Air

Xin Fan1, Yue Wang2, , Yan Huo1, , and Zhi Tian2,{}^{2},~{}
1School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing, China
2Department of Electrical & Computer Engineering, George Mason University, Fairfax, VA, USA
E-mails: {yhuo,fanxin}@bjtu.edu.cn, {ywang56,ztian1}@gmu.edu
Abstract

Federated learning (FL) is an attractive paradigm for making use of rich distributed data while protecting data privacy. Nonetheless, nonideal communication links and limited transmission resources may hinder the implementation of fast and accurate FL. In this paper, we study joint optimization of communications and FL based on analog aggregation transmission in realistic wireless networks. We first derive closed-form expressions for the expected convergence rate of FL over the air, which theoretically quantify the impact of analog aggregation on FL. Based on the analytical results, we develop a joint optimization model for accurate FL implementation, which allows a parameter server to select a subset of workers and determine an appropriate power scaling factor. Since the practical setting of FL over the air encounters unobservable parameters, we reformulate the joint optimization of worker selection and power allocation using controlled approximation. Finally, we efficiently solve the resulting mixed-integer programming problem via a simple yet optimal finite-set search method by reducing the search space. Simulation results show that the proposed solutions developed for realistic wireless analog channels outperform a benchmark method, and achieve comparable performance of the ideal case where FL is implemented over noise-free wireless channels.

Index Terms:
Federated learning, analog aggregation, convergence analysis, joint optimization, worker scheduling, power scaling.

I Introduction

In recent years, with the development of IoT and social networks, huge amounts of data have been generated at the edge of networks [1]. To obtain useful information from big data, machine learning has been widely applied to deal with complex models and tasks in emerging data-driven applications, such as autonomous driving, virtual and augmented reality [2]. Standard machine learning is usually developed under a centralized architecture, where each node located at the edge sends its collected data to a central node for centralized data processing. However, with the exponential growth of the volume of local data and the increasing concerns on user data privacy, it is neither practical nor safe to directly transmit the data of local devices to a central node due to the limited communication and processing capability as well as the lack of privacy protection. As such, distributed machine learning is well motivated to overcome these issues.

In the regime of distributed machine learning, federated learning (FL) has been proposed as a well noted approach for collaborative learning[3]. In FL, local workers train local models from their own data, and then transmit their local updates to a parameter server (PS). The PS aggregates these received local updates and sends the averaged update back to the local workers.These iterative updates between the PS and workers, can be either model parameters or their gradients, for model averaging [4] and gradient averaging [5], respectively. In this way, FL relieves communication overheads and protects user privacy compared to traditional data-sharing based collaborative learning, especially when the local data is in large volume and privacy-sensitive. Existing research on FL mostly focuses on FL algorithms under idealized link assumptions. However, the impacts of wireless environments on FL performance should be taken into account in the design of FL deployed in practical wireless systems. Otherwise, such impacts may introduce unwanted training errors that dramatically degrade the learning performance in terms of accuracy and convergence rate[6].

To solve this problem, research efforts have been spent on optimizing network resources used for transmitting model updates in FL[7, 8]. These works of FL over wireless networks adopt digital communications, using a transmission-then-aggregation policy. Unfortunately, the communication overhead and transmission latency become large as the number of active workers increases. On the other hand, it is worth noting that FL aims for global aggregation and hence only utilizes the averaged updates of distributed workers rather than the individual local updates from workers. Alternatively, the nature of waveform superposition in wireless multiple access channel (MAC) [9, 10, 11, 12] provides a direct and efficient way for transmission of the averaged updates in FL, also known as analog aggregation based FL[13, 14, 15, 16, 17, 18]. As a joint transmission-and-aggregation policy, analog aggregation transmission enables all the participating workers to simultaneously upload their local model updates to the PS over the same time-frequency resources as long as the aggregated waveform represents the averaged updates, thus substantially reducing the overhead of wireless communication for FL.

The research on analog aggregation based FL is still at early stage, leaving some fundamental questions unexplored, such as its convergence behavior and design of efficient algorithms. Given the limited transmit power and communication bandwidth at user devices, users may have to contend for communication resources when transmitting their local updates to the PS. It gives rise to the need for an efficient transmission paradigm, along with network resource allocation in terms of worker selection and transmit power control. All these practical issues motivate our work to study FL from the perspectives of both wireless communications and machine learning. In this paper, we quantify the impact of analog aggregation on the convergence behavior and performance of FL. Such quantitative results are essential in guiding the joint optimization of communication and computing resources. This paper aims at a comprehensive study on problem formulation, solution development, and algorithm implementation for the joint design and optimization of wireless communication and FL. Our key contributions are summarized as follows:

  • We derive closed-form expressions for the expected convergence rate of FL over the air in the cases of convex and non-convex, respectively, which not only interprets but also quantifies the impact of wireless communications on the convergence and accuracy of FL over the air. Also, full-size gradient descent (GD) and mini-batched statistical gradient descent (SGD) methods are both considered in this work. These closed-form expressions unveil a fundamental connection between analog wireless communication and FL with analog aggregation, which provides a fresh perspective to measure how the parameter design of analog wireless systems affects the performance of FL over the air.

  • Based on the closed-form theoretical results, we formulate a joint optimization problem of learning, worker selection, and power control, with a goal of minimizing the global FL loss function given limited transmit power and bandwidth. The optimization formulation turns out to be universal for the convex and non-convex cases with GD and SGD. Further, for practical implementation of the joint optimization problem in the presence of some unobservable parameters, we develop an alternative reformulation that approximates the original unattainable problem as a feasible optimization problem under the operational constraints of analog aggregation.

  • To efficiently solve the approximate problem, we identity a tight solution space by exploring the relationship between the number of workers and the power scaling. Thanks to the reduced search space, we propose a simple discrete enumeration method to efficiently find the globally optimal solution.

We evaluate the proposed joint optimization scheme for FL with analog aggregation in solving linear regression and image classification problems, respectively. Simulation results show that our proposed FL is superior to the benchmark scheme that uses random worker selection and power control, and achieves comparable performance to the ideal case where FL is implemented over noise-free wireless channels.

The remainder of this paper is organized as follows. Related work is presented in Section II. The system model for FL over the air and the associated joint communication and learning optimization formulation are presented in Section III. Section IV derives the closed-form expressions of the expected convergence rate of the FL over the air as the foundation for algorithm design and performance analysis. Section V provides a framework of joint optimization of communication and FL, and develops the corresponding algorithms. Section VI presents numerical results, followed by conclusions in Section VII.

II Related Work

This section reviews the literature and highlights the novelty of this paper with respect to related works.

To achieve communication efficiency in distributed learning, most of the existing strategies focus on digital communications, which may involve the preprocessing of transmitted updates or the management of wireless resources. For example, a popular line of work is to reduce the communication load per worker by compression of the updates under the assumptions of ideal communication links, such as exploiting coding schemes [19], utilizing the sparsity of updates [20], employing quantization of the updates [21], and avoiding less informative local updates via communication censoring schemes [22, 23, 24, 25, 26]. Another line of work is to support FL through communication resource management, such as worker scheduling schemes to maximize the number of participating workers[27], joint optimization of resource allocation and worker scheduling[7], and communication and computation resource allocation and scheduling for cell-free networks[8].

There are some pioneering works on analog aggregation based FL [13, 14, 16, 15, 17, 18], most of which focus on designing transmission schemes[13, 14, 16, 15]. They adopt preselected participating workers and fix their power allocation without further optimization along FL iterations. The optimization issues are considered in [17, 18], but they are mainly conducted on the communication side alone, without an underlying connection to FL. When communication-based metrics are used, the optimization in existing works often suggests to maximize the number of selected workers that participate FL, but our theoretical results indicate that selecting more workers is not necessarily better over imperfect links or under limited communication resources. Thus, unlike these existing works, we seek to analyze the convergence behavior of analog aggregation based FL, which provides a fresh angle to interpret the specific relationship between communications and FL in the paradigm of analog aggregation. Such a connection leads to this work on a joint optimization framework for analog communications and FL, in which the work selection and power allocation decisions are optimized during the iterative FL process.

III System Model

Refer to caption
Figure 1: Federated learning via analog aggregation from wirelessly distributed data.

As shown in Fig. 1, we consider a one-hop wireless network consisting of a single parameter server (PS) at a base station and UU user devices as distributed local workers. Through federated learning, the PS and all workers collaborate to train a common model for supervised learning and data inference, without sharing local data.

III-A FL Model

Let 𝒟i={𝐱i,k,𝐲i,k}k=1Ki\mathcal{D}_{i}=\{\mathbf{x}_{i,k},\mathbf{y}_{i,k}\}_{k=1}^{K_{i}} denote the local dataset at the ii-th worker, i=1,,Ui=1,\ldots,U, where 𝐱i,k\mathbf{x}_{i,k} is the input data vector, 𝐲i,k\mathbf{y}_{i,k} is the labeled output vector, k=1,2,,Kik=1,2,...,K_{i}, and Ki=|𝒟i|K_{i}=|\mathcal{D}_{i}| is the number of data samples available at the ii-th worker. With K=i=1UKiK=\sum_{i=1}^{U}K_{i} samples in total, these UU workers seek to collectively train a learning model parameterized by a global model parameter 𝐰=[w1,,wD]D\mathbf{w}=[w^{1},\ldots,w^{D}]\in\mathcal{R}^{D} of dimension DD, by minimizing the following loss function

(Global loss function)F(𝐰;𝒟)=1Ki=1Uk=1Kif(𝐰;𝐱i,k,𝐲i,k),\text{(Global loss function)}\quad F(\mathbf{w};\mathcal{D})=\frac{1}{K}\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}f(\mathbf{w};\mathbf{x}_{i,k},\mathbf{y}_{i,k}), (1)

where the global loss function F(𝐰;𝒟)F(\mathbf{w};\mathcal{D}) is a summation of KK data-dependent components, each component f(𝐰;𝐱i,k,𝐲i,k)f(\mathbf{w};\mathbf{x}_{i,k},\mathbf{y}_{i,k}) is a sample-wise local function that quantifies the model prediction error of the same data model parameterized by the shared model parameter 𝐰\mathbf{w}, and 𝒟=i𝒟i\mathcal{D}=\bigcup_{i}\mathcal{D}_{i}.

In distributed learning, each worker trains a local model 𝐰i\mathbf{w}_{i} from its local data 𝒟i\mathcal{D}_{i}, which can be viewed as a local copy of the global model 𝐰\mathbf{w}. That is, the local loss function is

(Local loss function)Fi(𝐰i;𝒟i)=1Kik=1Kif(𝐰i;𝐱i,k,𝐲i,k),\text{(Local loss function)}\quad F_{i}(\mathbf{w}_{i};\mathcal{D}_{i})=\frac{1}{K_{i}}\sum_{k=1}^{K_{i}}f(\mathbf{w}_{i};\mathbf{x}_{i,k},\mathbf{y}_{i,k}), (2)

where 𝐰i=[wi1,,wiD]D\mathbf{w}_{i}=[w_{i}^{1},\ldots,w_{i}^{D}]\in\mathcal{R}^{D} is the local model parameter. Through collaboration, it is desired to achieve 𝐰i=𝐰=𝐰\mathbf{w}_{i}=\mathbf{w}=\mathbf{w}^{*}, i\forall i, so that all workers reach the globally optimal model 𝐰\mathbf{w}^{*}. Such a distributed learning can be formulated via consensus optimization as[4, 28]

P1:min𝐰\displaystyle\textbf{P1:}\quad\quad\min_{\mathbf{w}} 1Ki=1Uk=1Kif(𝐰i;𝐱i,k,yi,k).\displaystyle\quad\frac{1}{K}\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}f(\mathbf{w}_{i};\mathbf{x}_{i,k},y_{i,k}). (3a)

To solve P1, this paper adopts a model-averaging algorithm for FL [4, 28]. It is essentially an iterative process consisting of both computing and communication steps at each iteration. Specifically, in each communication round, the PS broadcasts the current 𝐰\mathbf{w} to all workers. Then, the ii-th worker uses a learning algorithm to update its 𝐰i\mathbf{w}_{i} by minimizing its local data-dependent loss function in (2). In this work, gradient descent111In this work, we take the basic gradient descent as an example, while the proposed methodology can be extended to mini-batch gradient descent as well. is applied, in which the local model at the ii-th local worker is updated as

(Local model updating)𝐰i\displaystyle\text{(Local model updating)}\quad\mathbf{w}_{i} =𝐰αFi(𝐰i;𝒟i)\displaystyle=\mathbf{w}-\alpha\nabla F_{i}(\mathbf{w}_{i};\mathcal{D}_{i})
=𝐰αKik=1Kif(𝐰;𝐱i,k,𝐲i,k),\displaystyle=\mathbf{w}-\frac{\alpha}{K_{i}}\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w};\mathbf{x}_{i,k},\mathbf{y}_{i,k}), (4)

where α\alpha is the learning rate, and f(𝐰;𝐱i,k,𝐲i,k)\nabla f(\mathbf{w};\mathbf{x}_{i,k},\mathbf{y}_{i,k}) is the gradient of f(𝐰;𝐱i,k,𝐲i,k)f(\mathbf{w};\mathbf{x}_{i,k},\mathbf{y}_{i,k}) with respect to 𝐰\mathbf{w}.

When local updating is completed, each worker transmits its updated parameter 𝐰i\mathbf{w}_{i} to the PS via wireless uplinks to update the global 𝐰\mathbf{w} as

(Global model updating)𝐰=i=1UKi𝐰iK.\text{(Global model updating)}\quad\mathbf{w}=\frac{\sum_{i=1}^{U}K_{i}\mathbf{w}_{i}}{K}.\qquad\qquad\ \ (5)

Then, the PS broadcasts 𝐰\mathbf{w} in (5) to all participating workers as their initial value in the next round. The FL implements the local model-updating in (III-A) and the global model-averaging in (5) iteratively, until convergence. It has been shown that this FL algorithm converges to the globally optimal solution of the original problem in P1 under the conditions that FF is a convex function and the data transmission between the PS and workers is error-free[4, 28].

Note that the implementation steps in (III-A) and (5) only concern the computational aspect of FL, by assuming perfect communications for both the global 𝐰\mathbf{w} and local 𝐰i\mathbf{w}_{i} between the PS and all workers. However, the communication impacts on FL performance should not be ignored. Especially in practical wireless network environments, certain errors are inevitably introduced during transmissions of the updates due to the imperfect characteristics of wireless channels.

III-B Analog Aggregation Transmission Model

To avoid heavy communication overhead and save transmission bandwidth of FL over wireless channels, we adopt analog aggregation without coding, which allows multiple workers to simultaneously upload their local model updates to the PS over the same time-frequency resources. All workers transmit their local 𝐰i\mathbf{w}_{i}’s in an analog form with perfect time synchronization among them222The implementation of time synchronization and the impact of imperfect synchronization are beyond the scope of this work. Interested readers are referred to [12, 29].. In this way, the local updates 𝐰i\mathbf{w}_{i}’s are aggregated over the air to implement the global model updating step in (5). Such an analog aggregation is conducted in an entry-wise manner. That is, the dd-th entries widw_{i}^{d} from all workers, i=1,,Ui=1,...,U, are aggregated to compute wdw^{d} in (5), for any d[1,D]d\in[1,D].

Let 𝐩i,t=[pi,t1,,pi,td,,pi,tD]\mathbf{p}_{i,t}=[p^{1}_{i,t},\dots,p^{d}_{i,t},\ldots,p^{D}_{i,t}] denote the power control vector of worker ii at the tt-th iteration. Noticeably, the choice of 𝐩i,t\mathbf{p}_{i,t} in FL over the air should be made not only to effectively implement the aggregation rule in (5), but also to properly accommodate the need for network resource allocation. Accordingly, we set the power control policy as

pi,td=βi,tdKibtdhi,td,p^{d}_{i,t}=\frac{\beta^{d}_{i,t}K_{i}b^{d}_{t}}{h^{d}_{i,t}}, (6)

where hi,th_{i,t} is the channel gain between the ii-th worker and the PS at the tt-th iteration333In this paper, we assume the channel state information (CSI) to be constant within each iteration, but may vary over iterations. We also assume that the CSI is perfectly known at the PS, and leave the imperfect CSI case in future work., btdb^{d}_{t} is the power scaling factor, and βi,td\beta^{d}_{i,t} is a transmission scheduling indicator. That is, βi,td=1\beta^{d}_{i,t}=1 means that the dd-th entry of the local model parameter 𝐰i,t\mathbf{w}_{i,t} of the ii-th worker is scheduled to contribute to the FL algorithm at the tt-th iteration, and βi,td=0\beta^{d}_{i,t}=0, otherwise. Through power scaling, the transmit power used for uploading the dd-th entry from the ii-th worker should not exceed a maximum power limit Pid,max=PimaxP_{i}^{d,\max}=P_{i}^{\max} for any dd, as follows:

|pi,tdwi,td|2=|βi,tdKibtdhi,tdwi,td|2Pimax.|p^{d}_{i,t}w^{d}_{i,t}|^{2}=\left|\frac{\beta^{d}_{i,t}K_{i}b^{d}_{t}}{h^{d}_{i,t}}w^{d}_{i,t}\right|^{2}\leq P_{i}^{\max}. (7)

At the PS side, the received signal at the tt-th iteration can be written as

𝐲t\displaystyle\mathbf{y}_{t} =i=1U𝐩i,t𝐰i,t𝐡i,t+𝐳t\displaystyle=\sum_{i=1}^{U}\mathbf{p}_{i,t}\odot\mathbf{w}_{i,t}\odot\mathbf{h}_{i,t}+\mathbf{z}_{t}
=i=1UKi𝐛t𝜷i,t𝐰i,t+𝐳t,\displaystyle=\sum_{i=1}^{U}K_{i}\mathbf{b}_{t}\odot\bm{\beta}_{i,t}\odot\mathbf{w}_{i,t}+\mathbf{z}_{t}, (8)

where \odot represents Hadamard product, 𝐡i,t=[hi,t1,hi,t2,,hi,tD]\mathbf{h}_{i,t}=[h^{1}_{i,t},h^{2}_{i,t},...,h^{D}_{i,t}], 𝜷i,t=[βi,t1,βi,t2,..,βi,tD]\bm{\beta}_{i,t}=[\beta^{1}_{i,t},\beta^{2}_{i,t},..,\beta^{D}_{i,t}], 𝐛t=[bt1,bt2,,btD]\mathbf{b}_{t}=[b^{1}_{t},b^{2}_{t},...,b^{D}_{t}], and 𝐳t𝒞𝒩(0,σ2𝐈)\mathbf{z}_{t}\sim\mathcal{CN}(0,\sigma^{2}\mathbf{I}) is additive white Gaussian noise (AWGN).

Given the received 𝐲t\mathbf{y}_{t}, the PS estimates 𝐰t\mathbf{w}_{t} via a post-processing operation as

𝐰t=\displaystyle\mathbf{w}_{t}= (i=1UKi𝜷i,t𝐛t)1𝐲t\displaystyle\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\mathbf{b}_{t}\right)^{\odot-1}\odot\mathbf{y}_{t}
=\displaystyle= (i=1UKi𝜷i,t)1i=1UKi𝜷i,t𝐰i,t+(i=1UKi𝜷i,t𝒃t)1𝐳t,\displaystyle\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\mathbf{w}_{i,t}+\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}, (9)

where (i=1UKi𝜷i,t𝐛t)1(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\mathbf{b}_{t})^{\odot-1} is a properly chosen scaling vector to produce equal weighting for participating 𝐰i\mathbf{w}_{i}’s in (III-B) as desired in (5), and (𝐗)1(\mathbf{X})^{\odot-1} represents the inverse Hadamard operation of 𝐗\mathbf{X} that calculates its entry-wise reciprocal. Noticeably, in order to implement the averaging of (5) in FL over the air, such a post-processing operation requires 𝐛t\mathbf{b}_{t} to be the same for all workers for given tt and dd, which allows to eliminate 𝐛t\mathbf{b}_{t} from the first term in (III-B).

Comparing (III-B) with (5), there exist differences between 𝐰t\mathbf{w}_{t} and 𝐰\mathbf{w} due to the effect of wireless communications. This work aims to mitigate such a gap through optimizing the worker selection 𝜷i,t\bm{\beta}_{i,t} and power scaling 𝐛t\mathbf{b}_{t} for FL over the air. To this end, our next step is to unveil an important but unexplored foundation, i.e., how wireless communications affect the convergence behavior of FL over the air.

IV The Convergence Analysis of FL with Analog Aggregation

In this section, we study the effect of analog aggregation transmission on FL over the air, by analyzing its convergence behavior for both the convex and the non-convex cases. To average the effects of instantaneous SNRs, we derive the expected convergence rate of FL over the air, which quantifies the impact of wireless communications on FL using analog aggregation transmissions.

IV-A Convex Case

We first make the following assumptions that are commonly adopted in the optimization literature [30, 31, 32, 33, 7, 34].

Assumption 1 (Lipschitz continuity, smoothness): The gradient F(𝐰)\nabla F(\mathbf{w}) of the loss function F(𝐰)F(\mathbf{w}) is uniformly Lipschitz continuous with respect to 𝐰\mathbf{w}, that is,

F(𝐰t+1)F(𝐰t)L𝐰t+1𝐰t,𝐰t,𝐰t+1,\displaystyle\|\nabla F(\mathbf{w}_{t+1})-\nabla F(\mathbf{w}_{t})\|\leq L\|\mathbf{w}_{t+1}-\mathbf{w}_{t}\|,\quad\forall\mathbf{w}_{t},\mathbf{w}_{t+1}, (10)

where LL is a positive constant, referred to as a Lipschitz constant for the function F()F(\cdot).

Assumption 2 (strongly convex): F(𝐰)\nabla F(\mathbf{w}) is strongly convex with a positive parameter μ\mu, obeying

F(𝐰t+1)F(𝐰t)+(𝐰t+1𝐰t)TF(𝐰t)+μ2𝐰t+1𝐰t2,𝐰t,𝐰t+1.\displaystyle F(\mathbf{w}_{t+1})\geq F(\mathbf{w}_{t})+(\mathbf{w}_{t+1}-\mathbf{w}_{t})^{T}\nabla F(\mathbf{w}_{t})+\frac{\mu}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_{t}\|^{2},\quad\forall\mathbf{w}_{t},\mathbf{w}_{t+1}. (11)

Assumption 3 (bounded local gradients): The sample-wised local gradients at local workers are bounded by their global counterpart[32, 33]

f(𝐰t)2ρ1+ρ2F(𝐰t)2,\displaystyle\|\nabla f(\mathbf{w}_{t})\|^{2}\leq\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t})\|^{2}, (12)

where ρ1,ρ20\rho_{1},\rho_{2}\geq 0.

According to [5, 35], the FL algorithm applied over ideal wireless channels is able to solve P1 and converges to an optimal 𝐰\mathbf{w}^{*}. In the presence of wireless transmission errors, we derive the expected convergence rate of the FL over the air with analog aggregation, as in Theorem 1.

Theorem 1.

Adopt Assumptions 1-3, and denote the globally optimal learning model in (3) as 𝐰\mathbf{w}^{*}. The model updating rule for 𝐰t\mathbf{w}_{t} of the FL-over-the-air scheme is given by (III-B), t\forall t. Given the transmit power scaling factors 𝐛t\bm{b}_{t}, worker selection vectors 𝛃i,t\bm{\beta}_{i,t}, and setting the learning rate to be α=1L\alpha=\frac{1}{L}, the expected performance gap 𝔼[F(𝐰t)F(𝐰)]\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})] of 𝐰t\mathbf{w}_{t} at the tt-th iteration is given by

𝔼[F(𝐰t)F(𝐰)]Bt+At𝔼[F(𝐰t1)F(𝐰)],\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq B_{t}+A_{t}\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})], (13)

with

At=1μL+ρ2d=1D(Ki=1UKiβi,td1),\displaystyle A_{t}=1-\frac{\mu}{L}+\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}, (14)
Bt\displaystyle B_{t} =ρ12Ld=1D(Ki=1UKiβi,td1)+(i=1UKi𝜷i,t𝒃t)12Lσ22,\displaystyle=\frac{\rho_{1}}{2L}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}+\Bigg{\|}\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\Bigg{\|}^{2}\frac{L\sigma^{2}}{2}, (15)

where the expectation is over the channel AWGN of zero mean and variance σ2\sigma^{2}.

Proof.

The proof of Theorem 1 is provide in Appendix A. ∎

Based on Theorem 1, we further derive the cumulative performance gap resulted from wireless communications and the worker selection of the whole FL process, summarized by the following Lemma 1.

Lemma 1.

Given an initial global model 𝐰0\mathbf{w}_{0}, the cumulative performance gap 𝔼[F(𝐰t)F(𝐰)]\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})] of FL after tt iterations is bounded by

𝔼[F(𝐰t)F(𝐰)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq i=1t1j=1iAt+1jBti+BtΔt+j=1tAj𝔼[F(𝐰0)F(𝐰)].\displaystyle\underbrace{\sum_{i=1}^{t-1}\prod_{j=1}^{i}A_{t+1-j}B_{t-i}+B_{t}}_{\Delta_{t}}+\prod_{j=1}^{t}A_{j}\mathbb{E}[F(\mathbf{w}_{0})-F(\mathbf{w}^{*})]. (16)
Proof.

Given the expected performance gap at the tt-th iteration in (13), we carry out recursions as follows:

𝔼[F(𝐰t)F(𝐰)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})] Bt+At𝔼[F(𝐰t1)F(𝐰)]\displaystyle\leq B_{t}+A_{t}\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})]
Bt+At(Bt1+At1𝔼[F(𝐰t2)F(𝐰)])\displaystyle\leq B_{t}+A_{t}\bigg{(}B_{t-1}+A_{t-1}\mathbb{E}[F(\mathbf{w}_{t-2})-F(\mathbf{w}^{*})]\bigg{)}
\displaystyle\leq...
i=1t1j=1iAt+1jBti+Bt+j=1tAj𝔼[F(𝐰0)F(𝐰)].\displaystyle\leq\sum_{i=1}^{t-1}\prod_{j=1}^{i}A_{t+1-j}B_{t-i}+B_{t}+\prod_{j=1}^{t}A_{j}\mathbb{E}[F(\mathbf{w}_{0})-F(\mathbf{w}^{*})]. (17)

. ∎

Lemma 1 reveals that the FL algorithm converges asymptotically in tt under mild conditions, as stated in Proposition 1.

Proposition 1.

Given the learning rate α=1L\alpha=\frac{1}{L}, the convergence of the FL algorithm is guaranteed with limt𝐰t=𝐰\lim_{t\rightarrow\infty}\mathbf{w}_{t}=\mathbf{w}^{*}, as long as ρ2\rho_{2} in (12) satisfies the following condition:

0<ρ2<μ(KKmin1)DL,\displaystyle 0<\rho_{2}<\frac{\mu}{(\frac{K}{K_{min}}-1)DL}, (18)

where Kmin=min{Ki}i=1UK_{min}=\min\{K_{i}\}_{i=1}^{U}.

Proof.

When At<1,tA_{t}<1,\ \forall t, it is evident that limtj=1t+1Aj=0\lim_{t\rightarrow\infty}\prod_{j=1}^{t+1}A_{j}=0. From Lemma 1, to guarantee the convergence, a sufficient condition is to ensure Amaxmax{At,t=1,2}<1A_{max}\triangleq\max\{A_{t},t=1,2...\}<1. Given (14), it holds that

At\displaystyle A_{t} =1μL+ρ2d=1D(Ki=1UKiβi,td1)\displaystyle=1-\frac{\mu}{L}+\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}
1μL+ρ2d=1D(KKmin1),\displaystyle\leq 1-\frac{\mu}{L}+\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{K_{min}}-1\Bigg{)}, (19)

where Kmin=min{Ki}i=1UK_{min}=\min\{K_{i}\}_{i=1}^{U}. When all workers have the same amount of data, i.e., Ki=KU,iK_{i}=\frac{K}{U},\forall i, then At1μL+ρ2D(U1)A_{t}\leq 1-\frac{\mu}{L}+\rho_{2}D(U-1).

To ensure Amax<1A_{max}<1, we have

Amax1μL+ρ2d=1D(KKmin1)<1.\displaystyle A_{max}\leq 1-\frac{\mu}{L}+\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{K_{min}}-1\Bigg{)}<1. (20)

From (20), it holds that ρ2<μ(KKmin1)DL\rho_{2}<\frac{\mu}{(\frac{K}{K_{min}}-1)DL}. On the other hand, ρ2>0\rho_{2}>0, according to (12) in Assumption 3. As a result, we have 0<ρ2<μ(KKmin1)DL0<\rho_{2}<\frac{\mu}{(\frac{K}{K_{min}}-1)DL}, which completes the proof. ∎

Proposition 1 indicates that the convergence behavior of the FL algorithm depends on both the learning-related parameters, i.e., μ,L,ρ1,ρ2\mu,L,\rho_{1},\rho_{2}, and communication-related parameters, including 𝜷\bm{\beta}, 𝐛\mathbf{b} and σ2\sigma^{2}. Interestingly, the channel noise σ2\sigma^{2} and 𝐛\mathbf{b} do not affect AtA_{t}, and hence they do not affect the convergence of the FL algorithm but determine the steady state that the FL algorithm converges to.

Lemma 1 also provides the expected convergence rate of an FL algorithm when the transmission link is error-free. In this ideal case, it offers the fastest convergence rate achievable by the FL algorithm, which is derived by the following Lemma 2.

Lemma 2.

Consider a resource-unconstrained and error-free mode where the effects of wireless channels, as well as that of noise, are already mitigated or fully compensated. Given the optimal global 𝐰\mathbf{w}^{*} and the learned 𝐰t\mathbf{w}_{t} in (III-B), the upper bound of 𝔼[F(𝐰t)F(𝐰)]\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})] for the FL over the air is given by

𝔼[F(𝐰t)F(𝐰)](1μL)t𝔼[F(𝐰0)F(𝐰)].\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq(1-\frac{\mu}{L})^{t}\mathbb{E}[F(\mathbf{w}_{0})-F(\mathbf{w}^{*})]. (21)
Proof.

Without channel noise or worker selection (that is, all workers participate the FL and deliver their data perfectly), we have σ2=0\sigma^{2}=0 and d=1D(Ki=1UKiβi,td1)=0\sum_{d=1}^{D}\Big{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Big{)}=0. Then, in (14) and (15), we have Bt=B=0B_{t}=B=0 and At=A=1μL,tA_{t}=A=1-\frac{\mu}{L},\ \forall t. As a result, (16) is reduced to (21). ∎

It is worth noting that Lemma 2 provides the convergence rate for the ideal case, which assumes that the impacts of wireless communications, including noise, channel and constrained resources, are all mitigated to result in error-free transmission. According to (16) in the realistic case, the trajectory of 𝔼[F(𝐰t+1)]\mathbb{E}[F(\mathbf{w}_{t+1})] exhibits jump discontinuity with a gap term Δt\Delta_{t} at each step tt, as defined in (16):

Δt=i=1t1j=1iAt+1jBti+Bt.\displaystyle\Delta_{t}=\sum_{i=1}^{t-1}\prod_{j=1}^{i}A_{t+1-j}B_{t-i}+B_{t}.

This gap reflects the impact of wireless communication factors on FL, by means of the worker selection, transmit power scaling and AWGN. Intuitively, this gap diminishes as the number of selected workers increases, which reduces the value of AtA_{t}. Meanwhile, as the power scaling factor 𝒃t\bm{b}_{t} increases, BtB_{t} is decreased, which leads to a reduction of the gap as well. Hence, it is necessary to optimize transmit power scaling factors and worker selection in order to minimize the gap in (16) for the implementation of FL algorithms over a realistic wireless network.

IV-B Non-convex Case

When the loss function F(w)F(w) is nonconvex, such as in the case of convolutional neural networks, we derive the convergence rate of the FL over the air with analog aggregation for the nonconvex case without Assumption 2, which is summarized in Theorem 2.

Theorem 2.

Under the Assumptions 1 and 3 for the non-convex case, given the transmit power scaling factors 𝐛t\bm{b}_{t}, worker selection vectors 𝛃i,t\bm{\beta}_{i,t}, the optimal global FL model 𝐰\mathbf{w}^{*}, and the learning rate α=1L\alpha=\frac{1}{L}, the convergence at the TT-th iteration is given by

1Tt=1TF(𝐰t1)2\displaystyle\frac{1}{T}\sum_{t=1}^{T}\|\nabla F(\mathbf{w}_{t-1})\|^{2}\leq 2LT(1ρ2D(KKmin1))𝔼[F(𝐰0)]F(𝐰)]\displaystyle\frac{2L}{T(1-\rho_{2}D(\frac{K}{K_{min}}-1))}\mathbb{E}[F(\mathbf{w}_{0})]-F(\mathbf{w}^{*})]
+2Lt=1TBtT(1ρ2D(KKmin1)).\displaystyle+\frac{2L\sum_{t=1}^{T}B_{t}}{T(1-\rho_{2}D(\frac{K}{K_{min}}-1))}. (22)
Proof.

Please refer to Appendix B. ∎

As we can see from Theorem 2, when TT is large enough, we have

min0,1,,T𝔼[F(𝐰t1)2]1Tt=1TF(𝐰t1)2T2Lt=1TBtT(1ρ2D(KKmin1))TNC,\displaystyle\min_{0,1,...,T}\mathbb{E}[\|\nabla F(\mathbf{w}_{t-1})\|^{2}]\leq\frac{1}{T}\sum_{t=1}^{T}\|\nabla F(\mathbf{w}_{t-1})\|^{2}\overset{T\rightarrow\infty}{\leq}\underbrace{\frac{2L\sum_{t=1}^{T}B_{t}}{T(1-\rho_{2}D(\frac{K}{K_{min}}-1))}}_{\bigtriangleup_{T}^{NC}}, (23)

which guarantees convergence of the FL algorithm to a stationary point[36, 28]. Similarly, the performance gap at the step tt for non-convex cases is given by

tNC=2Lt=1TBtT(1ρ2D(KKmin1)).\displaystyle\bigtriangleup_{t}^{NC}=\frac{2L\sum_{t=1}^{T}B_{t}}{T(1-\rho_{2}D(\frac{K}{K_{min}}-1))}. (24)

Note that the non-convex case and the convex shares the same sufficient condition for convergence as (18) in Proposition 1.

IV-C Stochastic gradient descent

Our work can be extended to stochastic versions of gradient descent (SGD) as well. Here, we provide convergence analysis for mini-batch gradient descent with a constant mini-batch size KbK_{b}, while the results directly apply to the standard SGD by setting Kb=1K_{b}=1. Theorem 3 summarizes the convergence behavior of SGD for the strongly convex case.

Theorem 3.

Under the Assumptions 1, 2 and 3 for the convex case, and given the transmit power scaling factors 𝐛t\bm{b}_{t}, worker selection vectors 𝛃i,t\bm{\beta}_{i,t}, the optimal global FL model 𝐰\mathbf{w}^{*}, the learning rate α=1L\alpha=\frac{1}{L} and the mini-batch size KbK_{b}, the convergence behavior of the SGD implementation of FL over the air is given by

𝔼[F(𝐰t)F(𝐰)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq i=1t1j=1iAt+1jSGDBtiSGD+BtSGDΔtSGD+j=1tAjSGD𝔼[F(𝐰0)F(𝐰)],\displaystyle\underbrace{\sum_{i=1}^{t-1}\prod_{j=1}^{i}A^{SGD}_{t+1-j}B^{SGD}_{t-i}+B^{SGD}_{t}}_{\Delta^{SGD}_{t}}+\prod_{j=1}^{t}A^{SGD}_{j}\mathbb{E}[F(\mathbf{w}_{0})-F(\mathbf{w}^{*})], (25)

where

AtSGD=\displaystyle A^{SGD}_{t}= 1μL+ρ2(d=1D((i=1UKb)22K(i=1UKb)K2+(i=1UKb)i=1UKbβi,td)\displaystyle 1-\frac{\mu}{L}+\rho_{2}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{(\sum_{i=1}^{U}K_{b})^{2}-2K(\sum_{i=1}^{U}K_{b})}{K^{2}}+\frac{(\sum_{i=1}^{U}K_{b})}{\sum_{i=1}^{U}K_{b}\beta^{d}_{i,t}}\Bigg{)}
+(i=1U(KiKb))2K2),\displaystyle+\frac{(\sum_{i=1}^{U}(K_{i}-K_{b}))^{2}}{K^{2}}\Bigg{)}, (26)
BtSGD=\displaystyle B^{SGD}_{t}= ρ12L(d=1D((i=1UKb)22K(i=1UKb)K2+(i=1UKb)i=1UKbβi,td)\displaystyle\frac{\rho_{1}}{2L}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{(\sum_{i=1}^{U}K_{b})^{2}-2K(\sum_{i=1}^{U}K_{b})}{K^{2}}+\frac{(\sum_{i=1}^{U}K_{b})}{\sum_{i=1}^{U}K_{b}\beta^{d}_{i,t}}\Bigg{)}
+(i=1U(KiKb))2K2)+(i=1UKi𝜷i,t𝒃t)12Lσ22.\displaystyle+\frac{(\sum_{i=1}^{U}(K_{i}-K_{b}))^{2}}{K^{2}}\Bigg{)}+\left\|\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\frac{L\sigma^{2}}{2}. (27)
Proof.

Please refer to Appendix C. ∎

From Theorem 3, the cumulative performance gap of FL after the tt-th iteration for the SGD case is bounded by

tSGD=i=1t1j=1iAt+1jSGDBtiSGD+BtSGD.\displaystyle\bigtriangleup_{t}^{SGD}=\sum_{i=1}^{t-1}\prod_{j=1}^{i}A^{SGD}_{t+1-j}B^{SGD}_{t-i}+B^{SGD}_{t}. (28)
Remark 1.

If KbK_{b} is set to be KiK_{i}, then Theorem 3 for SGD is the same as Theorem 1 for GD. In addition, since the common mini-batch size is no larger than the minimum local data size, i.e., KbKminKUK_{b}\leq K_{min}\leq\frac{K}{U}, both AtSGDA_{t}^{SGD} in (26) and BtSGDB^{SGD}_{t} in (27) decrease as KbK_{b} increases, which leads to a smaller ΔtSGD\Delta_{t}^{SGD} in (25). In other words, FL has a better convergence performance with a larger KbK_{b}. On the other hand, such an improvement on performance is achieved at the cost of high computation load at the local workers in each communication round, which reflects the tradeoff between training performance and computational complexity in SGD.

Similarly, we can also derive the mild convergence condition for SGD, given by the following Proposition 2.

Proposition 2.

Given the learning rate α=1L\alpha=\frac{1}{L}, the convergence of the FL algorithm for SGD cases is guaranteed with limt𝐰t=𝐰\lim_{t\rightarrow\infty}\mathbf{w}_{t}=\mathbf{w}^{*}, as long as ρ2\rho_{2} in (12) satisfies the following condition:

0<ρ2<μ(2UKbK+U2Kb2K2+DU2DUKbK+DU2Kb2K2)L.\displaystyle 0<\rho_{2}<\frac{\mu}{(\frac{2UK_{b}}{K}+\frac{U^{2}K^{2}_{b}}{K^{2}}+DU-\frac{2DUK_{b}}{K}+\frac{DU^{2}K_{b}^{2}}{K^{2}})L}. (29)
Proof.

Similar to the GD case, to guarantee the convergence, a sufficient condition is still to ensure AmaxSGDmax{AtSGD,t=1,2}<1A^{SGD}_{max}\triangleq\max\{A^{SGD}_{t},t=1,2...\}<1. It holds that,

AtSGD=\displaystyle A^{SGD}_{t}= 1μL+ρ2((KUKb)2K2+d=1D(U2Kb22KUKbK2+Ui=1Uβi,td))\displaystyle 1-\frac{\mu}{L}+\rho_{2}\Bigg{(}\frac{(K-UK_{b})^{2}}{K^{2}}+\sum_{d=1}^{D}\Bigg{(}\frac{U^{2}K_{b}^{2}-2KUK_{b}}{K^{2}}+\frac{U}{\sum_{i=1}^{U}\beta^{d}_{i,t}}\Bigg{)}\Bigg{)}
\displaystyle\leq 1μL+ρ2((KUKb)2K2+DU2Kb22DKUKbK2+DU),\displaystyle 1-\frac{\mu}{L}+\rho_{2}\Bigg{(}\frac{(K-UK_{b})^{2}}{K^{2}}+\frac{DU^{2}K_{b}^{2}-2DKUK_{b}}{K^{2}}+DU\Bigg{)}, (30)

which comes from the fact that i=1Uβi,td1\sum_{i=1}^{U}\beta^{d}_{i,t}\geq 1.

Thus, we have

AmaxSGD1μL+ρ2((KUKb)2+DU2Kb22DKUKb+DUK2K2)<1.\displaystyle A^{SGD}_{max}\leq 1-\frac{\mu}{L}+\rho_{2}\Bigg{(}\frac{(K-UK_{b})^{2}+DU^{2}K_{b}^{2}-2DKUK_{b}+DUK^{2}}{K^{2}}\Bigg{)}<1. (31)

From (31), we can get ρ2<μ(12UKbK+U2Kb2K2+DU2DUKbK+DU2Kb2K2)L\rho_{2}<\frac{\mu}{(1-\frac{2UK_{b}}{K}+\frac{U^{2}K^{2}_{b}}{K^{2}}+DU-\frac{2DUK_{b}}{K}+\frac{DU^{2}K_{b}^{2}}{K^{2}})L}. Considering ρ2>0\rho_{2}>0 in Assumption 3, we get (29) as a result. ∎

V Performance Optimization for Federated Learning over the Air

In this section, we first formulate a joint optimization problem to reduce the gap for FL over the air, which turns out to be applicable for both the convex and non-convex cases, using either GD or SGD implementations. To make it applicable in practice in the presence of some unobservable parameters at the PS, we reformulate it to an approximate problem by imposing a conservative power constraint. To efficiently solve such an approximate problem, we first identify a tight solution space and then develop an optimal solution via discrete programming.

V-A Problem Formulation for Joint Optimization

Since we are concerned with convergence accuracy, our optimization problem boils down to minimizing the performance gap for different cases (i.e., t\bigtriangleup_{t}, tNC\bigtriangleup_{t}^{NC}, and tSGD\bigtriangleup_{t}^{SGD}) at each iteration under the corresponding convergence conditions (i.e., Proposition 1 and Proposition 2).

We recognize that solving P1 amounts to iteratively minimizing those gap t\bigtriangleup_{t}, tNC\bigtriangleup_{t}^{NC}, and tSGD\bigtriangleup_{t}^{SGD} under the transmit power constraint in (7). At the tt-th iteration, the objective functions for those three cases are given by

t=\displaystyle\bigtriangleup_{t}= Bt+Att1,\displaystyle B_{t}+A_{t}\bigtriangleup_{t-1}, (32)
tNC=\displaystyle\bigtriangleup^{NC}_{t}= Bt,\displaystyle B_{t}, (33)
tSGD=\displaystyle\bigtriangleup^{SGD}_{t}= BtSGD+AtSGDt1SGD.\displaystyle B^{SGD}_{t}+A^{SGD}_{t}\bigtriangleup^{SGD}_{t-1}. (34)

where 0=0\bigtriangleup_{0}=0 and 0SGD=0\bigtriangleup^{SGD}_{0}=0. Note that when the optimization is executed at the tt-th iteration, t1\bigtriangleup_{t-1} and t1SGD\bigtriangleup^{SGD}_{t-1} can be treated as constants.

Considering the entry-wise transmission for analog aggregation, we remove irrelevant items and extract the component of the dd-th entry from those gap in (32), (33) and (34) as the objective to minimize, which is given by

Rt[d]=\displaystyle R_{t}[d]= Lσ22(i=1Uβi,tdKibtd)2+Kρ1+2KLρ2t12Li=1UKiβi,td,d,\displaystyle\frac{L\sigma^{2}}{2\left(\sum_{i=1}^{U}\beta^{d}_{i,t}K_{i}b^{d}_{t}\right)^{2}}+\frac{K\rho_{1}+2KL\rho_{2}\bigtriangleup_{t-1}}{2L\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}},\quad\forall d, (35)
RtNC[d]=\displaystyle R_{t}^{NC}[d]= Lσ22(i=1Uβi,tdKibtd)2+Kρ12Li=1UKiβi,td,d,\displaystyle\frac{L\sigma^{2}}{2\left(\sum_{i=1}^{U}\beta^{d}_{i,t}K_{i}b^{d}_{t}\right)^{2}}+\frac{K\rho_{1}}{2L\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}},\quad\forall d, (36)
RtSGD[d]=\displaystyle R_{t}^{SGD}[d]= Lσ22(i=1Uβi,tdKibtd)2+U(ρ1+2Lρ2t1)2Li=1UKiβi,td,d.\displaystyle\frac{L\sigma^{2}}{2\left(\sum_{i=1}^{U}\beta^{d}_{i,t}K_{i}b^{d}_{t}\right)^{2}}+\frac{U(\rho_{1}+2L\rho_{2}\bigtriangleup_{t-1})}{2L\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}},\quad\forall d. (37)

Since all entries indexed by dd are separable with respect to the design parameters, we perform entry-wise optimization by considering 𝐰t\mathbf{w}_{t} and 𝐰i,t\mathbf{w}_{i,t} one entry at a time, where the superscript dd and the index of different cases are omitted hereafter. To determine the worker selection vector βi,t\beta_{i,t} and the power scaling factor btb_{t} at the tt-th iteration, the PS carries out a joint optimization problem formulated as follows:

P2:min{bt,βi,t}i=1U\displaystyle\textbf{P2:}\quad\min_{\{b_{t},\beta_{i,t}\}_{i=1}^{U}} Rt\displaystyle R_{t} (38a)
s.t. |βi,tKibthi,twi,t|2Pimax,\displaystyle\ \bigg{|}\frac{\beta_{i,t}K_{i}b_{t}}{h_{i,t}}w_{i,t}\bigg{|}^{2}\leq P_{i}^{\max}, (38b)
βi,t{0,1},i{1,2,,U},\displaystyle\ {\beta}_{i,t}\in\{0,1\},i\in\{1,2,...,U\},

where KiK_{i} should be KbK_{b} in (38b) for the SGD case.

However, in (38b), the knowledge of {wi,t}i=1U\{w_{i,t}\}_{i=1}^{U} is needed but is unavailable to the PS due to analog aggregation.

To overcome this issue, we reformulate a practical optimization problem via an approximation that 𝐰t11Ui=1U𝐰i,t\mathbf{w}_{t-1}\approx\frac{1}{U}\sum_{i=1}^{U}\mathbf{w}_{i,t}, in light of the consensus constraint in P1. According to (47) in the proof of Theorem 1, each local parameter 𝐰i,t\mathbf{w}_{i,t} is updated from the broadcast 𝐰t1\mathbf{w}_{t-1} along the direction of the averaged gradient over its local data αKik=1Kif(𝐰t1;𝐱i,k,𝐲i,k)\frac{\alpha}{K_{i}}\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k}). Hence, it is reasonable to make the following common assumption on bounded local gradients, considering that the local gradients can be controlled by adjusting the learning rate or through simple clipping [21, 28, 37, 38].

Assumption 4 (bounded local gradients): The gap between the global parameter wt1w_{t-1} and the local parameter update wi,t,i,tw_{i,t},\forall i,t is bounded by

|wt1wi,t|η,\displaystyle|w_{t-1}-w_{i,t}|\leq\eta, (39)

where η0\eta\geq 0 is related to the learning rate α\alpha that satisfies the following condition444This implies the value range of η\eta. In practice, η\eta can take |wt1wt2||w_{t-1}-w_{t-2}|. In addition, for the SGD case, we have ηmax{{|α𝔼𝒟i[f(w,𝐱i,k,𝐲i,k)]|}i=1U}\eta\geq\max\{\{|\alpha\mathbb{E}_{\mathcal{D}_{i}}[\nabla f(w,\mathbf{x}_{i,k},\mathbf{y}_{i,k})]|\}_{i=1}^{U}\}

ηmax{{|αKik=1Kif(w,𝐱i,k,𝐲i,k)|}i=1U}.\displaystyle\eta\geq\max\left\{\left\{\Bigg{|}\frac{\alpha}{K_{i}}\sum_{k=1}^{K_{i}}\nabla f(w,\mathbf{x}_{i,k},\mathbf{y}_{i,k})\Bigg{|}\right\}_{i=1}^{U}\right\}. (40)

Under Assumption 4, we reformulate the original optimization problem (P2) into the following problem (P3), by replacing wi,tw_{i,t} in (38b) by its approximation:

P3:min{bt,βi,t}i=1U\displaystyle\textbf{P3:}\quad\min_{\{b_{t},\beta_{i,t}\}_{i=1}^{U}} Rt\displaystyle R_{t} (41a)
s.t. |βi,tKibthi,t|2(|wt1|+η)2Pimax,\displaystyle\ \bigg{|}\frac{\beta_{i,t}K_{i}b_{t}}{h_{i,t}}\bigg{|}^{2}(|w_{t-1}|+\eta)^{2}\leq P_{i}^{\max}, (41b)
βi,t{0,1},i{1,2,,U},\displaystyle\ {\beta}_{i,t}\in\{0,1\},i\in\{1,2,...,U\}, (41c)

where the power constraint (41b) is constructed based on the fact that

|βi,tKibthi,twi,t|2=\displaystyle\bigg{|}\frac{\beta_{i,t}K_{i}b_{t}}{h_{i,t}}w_{i,t}\bigg{|}^{2}= |βi,tKibthi,t|2|wi,t|2\displaystyle\bigg{|}\frac{\beta_{i,t}K_{i}b_{t}}{h_{i,t}}\bigg{|}^{2}|w_{i,t}|^{2}
\displaystyle\leq |βi,tKibthi,t|2(|wt1|+η)2.\displaystyle\bigg{|}\frac{\beta_{i,t}K_{i}b_{t}}{h_{i,t}}\bigg{|}^{2}(|w_{t-1}|+\eta)^{2}. (42)

Since wt1w_{t-1} is always available at the PS, P3 becomes a feasible formulation for adoption in practice. Next, we develop the optimal solution to P3.

V-B Optimal Solution to P3 via Discrete Programming

At first glance, a direct solution to P3 leads to a mixed integer programming (MIP), which unfortunately incurs high complexity. To solve P3 in an efficient manner, we develop a simple solution by identifying a tight search space without loss of optimality. The tight search space, given in the following Theorem 4, is a result of the constraints in (41b) and (41c), irrespective of the objective function (41a). Hence, it holds universally for any RtR_{t}, such as those in (35)-(37).

Theorem 4.

When all the required parameters in P3 i.e., {Pimax,wt1,hi,t,Ki,η}i=1U\{P_{i}^{\max},w_{t-1},h_{i,t},K_{i},\eta\}_{i=1}^{U}, are available at the PS, the solution space of (bt,βi,t)(b_{t},\beta_{i,t}) in P3 can be reduced to the following tight search space without loss of optimality as

𝒮={{(bt(k),βi,t(k))}k=1U|bt(k)=|Pkmaxhk,tKk(|wt1|+η)|,𝜷t(k)(bt(k))=[β1,t(k),,βU,t(k)],k=1,,U},\displaystyle\mathcal{S}=\left\{\bigg{\{}\left(b_{t}^{(k)},\beta_{i,t}^{(k)}\right)\bigg{\}}_{k=1}^{U}\Bigg{|}b_{t}^{(k)}=\left|\frac{\sqrt{P_{k}^{\max}}h_{k,t}}{K_{k}(|w_{t-1}|+\eta)}\right|,\bm{\beta}_{t}^{(k)}(b_{t}^{(k)})=\left[\beta^{(k)}_{1,t},\dots,\beta^{(k)}_{U,t}\right],k=1,\dots,U\right\}, (43)

where 𝛃t(k)\bm{\beta}_{t}^{(k)} is a function of bt(k)b_{t}^{(k)}, in the form:

βi,t(k)=H(Pimax|Kibt(k)(|wt1|+η)hi,t|)\displaystyle\beta^{(k)}_{i,t}=H\bigg{(}P_{i}^{\max}-\bigg{|}\frac{K_{i}b^{(k)}_{t}(|w_{t-1}|+\eta)}{h_{i,t}}\bigg{|}\bigg{)} (44)

and H(x)H(x) is the Heaviside step function, i.e., H(x)=1H(x)=1 for x>0x>0, and H(x)=0H(x)=0 otherwise.

Proof.

Please see Appendix B. ∎

Thanks to Theorem 4, we equivalently transform P3 from a MIP into a discrete programming (DP) problem P4 as follows

P4:min(bt,𝜷t)𝒮Rt=Rt(bt,𝜷t)\displaystyle\textbf{P4:}\qquad\min_{(b_{t},\bm{\beta}_{t})\in\mathcal{S}}R_{t}=R_{t}\left(b_{t},\bm{\beta}_{t}\right) (45)

According to P4, the objective RtR_{t} can only take on UU possible values corresponding to the UU feasible values of btb_{t}; meanwhile, given each btb_{t}, the value of 𝜷t\bm{\beta}_{t} is uniquely determined. Hence the minimum RtR_{t} can be obtained via line search over the UU feasible points (bt,𝜷t)(b_{t},\bm{\beta}_{t}) in (43). Note that the feasible points in (43) are determined by the channel gains, power limits and data sizes of the UU workers. Hence, the optimal transmission policy decided by P4 reflects the tradeoff among workers in terms of their channel quality, available power resource, and data quality.

It is worth noting that the solution btb^{*}_{t} of P4 may exceed the maximum value allowed at a worker, due to the approximation introduced in Assumption 4. To strictly comply to the power constraint, each worker needs to take the following bounding step when sending its local parameters:

1) if |Kibtwi,thi,t|2Pimax\bigg{|}\frac{K_{i}b^{*}_{t}w_{i,t}}{h_{i,t}}\bigg{|}^{2}\leq P_{i}^{\max}, then the ii-th worker sends Kibtwi,thi,t\frac{K_{i}b^{*}_{t}w_{i,t}}{h_{i,t}} ;

2) otherwise, it sends Pimaxsgn(wi,t)\sqrt{P_{i}^{\max}}\text{sgn}(w_{i,t}), where sgn()\mbox{sgn}(\cdot) is the sign function.

Putting together, we propose a joint optimization for FL over the air (INFLOTA) as summarized in Algorithm 1, which is a dynamic scheduling and power scaling policy. By using different RtR_{t}, our INFLOTA can be adjust to all the considered cases including the convex and non-convex, using either GD or SGD implementations.

Algorithm 1 The implementation of INFLOTA
0:    System parameters {Pimax,Ki,η}i=1U\{P_{i}^{\max},K_{i},\eta\}_{i=1}^{U};
1:  The PS initializes {𝐰0,b1,𝜷1}\{\mathbf{w}_{0},b_{1}^{*},\bm{\beta}^{*}_{1}\} and broadcasts them to all the local workers.
2:  for t=1:Tt=1:T do
3:     ​​​​At the workers:
4:     Computation: Iteratively update the local model via (III-A) where 𝐰\mathbf{w} = 𝐰t1\mathbf{w}_{t-1} is from the PS;
5:     Communication: Upon receiving (bt,𝜷t)(b_{t},\bm{\beta}_{t}), send sgn(wi,t)min(Kibt|wi,t|hi,t,PiMax)\text{sgn}(w_{i,t})\min\left(\frac{K_{i}b_{t}|w_{i,t}|}{h_{i,t}},\sqrt{P_{i}^{\texttt{Max}}}\right) to the PS, if βi,t=1,i,d\beta_{i,t}=1,\forall i,d;
6:     ​​​​At the PS:
7:     Calculate the global model 𝐰t\mathbf{w}_{t} via (III-B) from 𝐲t\mathbf{y}_{t} that is aggregated from local workers;
8:     for d=1:Dd=1:D do
9:        Calculate 𝒮\mathcal{S} from (43), which yields UU feasible points {(bt+1(k),𝜷t+1(k)}k=1U\{(b_{t+1}^{(k)},\bm{\beta}_{t+1}^{(k})\}_{k=1}^{U};
10:        Solve P4 in (45) by using a line search over the UU feasible points to find the optimal {bt+1,𝜷t+1}\{b_{t+1}^{*},\bm{\beta}^{*}_{t+1}\} for given dd and tt;
11:     end for
12:     Send wtw_{t} and (bt+1,𝜷t+1)(b_{t+1},\bm{\beta}_{t+1}) (including all DD optimal {bt+1,𝜷t+1}\{b_{t+1}^{*},\bm{\beta}^{*}_{t+1}\}) to all the workers;
13:  end for
Remark 2.

(Optimality) P3 is equivalently reformulated as P4, which is solved by a search method with much reduced computational complexity thanks to the reduced search space. Comparing P3 to P2, the constraint (41b) of P3 is more restrictive than the constraint in (38b) of P2. Since P3 reduces the feasible domain of P2, the solution to P3 cannot be superior to that of P2. Therefore, the optimal solution of P3 is an upper bound of P2, i.e., RtR_{t} calculated by solving P3 is larger than the actual one.

Remark 3.

(Complexity) Algorithm 1 provides a holistic solution for implementation of the overall FL at both the PS and workers sides. Its computational complexity is mainly determined by that of the optimization step in P4. The complexity order of the optimization step is low at 𝒪(U)\mathcal{O}(U), since the search space is reduced to UU points only via P4.

Remark 4.

(Implementation) To implement the FL over the air in Algorithm 1, the PS must know the CSI, the number of the data samples and the maximum transmit power of all local workers. This information can be obtained by the PS when local workers initially connect to it. Before the implementation of Algorithm 1, the PS must first broadcast the global model information to the workers. Noticeably, taking P4 into the implementation of FL, some workers need to send sgn(wi,t)min(Kibt|wi,t|hi,t,PiMax)\text{sgn}(w_{i,t})\min\left(\frac{K_{i}b_{t}|w_{i,t}|}{h_{i,t}},\sqrt{P_{i}^{\texttt{Max}}}\right) to meet the requirement on the maximum transmit power. Such a bounding method can be viewed as a quantization measure, which does not affect the convergence [39].

VI Simulation Results And Analysis

In the simulations, we evaluate the performance of the proposed INFLOTA for both linear regression and image classification tasks, which are based on a synthetic dataset and the MNIST dataset, respectively.

The considered network has U=20U=20 workers, whose maximum power is set to be Pimax=Pmax=10P^{\max}_{i}=P^{\max}=10 mW for any i[1,U]i\in[1,U]. The receiver noise power at PS is set to be σ2=104\sigma^{2}=10^{-4} mW, i.e., SNR=Pmaxσ2=5SNR=\frac{P^{\max}}{\sigma^{2}}=5 dB. The wireless channel gain between the workers and the PS are generated from a Rayleigh fading model. Here, hi,th_{i,t} is generated from an exponential distribution with unit mean for different ii and tt.

We use two baseline methods for comparison: a) an FL algorithm that assumes idealized wireless transmissions with error-free links to achieve perfect aggregation, and b) an FL algorithm that randomly determines the power scalar and user selection. They are named as Perfect aggregation and Random policy, respectively. In Random policy, the probability of each worker being selected is 50% and the power scalar is generated from an exponential distribution with unit mean.

VI-A Linear regression experiments

In linear regression experiments, the synthetic data used to train the FL algorithm is generated randomly from [0,1][0,1]. The input xx and the output yy follow the function y=2x+1+n×0.4y=-2x+1+n\times 0.4 where nn follows a Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1). The FL algorithm is used to model the relationship between xx and yy.

Refer to caption
Figure 2: An example of implementing FL for linear regression.

Since linear regression only involves two parameters, we train a simple two-layer neural network, with one neuron in each layer, without activation functions, which is the convex case. The loss function is the MSE of the model prediction y^\hat{y} and the labeled true output yy. The learning rate is set to 0.01.

Fig. 2 shows an example of using FL for linear regression. The optimal result of a linear regression is y=2x+1y=-2x+1, because the original data generation function is y=2x+1+0.4ny=-2x+1+0.4n. In Fig. 2, we can see that the most accurate approximation is achieved by Perfect aggregation, which is the ideal case without considering the influence of wireless communication. Random policy considers the influence of wireless communication but without any optimization. Thus, its performance is worst. Our proposed INFLOTA performs closely to the ideal case, which jointly considers the learning and the influence of wireless communication. This is because that our proposed INFLOTA can optimize worker selection and power control so as to reduce the effect of wireless transmission errors on FL.

Refer to caption
Figure 3: MSE as the number of iteration varies.

In Fig. 3, we show how wireless transmission affects the convergence behavior of the global FL model training in terms the value of the loss function and the global FL model remains unchanged which shows that the global FL model converges. As we can see, as the number of iterations increases, the MSE values of all the considered learning algorithms decrease at different rates, and eventually flatten out to reach their steady state. All schemes converge, but to different steady state values. This behavior corroborates the results in Lemma 1 and Proposition 1 that the channel noise does not affect the convergence of the FL algorithm but it affects the value that the FL algorithm converges to.

Refer to caption
Figure 4: MSE as the number of workers varies.

Fig. 4 shows how MSE varies with the total number of workers UU. In general, the MSE performance of all considered FL schemes decreases as UU increases. This is due to the fact that an increase in the number of workers leads to an increased volume of data available for the FL training and hence improved accuracy of the estimated model parameters. Moreover, as the number of workers increases, the effect of wireless transmission on the global FL model accuracy starts to diminish. This is because the data samples may be already enough for accurate training when UU exceeds a certain level.

Refer to caption
Figure 5: MSE as the number of data samples per worker varies.

In Fig. 5, we present how the MSE changes with the average number of samples per worker K¯=K/U\bar{K}=K/U. The number of data samples per worker fluctuates around the average number, i.e., we set Ki=round(uniform[K¯5,K¯+5])K_{i}=\text{round}(\text{uniform}[\bar{K}-5,\bar{K}+5]). As K¯\bar{K} increases, all of the considered learning algorithms have more data samples available for training, and hence the MSE of all of considered FL algorithms decrease in Fig. 5. As the average data samples per worker continues to increase, the MSE improvement slows down and eventually saturates. This is because that as the data samples per worker continues to increase, the data samples are enough for training the FL model.

Refer to caption
Figure 6: MSE as the noise variance varies.

Fig. 6 presents how the AWGN received by the PS affects the MSE. We can see that as the noise variance increases, the MSE values of all of considered FL algorithms increase, except for Perfect aggregation. When the noise variance is small (e.g., less than 10110^{-1}), it has little effect on the performance of FL algorithms.

VI-B Evaluation on the MNIST dataset

In order to evaluate the performance of our proposed INFLOTA in realistic application scenarios with real data, we train a multilayer perceptron (MLP) on the MNIST dataset555http://yann.lecun.com/exdb/mnist/ with a 784-neuron input layer, a 64-neuron hidden layer, and a 10-neuron softmax output layer, which is a non-convex case. We adopt cross entropy as the loss function, and rectified linear unit (ReLU) as the activation function. The total number of parameters in the MLP is 50890. The learning rate α\alpha is set as 0.1. In MNIST dataset, there are 60000 training samples and 10000 test samples. We randomly take out 5001000500-1000 training samples and distribute them to 20 local workers as their local data. Then the three trained FL are tested with 10000 test samples. We provide the results of cross entropy and test accuracy versus the iteration index tt in Fig. 7 and Fig. 8, respectively. Since the MNIST dataset is designed for handwritten digit identification, the test accuracy presents the identification accuracy. As we can see, our proposed INFLOTA outperforms Random policy, and achieves comparable performance as Perfect aggregation.

Refer to caption
Figure 7: Cross entropy as the number of iteration varies.
Refer to caption
Figure 8: Test accuracy as the number of iteration varies.

VII Conclusion

In this paper, we have studied the joint optimization of communications and FL over the air with analog aggregation, in which both worker selection and transmit power control are considered under the constraints of limited communication resources. Under the convex and non-convex cases with either the GD or SGD implementations, we respectively derive closed-form expressions for the expected convergence rate of the FL algorithm, which can quantify the impact of resource-constrained wireless communications on FL under the analog aggregation paradigm. Through analyzing the expected convergence rate, we have proposed a joint optimization scheme of worker selection and power control, which can mitigate the impact of wireless communications on the convergence and performance of the FL algorithm. More significantly, our joint optimization scheme is applicable for both the convex and non-convex cases, using either GD or SGD implementations. Simulation results show that the proposed optimization scheme is effective in mitigating the impact of wireless communications on FL.

Acknowledgments

We are very grateful to all reviewers who have helped improve the quality of this paper. This work was partly supported by the National Natural Science Foundation of China (Grant Nos. 61871023 and 61931001), Beijing Natural Science Foundation (Grant No. 4202054), and the National Science Foundation of the US (Grant Nos. 1741338 and 1939553).

Appendix A Proof of Theorem 1

Theorem 1 considers the full GD method for convex problems. Following the proof for the gradient methods with noise in [33], we first present the inequality implied by the Assumption 1, as follows

F(𝐰t)F(𝐰t1)+(𝐰t𝐰t1)TF(𝐰t1)+L2𝐰t𝐰t12.\displaystyle F(\mathbf{w}_{t})\leq F(\mathbf{w}_{t-1})+(\mathbf{w}_{t}-\mathbf{w}_{t-1})^{T}\nabla F(\mathbf{w}_{t-1})+\frac{L}{2}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|^{2}. (46)

Employing a standard full GD method, the ii-th worker updates its local FL model parameter 𝐰i,t\mathbf{w}_{i,t} at the tt-th iteration by

𝐰i,t=𝐰t1αKik=1Kif(𝐰t1,𝐱i,k,𝐲i,k),i=1,2,,U.\mathbf{w}_{i,t}=\mathbf{w}_{t-1}-\frac{\alpha}{K_{i}}\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w}_{t-1},\mathbf{x}_{i,k},\mathbf{y}_{i,k}),\quad i=1,2,...,U. (47)

Substituting (47) to (III-B), we have

𝐰t=\displaystyle\mathbf{w}_{t}= 𝐰t1+(i=1UKi𝜷i,t𝒃t)1𝐳t\displaystyle\mathbf{w}_{t-1}+\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
α(i=1UKi𝜷i,t)1i=1Uk=1Ki𝜷i,tf(𝐰t1;𝐱i,k,yi,k)\displaystyle-\alpha\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}\bm{\beta}_{i,t}\odot\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},y_{i,k})
=\displaystyle= 𝐰t1α(F(𝐰t1)𝐨),\displaystyle\mathbf{w}_{t-1}-\alpha(\nabla F(\mathbf{w}_{t-1})-\mathbf{o}), (48)

where

𝐨=\displaystyle\mathbf{o}= F(𝐰t1)+(αi=1UKi𝜷i,t𝒃t)1𝐳t\displaystyle\nabla F(\mathbf{w}_{t-1})+\left(\alpha\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
(i=1UKi𝜷i,t)1i=1Uk=1Ki𝜷i,tf(𝐰t1;𝐱i,k,yi,k).\displaystyle-\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}\bm{\beta}_{i,t}\odot\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},y_{i,k}). (49)

Given the learning rate α=1L\alpha=\frac{1}{L} (a special setting for simple expression without loss of generality), the expected optimization function of 𝔼[F(𝐰t)]\mathbb{E}[F(\mathbf{w}_{t})] can be expressed as

𝔼[F(𝐰t)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})]\leq 𝔼[F(𝐰t1)α(F(𝐰t1)𝐨)TF(𝐰t1)+Lα22F(𝐰t1)𝐨2]\displaystyle\mathbb{E}\bigg{[}F(\mathbf{w}_{t-1})-\alpha(\nabla F(\mathbf{w}_{t-1})-\mathbf{o})^{T}\nabla F(\mathbf{w}_{t-1})+\frac{L\alpha^{2}}{2}\|\nabla F(\mathbf{w}_{t-1})-\mathbf{o}\|^{2}\bigg{]}
=(a)\displaystyle\overset{(a)}{=} 𝔼[F(𝐰t1)]12LF(𝐰t1)2+12L𝔼[𝐨2],\displaystyle\mathbb{E}[F(\mathbf{w}_{t-1})]-\frac{1}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}+\frac{1}{2L}\mathbb{E}[\|\mathbf{o}\|^{2}], (50)

where the step (a) is derived from the fact that

Lα22F(𝐰t1)𝐨2\displaystyle\frac{L\alpha^{2}}{2}\|\nabla F(\mathbf{w}_{t-1})-\mathbf{o}\|^{2} =12LF(𝐰t1)21L𝐨TF(𝐰t1)+12L𝐨2.\displaystyle=\frac{1}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}-\frac{1}{L}\mathbf{o}^{T}\nabla F(\mathbf{w}_{t-1})+\frac{1}{2L}\|\mathbf{o}\|^{2}. (51)

𝔼[𝐨2]\mathbb{E}[\|\mathbf{o}\|^{2}] can be derived as follows

𝔼[𝐨2]=\displaystyle\mathbb{E}[\|\mathbf{o}\|^{2}]= 𝔼[F(𝐰t1)+(αi=1UKi𝜷i,t𝒃t)1𝐳t\displaystyle\mathbb{E}\Bigg{[}\bigg{\|}\nabla F(\mathbf{w}_{t-1})+\left(\alpha\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
(i=1UKi𝜷i,t)1i=1Uk=1Ki𝜷i,tf(𝐰t1;𝐱i,k,𝐲i,k)2]\displaystyle-\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}\bm{\beta}_{i,t}\odot\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\bigg{\|}^{2}\Bigg{]}
=\displaystyle= 𝔼[i=1Uk=1Kif(𝐰t1;𝐱i,k,𝐲i,k)K+(αi=1UKi𝜷i,t𝒃t)1𝐳t\displaystyle\mathbb{E}\Bigg{[}\bigg{\|}\frac{\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})}{K}+\left(\alpha\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
(i=1UKi𝜷i,t)1i=1Uk=1Ki𝜷i,tf(𝐰t1;𝐱i,k,𝐲i,k)2]\displaystyle-\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}\bm{\beta}_{i,t}\odot\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\bigg{\|}^{2}\Bigg{]}
=\displaystyle= 𝔼[i=1U(𝟏K𝜷i,t(i=1UKi𝜷i,t)1)k=1Kif(𝐰t1;𝐱i,k,𝐲i,k)\displaystyle\mathbb{E}\Bigg{[}\bigg{\|}\sum_{i=1}^{U}\left(\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\right)\odot\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})
+(αi=1UKi𝜷i,t𝒃t)1𝐳t2],\displaystyle+\left(\alpha\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}\bigg{\|}^{2}\Bigg{]}, (52)

where 𝟏\mathbf{1} is the all-11 vector of length DD. The dimension of 𝟏\mathbf{1} is the same length as that of 𝜷i,t\bm{\beta}_{i,t}.

Employing the triangle inequality of norms 𝐗+𝐘𝐗+𝐘\|\mathbf{X}+\mathbf{Y}\|\leq\|\mathbf{X}\|+\|\mathbf{Y}\|, the submultiplicative property of norms 𝐗𝐘𝐗𝐘\|\mathbf{X}\mathbf{Y}\|\leq\|\mathbf{X}\|\|\mathbf{Y}\|, and the Jensen’s inequality, (A) can be further derived as follows

𝔼[𝐨2]\displaystyle\mathbb{E}[\|\mathbf{o}\|^{2}]\leq 𝔼[i=1U(𝟏K𝜷i,t(i=1UKi𝜷i,t)1)k=1Kif(𝐰t1;𝐱i,k,𝐲i,k)2]\displaystyle\mathbb{E}\Bigg{[}\Bigg{\|}\sum_{i=1}^{U}\left(\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\right)\odot\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\Bigg{\|}^{2}\Bigg{]}
+𝔼[(αi=1UKi𝜷i,t𝒃t)1𝐳t2]\displaystyle+\mathbb{E}\Bigg{[}\Bigg{\|}\left(\alpha\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}\Bigg{\|}^{2}\Bigg{]}
\displaystyle\leq 𝔼[Ki=1U𝟏K𝜷i,t(i=1UKi𝜷i,t)12k=1Kif(𝐰t1;𝐱i,k,𝐲i,k)2]\displaystyle\mathbb{E}\Bigg{[}K\sum_{i=1}^{U}\Bigg{\|}\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\Bigg{\|}^{2}\sum_{k=1}^{K_{i}}\|\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\|^{2}\Bigg{]}
+𝔼[(αi=1UKi𝜷i,t𝒃t)1𝐳t2]\displaystyle+\mathbb{E}\Bigg{[}\left\|\left(\alpha\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}\right\|^{2}\Bigg{]}
\displaystyle\leq Ki=1U𝟏K𝜷i,t(i=1UKi𝜷i,t)12k=1Kif(𝐰t1;𝐱i,k,𝐲i,k)2\displaystyle K\sum_{i=1}^{U}\Bigg{\|}\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\Bigg{\|}^{2}\sum_{k=1}^{K_{i}}\|\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\|^{2}
+(i=1UKi𝜷i,t𝒃t)12σ2L2.\displaystyle+\left\|\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\sigma^{2}L^{2}. (53)

Applying (12) in Assumption 3 to (A) leads to

𝔼[𝐨2]\displaystyle\mathbb{E}[\|\mathbf{o}\|^{2}]\leq Ki=1U𝟏K𝜷i,t(i=1UKi𝜷i,t)12Ki(ρ1+ρ2F(𝐰t1)2)\displaystyle K\sum_{i=1}^{U}\Bigg{\|}\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\right)^{\odot-1}\Bigg{\|}^{2}K_{i}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})
+(i=1UKi𝜷i,t𝒃t)12σ2L2\displaystyle+\left\|\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\sigma^{2}L^{2}
=\displaystyle= Ki=1Ud=1D(1Kβi,tdi=1UKiβi,td)2Ki(ρ1+ρ2F(𝐰t1)2)\displaystyle K\sum_{i=1}^{U}\sum_{d=1}^{D}\Bigg{(}\frac{1}{K}-\frac{\beta^{d}_{i,t}}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}\Bigg{)}^{2}K_{i}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})
+(i=1UKi𝜷i,t𝒃t)12σ2L2\displaystyle+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\sigma^{2}L^{2}
=\displaystyle= Ki=1Ud=1D(1K22Kβi,tdi=1UKiβi,td+(βi,td)2(i=1UKiβi,td)2)Ki(ρ1+ρ2F(𝐰t1)2)\displaystyle K\sum_{i=1}^{U}\sum_{d=1}^{D}\Bigg{(}\frac{1}{K^{2}}-\frac{2}{K}\frac{\beta^{d}_{i,t}}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}+\frac{(\beta^{d}_{i,t})^{2}}{(\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t})^{2}}\Bigg{)}K_{i}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})
+(i=1UKi𝜷i,t𝒃t)12σ2L2\displaystyle+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\sigma^{2}L^{2}
=\displaystyle= d=1D(Ki=1UKiβi,td1)(ρ1+ρ2F(𝐰t1)2)+(i=1UKi𝜷i,t𝒃t)12σ2L2.\displaystyle\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\sigma^{2}L^{2}. (54)

Substituting (A) to (A), we have:

𝔼[F(𝐰t)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})]\leq 12L(d=1D(Ki=1UKiβi,td1)(ρ1+ρ2F(𝐰t1)2)\displaystyle\frac{1}{2L}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})
+(i=1UKi𝜷i,t𝒃t)12σ2L2)+𝔼[F(𝐰t1)]12LF(𝐰t1)2.\displaystyle+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\sigma^{2}L^{2}\Bigg{)}+\mathbb{E}[F(\mathbf{w}_{t-1})]-\frac{1}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}. (55)

Subtract 𝔼[F(𝐰)]\mathbb{E}[F(\mathbf{w}^{*})] from both sides of (A), we have:

𝔼[F(𝐰t)F(𝐰)]𝔼[F(𝐰t1)F(𝐰)]12LF(𝐰t1)2\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})]-\frac{1}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}
+12L(d=1D(Ki=1UKiβi,td1)(ρ1+ρ2F(𝐰t1)2)+(i=1UKi𝜷i,t𝒃t)12σ2L2).\displaystyle+\frac{1}{2L}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\sigma^{2}L^{2}\Bigg{)}. (56)

To minimize both sides of (11), we have

min𝐰tF(𝐰t)min𝐰t(\displaystyle\min_{\mathbf{w}_{t}}F(\mathbf{w}_{t})\geq\min_{\mathbf{w}_{t}}( F(𝐰t1)+(𝐰t𝐰t1)TF(𝐰t1)+μ2𝐰t𝐰t12).\displaystyle F(\mathbf{w}_{t-1})+(\mathbf{w}_{t}-\mathbf{w}_{t-1})^{T}\nabla F(\mathbf{w}_{t-1})+\frac{\mu}{2}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|^{2}). (57)

The minimization of the left-hand side is achieved by 𝐰t=𝐰\mathbf{w}_{t}=\mathbf{w}^{*}, while the minimization of the right-hand side is achieved by 𝐰t=𝐰t11μF(𝐰t1)\mathbf{w}_{t}=\mathbf{w}_{t-1}-\frac{1}{\mu}\nabla F(\mathbf{w}_{t-1}). Thus, we have

F(𝐰)F(𝐰t1)12μF(𝐰t1)2.\displaystyle F(\mathbf{w}^{*})\geq F(\mathbf{w}_{t-1})-\frac{1}{2\mu}\|\nabla F(\mathbf{w}_{t-1})\|^{2}. (58)

Then

F(𝐰t1)22μ(F(𝐰t1)F(𝐰)).\displaystyle\|\nabla F(\mathbf{w}_{t-1})\|^{2}\geq 2\mu(F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})). (59)

Substituting (59) to (A), we get

𝔼[F(𝐰t)F(𝐰)](1μL)𝔼[F(𝐰t1)F(𝐰)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq(1-\frac{\mu}{L})\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})]
+12L(d=1D(Ki=1UKiβi,td1)(ρ1+ρ2F(𝐰t1)2)+(i=1UKi𝜷i,t𝒃t)12σ2L2)\displaystyle+\frac{1}{2L}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\sigma^{2}L^{2}\Bigg{)}
=(1μL)𝔼[F(𝐰t1)F(𝐰)]+ρ22L(d=1D(Ki=1UKiβi,td1))F(𝐰t1)2\displaystyle=(1-\frac{\mu}{L})\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})]+\frac{\rho_{2}}{2L}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}\Bigg{)}\|\nabla F(\mathbf{w}_{t-1})\|^{2}
+ρ12Ld=1D(Ki=1UKiβi,td1)+(i=1UKi𝜷i,t𝒃t)12Lσ22.\displaystyle+\frac{\rho_{1}}{2L}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\frac{L\sigma^{2}}{2}. (60)

Next, in the same way that (59) is derived, to minimize both sides of (46), we have

F(𝐰t1)22L(F(𝐰t1)F(𝐰)).\displaystyle\|\nabla F(\mathbf{w}_{t-1})\|^{2}\leq 2L(F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})). (61)

Substituting (61) to (A), we get

𝔼[F(𝐰t)F(𝐰)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq (1μL+ρ2d=1D(Ki=1UKiβi,td1))𝔼[F(𝐰t1)F(𝐰)]\displaystyle\Bigg{(}1-\frac{\mu}{L}+\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}\Bigg{)}\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})]
+ρ12Ld=1D(Ki=1UKiβi,td1)+(i=1UKi𝜷i,t𝒃t)12Lσ22\displaystyle+\frac{\rho_{1}}{2L}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\frac{L\sigma^{2}}{2}
=\displaystyle= Bt+At𝔼[F(𝐰t1)F(𝐰)],\displaystyle B_{t}+A_{t}\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})], (62)

where

At=1μL+ρ2d=1D(Ki=1UKiβi,td1),\displaystyle A_{t}=1-\frac{\mu}{L}+\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}, (63)
Bt\displaystyle B_{t} =ρ12Ld=1D(Ki=1UKiβi,td1)+(i=1UKi𝜷i,t𝒃t)12Lσ22.\displaystyle=\frac{\rho_{1}}{2L}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\frac{L\sigma^{2}}{2}. (64)

The proof is completed.

Appendix B Proof of Theorem 2

Theorem 2 considers the full GD method for non-convex problems. The proof of Theorem 2 follows that of Theorem 1 until (A). From (A), we have

𝔼[F(𝐰t)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})]\leq 𝔼[F(𝐰t1)]12L(1ρ2d=1D(Ki=1UKiβi,td1))F(𝐰t1)2\displaystyle\mathbb{E}[F(\mathbf{w}_{t-1})]-\frac{1}{2L}\Bigg{(}1-\rho_{2}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}\Bigg{)}\|\nabla F(\mathbf{w}_{t-1})\|^{2}
+ρ12Ld=1D(Ki=1UKiβi,td1)+(i=1UKi𝜷i,t𝒃t)12Lσ22\displaystyle+\frac{\rho_{1}}{2L}\sum_{d=1}^{D}\Bigg{(}\frac{K}{\sum_{i=1}^{U}K_{i}\beta^{d}_{i,t}}-1\Bigg{)}+\left\|(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t})^{\odot-1}\right\|^{2}\frac{L\sigma^{2}}{2}
=𝔼[F(𝐰t1)]2AtμL2LF(𝐰t1)2+Bt.\displaystyle=\mathbb{E}[F(\mathbf{w}_{t-1})]-\frac{2-A_{t}-\frac{\mu}{L}}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}+B_{t}. (65)

Summing up the above inequality from t=1t=1 to t=Tt=T, we get

𝔼[F(𝐰t)]𝔼[F(𝐰0)]t=1T2AtμL2LF(𝐰t1)2+t=1TBt,\displaystyle\mathbb{E}[F(\mathbf{w}_{t})]-\mathbb{E}[F(\mathbf{w}_{0})]\leq-\sum_{t=1}^{T}\frac{2-A_{t}-\frac{\mu}{L}}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}+\sum_{t=1}^{T}B_{t}, (66)

which leads to

t=1T2AtμL2LF(𝐰t1)2\displaystyle\sum_{t=1}^{T}\frac{2-A_{t}-\frac{\mu}{L}}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2} 𝔼[F(𝐰0)]𝔼[F(𝐰t)]+t=1TBt\displaystyle\leq\mathbb{E}[F(\mathbf{w}_{0})]-\mathbb{E}[F(\mathbf{w}_{t})]+\sum_{t=1}^{T}B_{t}
𝔼[F(𝐰0)]𝔼[F(𝐰)]+t=1TBt.\displaystyle\leq\mathbb{E}[F(\mathbf{w}_{0})]-\mathbb{E}[F(\mathbf{w}^{*})]+\sum_{t=1}^{T}B_{t}. (67)

Recalling Proposition 1, we have

01ρ2D(KKmin1)2L2AtμL2L12L,t.\displaystyle 0\leq\frac{1-\rho_{2}D(\frac{K}{K_{min}}-1)}{2L}\leq\frac{2-A_{t}-\frac{\mu}{L}}{2L}\leq\frac{1}{2L},\quad\forall t. (68)

Substituting (68) to (B), we get

1Tt=1T1ρ2D(KKmin1)2LF(𝐰t1)2\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{1-\rho_{2}D(\frac{K}{K_{min}}-1)}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2} 1Tt=1T2AtμL2LF(𝐰t1)2\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\frac{2-A_{t}-\frac{\mu}{L}}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}
1T(𝔼[F(𝐰0)]𝔼[F(𝐰)])+1Tt=1TBt.\displaystyle\leq\frac{1}{T}(\mathbb{E}[F(\mathbf{w}_{0})]-\mathbb{E}[F(\mathbf{w}^{*})])+\frac{1}{T}\sum_{t=1}^{T}B_{t}. (69)

As a result, we have the conclusion in Theorem 2,

1Tt=1TF(𝐰t1)2\displaystyle\frac{1}{T}\sum_{t=1}^{T}\|\nabla F(\mathbf{w}_{t-1})\|^{2}\leq 2LT(1ρ2D(KKmin1))𝔼[F(𝐰0)]𝔼[F(𝐰)]\displaystyle\frac{2L}{T(1-\rho_{2}D(\frac{K}{K_{min}}-1))}\mathbb{E}[F(\mathbf{w}_{0})]-\mathbb{E}[F(\mathbf{w}^{*})]
+2Lt=1TBtT(1ρ2D(KKmin1)).\displaystyle+\frac{2L\sum_{t=1}^{T}B_{t}}{T(1-\rho_{2}D(\frac{K}{K_{min}}-1))}. (70)

Appendix C Proof of Theorem 3

Exploiting the SGD method, the local parameter of the ii-th worker is updated at the tt-th iteration by

𝐰i,t=𝐰t1α𝔼𝒟i[k=1Kbf(𝐰t1,𝐱i,k,𝐲i,k)Kb],i=1,2,,U,\displaystyle\mathbf{w}_{i,t}=\mathbf{w}_{t-1}-\alpha\mathbb{E}_{\mathcal{D}_{i}}\left[\frac{\sum_{k=1}^{K_{b}}\nabla f(\mathbf{w}_{t-1},\mathbf{x}_{i,k},\mathbf{y}_{i,k})}{K_{b}}\right],\quad i=1,2,...,U, (71)

where 𝔼𝒟i[]\mathbb{E}_{\mathcal{D}_{i}}[\cdot] is the expectation, which represents that the ii-th worker randomly chooses KbK_{b} samples from its local dataset 𝒟i\mathcal{D}_{i} to compute the local gradient.

Substituting (71) to (III-B), we reach a averaged gradient estimate as

𝐰t=\displaystyle\mathbf{w}_{t}= 𝐰t1+(i=1UKb𝜷i,t𝒃t)1𝐳t\displaystyle\mathbf{w}_{t-1}+\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
α(i=1UKb𝜷i,t)1i=1U(Kb𝜷i,t𝔼𝒟i[k=1Kbf(𝐰t1,𝐱i,k,𝐲i,k)Kb])\displaystyle-\alpha\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\left(K_{b}\bm{\beta}_{i,t}\odot\mathbb{E}_{\mathcal{D}_{i}}\left[\frac{\sum_{k=1}^{K_{b}}\nabla f(\mathbf{w}_{t-1},\mathbf{x}_{i,k},\mathbf{y}_{i,k})}{K_{b}}\right]\right)
=\displaystyle= 𝐰t1α(F(𝐰t1)𝐨),\displaystyle\mathbf{w}_{t-1}-\alpha(\nabla F(\mathbf{w}_{t-1})-\mathbf{o}), (72)

where

𝐨=\displaystyle\mathbf{o}= F(𝐰t1)+(αi=1UKb𝜷i,t𝒃t)1𝐳t\displaystyle\nabla F(\mathbf{w}_{t-1})+\left(\alpha\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
(i=1UKb𝜷i,t)1i=1U(𝜷i,t𝔼𝒟i[k=1Kbf(𝐰t1,𝐱i,k,𝐲i,k)]).\displaystyle-\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\left(\bm{\beta}_{i,t}\odot\mathbb{E}_{\mathcal{D}_{i}}\left[\sum_{k=1}^{K_{b}}\nabla f(\mathbf{w}_{t-1},\mathbf{x}_{i,k},\mathbf{y}_{i,k})\right]\right). (73)

Let 𝒩i,t\mathcal{N}_{i,t} denote the set of the samples that are not chosen by the ii-th worker at the tt-th iteration, 𝔼[𝐨2]\mathbb{E}[\|\mathbf{o}\|^{2}] can be derived as follows

𝔼[𝐨2]=\displaystyle\mathbb{E}[\|\mathbf{o}\|^{2}]= 𝔼[i=1Uk=1Kif(𝐰t1;𝐱i,k,𝐲i,k)K+(αi=1UKb𝜷i,t𝒃t)1𝐳t\displaystyle\mathbb{E}\Bigg{[}\bigg{\|}\frac{\sum_{i=1}^{U}\sum_{k=1}^{K_{i}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})}{K}+\left(\alpha\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}
(i=1UKb𝜷i,t)1i=1U(𝜷i,t𝔼𝒟i[k=1Kbf(𝐰t1,𝐱i,k,𝐲i,k)])2]\displaystyle-\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\right)^{\odot-1}\odot\sum_{i=1}^{U}\left(\bm{\beta}_{i,t}\odot\mathbb{E}_{\mathcal{D}_{i}}\left[\sum_{k=1}^{K_{b}}\nabla f(\mathbf{w}_{t-1},\mathbf{x}_{i,k},\mathbf{y}_{i,k})\right]\right)\bigg{\|}^{2}\Bigg{]}
=\displaystyle= 𝔼[i=1U(𝟏K𝜷i,t(i=1UKb𝜷i,t)1)𝔼𝒩¯i,t[k𝒩¯i,tf(𝐰t1;𝐱i,k,𝐲i,k)]\displaystyle\mathbb{E}\Bigg{[}\bigg{\|}\sum_{i=1}^{U}\left(\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\right)^{\odot-1}\right)\odot\mathbb{E}_{\overline{\mathcal{N}}_{i,t}}\left[\sum_{k\in\overline{\mathcal{N}}_{i,t}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\right]
+i=1U𝔼[k𝒩i,tf(𝐰t1;𝐱i,k,𝐲i,k)]K+(αi=1UKb𝜷i,t𝒃t)1𝐳t2],\displaystyle+\frac{\sum_{i=1}^{U}\mathbb{E}[\sum_{k\in\mathcal{N}_{i,t}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})]}{K}+\left(\alpha\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\odot\mathbf{z}_{t}\bigg{\|}^{2}\Bigg{]},
\displaystyle\leq (i=1UKb)i=1U𝟏K𝜷i,t(i=1UKb𝜷i,t)12𝔼𝒩¯i,t[k𝒩¯i,tf(𝐰t1;𝐱i,k,𝐲i,k)2]\displaystyle\left(\sum_{i=1}^{U}K_{b}\right)\sum_{i=1}^{U}\bigg{\|}\frac{\mathbf{1}}{K}-\bm{\beta}_{i,t}\odot\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\right)^{\odot-1}\bigg{\|}^{2}\mathbb{E}_{\overline{\mathcal{N}}_{i,t}}\left[\sum_{k\in\overline{\mathcal{N}}_{i,t}}\|\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})\|^{2}\right]
+i=1U𝔼[k𝒩i,tf(𝐰t1;𝐱i,k,𝐲i,k)]2K2+(i=1UKb𝜷i,t𝒃t)12σ2L2.\displaystyle+\frac{\|\sum_{i=1}^{U}\mathbb{E}[\sum_{k\in\mathcal{N}_{i,t}}\nabla f(\mathbf{w}_{t-1};\mathbf{x}_{i,k},\mathbf{y}_{i,k})]\|^{2}}{K^{2}}+\left\|\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\sigma^{2}L^{2}. (74)

Applying Assumption 3, we get

𝔼[𝐨2]\displaystyle\mathbb{E}[\|\mathbf{o}\|^{2}]\leq (i=1UKb)d=1D((i=1UKb)2KK2+1i=1UKbβi,td)(ρ1+ρ2F(𝐰t1)2)+\displaystyle\left(\sum_{i=1}^{U}K_{b}\right)\sum_{d=1}^{D}\left(\frac{\left(\sum_{i=1}^{U}K_{b}\right)-2K}{K^{2}}+\frac{1}{\sum_{i=1}^{U}K_{b}\beta^{d}_{i,t}}\right)(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})+
(i=1U(KiKb))2K2(ρ1+ρ2F(𝐰t1)2)+(i=1UKb𝜷i,t𝒃t)12σ2L2.\displaystyle\frac{(\sum_{i=1}^{U}(K_{i}-K_{b}))^{2}}{K^{2}}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})+\left\|\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\sigma^{2}L^{2}. (75)

Substituting (C) into (A), we have

𝔼[F(𝐰t)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})]\leq 12L((i=1UKb𝜷i,t𝒃t)12σ2L2\displaystyle\frac{1}{2L}\Bigg{(}\left\|\left(\sum_{i=1}^{U}K_{b}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\sigma^{2}L^{2}
+(d=1D((i=1UKb)22K(i=1UKb)K2+(i=1UKb)i=1UKbβi,td)\displaystyle+\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{\left(\sum_{i=1}^{U}K_{b}\right)^{2}-2K\left(\sum_{i=1}^{U}K_{b}\right)}{K^{2}}+\frac{\left(\sum_{i=1}^{U}K_{b}\right)}{\sum_{i=1}^{U}K_{b}\beta^{d}_{i,t}}\Bigg{)}
+(i=1U(KiKb))2K2))(ρ1+ρ2F(𝐰t1)2))\displaystyle+\frac{(\sum_{i=1}^{U}(K_{i}-K_{b}))^{2}}{K^{2}}\Bigg{)}\Bigg{)}(\rho_{1}+\rho_{2}\|\nabla F(\mathbf{w}_{t-1})\|^{2})\Bigg{)}
+𝔼[F(𝐰t1)]12LF(𝐰t1)2.\displaystyle+\mathbb{E}[F(\mathbf{w}_{t-1})]-\frac{1}{2L}\|\nabla F(\mathbf{w}_{t-1})\|^{2}. (76)

Subtracting 𝔼[F(𝐰)]\mathbb{E}[F(\mathbf{w}^{*})] from both sides of (C), and applying (59) and (61), we get

𝔼[F(𝐰t)F(𝐰)]BtSGD+AtSGD𝔼[F(𝐰t1)F(𝐰)],\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq B^{SGD}_{t}+A^{SGD}_{t}\mathbb{E}[F(\mathbf{w}_{t-1})-F(\mathbf{w}^{*})], (77)

where

AtSGD=\displaystyle A^{SGD}_{t}= 1μL+ρ2(d=1D((i=1UKb)22K(i=1UKb)K2+(i=1UKb)i=1UKbβi,td)\displaystyle 1-\frac{\mu}{L}+\rho_{2}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{(\sum_{i=1}^{U}K_{b})^{2}-2K(\sum_{i=1}^{U}K_{b})}{K^{2}}+\frac{(\sum_{i=1}^{U}K_{b})}{\sum_{i=1}^{U}K_{b}\beta^{d}_{i,t}}\Bigg{)}
+(i=1U(KiKb))2K2),\displaystyle+\frac{(\sum_{i=1}^{U}(K_{i}-K_{b}))^{2}}{K^{2}}\Bigg{)}, (78)
BtSGD=\displaystyle B^{SGD}_{t}= ρ12L(d=1D((i=1UKb)22K(i=1UKb)K2+(i=1UKb)i=1UKbβi,td)\displaystyle\frac{\rho_{1}}{2L}\Bigg{(}\sum_{d=1}^{D}\Bigg{(}\frac{(\sum_{i=1}^{U}K_{b})^{2}-2K(\sum_{i=1}^{U}K_{b})}{K^{2}}+\frac{(\sum_{i=1}^{U}K_{b})}{\sum_{i=1}^{U}K_{b}\beta^{d}_{i,t}}\Bigg{)}
+(i=1U(KiKb))2K2)+(i=1UKi𝜷i,t𝒃t)12Lσ22.\displaystyle+\frac{(\sum_{i=1}^{U}(K_{i}-K_{b}))^{2}}{K^{2}}\Bigg{)}+\left\|\left(\sum_{i=1}^{U}K_{i}\bm{\beta}_{i,t}\odot\bm{b}_{t}\right)^{\odot-1}\right\|^{2}\frac{L\sigma^{2}}{2}. (79)

Applying (77) recursively, we have

𝔼[F(𝐰t)F(𝐰)]\displaystyle\mathbb{E}[F(\mathbf{w}_{t})-F(\mathbf{w}^{*})]\leq i=1t1j=1iAt+1jSGDBtiSGD+BtSGD+j=1tAjSGD𝔼[F(𝐰0)F(𝐰)].\displaystyle\sum_{i=1}^{t-1}\prod_{j=1}^{i}A^{SGD}_{t+1-j}B^{SGD}_{t-i}+B^{SGD}_{t}+\prod_{j=1}^{t}A^{SGD}_{j}\mathbb{E}[F(\mathbf{w}_{0})-F(\mathbf{w}^{*})]. (80)

which completes the proof.

Appendix D Proof of Theorem 4

To minimize RtR_{t}, it can be seen from (35), (36) and (37) that we should maximize the number of the selected workers and the transmit power scaling factor in the tt-th iteration. Thus, the selected workers should send their parameters at their maximum power. In order to reach the desired parameter aggregation at the PS as in (5), each worker needs to use the same transmit power scaling factor btb_{t}, which is a parameter that needs to be optimized (btb_{t} determines the worker selection). According to (35), (36) and (37), a larger btb_{t} leads to a smaller RtR_{t}. On the other hand, (41b) indicates that a larger btb_{t} results in less workers is selected, which then results in an increase of RtR_{t}.

Rewriting (38b) and replacing |wi,t||w_{i,t}| with (|wt1|+η)(|w_{t-1}|+\eta), we obtain the maximum acceptable btb_{t} of the ii-th worker as

bi,tmax=|Pimaxhi,tKi(|wt1|+η)|.\displaystyle b_{i,t}^{\max}=\left|\frac{\sqrt{P_{i}^{\max}}h_{i,t}}{K_{i}(|w_{t-1}|+\eta)}\right|. (81)

Accordingly, btb_{t} should be chosen from {bi,tmax}i=1U\{b_{i,t}^{\max}\}_{i=1}^{U}. Once btb_{t} is determined, 𝜷t\bm{\beta}_{t} can be determined by verifying whether the transmit power meets the condition in (7). As a result, we obtain a reduced solution space of the optimization problem P2 as

𝒮=\displaystyle\mathcal{S}= {{(bt(k),βi,t(k))}k=1U|bt(k)=bk,tmax,𝜷t(k)(bt(k))=[β1,t(k),,βU,t(k)],k=1,,U},\displaystyle\Bigg{\{}\{(b_{t}^{(k)},\beta_{i,t}^{(k)})\}_{k=1}^{U}\bigg{|}b_{t}^{(k)}=b_{k,t}^{\max},\bm{\beta}_{t}^{(k)}(b_{t}^{(k)})=[\beta^{(k)}_{1,t},\dots,\beta^{(k)}_{U,t}],k=1,\dots,U\Bigg{\}}, (82)

with

βU,t(k)=H(PUmax|KUbt(k)(|wt1|+η)hU,t|)\displaystyle\beta^{(k)}_{U,t}=H\bigg{(}P_{U}^{\max}-\bigg{|}\frac{K_{U}b^{(k)}_{t}(|w_{t-1}|+\eta)}{h_{U,t}}\bigg{|}\bigg{)} (83)

where

H(x)={1,x>0,0,x0.H(x)=\left\{\begin{aligned} 1&,&x>0,\\ 0&,&x\leq 0.\end{aligned}\right. (84)

is the Heaviside step function.

References

  • [1] M. Chiang and T. Zhang, “Fog and iot: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, 2016.
  • [2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
  • [3] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  • [4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
  • [5] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527, 2016.
  • [6] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: wireless communication meets machine learning,” IEEE Communications Magazine, vol. 58, no. 1, pp. 19–25, 2020.
  • [7] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Transactions on Wireless Communications, 2020.
  • [8] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H. Middleton, “Cell-free massive mimo for wireless federated learning,” IEEE Transactions on Wireless Communications, 2020.
  • [9] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Transactions on information theory, vol. 53, no. 10, pp. 3498–3516, 2007.
  • [10] L. Chen, N. Zhao, Y. Chen, F. R. Yu, and G. Wei, “Over-the-air computation for iot networks: Computing multiple functions with antenna arrays,” IEEE Internet of Things Journal, vol. 5, no. 6, pp. 5296–5306, 2018.
  • [11] M. Goldenbaum, H. Boche, and S. Stańczak, “Harnessing interference for analog function computation in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 61, no. 20, pp. 4893–4906, 2013.
  • [12] O. Abari, H. Rahul, D. Katabi, and M. Pant, “Airshare: Distributed coherent transmission made seamless,” in 2015 IEEE Conference on Computer Communications (INFOCOM).   IEEE, 2015, pp. 1742–1750.
  • [13] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Transactions on Signal Processing, vol. 68, pp. 2155–2169, 2020.
  • [14] ——, “Federated learning over wireless fading channels,” IEEE Transactions on Wireless Communications, vol. 19, no. 5, pp. 3546–3557, 2020.
  • [15] M. M. Amiri, T. M. Duman, and D. Gündüz, “Collaborative machine learning at the wireless edge with blind transmitters,” arXiv preprint arXiv:1907.03909, 2019.
  • [16] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 491–506, 2019.
  • [17] Y. Sun, S. Zhou, and D. Gündüz, “Energy-aware analog aggregation for federated learning with redundant data,” arXiv preprint arXiv:1911.00188, 2019.
  • [18] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Transactions on Wireless Communications, vol. 19, no. 3, pp. 2022–2035, 2020.
  • [19] M. Ye and E. Abbe, “Communication-computation efficient gradient coding,” arXiv preprint arXiv:1802.03475, 2018.
  • [20] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
  • [21] Y. Liu, K. Yuan, G. Wu, Z. Tian, and Q. Ling, “Decentralized dynamic admm with quantized and censored communications,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers.   IEEE, 2019, pp. 1496–1500.
  • [22] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, “Communication-censored admm for decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2565–2579, 2019.
  • [23] P. Xu, Z. Tian, Z. Zhang, and Y. Wang, “Coke: Communication-censored kernel learning via random features,” in 2019 IEEE Data Science Workshop (DSW), 2019, pp. 32–36.
  • [24] T. Chen, G. Giannakis, T. Sun, and W. Yin, “Lag: Lazily aggregated gradient for communication-efficient distributed learning,” in Advances in Neural Information Processing Systems, 2018, pp. 5050–5060.
  • [25] P. Xu, Z. Tian, and Y. Wang, “An energy-efficient distributed average consensus scheme via infrequent communication,” in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2018, pp. 648–652.
  • [26] P. Xu, Y. Wang, X. Chen, and T. Zhi, “Coke: Communication-censored kernel learning for decentralized non-parametric learning,” arXiv preprint arXiv:2001.10133, 2020.
  • [27] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energy-efficient radio resource allocation for federated edge learning,” arXiv preprint arXiv:1907.06040, 2019.
  • [28] J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms,” arXiv preprint arXiv:1808.07576, 2018.
  • [29] M. Goldenbaum and S. Stanczak, “Robust analog function computation via wireless multiple-access channels,” IEEE Transactions on Communications, vol. 61, no. 9, pp. 3863–3877, 2013.
  • [30] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate newton-type method,” in International conference on machine learning, 2014, pp. 1000–1008.
  • [31] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, 2020.
  • [32] D. P. Bertsekas, J. N. Tsitsiklis, and J. Tsitsiklis, Neuro-Dynamic Programming.   Athena Scientific, 1996.
  • [33] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM Journal on Scientific Computing, vol. 34, no. 3, pp. A1380–A1405, 2012.
  • [34] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” in Advances in Neural Information Processing Systems, 2018, pp. 5973–5983.
  • [35] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
  • [36] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
  • [37] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Advances in Neural Information Processing Systems, 2018, pp. 4447–4458.
  • [38] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu, “Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression,” in International Conference on Machine Learning.   PMLR, 2019, pp. 6155–6165.
  • [39] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.