This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Federated Meta-Learning over Multi-Access Wireless Networks

Sheng Yue, Ju Ren, Jiang Xin, Deyu Zhang, Yaoxue Zhang, and Weihua Zhuang Sheng Yue, Jiang Xin, Deyu Zhang are with the School of Computer Science and Engineering, Central South University, Changsha, 410083 China. Emails: {sheng.yue, xinjiang, zdy876}@csu.edu.cn.Ju Ren and Yaoxue Zhang are with the Department of Computer Science and Technology, Tsinghua University, Beijing, 100084 China. Emails: {renju,zhangyx}@tsinghua.edu.cn.Weihua Zhuang is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada. Email: [email protected].
Abstract

Federated meta-learning (FML) has emerged as a promising paradigm to cope with the data limitation and heterogeneity challenges in today’s edge learning arena. However, its performance is often limited by slow convergence and corresponding low communication efficiency. In addition, since the available radio spectrum and IoT devices’ energy capacity are usually insufficient, it is crucial to control the resource allocation and energy consumption when deploying FML in practical wireless networks. To overcome the challenges, in this paper, we rigorously analyze the contribution of each device to the global loss reduction in each round and develop an FML algorithm (called NUFM) with a non-uniform device selection scheme to accelerate the convergence. After that, we formulate a resource allocation problem integrating NUFM in multi-access wireless systems to jointly improve the convergence rate and minimize the wall-clock time along with energy cost. By deconstructing the original problem step by step, we devise a joint device selection and resource allocation strategy to solve the problem with theoretical guarantees. Further, we show that the computational complexity of NUFM can be reduced from O(d2)O(d^{2}) to O(d)O(d) (with the model dimension dd) via combining two first-order approximation techniques. Extensive simulation results demonstrate the effectiveness and superiority of the proposed methods in comparison with existing baselines.

Index Terms:
Federated meta-learning, multi-access systems, device selection, resource allocation, efficiency.

I Introduction

The integration of Artificial Intelligence (AI) and Internet-of-Things (IoT) has led to a proliferation of studies on edge intelligence, aiming at pushing AI frontiers to the wireless network edge proximal to IoT devices and data sources [1]. It is expected that edge intelligence will reduce time-to-action latency down to milliseconds for IoT applications while minimizing network bandwidth and offering security guarantees [2]. However, a general consensus is that a single IoT device can hardly realize edge intelligence due to its limited computational and storage capabilities. Accordingly, it is natural to rely on collaboration in edge learning, whereby IoT devices work together to accomplish computation-intensive tasks [3].

Building on a synergy of federated learning [4] and meta-learning [5], federated meta-learning (FML) has been proposed under a common theme of fostering edge-edge collaboration [6, 7, 8]. In FML, IoT devices join forces to learn an initial shared model under the orchestration of a central server such that current or new devices can quickly adapt the learned model to their local datasets via one or a few gradient descent steps. Notably, FML can keep all the benefits of the federated learning paradigm (such as simplicity, data security, and flexibility), while giving a more personalized model for each device to capture the differences among tasks [9]. Therefore, FML has emerged as a promising approach to tackle the heterogeneity challenges in federated learning and to facilitate efficient edge learning [10].

Despite its promising benefits, FML comes with new challenges. On one hand, the number of participated devices can be enormous. The uniform device selection at random, often done in the existing methods, leads to a low convergence speed [11, 9]. Although recent studies [12, 13, 14] have characterized the convergence of federated learning and proposed non-uniform device selection mechanisms, they cannot be directly applied to FML problems due to the bias and high-order information in the stochastic gradients. On the other hand, the performance of FML in a wireless environment is highly related to its wall-clock time, including the computation time (determined by local data sizes and devices’ CPU types) and communication time (depending on channel gains, interference, and transmission power) [15, 16]. If not properly controlled, a large wall-clock time can cause unexpected training delay and communication inefficiency. In addition, the DNN model training, which involves a large number of data samples and epochs, usually induces high computational cost, especially for sophisticated model structures consisting of millions of parameters [15]. Due to the limited power capacity of IoT devices, energy consumption should also be properly managed to ensure system sustainability and stability [14]. In a nutshell, for the purpose of efficiently deploying FML in today’s wireless systems, the strategy of device selection and resource allocation must be carefully crafted not only to accelerate the learning process, but also to control the wall-clock time of training and energy cost in edge devices. Unfortunately, despite their importance, there are limited studies on these aspects in the current literature.

In this paper, we tackle the above-mentioned challenges in two steps: 1) We develop an algorithm (called NUFM) with a non-uniform device selection scheme to improve the convergence rate of vanilla FML algorithm; 2) based on NUFM, we propose a resource allocation strategy (called URAL) that jointly optimizes the convergence speed, wall-clock time, and energy consumption in the context of multi-access wireless systems. More specifically, first we rigorously quantify the contribution of each device to the convergence of FML via deriving a tight lower bound on the reduction of one-round global loss. Based on the quantitative results, we present the non-uniform device selection scheme that maximizes the loss reduction per round, followed by the NUFM algorithm. Then, we formulate a resource allocation problem for NUFM over wireless networks, capturing the trade-offs among convergence, wall-clock time, and energy cost. To solve this problem, we exploit its special structure and decompose it into two sub-problems. The first one is to minimize the computation time via controlling devices’ CPU-cycle frequencies, which is solved optimally based on the analysis of the effect of device heterogeneity on the objective. The second sub-problem aims at optimizing the resource block allocation and transmission power management. It is a non-convex mixed-integer non-linear programming (MINLP) problem, and to derive a closed-form solution is a non-trivial task. Thus, after deconstructing the problem step by step, we devise an iterative method to solve it and provide a convergence guarantee.

In summary, our main contributions are three-fold.

  • We provide a theoretical characterization of contribution of an individual device to the convergence of FML in each round, via establishing a tight lower bound on the one-round reduction of expected global loss. Using this quantitative result, we develop NUFM, a fast-convergent FML algorithm with non-uniform device selection;

  • To embed NUFM in the context of multi-access wireless systems, we formulate a resource allocation problem, capturing the trade-offs among the convergence, wall-clock time, and energy consumption. By decomposing the original problem into two sub-problems and deconstructing sub-problems step by step, we propose a joint device selection and resource allocation algorithm (namely URAL) to solve the problem effectively with theoretical performance guarantees;

  • To reduce the computational complexity, we further integrate our proposed algorithms with two first-order approximation techniques in [9], by which the complexity of a one-step update in NUFM can be reduced from O(d2)O(d^{2}) to O(d)O(d). We also show that our theoretical results hold in these cases;

  • We provide extensive simulation results on challenging real-world benchmarks (i.e., Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet) to demonstrate the efficacy of our methods.

The remainder of this paper is organized as follows. Section II briefly reviews the related work. Section III introduces the FML problem and standard algorithm. We present the non-uniform device selection scheme in Section IV and adapt the scheme for wireless networks in Section V. Finally, Section VI presents the extension to first-order approximation techniques, followed by the simulation results in Section VII and a conclusion drawn in Section VIII.

II Related Work

Federated learning (FL) [4] has been proposed as a promising technique to facilitate edge-edge collaborative learning [17]. However, due to the heterogeneity in devices, models, and data distributions, a shared global model often fails to capture the individual information of each device, leading to performance degradation in inference or classification [18, 10, 19].

Very recently, based on the advances in meta-learning [5], federated meta-learning (FML) has garnered much attention, which aims to learn a personalized model for each device to cope with the heterogeneity challenges [6, 7, 8, 9, 11]. Chen et al.[6] first introduce an FML method called FedMeta, integrating the model-agnostic meta-learning (MAML) algorithm [5] into the federated learning framework. They show that FML can significantly improve the performance of FedAvg [4]. Jiang et al.[7] analyze the connection between FedAvg and MAML, and empirically demonstrate that FML enables better and more stable personalized performance. From a theoretical perspective, Lin et al.[8] analyze the convergence properties and computational complexity of FML with strongly convex loss functions and exact gradients. Fallah et al.[9] further provide the convergence guarantees in non-convex cases with stochastic gradients. Different from the above gradient descent–based approaches, another recent work [11] develops an ADMM-based FML method and gives its convergence guarantee under non-convex cases. However, due to selecting devices uniformly at random, the existing FML algorithms often suffer from slow convergence and low communication efficiency [9]. Further, deploying FML in practical wireless systems calls for effective resource allocation strategies [15, 20], which is beyond the scope of existing FML literature.

There exists a significant body of works placing interests in convergence improvement and resource allocation for FL [21, 22, 23, 24, 13, 25, 26, 27, 15, 28, 29, 14, 30, 31, 32, 33, 34, 35, 36]. Regarding the convergence improvement, Nguyen et al.[13] propose a fast-convergent FL algorithm, called FOLB, which achieves a near-optimal lower bound for the overall loss decrease in each round. Note that, while the idea of NUFM is similar to [13], the lower bound for FML is derived from a completely different technical path due to the inherent complexity in the local update. To minimize the convergence time, a probabilistic device selection scheme for FL is designed in [25], which assigns high probabilities to the devices with large effects on the global model. Ren et al.[26] investigate a batchsize selection strategy for accelerating the FL training process. Karimireddy et al.[30] employ the variance reduction technique to develop a new FL algorithm. Based on momentum methods, Yu et al.[27] give an FL algorithm with linear speedup property. Regarding the resource allocation in FL, Dinh et al.[15] embed FL in wireless networks, considering the trade-offs between training time and energy consumption, under the assumption that all devices participate in the whole training process. From a long-term perspective, Xu et al.[28] empirically investigate the device selection scheme jointly with bandwidth allocation for FL, using the Lyapunov optimization method. Chen et al.[14] investigate a device selection problem with “hard” resource constraints to enable the implementation of FL over wireless networks. Wang et al.[29] propose a control algorithm to determine the best trade-off between the local update and global aggregation under a resource budget.

Although extensive research has been carried out on FL, researchers have not treated FML in much detail. In particular, the existing FL acceleration techniques cannot be directly applied to FML due to the high-order information and biased stochastic gradients in the local update phases111Different from FML, FL is a first-order method with unbiased stochastic gradients. (see Lemma 2 in Section IV). At the same time, device selection and resource allocation require crafting jointly [6], rather than simply plugging the existing strategies in.

III Preliminaries and Assumptions

In this section, we introduce federated meta-learning, including the learning problem, standard algorithm, and assumptions for theoretical analysis.

TABLE I: Key Notations
Notation Definition Notation Definition
θ\theta Model parameter Fi()F_{i}(\cdot) Meta–loss function
α\alpha, β\beta Stepsize and meta–learning rate LiL_{i}, ρi\rho_{i} Lipschitz continuous parameters
ii, jj Indexes of user devices and data samples σG\sigma_{G}, σH\sigma_{H} Upper bounds of variances
kk Index of training rounds γG\gamma_{G}, γH\gamma_{H} Similarity parameters
tt Index of local update steps uiu_{i} Contribution to global loss reduction of device ii
τ\tau Number of local update steps viv_{i} CPU-cycle frequency of device ii
PiP_{i} Underlying data distribution of device ii pip_{i} Transmission power of device ii
𝒟i\mathcal{D}_{i}, DiD_{i} Dataset and its size of device ii zi,mz_{i,m} Binary variable indicating if device ii accesses RB mm
𝒩\mathcal{N} Set of devices η1\eta_{1}, η2\eta_{2} Weight parameters
nn Number of devices cic_{i} CPU cycles for computing one sample by device ii
𝒩k\mathcal{N}_{k} Set of participating devices in round kk hih_{i}, ImI_{m} Channel gain of device ii and inference in RB mm
nkn_{k} Number of participating devices in round kk BB, N0N_{0} Bandwidth of each RB and noise power spectral density
x\mathrm{x}, y\mathrm{y} Input and corresponding label \mathcal{M}, MM Set and number of RBs
li(θ;x,y)l_{i}(\theta;\mathrm{x},\mathrm{y}) Loss of model θ\theta on sample (x,y)(\mathrm{x},\mathrm{y}) EE Total energy consumption
fi()f_{i}(\cdot) Expected loss function TT Total latency

III-A Federated Meta-Learning Problem

We consider a set 𝒩\mathcal{N} of user devices that are all connected to a server. Each device i𝒩i\in\mathcal{N} has a labeled dataset 𝒟i={xij,yij}j=1Di\mathcal{D}_{i}=\{\mathrm{x}^{j}_{i},\mathrm{y}^{j}_{i}\}^{D_{i}}_{j=1} that can be accessed only by itself. Here, the tuple (xij,yij)𝒳×𝒴(\mathrm{x}^{j}_{i},\mathrm{y}^{j}_{i})\in\mathcal{X}\times\mathcal{Y} is a data sample with input xij\mathrm{x}^{j}_{i} and label yij\mathrm{y}^{j}_{i}, and follows an unknown underlying distribution PiP_{i}. Define θ\theta as the model parameter, such as the weights of a Deep Neural Network (DNN) model. For device ii, the loss function of a model parameter θd{\theta}\in\mathbb{R}^{d} is defined as i(θ;x,y)\ell_{i}({\theta};\mathrm{x},\mathrm{y}), which measures the error of model θ{\theta} in predicting the true label y\mathrm{y} given input x\mathrm{x}.

Federated meta-learning (FML) looks for a good model initialization (also called meta-model) such that the well-performed models of different devices can be quickly obtained via one or a few gradient descent steps. More specially, FML aims to solve the following problem

minθdF(θ)1ni𝒩fi(θαfi(θ))\displaystyle\mathop{\min}_{{\theta}\in\mathbb{R}^{d}}F({\theta})\coloneqq\frac{1}{n}\sum_{i\in\mathcal{N}}f_{i}({\theta}-\alpha\nabla f_{i}({\theta})) (1)

where fif_{i} represents the expected loss function over the data distribution of device ii, i.e., fi(θ)𝔼(x,y)Pi[i(θ;x,y)]f_{i}({\theta})\coloneqq\mathbb{E}_{(\mathrm{x},\mathrm{y})\sim P_{i}}\left[\ell_{i}({\theta};\mathrm{x},\mathrm{y})\right], n=|𝒩|n=|\mathcal{N}| is the number of devices, and α\alpha is the stepsize. The advantages of this formulation are two-fold: 1) It gives a personalized solution that can capture any heterogeneity between the devices; 2) the meta-model can quickly adapt to new devices via slightly updating it with respect to their own data. Clearly, FML well fits edge learning cases, where edge devices have insufficient computing power and limited data samples.

Next, we review the standard FML algorithm in the literature.

III-B Standard Algorithm

Similar to federated learning, vanilla FML algorithm solves (1) in two repeating steps: local update and global aggregation [9], as detailed below.

  • Local update: At the beginning of each round kk, the server first sends the current global model θk{\theta}^{k} to a fraction of devices 𝒩k\mathcal{N}_{k} chosen uniformly at random with pre-set size nkn_{k}. Then, each device i𝒩ki\in\mathcal{N}_{k} updates the received model based on its meta-function Fi(θ)fi(θαfi(θ))F_{i}({\theta})\coloneqq f_{i}({\theta}-\alpha\nabla f_{i}({\theta})) by running τ\tau (1\geq 1) steps of stochastic gradient descent locally (also called mini-batch gradient descent), i.e.,

    θik,t+1=θik,tβ~Fi(θik,t),for0tτ1\displaystyle{\theta}^{k,t+1}_{i}={\theta}^{k,t}_{i}-\beta\tilde{\nabla}F_{i}({\theta}^{k,t}_{i}),~{}\mathrm{for}~{}0\leq t\leq\tau-1 (2)

    where θtk{\theta}^{k}_{t} denotes the local model of device ii in the tt-th step of the local update in round kk with θik,0=θk{\theta}^{k,0}_{i}={\theta}^{k}, and β>0\beta>0 is the meta–learning rate. In (2), the stochastic gradient ~Fi(θ)\tilde{\nabla}F_{i}({\theta}) is given by

    ~Fi(θ)(Iα~2fi(θ,𝒟i′′))~fi(θα~fi(θ,𝒟i),𝒟i)\displaystyle\tilde{\nabla}F_{i}({\theta})\coloneqq\big{(}I-\alpha\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\big{)}\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)} (3)

    where 𝒟i\mathcal{D}_{i}, 𝒟i\mathcal{D}^{\prime}_{i}, and 𝒟i′′\mathcal{D}^{\prime\prime}_{i} are independent batches222We slightly abuse the notation 𝒟i\mathcal{D}_{i} as a batch of the local dataset of the ii-th device., and for any batch 𝒟\mathcal{D}, ~fi(θ,𝒟)\tilde{\nabla}f_{i}({\theta},\mathcal{D}) and ~2fi(θ,𝒟)\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}) are the unbiased esimates of fi(θ)\nabla f_{i}({\theta}) and 2fi(θ)\nabla^{2}f_{i}({\theta}) respectively, i.e.,

    ~fi(θ,𝒟)\displaystyle\tilde{\nabla}f_{i}({\theta},\mathcal{D}) 1|𝒟|(x,y)𝒟i(θ;x,y)\displaystyle\coloneqq\frac{1}{|\mathcal{D}|}\sum_{(\mathrm{x},\mathrm{y})\in\mathcal{D}}\nabla\ell_{i}({\theta};\mathrm{x},\mathrm{y}) (4)
    ~2fi(θ,𝒟)\displaystyle\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}) 1|𝒟|(x,y)𝒟2i(θ;x,y).\displaystyle\coloneqq\frac{1}{|\mathcal{D}|}\sum_{(\mathrm{x},\mathrm{y})\in\mathcal{D}}\nabla^{2}\ell_{i}({\theta};\mathrm{x},\mathrm{y}). (5)
  • Global aggregation: After updating the local model parameter, each selected device sends its local model θik=θik,τ1{\theta}^{k}_{i}={\theta}^{k,\tau-1}_{i} to the server. The server updates the global model by averaging over the received models, i.e.,

    θk+1=1nki𝒩kθik.\displaystyle{\theta}^{k+1}=\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}{\theta}^{k}_{i}. (6)

It is easy to see that the main difference between federated learning and FML lies in the local update phase: In federated learning, local update is done using the unbiased gradient estimates while FML uses the biased one consisting of high-order information. Besides, federated learning can be considered as a special case of FML, i.e., FML under α=0\alpha=0.

III-C Assumptions

In this subsection, we list the standard assumptions for the analysis of FML algorithms [9, 8, 11].

Assumption 1 (Smoothness).

The expected loss function fif_{i} corresponding to device i𝒩i\in\mathcal{N} is twice continuously differentiable and LiL_{i}-smooth, i.e.,

fi(θ1)fi(θ2)Liθ1θ2,θ1,θ2d.\displaystyle\quad\|\nabla f_{i}({\theta}_{1})-\nabla f_{i}({\theta}_{2})\|\leq L_{i}\|{\theta}_{1}-{\theta}_{2}\|,~{}\forall{\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}. (7)

Besides, its gradient is bounded by a positive constant ζi\zeta_{i}, i.e., fi(θ)ζi\|\nabla f_{i}({\theta})\|\leq\zeta_{i}.

Assumption 2 (Lipschitz Hessian).

The Hessian of function fif_{i} is ρi\rho_{i}-Lipschitz continuous for each i𝒩i\in\mathcal{N}, i.e.,

2fi(θ1)2fi(θ2)ρiθ1θ2,θ1,θ2d.\displaystyle\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\|\leq\rho_{i}\|{\theta}_{1}-{\theta}_{2}\|,~{}\forall{\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}. (8)
Assumption 3 (Bounded Variance).

Given any θd{\theta}\in\mathbb{R}^{d}, the following facts hold for stochastic gradient i(θ;x,y)\nabla\ell_{i}({\theta};\mathrm{x},\mathrm{y}) and Hessians 2i(θ;x,y)\nabla^{2}\ell_{i}({\theta};\mathrm{x},\mathrm{y}) with (x,y)𝒳×𝒴(\mathrm{x},\mathrm{y})\in\mathcal{X}\times\mathcal{Y}

𝔼(x,y)Pi[i(θ;x,y)fi(θ)2]\displaystyle\mathbb{E}_{(\mathrm{x},\mathrm{y})\sim P_{i}}\left[\left\|\nabla\ell_{i}({\theta};\mathrm{x},\mathrm{y})-\nabla f_{i}({\theta})\right\|^{2}\right] σG2\displaystyle\leq\sigma^{2}_{G} (9)
𝔼(x,y)Pi[2i(θ;x,y)2fi(θ)2]\displaystyle\mathbb{E}_{(\mathrm{x},\mathrm{y})\sim P_{i}}\left[\left\|\nabla^{2}\ell_{i}({\theta};\mathrm{x},\mathrm{y})-\nabla^{2}f_{i}({\theta})\right\|^{2}\right] σH2.\displaystyle\leq\sigma^{2}_{H}. (10)
Assumption 4 (Similarity).

For any θd{\theta}\in\mathbb{R}^{d} and i,j𝒩i,j\in\mathcal{N}, there exist nonnegtive constants γG0\gamma_{G}\geq 0 and γH0\gamma_{H}\geq 0 such that the gradients and Hessians of the expected loss funtions fi(θ)f_{i}({\theta}) and fj(θ)f_{j}({\theta}) satisfy the following conditions

fi(θ)fj(θ)\displaystyle\|\nabla f_{i}({\theta})-\nabla f_{j}({\theta})\| γG\displaystyle\leq\gamma_{G} (11)
2fi(θ)2fj(θ)\displaystyle\|\nabla^{2}f_{i}({\theta})-\nabla^{2}f_{j}({\theta})\| γH.\displaystyle\leq\gamma_{H}. (12)

Assumption 2 implies the high-order smoothness of fi(θ)f_{i}({\theta}) dealing with the second-order information in the local update step (2). Assumption 4 indicates that the variations of gradients between different devices are bounded by some constants, which captures the similarities between devices’ tasks corresponding to non-IID data. It holds for many practical loss functions [37], such as logistic regression and hyperbolic tangent functions. In particular, ψig\psi^{g}_{i} and ψih\psi^{h}_{i} can be roughly seen as a distance between data distributions PiP_{i} and PjP_{j} [38].

IV Non-Uniform Federated Meta-Learning

Due to the uniform selection of devices in each round, the convergence rate of the standard FML algorithm is naturally slow. In this section, we present a non-uniform device selection scheme to tackle this challenge.

IV-A Device Contribution Quantification

We begin with quantifying the contribution of each device to the reduction of one-round global loss using its dataset size and gradient norms. For convenience, define ζmaxiζi\zeta\coloneqq\max_{i}\zeta_{i}, LmaxiLiL\coloneqq\max_{i}L_{i}, and ρmaxiρi\rho\coloneqq\max_{i}\rho_{i}. We first provide necessary lemmas before giving the main result.

Lemma 1.

If Assumptions 1 and 2 hold, local meta-function FiF_{i} is smooth with parameter LF(1+αL)2L+αρζL_{F}\coloneqq(1+\alpha L)^{2}L+\alpha\rho\zeta.

Sketch of Proof.

We expand the expression of Fi(θ1)Fi(θ2)\|\nabla F_{i}(\theta_{1})-\nabla F_{i}(\theta_{2})\| into two parts by triangle inequality, followed by bounding each part via Assumptions 1 and 2. The detailed proof is presented in Appendix A of the technical report [39]. ∎

Lemma 1 gives the smoothness of the local meta-function FiF_{i} and global loss function FF.

Lemma 2.

Suppose that Assumptions 1-3 are satisfied, and 𝒟i\mathcal{D}_{i}, 𝒟i\mathcal{D}^{\prime}_{i}, and 𝒟i′′\mathcal{D}^{\prime\prime}_{i} are independent batches with sizes DiD_{i}, DiD^{\prime}_{i}, and Di′′D^{\prime\prime}_{i} respectively. For any θd{\theta}\in\mathbb{R}^{d}, the following holds

𝔼[~Fi(θ)Fi(θ)]\displaystyle\left\|\mathbb{E}\left[\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right]\right\| ασGL(1+αL)Di\displaystyle\leq\frac{\alpha\sigma_{G}L(1+\alpha L)}{\sqrt{D_{i}}} (13)
𝔼[~Fi(θ)Fi(θ)2]\displaystyle\mathbb{E}\left[\left\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\|^{2}\right] σFi2\displaystyle\leq\sigma^{2}_{F_{i}} (14)

where σFi2\sigma^{2}_{F_{i}} is denoted as

σFi2\displaystyle\sigma^{2}_{F_{i}}\coloneqq 6σG2(1+αL)2(1Di+(αL)2Di)+3(αζσH)2Di′′\displaystyle~{}6\sigma^{2}_{G}(1+\alpha L)^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+\frac{3(\alpha\zeta\sigma_{H})^{2}}{D^{\prime\prime}_{i}}
+6(ασGσH)2Di′′(1Di+(αL)2Di).\displaystyle+\frac{6(\alpha\sigma_{G}\sigma_{H})^{2}}{D^{\prime\prime}_{i}}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right). (15)
Sketch of Proof.

We first obtain

𝔼[~Fi(θ)\displaystyle\big{\|}\mathbb{E}\big{[}\tilde{\nabla}F_{i}({\theta}) Fi(θ)](Iα2fi(θ))𝔼[δ2]+𝔼[δ1δ2]\displaystyle-\nabla F_{i}({\theta})\big{]}\big{\|}\leq\big{\|}\big{(}I-\alpha\nabla^{2}f_{i}({\theta})\big{)}\mathbb{E}\big{[}\delta^{*}_{2}\big{]}+\mathbb{E}\big{[}\delta^{*}_{1}\delta^{*}_{2}\big{]}
+𝔼[δ1]fi(θαfi(θ)),\displaystyle+\mathbb{E}\big{[}\delta^{*}_{1}\big{]}\nabla f_{i}({\theta}-\alpha\nabla f_{i}({\theta}))\big{\|},
𝔼[~Fi(θ)\displaystyle\mathbb{E}\big{[}\|\tilde{\nabla}F_{i}({\theta}) Fi(θ)2]3(𝔼[δ12]𝔼[δ22]\displaystyle-\nabla F_{i}({\theta})\|^{2}\big{]}\leq 3\Big{(}\mathbb{E}\big{[}\|\delta^{*}_{1}\|^{2}\big{]}\mathbb{E}\big{[}\|\delta^{*}_{2}\|^{2}\big{]}
+ζ2𝔼[δ12]+(1+αL)2𝔼[δ22]),\displaystyle+\zeta^{2}\mathbb{E}\big{[}\|\delta^{*}_{1}\|^{2}\big{]}+\left(1+\alpha L\right)^{2}\mathbb{E}\big{[}\|\delta^{*}_{2}\|^{2}\big{]}\Big{)},

where δ1\delta^{*}_{1} and δ2\delta^{*}_{2} are given by

δ1\displaystyle\delta^{*}_{1} =α(2fi(θ)~2fi(θ,𝒟i′′))\displaystyle=\alpha\left(\nabla^{2}f_{i}({\theta})-\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\right)
δ2\displaystyle\delta^{*}_{2} =~fi(θα~fi(θ,𝒟i),𝒟i)fi(θαfi(θ)).\displaystyle=\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right).

Then, we derive the results via bounding the first and second moments of δ1\delta^{*}_{1} and δ2\delta^{*}_{2}. The detailed proof is presented in Appendix B of the technical report [39]. ∎

Lemma 2 shows that the stochastic gradient ~Fi(θ)\tilde{\nabla}F_{i}({\theta}) is a biased estimate of Fi(θ)\nabla F_{i}({\theta}), revealing the challenges in analyzing FML algorithms.

Lemma 3.

If Assumptions 1, 2, and 4 are satisfied, then for any θd{\theta}\in\mathbb{R}^{d} and i,j𝒩i,j\in\mathcal{N}, we have

Fi(θ)Fj(θ)(1+αL)2γG+αζγH.\displaystyle\left\|\nabla F_{i}({\theta})-\nabla F_{j}({\theta})\right\|\leq\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}. (16)
Sketch of Proof.

We divide the bound of Fi(θ)Fj(θ)\|\nabla F_{i}({\theta})-\nabla F_{j}({\theta})\| into two independent terms, followed by bounding the two terms separately. The detailed proof is presented in Appendix C of the technical report [39]. ∎

Lemma 3 characterizes the similarities between the local meta-functions, which is critical for analyzing the one-step global loss reduction because it relates the local meta-function and global objective. Based on Lemmas 13, we are now ready to give our main result.

Theorem 1.

Suppose that Assumptions 1-4 are satisfied, and 𝒟i\mathcal{D}_{i}, 𝒟i\mathcal{D}^{\prime}_{i}, and 𝒟i′′\mathcal{D}^{\prime\prime}_{i} are independent batches. If the local update and global aggregation follow (2) and (6) respectively, then the following fact holds true for τ=1\tau=1

𝔼[F(θk)F(θk+1)]\displaystyle\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\; β𝔼[1nki𝒩k((1LFβ2)~Fi(θk)2\displaystyle\beta\mathbb{E}\Bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Bigg{(}\Big{(}1-\frac{L_{F}\beta}{2}\Big{)}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}
((1+αL)2γG+αζγH+σFi)\displaystyle-\bigg{(}\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}\bigg{)}
×𝔼[~Fi(θk)2|𝒩k])]\displaystyle\times\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\Bigg{)}\Bigg{]} (17)

where the outer expectation of RHS is taken with respect to the selected user set 𝒩k\mathcal{N}_{k} and data sample sizes, and the inner expectation is only regarding data sample sizes.

Sketch of Proof.

Using the smoothness condition of FiF_{i}, we express the lower bound of loss reduction by

𝔼[F(θk)F(θk+1)]𝔼[Gk],\displaystyle\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\mathbb{E}[G^{k}],

where GkG^{k} is defined as

Gk\displaystyle G^{k}\coloneqq\; βF(θk)(1nki𝒩k~Fi(θk))\displaystyle\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)
LFβ221nki𝒩k~Fi(θk)2.\displaystyle-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}.

The key step to derive the desired result is providing a tight lower bound for the product of F(θk)\nabla F({\theta}^{k}) and ~F(θk)\tilde{\nabla}F({\theta}^{k}). The detailed proof is presented in Appendix G of the technical report [39]. ∎

Theorem 1 provides a lower bound on the one-round reduction of the global objective function FF based on the device selection. It implies that different user selection has varying impacts on the objective improvement and quantifies the contribution of each device to the objective improvement, depending on the variance of local meta-function, task similarities, smoothness, and learning rates. It therefore provides a criterion for selecting users to accelerate the convergence.

From Theorem 1, we have the following Corollary, which simplifies the above result and extends it to multi-step cases.

Corollary 1.

Suppose that Assumptions 1-4 are satisfied, and 𝒟i\mathcal{D}_{i}, 𝒟i\mathcal{D}^{\prime}_{i}, and 𝒟i′′\mathcal{D}^{\prime\prime}_{i} are independent batches with Di=Di=Di′′D_{i}=D^{\prime}_{i}=D^{\prime\prime}_{i}. If the local update and global aggregation follow (2) and (6) respectively, then the following fact holds true with β[0,1/LF)\beta\in[0,1/L_{F})

𝔼[F(θk)F(θk+1)]β2𝔼[1nki𝒩kt=0τ1(~Fi(θik,t)2\displaystyle\mathbb{E}\big{[}F({\theta}^{k})-F({\theta}^{k+1})\big{]}\geq\frac{\beta}{2}\mathbb{E}\Bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\sum^{\tau-1}_{t=0}\Bigg{(}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\|}^{2}
2(λ1+λ2Di)𝔼[~Fi(θik,t)2|𝒩k])]\displaystyle-2\big{(}\lambda_{1}+\frac{\lambda_{2}}{\sqrt{D_{i}}}\big{)}\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\Bigg{)}\Bigg{]} (18)

where positive constants λ1\lambda_{1} and λ2\lambda_{2} satisfy that

λ1\displaystyle\lambda_{1}\geq (1+αL)2γG+αζγH+βτ35(γG2+2σF2)\displaystyle~{}\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\beta\tau\sqrt{35(\gamma^{2}_{G}+2\sigma^{2}_{F})} (19)
λ22\displaystyle\lambda^{2}_{2}\geq 6σG2(1+(αL)2)((ασH)2+(1+αL)2)\displaystyle~{}6\sigma^{2}_{G}\left(1+(\alpha L)^{2}\right)\left((\alpha\sigma_{H})^{2}+(1+\alpha L)^{2}\right)
+3(αζσH)2.\displaystyle+3(\alpha\zeta\sigma_{H})^{2}. (20)
Sketch of Proof.

The proof is similar to Theorem 1, with additional tricks in bounding the product of F(θk)\nabla F({\theta}^{k}) and ~F(θk)\tilde{\nabla}F({\theta}^{k}). The detailed proof is presented in Appendix I of the technical report [39]. ∎

Corollary 1 implies that the device with a large gradient naturally accelerates the global loss decrease, but a small dataset size degrades the process due to corresponding high variance. Besides, as the device dissimilarities become large, the lower bound (1) weakens.

Motivated by Corollary 1, we study the device selection in the following subsection.

IV-B Device Selection

To improve the convergence speed, we aim to maximize the lower bound (1) on the one-round objective reduction. Based on Corollary 1, we define the contribution uiku^{k}_{i} device ii to the convergence in round kk as

uikt=0τ1~Fi(θik,t)22(λ1+λ2Di)~Fi(θik,t)\displaystyle u^{k}_{i}\coloneqq\sum^{\tau-1}_{t=0}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\|}^{2}-2\big{(}\lambda_{1}+\frac{\lambda_{2}}{\sqrt{D_{i}}}\big{)}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\|} (21)

where we replace the second moment of ~Fi(θik,t)\tilde{\nabla}F_{i}({\theta}^{k,t}_{i}) by its sample value ~Fi(θik,t)2\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\|^{2}. Then, the device selection problem in round kk can be formulated as

max{zi}i𝒩ziuiks.t.i𝒩zi=nkzi{0,1},i𝒩.\displaystyle\begin{split}\max_{\{z_{i}\}}~{}&\sum_{i\in\mathcal{N}}z_{i}u^{k}_{i}\\ \mathrm{s.t.}~{}&\sum_{i\in\mathcal{N}}z_{i}=n_{k}\\ &\,z_{i}\in\{0,1\},~{}\forall i\in\mathcal{N}.\end{split} (22)

In (22), ziz_{i} is a binary variable: zi=1z_{i}=1 for selecting device ii in this round; zi=0z_{i}=0, otherwise. The solution of (22) can be found by the Select algorithm introduced in [40, Chapter 9] with the worst-case complexity O(n)O(n) (the problem (22) is indeed a selection problem). Accordingly, our device selection scheme is presented as follows.

Device selection: Following the local update phase, instead of selecting devices uniformly at random, each device ii first computes its contribution scalar uiku^{k}_{i} locally and sends it to the server. After receiving {uik}i𝒩\{u^{k}_{i}\}_{i\in\mathcal{N}} from all devices, the server runs the Select algorithm and finds the optimal device set denoted by 𝒩k\mathcal{N}^{*}_{k}. Notably, although constants λ1\lambda_{1} and λ2\lambda_{2} in (22) consist of unknown parameters such as LL, γG\gamma_{G}, and γH\gamma_{H}, they can be either estimated during training as in [29] or directly tuned as in our simulations.

Based on the device selection scheme, we propose the Non-Uniform Federated Meta-Learning (NUFM) algorithm (depicted in Algorithm 1). In particular, although NUFM requires an additional communication phase to upload uiku^{k}_{i} to the server, the communication overhead can be negligible because uiku^{k}_{i} is just a scalar.

Input: α\alpha, β\beta, λ1\lambda_{1}, λ2\lambda_{2}
1 Server initializes model θ0{\theta}^{0} and sends it to all devices;
2 for round k=0k=0 to K1K-1 do
3       foreach device i𝒩i\in\mathcal{N} do
             // Local update
4             Initialize θik,0θk{\theta}^{k,0}_{i}\leftarrow{\theta}^{k} and uik0u^{k}_{i}\leftarrow 0;
5             for local step t=0t=0 to τ1\tau-1 do
6                   Compute stochastic gradient ~Fi(θik,t)\tilde{\nabla}F_{i}({\theta}^{k,t}_{i}) by (3) using batches 𝒟i\mathcal{D}_{i}, 𝒟i\mathcal{D}^{\prime}_{i}, and 𝒟i′′\mathcal{D}^{\prime\prime}_{i};
7                   Update local model θik,t+1{\theta}^{k,t+1}_{i} by (2);
8                   Update contribution scalar uiku^{k}_{i} by uikuik+~Fi(θik,t)22(λ1+λ2Di)~Fi(θik,t)\begin{aligned} u^{k}_{i}\leftarrow&~{}u^{k}_{i}+\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\|^{2}\\ &-2(\lambda_{1}+\tfrac{\lambda_{2}}{\sqrt{D_{i}}})\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\|\end{aligned};
9                  
10             end for
11            Set θik=θik,τ{\theta}^{k}_{i}={\theta}^{k,\tau}_{i} and send uiku^{k}_{i} to server;
12            
13       end foreach
      // Device selection
14       Once receiving {uik}i𝒩\{u^{k}_{i}\}_{i\in\mathcal{N}}, server computes optimal device selection 𝒩k\mathcal{N}^{*}_{k} by solving (22);
       // Global aggregation
15       After receiving local models {θik}i𝒩k\{{\theta}^{k}_{i}\}_{i\in\mathcal{N}^{*}_{k}}, server computes the global model by (6);
16      
17 end for
return θK{\theta}^{K}
Algorithm 1 Non-Uniform Federated Meta-Learning (NUFM)

V Federated Meta-Learning over Wireless Networks

In this section, we extend NUFM to the context of multi-access wireless systems, where the bandwidth for uplink transmission and the power of IoT devices are limited. First, we present the system model followed by the problem formulation. Then, we decompose the original problem into two sub-problems and devise solutions for each of them with theoretical performance guarantees.

V-A System Model

As illustrated in Fig. 1, we consider a wireless multi-user system, where a set 𝒩\mathcal{N} of nn end devices joint forces to carry out federated meta-learning aided by an edge server. Each round consists of two stages: the computation phase and the communication phase. In the computation phase, each device i𝒩i\in\mathcal{N} downloads the current global model and computes the local model based on its local dataset; in the communication phase, the selected devices transmit the local models to the edge server via a limited number of wireless channels. After that, the edge server runs the global aggregation and starts the next round. Here we do not consider the downlink communication due to the asymmetric uplink-downlink settings in wireless networks. That is, the transmission power at the server (e.g., a base station) and the downlink communication bandwidth are generally sufficient for global meta-model transmission. Thus, the downlink time is usually neglected, compared to the uplink data transmission time [15]. Since we focus on the device selection and resource allocation problem in each round, we omit subscript kk for brevity throughout this section.

Refer to caption
Figure 1: The architecture of federated meta-learning over a wireless network with multiple user devices and an edge server. Due to limited communication resources, only part of user devices can upload their local models in each training round.

V-A1 Computation Model

We denote cic_{i} as the CPU cycles for device ii to update the model with one sample, which can be measured offline as a priori knowledge [15]. Assume that the batch size of device ii used in local update phase (2) is DiD_{i}. Then, the number of CPU cycles required for device ii to run a one-step local update is ciDic_{i}D_{i}. We denote the CPU-cycle frequency of device ii as νi\nu_{i}. Thus, the CPU energy consumption of device ii in the computation during the local update phase can be expressed by

Ei𝑐𝑝(νi)ιi2τiciDiνi2\displaystyle E^{\mathit{cp}}_{i}(\nu_{i})\coloneqq\frac{\iota_{i}}{2}\tau_{i}c_{i}D_{i}\nu^{2}_{i} (23)

where ιi/2\iota_{i}/2 is the effective capacitance coefficient of the computing chipset of device ii [41]. The computational time of device ii in a round can be denoted as

Ti𝑐𝑝(νi)τiciDiνi.\displaystyle T^{\mathit{cp}}_{i}(\nu_{i})\coloneqq\frac{\tau_{i}c_{i}D_{i}}{\nu_{i}}. (24)

For simplicity, we set τ=1\tau=1 in the following.

V-A2 Communication Model

We consider a multi-access protocol for devices, i.e., the orthogonal frequency division multiple access (OFDMA) technique whereby each device can occupy one uplink resource block (RB) in a communication round to upload its local model. There are MM RBs in the system, denoted by ={1,2,,M}\mathcal{M}=\{1,2,\dots,M\}. The achievable transmission rate of device ii is [25]

ri(𝒛i,pi)mzi,mBlog2(1+hipiIm+BN0)\displaystyle r_{i}(\bm{z}_{i},p_{i})\coloneqq\sum_{m\in\mathcal{M}}z_{i,m}B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right) (25)

with BB being the bandwidth of each RB, hih_{i} the channel gain, N0N_{0} the noise power spectral density, pip_{i} the transmission power of device ii, and ImI_{m} the interference caused by the devices that are located in other service areas and use the same RB. In (25), zi,m{0,1}z_{i,m}\in\{0,1\} is a binary variable associated with the mm-th RB allocation for device ii: zi,m=1z_{i,m}=1 indicates that RB mm is allocated to device ii, and zi,m=0z_{i,m}=0 otherwise. Each device can only occupy one RB at maximum while each RB can be accessed by at most one device, thereby we have

mzi,m\displaystyle\sum_{m\in\mathcal{M}}z_{i,m} 1i𝒩\displaystyle\leq 1\quad\forall i\in\mathcal{N} (26)
i𝒩zi,m\displaystyle\sum_{i\in\mathcal{N}}z_{i,m} 1m.\displaystyle\leq 1\quad\forall m\in\mathcal{M}. (27)

Due to the fixed dimension of model parameters, we assume that the model sizes of devices are constant throughout the learning process, denoted by SS. If device ii is selected, the time duration of transmitting the model is given by

Ti𝑐𝑜(𝒛i,pi)Sri(𝒛i,pi)\displaystyle T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})\coloneqq\frac{S}{r_{i}(\bm{z}_{i},p_{i})} (28)

where 𝒛i{zi,mm}\bm{z}_{i}\coloneqq\{z_{i,m}\mid m\in\mathcal{M}\}. Besides, the energy consumption of the transmission is

Ei𝑐𝑜(𝒛i,pi)mzi,mTi𝑐𝑜(𝒛i,pi)pi.\displaystyle E^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})\coloneqq\sum_{m\in\mathcal{M}}z_{i,m}T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})p_{i}. (29)

If no RB is allocated to device ii in current round, its transmission power and energy consumption is zero.

V-B Problem Formulation

For ease of exposition, we define 𝝂{νii𝒩}\bm{\nu}\coloneqq\{\nu_{i}\mid i\in\mathcal{N}\}, 𝒑{pii𝒩}\bm{p}\coloneqq\{p_{i}\mid i\in\mathcal{N}\}, and 𝒛{zi,mi𝒩,m}\bm{z}\coloneqq\{z_{i,m}\mid i\in\mathcal{N},m\in\mathcal{M}\}. Recall the procedure of NUFM. The total energy consumption E(𝒛,𝒑,𝝂)E(\bm{z},\bm{p},\bm{\nu}) and wall-clock time T(𝒛,𝒑,𝝂)T(\bm{z},\bm{p},\bm{\nu}) in a round can be expressed by

E(𝒛,𝒑,𝝂)\displaystyle E(\bm{z},\bm{p},\bm{\nu}) i(Ei𝑐𝑝(νi)+Ei𝑐𝑜(𝒛i,pi))\displaystyle\coloneqq\sum_{i\in\mathcal{I}}\Big{(}E^{\mathit{cp}}_{i}(\nu_{i})+E^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})\Big{)} (30)
T(𝒛,𝒑,𝝂)\displaystyle T(\bm{z},\bm{p},\bm{\nu}) maxi𝒩Ti𝑐𝑝(νi)+maxi𝒩mzi,mTi𝑐𝑜(𝒛i,pi)\displaystyle\coloneqq\max_{i\in\mathcal{N}}T^{\mathit{cp}}_{i}(\nu_{i})+\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i}) (31)

where we neglect the communication time for transmitting the scalar uiu_{i}. The total contribution to the convergence is

U(𝒛)=i𝒩mzi,mui\displaystyle U(\bm{z})=\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}u_{i} (32)

where uiu_{i} is given in (21)333One can regularize uiu_{i} via adding a large enough constant in (21) to keep it positive..

We consider the following non-convex mixed-integer non-linear programming (MINLP) problem

(P)max𝒛,𝒑,𝝂\displaystyle\mathrm{(P)}~{}\mathop{\max}_{\bm{z},\bm{p},\bm{\nu}} U(𝒛)η1E(𝒛,𝒑,𝝂)η2T(𝒛,𝒑,𝝂)\displaystyle~{}~{}U(\bm{z})-\eta_{1}E(\bm{z},\bm{p},\bm{\nu})-\eta_{2}T(\bm{z},\bm{p},\bm{\nu})
s.t.\displaystyle\mathrm{s.t.} 0pipi𝑚𝑎𝑥,i𝒩\displaystyle~{}~{}0\leq p_{i}\leq p^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N} (33)
0νiνi𝑚𝑎𝑥,i𝒩\displaystyle~{}~{}0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N} (34)
zi,m{0,1},i𝒩,m\displaystyle~{}~{}z_{i,m}\in\{0,1\},~{}\forall i\in\mathcal{N},m\in\mathcal{M} (35)
i𝒩zi,m1,m\displaystyle~{}\,\sum\nolimits_{i\in\mathcal{N}}z_{i,m}\leq 1,~{}\forall m\in\mathcal{M} (26)
mzi,m1,i𝒩\displaystyle~{}\,\sum\nolimits_{m\in\mathcal{M}}z_{i,m}\leq 1,~{}\forall i\in\mathcal{N} (27)

where η10\eta_{1}\geq 0 and η20\eta_{2}\geq 0 are weight parameters to capture the Pareto-optimal trade-offs among convergence, latency, and energy consumption, the values of which depend on specific scenarios. Constraints (33) and (34) give the feasible regions of devices’ transmission power levels and CPU-cycle frequencies, respectively. Constraints (26) and (27) restrict that each device can only access one uplink RB while each RB can be allocated to one device at most.

In this formulation, we aim to maximize the convergence speed of FML, while minimizing the energy consumption and wall-clock time in each round. Notably, our solution can adapt to the problem with hard constraints on energy consumption and wall-clock time as in [14] via setting “virtual devices” (see Lemma 4 and Lemma 6).

Next, we provide a joint device selection and resource allocation algorithm to solve this problem.

V-C A Joint Device Selection and Resource Allocation Algorithm

Substituting (30), (31), and (32) into problem (P), we can easily decompose the original problem into the following two sub-problems (SP1) and (SP2).

(SP1)\displaystyle\mathrm{(SP1)}~{} min𝝂\displaystyle\mathop{\min}_{\bm{\nu}} g1(𝝂)=η1i𝒩ιi2ciDiνi2+η2maxi𝒩ciDiνi\displaystyle~{}~{}g_{1}(\bm{\nu})=\eta_{1}\sum_{i\in\mathcal{N}}\frac{\iota_{i}}{2}c_{i}D_{i}\nu^{2}_{i}+\eta_{2}\max_{i\in\mathcal{N}}\frac{c_{i}D_{i}}{\nu_{i}}
s.t.\displaystyle~{}\,\mathrm{s.t.} 0νiνi𝑚𝑎𝑥,i𝒩.\displaystyle~{}~{}0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}.
(SP2)\displaystyle\mathrm{(SP2)}~{} max𝒛,𝒑\displaystyle\mathop{\max}_{\bm{z},\bm{p}} g2(𝒛,𝒑)=i𝒩mzi,mui\displaystyle~{}~{}g_{2}(\bm{z},\bm{p})=\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}u_{i}
i𝒩mη1SpiBlog2(1+hipiIm+BN0)\displaystyle~{}~{}-\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}\eta_{1}\frac{Sp_{i}}{B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}
η2maxi𝒩mzi,mSBlog2(1+hipiIm+BN0)\displaystyle~{}~{}-\eta_{2}\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}
s.t.\displaystyle~{}\,\mathrm{s.t.} 0pipi𝑚𝑎𝑥,i𝒩\displaystyle~{}~{}0\leq p_{i}\leq p^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}
i𝒩zi,m1,m\displaystyle~{}\,\sum\nolimits_{i\in\mathcal{N}}z_{i,m}\leq 1,~{}\forall m\in\mathcal{M}
mzi,m1,i𝒩\displaystyle~{}\,\sum\nolimits_{m\in\mathcal{M}}z_{i,m}\leq 1,~{}\forall i\in\mathcal{N}
zi,m{0,1},i𝒩,m.\displaystyle~{}~{}z_{i,m}\in\{0,1\},~{}\forall i\in\mathcal{N},m\in\mathcal{M}.

(SP1) aims at controlling the CPU-cycle frequencies for devices to minimize the energy consumption and latency in the computational phase. (SP2) controls the transmission power and RB allocation to maximize the convergence speed while minimizing the transmission cost and communication delay. We provide the solutions to these two sub-problems separately.

V-C1 Solution to (SP1)

Denote the optimal CPU-cycle frequencies of (SP1) as 𝝂={νi}i𝒩\bm{\nu}^{*}=\{\nu^{*}_{i}\}_{i\in\mathcal{N}}. We first give the following lemma to offer insights into this sub-problem.

Lemma 4.

If device jj is the straggler of all devices with optimal CPU frequencies 𝛎\bm{\nu}^{*}, i.e., j=argmaxi𝒩ciDi/νij=\mathop{\arg\max}_{i\in\mathcal{N}}c_{i}D_{i}/\nu^{*}_{i}, the following holds true for any η1,η2>0\eta_{1},\eta_{2}>0

νi={min{a22a13,mini𝒩cjDjνj𝑚𝑎𝑥ciDi},ifi=jciDiνjcjDj,otherwise.\displaystyle\nu^{*}_{i}=\begin{cases}\min\left\{\sqrt[3]{\frac{a_{2}}{2a_{1}}},\min_{i\in\mathcal{N}}\frac{c_{j}D_{j}\nu^{\mathit{max}}_{j}}{c_{i}D_{i}}\right\}&,\mathrm{if}~{}i=j\\ \frac{c_{i}D_{i}\nu^{*}_{j}}{c_{j}D_{j}}&,\mathrm{otherwise}.\end{cases} (36)

The positive constants a1a_{1} and a2a_{2} in (36) are defined as

a1\displaystyle a_{1} η1i𝒩ιi(ciDi)32(cjDj)2\displaystyle\coloneqq\eta_{1}\sum_{i\in\mathcal{N}}\frac{\iota_{i}(c_{i}D_{i})^{3}}{2(c_{j}D_{j})^{2}} (37)
a2\displaystyle a_{2} η2cjDj.\displaystyle\coloneqq\eta_{2}c_{j}D_{j}. (38)
Sketch of Proof.

The derivation of the results involves two steps, i.e., expressing viv^{*}_{i} by vjv^{*}_{j} and deriving vjv^{*}_{j} by solving the corresponding optimization problem. The detailed proof is presented in Appendix D of the technical report [39]. ∎

Lemma 4 implies that if the straggler (a device with the lowest computational time) can be determined, then the optimal CPU-cycle frequencies of all devices can be derived as closed-form solutions. Intuitively, due to the contradiction in minimizing the energy consumption and computational time, if the straggler is fixed, then the other devices can use the smallest CPU-cycle frequencies as long as the computational time is shorter than that of the straggler. It leads to the following Theorem.

Theorem 2.

Denote 𝛎𝑠𝑡𝑟𝑎𝑔𝑔𝑙𝑒𝑟:j\bm{\nu}^{*}_{\mathit{straggler:}j} as the optimal solution (i.e., (36)) under the assumption that jj is the straggler. Then, the global optimal solution of (SP1) can be obtained by

𝝂=argmin𝝂𝒱g1(𝝂)\displaystyle\bm{\nu}^{*}=\mathop{\arg\min}_{\bm{\nu}\in\mathcal{V}}g_{1}(\bm{\nu}) (39)

where 𝒱{𝛎𝑠𝑡𝑟𝑎𝑔𝑔𝑙𝑒𝑟:j}j𝒩\mathcal{V}\coloneqq\{\bm{\nu}^{*}_{\mathit{straggler:}j}\}_{j\in\mathcal{N}}, and g1g_{1} is the objective function in (SP1).

Proof.

The result can be directly obtained from Lemma 4. We omit it for brevity. ∎

Theorem 2 shows that the optimal solution of (SP1) is the fixed-straggler solution in Lemma 4 corresponding to the minimum objective g1g_{1}. Thus, (SP1) can be solved with computational complexity O(n)O(n) by comparing the achievable objective values corresponding to different stragglers.

V-C2 Solution to (SP2)

Similar to Section V-C1, we denote the optimal solutions of (SP2) as 𝒛={zi,mi𝒩,m}\bm{z}^{*}=\{z^{*}_{i,m}\mid i\in\mathcal{N},m\in\mathcal{M}\} and 𝒑={pii𝒩}\bm{p}^{*}=\{p^{*}_{i}\mid i\in\mathcal{N}\} respectively, 𝒩{i𝒩m𝒩zi,m=1}\mathcal{N}^{*}\coloneqq\{i\in\mathcal{N}\mid\sum_{m\in\mathcal{N}}z^{*}_{i,m}=1\} as the optimal set of selected devices and, for each i𝒩i\in\mathcal{N}^{*}, RB block allocated to ii as mim^{*}_{i}, i.e., zi,mi=1z^{*}_{i,m^{*}_{i}}=1.

It is challenging to derive a closed-form solution for (SP2) because it is a non-convex MINLP problem with non-differentiable “max” operator in the objective. Thus, in the following, we develop an iterative algorithm to solve this problem and show that the algorithm will converge to a local minimum.

We begin with analyzing the properties of the optimal solution in the next lemma.

Lemma 5.

Denote the transmission delay regarding 𝐳\bm{z}^{*} and 𝐩\bm{p}^{*} as δ\delta^{*}, i.e.,

δmaxi𝒩mzi,mSBlog2(1+hipiIm+BN0).\displaystyle\delta^{*}\coloneqq\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z^{*}_{i,m}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{*}_{i}}{I_{m}+BN_{0}}\right)}. (40)

The following relation holds

δSBlog2(1+hipi𝑚𝑎𝑥Im+BN0)\displaystyle\delta^{*}\geq\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m}+BN_{0}}\right)} (41)

and pip^{*}_{i} can be expressed by

pi={(Imi+BN0)(2SBδ1)hi,ifi𝒩0,otherwise.\displaystyle p^{*}_{i}=\begin{cases}\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)}{h_{i}}&,~{}\mathrm{if}~{}i\in\mathcal{N}^{*}\\ 0&,~{}\mathrm{otherwise}.\end{cases} (42)
Sketch of Proof.

We prove (41) by contradiction and (42) by solving the corresponding transformed problem. The detailed proof is presented in Appendix E of the technical report [39]. ∎

Lemma 5 indicates that the optimal transmission power can be derived as a closed-form via (42), given the RB allocation and transmission delay. Lemma 5 also implies that for any RB allocation strategy 𝒛~\tilde{\bm{z}} and transmission delay δ~\tilde{\delta} (not necessarily optimal), equation (42) provides the “optimal” transmission power under 𝒛~\tilde{\bm{z}} and δ~\tilde{\delta} as long as (41) is satisfied. Based on that, we have the following result.

Theorem 3.

Denote μi,m(Im+BN0)(2SBδ1)/hi\mu_{i,m}\coloneqq(I_{m}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)/h_{i}. Given transmission delay δ\delta^{*}, the optimal RB allocation strategy can be obtained by

𝒛=argmax𝒛{i,mzi,m(uiei,m),s.t.(26)(35)}\displaystyle\bm{z}^{*}=\mathop{\arg\max}_{\bm{z}}\bigg{\{}\sum_{i,m}z_{i,m}\left(u_{i}-e_{i,m}\right),~{}\mathrm{s.t.}~{}\eqref{eq:constraint_zm}-\eqref{eq:constraint_z01}\bigg{\}} (43)

where

ei,m{η1δμi,m,ifμi,mpi𝑚𝑎𝑥ui+1,otherwise.\displaystyle e_{i,m}\coloneqq\begin{cases}\eta_{1}\delta^{*}\mu_{i,m}&,~{}\mathrm{if}~{}\mu_{i,m}\leq p^{\mathit{max}}_{i}\\ u_{i}+1&,~{}\mathrm{otherwise}.\end{cases} (44)
Proof.

From Lemma 5, Eq. (43) holds if μi,mpi𝑚𝑎𝑥\mu_{i,m}\leq p^{\mathit{max}}_{i}. On the other hand, when μi,m>pi𝑚𝑎𝑥\mu_{i,m}>p^{\mathit{max}}_{i}, if device ii is selected, the transmission delay will be larger than δ\delta^{*} (see the proof of Lemma 5), which is contradictory to the given condition. Thus, when μi,m>pi𝑚𝑎𝑥\mu_{i,m}>p^{\mathit{max}}_{i}, we set ei,m=ui+1e_{i,m}=u_{i}+1, ensuring device ii not to be selected. ∎

Theorem 3 shows the optimal RB allocation strategy can be obtained by solving (43), given transmission delay δ\delta^{*}. Naturally, problem (43) can be equivalently transformed to a bipartite matching problem. Consider a Bipartite Graph 𝒢\mathcal{G} with source set 𝒩\mathcal{N} and destination set \mathcal{M}. For each i𝒩i\in\mathcal{N} and mm\in\mathcal{M}, denote the weight of the edge from node ii to node jj as wijw_{i\rightarrow j}: If uiei,m>0u_{i}-e_{i,m}>0, wij=ei,muiw_{i\rightarrow j}=e_{i,m}-u_{i}; otherwise, wij=w_{i\rightarrow j}=\infty. Therefore, maximizing (43) is equivalent to finding a matching in 𝒢\mathcal{G} with the minimum sum of weights. It means that we can obtain the optimal RB allocation strategy under fixed transmission delay via Kuhn-Munkres algorithm with the worst complexity of O(Mn2)O(Mn^{2}) [42].

We proceed to show how to iteratively approximate the optimal δ\delta^{*}, 𝒑\bm{p}^{*}, and 𝒛\bm{z}^{*}.

Lemma 6.

Let jj denote the communication straggler among all selected devices with respect to RB allocation 𝐳\bm{z}^{*} and transmission power 𝐩\bm{p}^{*}, i.e., for any i𝒩i\in\mathcal{N}^{*},

mzi,mSBlog2(1+hipiIm+BN0)Ti𝑐𝑜(𝒛i,pi)\displaystyle\sum_{m\in\mathcal{M}}z^{*}_{i,m}\underbrace{\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{*}_{i}}{I_{m}+BN_{0}}\right)}}_{T^{\mathit{co}}_{i}(\bm{z}^{*}_{i},p^{*}_{i})}
mzj,mSBlog2(1+hjpjIm+BN0)Tj𝑐𝑜(𝒛j,pj).\displaystyle\leq\sum_{m\in\mathcal{M}}z^{*}_{j,m}\underbrace{\frac{S}{B\log_{2}\left(1+\frac{h_{j}p^{*}_{j}}{I_{m}+BN_{0}}\right)}}_{T^{\mathit{co}}_{j}(\bm{z}^{*}_{j},p^{*}_{j})}. (45)

Then, the following holds true:

  1. 1)

    Define function f4(p)b1((1+p)log2(1+p)ln2p)η2f_{4}(p)\coloneqq b_{1}\big{(}(1+p)\log_{2}(1+p)\ln 2-p\big{)}-\eta_{2}. Then, f4(p)f_{4}(p) is monotonically increasing with respect to p0p\geq 0, and has unique zero point p~j0(0,b2]\tilde{p}^{0}_{j}\in(0,b_{2}], where b1b_{1} and b2b_{2} are denoted as follows

    b1\displaystyle b_{1} η1i𝒩Imi+BN0hi\displaystyle\coloneqq\eta_{1}\sum_{i\in\mathcal{N}^{*}}\frac{I_{m^{*}_{i}}+BN_{0}}{h_{i}} (46)
    b2\displaystyle b_{2} 2(1+max{η2b1,1}1)/ln2;\displaystyle\coloneqq 2^{(1+\sqrt{\max\{\frac{\eta_{2}}{b_{1}},1\}-1})/\ln 2}; (47)
  2. 2)

    Denote (𝑆𝑁𝑅)j(Imj+BN0)/hj(\mathit{SNR})_{j}\coloneqq(I_{m^{*}_{j}}+BN_{0})/h_{j}. For i𝒩i\in\mathcal{N}^{*}, we have

    pi={min{p~j0,mini𝒩hipi𝑚𝑎𝑥Imi+BN0},ifi=j(𝑆𝑁𝑅)i(𝑆𝑁𝑅)jpj,otherwise.\displaystyle p^{*}_{i}=\begin{cases}\min\left\{\tilde{p}^{0}_{j},\min_{i\in\mathcal{N}^{*}}\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\right\}&,\mathrm{if}~{}i=j\\ \frac{(\mathit{SNR})_{i}}{(\mathit{SNR})_{j}}p^{*}_{j}&,\mathrm{otherwise}.\end{cases} (48)
Sketch of Proof.

We obtain the first result by analyzing the property of f4f_{4} and derive (48) via solving the corresponding optimization problem. The detailed proof is presented in Appendix F of the technical report [39]. ∎

Lemma 6 indicates that, given optimal RB allocation strategy 𝒛\bm{z}^{*} and straggler, the optimal transmission power can be derived by (48), different from Lemma 5 that requires the corresponding transmission delay δ\delta^{*}. Notably, in (48), we can obtain zero point p~j0\tilde{p}^{0}_{j} of f4f_{4} with any required tolerance ϵ\epsilon by Bisection method in log2(b2ϵ)\log_{2}(\frac{b_{2}}{\epsilon}) iterations at most.

Similar to Theorem 2, we can find the optimal transmission power by the following theorem, given the RB allocation.

Theorem 4.

Denote 𝐩𝑠𝑡𝑟𝑎𝑔𝑔𝑙𝑒𝑟:j\bm{p}^{*}_{\mathit{straggler:}j} as the optimal solution under the assumption that jj is the communication straggler, given fixed RB allocation 𝐳\bm{z}^{*}. The corresponding optimal transmission power is given by

𝒑=argmax𝒑𝒫g2(𝒛,𝒑)\displaystyle\bm{p}^{*}=\mathop{\arg\max}_{\bm{p}\in\mathcal{P}}g_{2}(\bm{z}^{*},\bm{p}) (49)

where 𝒫{𝐩𝑠𝑡𝑟𝑎𝑔𝑔𝑙𝑒𝑟:j}j𝒩\mathcal{P}\coloneqq\{\bm{p}^{*}_{\mathit{straggler:}j}\}_{j\in\mathcal{N}} and g2g_{2} is the objective function defined in (SP2).

Proof.

We can easily obtain the result from Lemma 6, thereby omitting it for brevity. ∎

Define the communication time corresponding to 𝒛\bm{z} and 𝒑\bm{p} as T𝑐𝑜(𝒛,𝒑)maxi𝒩mzi,mTi𝑐𝑜(𝒛i,pi)T^{\mathit{co}}(\bm{z},\bm{p})\coloneqq\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i}). Based on Theorem 3 and 4, we have the following Iterative Solution (IVES) algorithm to solve (SP2).

IVES: We initialize transmission delay δ0\delta^{0} (based on (42)) as follows

δ0=maxi𝒩SBlog2(1+hipi𝑚𝑎𝑥Im+BN0).\displaystyle\delta^{0}=\max_{i\in\mathcal{N}}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m}+BN_{0}}\right)}. (50)

In each iteration tt, we first compute an RB allocation strategy 𝒛t\bm{z}^{t} via solving (43) by the Kuhn-Munkres algorithm. Then, based on 𝒛t\bm{z}^{t}, we find the corresponding transmission power 𝒑t\bm{p}^{t} by (49) and update the transmission delay by δt+1=T𝑐𝑜(𝒛t,𝒑t)\delta^{t+1}=T^{\mathit{co}}(\bm{z}^{t},\bm{p}^{t}) before the next iteration. The details of IVES are depicted in Algorithm 2.

Input: η1\eta_{1}, η2\eta_{2}, SS, BB, {hi}\{h_{i}\}, {Im}\{I_{m}\}
1 Initialize t=0t=0 and δ0\delta_{0} by (50);
2 while not done do
3       Compute RB allocation stratege 𝒛t\bm{z}^{t} under δt\delta^{t} using Kuhn-Munkres algorithm based on (43);
4       Compute transmission power 𝒑t\bm{p}^{t} by (49);
5       Update δt+1=T𝑐𝑜(𝒛t,𝒑t)\delta^{t+1}=T^{\mathit{co}}(\bm{z}^{t},\bm{p}^{t});
6      
7 end while
return 𝐳\bm{z}^{*}, 𝐩\bm{p}^{*}
Algorithm 2 Iterative Solution (IVES)

Using IVES, we can solve (SP2) in an iterative manner. In the following theorem, we provide the convergence guarantee for IVES.

Theorem 5.

If we solve (SP2) by IVES, then {g2(𝐳t,𝐩t)}\{g_{2}(\bm{z}^{t},\bm{p}^{t})\} monotonically increases and converges to a unique point.

Sketch of Proof.

The result is derived via proving g(𝒛t,𝒑t)g(𝒛t+1,𝒑^t+1)g(\bm{z}^{t},\bm{p}^{t})\leq g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1}) and g(𝒛t+1,𝒑^t+1)g(𝒛t+1,𝒑t+1)g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1})\leq g(\bm{z}^{t+1},\bm{p}^{t+1}). The detailed proof is presented in Appendix H of the technical report [39]. ∎

Although IVES solves (SP2) iteratively, we observe that it can converge extremely fast (often with only two iterations) in the simulation, achieving a low computation complexity.

Combining the solutions of (SP1) and (SP2), we provide the User selection and Resource Allocation (URAL) in Algorithm 3 to solve the original problem (P). The URAL can simultaneously optimize the convergence speed, training time, and energy consumption via jointly selecting devices and allocating resources. Further, URAL can be directly integrated into the NUFM paradigm in the device selection phase to facilitate the deployment of FML in wireless networks.

Input: η1\eta_{1}, η2\eta_{2}, SS, BB, {hi}\{h_{i}\}, {Im}\{I_{m}\} {ιi}\{\iota_{i}\}, {ci}\{c_{i}\}, {Di}\{D_{i}\}
Compute 𝝂\bm{\nu}^{*} by (39);
  // Solve (SP1)
Compute 𝒛\bm{z}^{*} and 𝒑\bm{p}^{*} by IVES;
  // Solve (SP2)
return 𝛎\bm{\nu}^{*}, 𝐳\bm{z}^{*}, 𝐩\bm{p}^{*}  
Algorithm 3 User Selection and Resource Allocation (URAL) Algorithm

VI Extension to First-Order Approximations

Due to the computation of Hessian in local update (2), it may cause high computational cost for resource-limited IoT devices. In this section, we address this challenge.

There are two common methods used in the literature to reduce the complexity in computing Hessian [9]:

  1. 1.

    replacing the stochastic gradient by

    ~Fi(θ)~fi(θα~fi(θ,𝒟i),𝒟i);\displaystyle\tilde{\nabla}F_{i}({\theta})\approx\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}; (51)
  2. 2.

    replacing the Hessian-gradient product by

    ~2fi(θ,𝒟i′′)~fi(θα~fi(θ,𝒟i),𝒟i)\displaystyle\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}
    ~fi(θ+ϵg~i,𝒟i′′)~fi(θϵg~i,𝒟i′′)2ϵ\displaystyle\approx\frac{\tilde{\nabla}f_{i}\big{(}{\theta}+\epsilon\tilde{g}_{i},\mathcal{D}^{\prime\prime}_{i}\big{)}-\tilde{\nabla}f_{i}\big{(}{\theta}-\epsilon\tilde{g}_{i},\mathcal{D}^{\prime\prime}_{i}\big{)}}{2\epsilon} (52)

    where g~i=~fi(θα~fi(θ,𝒟i),𝒟i)\tilde{g}_{i}=\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}.

By doing so, the computational complexity of a one-step local update can be reduced from O(d2)O(d^{2}) to O(d)O(d), while not sacrificing too much learning performance. Next, we show that our results in Theorem 1 hold in the above two cases.

Corollary 2.

Suppose that Assumptions 1-4 are satisfied, and 𝒟i\mathcal{D}_{i}, 𝒟i\mathcal{D}^{\prime}_{i}, and 𝒟i′′\mathcal{D}^{\prime\prime}_{i} are independent batches. If the local update and global aggregation follow (2) and (6) respectively, we have for τ=1\tau=1

𝔼[F(θk)F(θk+1)]\displaystyle\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\; β𝔼[1nki𝒩k((1LFβ2)~Fi(θk)2\displaystyle\beta\mathbb{E}\Bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Bigg{(}\Big{(}1-\frac{L_{F}\beta}{2}\Big{)}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}
((1+αL)2γG+αζγH+σ~Fi)\displaystyle-\Big{(}\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\tilde{\sigma}_{F_{i}}\Big{)}
×𝔼[~Fi(θk)2|𝒩k])]\displaystyle\times\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\Bigg{)}\Bigg{]} (53)

where σ~Fi\tilde{\sigma}_{F_{i}} is defined as follows

σ~Fi2={4σG2(1Di+(αL)2Di)+2(αLζ)2,ifusing(51)6σG2(α2ϵ2Di′′+(1+2(αL)2)(1Di+(αL)2Di))+2(αρϵ)2ζ4,ifusing(2).\displaystyle\tilde{\sigma}^{2}_{F_{i}}=\begin{cases}4\sigma_{G}^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+2(\alpha L\zeta)^{2}&,~{}\mathrm{if~{}using}~{}\eqref{eq:first_approx}\\ \begin{aligned} &6\sigma_{G}^{2}\left(\frac{\alpha^{2}}{\epsilon^{2}D^{\prime\prime}_{i}}+\left(1+2(\alpha L)^{2}\right)\right.\\ &\left.\cdot\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)\right)+2(\alpha\rho\epsilon)^{2}\zeta^{4}\end{aligned}&,~{}\mathrm{if~{}using}~{}\eqref{eq:hessian_estimate}.\end{cases} (54)
Proof.

The detailed proof is presented in Appendix I of the technical report [39]. ∎

Corollary 2 indicates that NUFM can be directly combined with the first-order approximation techniques to reduce computational cost. Further, similar to Corollary 1, Corollary 2 can be extended to the multi-step cases.

VII Simulation

This section evaluates the performance of our proposed algorithms by comparing with existing baselines in real-world datasets. We first present the experimental setup, including the datasets, models, parameters, baselines, and environment. Then we provide our results from various aspects.

VII-A Experimental Setup

Refer to caption
Figure 2: Comparison of convergence rates under different numbers of participated devices. NUFM significantly accelerates the convergence of the existing FML approach, especially with fewer participated devices. In addition, with more participating devices, the advantages of NUFM weaken, leading to smaller gaps between NUFM and the existing algorithms.

VII-A1 Datasets and Models

We evaluate our algorithms on four widely-used benchmarks, namely Fashion-MNIST [43], CIFAR-10 [44], CIFAR-100 [44], and ImageNet [45]. Specifically, the data is distributed among n=100n=100 devices as follows: a) Each device has samples from two random classes; b) the number of samples per class follows a truncated Gaussian distribution 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) with μ=5\mu=5 and σ=5\sigma=5. We select 50% devices at random for training with the rest for testing. For each device, we divide the local dataset into a support set and a query set. We consider 1-shot 2-class classification tasks, i.e., the support set contains only 1 labeled example for each class. We set the stepsizes as α=β=0.001\alpha=\beta=0.001. We use a convolutional neural network (CNN) with max-pooling operations and the Leaky Rectified Linear Unit (Leaky ReLU) activation function, containing three convolutional layers with sizes 32, 64, and 128 respectively, followed by a fully connected layer and a softmax layer. The strides are set as 1 for convolution operation and 2 for the pooling operation.

VII-A2 Baselines

To compare the performance of NUFM, we first consider two existing algorithms, i.e., FedAvg [4] and Per-FedAvg [9]. Further, to validate the effectiveness of URAL in multi-access wireless networks, we use two baselines, called Greedy and Random. In each round, the Greedy strategy determines the CPU-cycle frequency νig\nu^{\mathit{g}}_{i} and transmission power pigp^{\mathit{g}}_{i} for device i𝒩i\in\mathcal{N} by greedily minimizing its individual objective, i.e.,

νig\displaystyle\nu^{\mathit{g}}_{i} =argminνi{η1ιiciDiνi22+η2ciDiνi,s.t.(34)}\displaystyle=\mathop{\arg\min}_{\nu_{i}}\left\{\eta_{1}\frac{\iota_{i}c_{i}D_{i}\nu^{2}_{i}}{2}+\eta_{2}\frac{c_{i}D_{i}}{\nu_{i}},~{}\mathrm{s.t.}~{}\eqref{eq:constraints_nui}\right\} (55)
pig\displaystyle p^{\mathit{g}}_{i} =argminpi{mzi,mg(η1pi+η2)SBlog2(1+hipiIm+BN0),s.t.(33)}\displaystyle=\mathop{\arg\min}_{p_{i}}\left\{\sum_{m\in\mathcal{M}}\frac{z^{\mathit{g}}_{i,m}(\eta_{1}p_{i}+\eta_{2})S}{B\log_{2}\left(\frac{1+h_{i}p_{i}}{I_{m}+BN_{0}}\right)},~{}\mathrm{s.t.}~{}\eqref{eq:constraints_pi}\right\} (56)

where {zi,mg}m\{z^{\mathit{g}}_{i,m}\}_{m\in\mathcal{M}} is selected at random (i.e., randomly allocating RBs to selected devices). The Random strategy decides the CPU-cycle frequencies, transmission powers, and RB allocation for the selected devices uniformly at random from the feasible regions.

VII-A3 Implementation

We implement the code in TensorFlow Version 1.14 on a server with two Intel® Xeon® Golden 5120 CPUs and one Nvidia® Tesla-V100 32G GPU. The parameters used in the simulation can be found in Table III.

VII-B Experimental Results

VII-B1 Convergence Speed

To demonstrate the improvement of NUFM on the convergence speed, we compare the algorithms on different benchmarks with the same initial model and learning rate. We vary the number of participated devices nkn_{k} from 20 to 40, and set the numbers of local update steps and total communication rounds as τ=1\tau=1 and K=50K=50, respectively. We let λ1=λ2=1\lambda_{1}=\lambda_{2}=1. As illustrated in Fig. 2 and Table II, NUFM significantly improve the convergence speed and corresponding test accuracy of the existing FML approach on all datasets444To make the graphs more legible, we draw symbols every two points in Fig. 2.. Clearly, it validates the effectiveness of our proposed device selection scheme that maximizes the lower bound of one-round global loss reduction. Interestingly, Fig. 2 also indicates that NUFM converges more quickly with relatively fewer participated devices. For example, in round 19, the loss achieved by NUFM with nk=20n_{k}=20 decreases by more than 9% and 20% over those with nk=30n_{k}=30 and nk=40n_{k}=40 on Fashion-MNIST, respectively. The underlying rationale is that relatively fewer “good” devices can provide a larger lower bound on the one-round global loss decrease (note that in (1) the lower bound takes the average of the selected devices). More selected devices in each round generally require more communication resources. Thus, the results reveal the potential of NUFM in applications to resource-limited wireless systems.

TABLE II: Test accuracy after fifty rounds of training.
Algorithm Fashion-MNIST CIFAR-10 CIFAR-100 ImageNet
NUFM 68.04% 58.80% 23.95% 34.04%
Per-FedAvg 62.75% 58.22% 21.49% 30.98%
FedAvg 61.04% 54.31% 10.13% 12.14%

VII-B2 Effect of Local Update Steps

To show the effect of local update steps on the convergence rate of NUFM, we present results with varying numbers of local update steps τ=1,2,,10\tau=1,2,\dots,10 in each round. For clarity of illustration, we compare the loss under different numbers of local steps in the round 19 on Fashion-MNIST and CIFAR-100. Fig. 3 shows that fewer local update steps lead to a larger gap between the baselines and NUFM, which verifies the theoretical result that a small number of local steps can slow the convergence of FedAvg and Per-FedAvg [9, Theorem 4.5]. It also implies that NUFM can improve the computational efficiency of local devices.

Refer to caption
Figure 3: Effect of local update steps on convergence rates. Fewer local steps lead to larger gaps between NUFM and the existing methods.

VII-B3 Performance of URAL in Wireless Networks

We evaluate the performance of URAL by comparing with four baselines, namely NUFM-Greedy, NUFM-Random, RU-Greedy, and RU-Random, as detailed below.

  1. a)

    NUFM-Greedy: select devices by NUFM, decide CPU-cycle frequencies, RB allocation, and transmission power by Greedy strategy;

  2. b)

    NUFM-Random: select devices by NUFM, decide CPU-cycle frequencies, RB allocation, and transmission power by Random strategy;

  3. c)

    RU-Greedy: select devices uniformly at random, decide CPU-cycle frequencies, RB allocation, and transmission power by Greedy strategy;

  4. d)

    RU-Random: select devices uniformly at random, decide CPU-cycle frequencies, RB allocation, and transmission power by Random strategy.

TABLE III: Parameters in Simulation
Parameter Value
Step size (α\alpha) and meta–learning rate (β\beta) 0.001
# edge devices (nn) 100
# participating devices (nkn_{k}) in Experiments 1 and 2 An integer varying among {20,30,40}\{20,30,40\}
# local updates (τ\tau) An integer varying among {1,2,,10}\{1,2,\dots,10\}
Hyper-parameters (λ1\lambda_{1} and λ2\lambda_{2}) 1
Weight parameters (η1\eta_{1} and η2\eta_{2}) Real numbers varying among {0.5,1,1.5,2.0,2.5}\{0.5,1,1.5,2.0,2.5\}
# RBs An integer varying among {1,5,10,20,30,40,50}\{1,5,10,20,30,40,50\}
CPU cycles of devices(cic_{i}) Real numbers following U(0,0.25)U(0,0.25)
Maximum CPU-cycle frequencies of devices (νi𝑚𝑎𝑥\nu^{\mathit{max}}_{i}) Real numbers following U(0,2)U(0,2)
Maximum transmission powers of devices (pi𝑚𝑎𝑥p^{\mathit{max}}_{i}) Real numbers following U(0,1)U(0,1)
Inference in RBs (ImI_{m}) Real numbers following U(0,0.8)U(0,0.8)
Model size (SS), Bandwidth (BB), 1
Noise power spectral density (N0N_{0}) 1
effective capacitance coefficients of devices (ιi\iota_{i}) Real numbers following U(0,1)U(0,1)
Channel gains of devices
Real numbers following U(0.1,h𝑚𝑎𝑥)U(0.1,h^{\mathit{max}})
where h𝑚𝑎𝑥h^{\mathit{max}} varies among {0.25,0.5,0.75,,2.0}\{0.25,0.5,0.75,\dots,2.0\}

We simulate a wireless system consisting of M=20M=20 RBs and let the channel gain hih_{i} of device ii follow a uniform distribution U(h𝑚𝑖𝑛,h𝑚𝑎𝑥)U(h^{\mathit{min}},h^{\mathit{max}}) with h𝑚𝑖𝑛=0.1h^{\mathit{min}}=0.1 and h𝑚𝑎𝑥=1h^{\mathit{max}}=1. We set S=1S=1, B=1B=1, and n0=1n_{0}=1. The interference of RB mm is drawn from ImU(0,0.8)I_{m}\sim U(0,0.8). We set ιiU(0,1)\iota_{i}\sim U(0,1), ciU(0,0.25)c_{i}\sim U(0,0.25), pi𝑚𝑎𝑥U(0,1)p^{\mathit{max}}_{i}\sim U(0,1), and νi𝑚𝑎𝑥U(0,2)\nu^{\mathit{max}}_{i}\sim U(0,2) for each i𝒩i\in\mathcal{N} to simulate the device heterogeneity. In the following experiments, we run the algorithms on FMNIST with local update steps τ=1\tau=1 and η1=η2=1\eta_{1}=\eta_{2}=1.

As shown in Fig. 4, URAL can significantly reduce energy consumption and wall-clock time, as compared with the baselines. However, it is counter-intuitive that the Greedy strategy is not always better than Random. There are two reasons. On one hand, the energy cost and wall-clock time depend on the selection of weight parameters η1\eta_{1} and η2\eta_{2}. The results in Fig. 4 imply that, when η1=η2=1\eta_{1}=\eta_{2}=1, Greedy pays more attention to the wall-clock time than to energy consumption. Accordingly, Greedy achieves much lower average delay than that of Random, but sacrificing parts of the energy. On the other hand, the wall-clock time and energy cost require joint control with RB allocation. Although Greedy minimizes the individual objectives (55)-(56), improper RB allocation can cause arbitrary performance degradation. Different from Greedy and Random–based baselines, since URAL aims to maximize the joint objective (P) via co-optimizing the CPU-cycle frequencies, transmission power, and RB allocation strategy, it can alleviate the above-mentioned issues, and achieve a better delay and energy control. At the same time, Fig. 4 indicates that URAL converges as fast as NUFM-Greedy and NUFM-Random (the corresponding lines almost overlap), which select devices greedily to accelerate convergence (as in NUFM). Thus, URAL can have an excellent convergence rate.

Refer to caption
Figure 4: Comparison of convergence, energy cost, and wall-clock training time. URAL can achieve a great convergence speed and short wall-clock time with low energy consumption.

VII-B4 Effect of Resource Blocks

In Fig. 5, we test the performance of URAL under different numbers of RBs. We vary the number of RBs MM from 1 to 50. More RBs enable more devices to be selected in each round, leading to larger energy consumption. As shown in Fig. 5, URAL keeps stable wall-clock time with the increase of RBs. Meanwhile, URAL can control the power in devices to avoid serious waste of energy. It is counter-intuitive that the convergence speed does not always decrease with the increase of RBs, especially for URAL. The reason is indeed the same as that in Section VII-B1. That is, too few selected devices can slow the convergence due to insufficient information provided in each round while a large number of participated devices may weaken the global loss reduction as shown in (1). Therefore, URAL can adapt to the practical systems with constrained wireless resources via achieving fast convergence with only a small set of devices.

Refer to caption
Figure 5: Comparison of convergence, energy cost, and wall-clock time under different numbers of RBs. URAL can well control the energy cost and wall-clock time with more available RBs. Meanwhile, it can achieve fast convergence with only a small number of RBs.

VII-B5 Effect of Channel Quality

To investigate the effect of channel conditions on performance, we set the number of RBs M=20M=20 and vary the maximum channel gain h𝑚𝑎𝑥h^{\mathit{max}} from 0.25 to 2, and show the corresponding energy consumption and wall-clock training time in Fig. 6. The results indicate that the energy consumption and latency decrease as channel quality improves, because devices can use less power to achieve a relatively large transmission rate.

Refer to caption
Figure 6: Effect of channel gains on performance. Worse channel conditions would induce larger transmission power and longer wall-clock time.

VII-B6 Effect of Weight Parameters

We study how weight parameters η1\eta_{1} and η2\eta_{2} affect the average energy consumption and wall-clock time of URAL in Fig. 7. We first fix η2=1\eta_{2}=1 and vary η1\eta_{1} from 0.5 to 2.5. As expected, the total energy consumption decreases with the increase of η1\eta_{1}, with the opposite trend for wall-clock time. Then we vary η2\eta_{2} with η1=1\eta_{1}=1. Similarly, a larger η2\eta_{2} leads to less latency and more energy cost. It implies that we can control the levels of wall-clock training time and energy consumption by tuning the weight parameters. In particular, even with a large η1\eta_{1} or η2\eta_{2}, the wall-clock time and energy cost can be controlled at low levels. Meanwhile, the convergence rate achieved by URAL is robust to η1\eta_{1} and η2\eta_{2}. Thus, URAL can make full use of the resources (including datasets, bandwidth, and power) and achieve great trade-offs among the convergence rate, latency, and energy consumption.

Refer to caption
Figure 7: Effect of weight parameters η1\eta_{1} and η2\eta_{2}. A large η1\eta_{1} achieves lower energy consumption while leading to longer wall-clock time (the average wall-clock time is 18.63 when η1=0.5\eta_{1}=0.5; it is 21.53 when η1=2.5\eta_{1}=2.5). It is the opposite for η2\eta_{2}.

VIII Conclusion

In this paper, we have proposed an FML algorithm, called NUFM, that maximizes the theoretical lower bound of global loss reduction in each round to accelerate the convergence. Aiming at effectively deploying NUFM in wireless networks, we present a device selection and resource allocation strategy (URAL), which jointly controls the CPU-cycle frequencies and RB allocation to optimize the trade-off between energy consumption and wall-clock training time. Moreover, we integrate the proposed algorithms with two first-order approximation techniques to further reduce the computational complexity in IoT devices. Extensive simulation results demonstrate that the proposed methods outperform the baseline algorithms.

Future work will investigate the trade-off between the local update and global aggregation in FML to minimize the convergence time and energy cost from a long-term perspective. In addition, how to characterize the convergence properties and communication complexity of the NUFM algorithm requires further research.

References

  • [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
  • [2] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intelligence: Challenges and opportunities of near-sensor machine learning applications,” in Proc. IEEE ASAP, 2018, pp. 1–7.
  • [3] X. Zhang, Y. Wang, S. Lu, L. Liu, W. Shi et al., “Openei: An open framework for edge intelligence,” in Proc. IEEE ICDCS, 2019, pp. 1840–1851.
  • [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS.   PMLR, 2017, pp. 1273–1282.
  • [5] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
  • [6] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning with fast convergence and efficient communication,” ArXiv reprints arXiv: 1802.07876, 2019.
  • [7] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” ArXiv reprints arXiv: 1909.12488, 2019.
  • [8] S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via federated meta-learning,” in Proc. IEEE ICDCS, 2020, pp. 289–299.
  • [9] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” in Proc. NIPS, 2020, pp. 1–12.
  • [10] P. Kairouz et al., “Advances and open problems in federated learning,” ArXiv reprints arXiv: 1912.04977, 2021.
  • [11] S. Yue, J. Ren, J. Xin, S. Lin, and J. Zhang, “Inexact-admm based federated meta-learning for fast and continual edge learning,” in Proc. ACM MobiHoc, 2021, p. 91–100.
  • [12] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in Proc. IEEE ICC, 2019, pp. 1–7.
  • [13] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang, and H. V. Poor, “Fast-convergent federated learning,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 201–218, 2020.
  • [14] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, 2020.
  • [15] C. T. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless networks: Convergence analysis and resource allocation,” IEEE/ACM Trans. Networking, vol. 29, no. 1, pp. 398–409, 2020.
  • [16] J. Liu, J. Ren, Y. Zhang, X. Peng, Y. Zhang, and Y. Yang, “Efficient dependent task offloading for multiple applications in mec-cloud system,” IEEE Trans. Mob. Comput., 2021.
  • [17] Z. M. Fadlullah and N. Kato, “Hcp: Heterogeneous computing platform for federated learning based collaborative content caching towards 6g networks,” IEEE Trans. Emerging Top. Comput., 2020.
  • [18] Q. Wu, K. He, and X. Chen, “Personalized federated learning for intelligent iot applications: A cloud-edge based framework,” IEEE Open J. Comput. Soc., vol. 1, pp. 35–44, 2020.
  • [19] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, pp. 1–19, 2019.
  • [20] S. Yue, J. Ren, N. Qiao, Y. Zhang, H. Jiang, Y. Zhang, and Y. Yang, “Todg: Distributed task offloading with delay guarantees for edge computing,” IEEE Trans. Parallel Distrib. Syst., 2021.
  • [21] J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling for cellular federated edge learning with importance and channel awareness,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7690–7703, 2020.
  • [22] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, 2019.
  • [23] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in Proc. IEEE ICC Workshops, 2020, pp. 1–6.
  • [24] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, 2020.
  • [25] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time optimization for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2457–2471, 2020.
  • [26] J. Ren, G. Yu, and G. Ding, “Accelerating dnn training in wireless federated edge learning systems,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 219–232, 2020.
  • [27] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization,” in Proc. ICML, 2019, pp. 7184–7193.
  • [28] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, 2020.
  • [29] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas in Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
  • [30] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in Proc. ICML, 2020, pp. 5132–5143.
  • [31] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective federated learning design,” in Proc. IEEE INFOCOM, 2021.
  • [32] W. Luping, W. Wei, and L. Bo, “Cmfl: Mitigating communication overhead for federated learning,” in Proc. IEEE ICDCS, 2019, pp. 954–964.
  • [33] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially private meta-learning,” in Proc. ICLR, 2019.
  • [34] K. C. Sim et al., “Personalization of end-to-end speech recognition on mobile devices for named entities,” in Proc. IEEE ASRU, 2019, pp. 23–30.
  • [35] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergence for wireless federated learning,” in Proc. IEEE ICC, 2020, pp. 1–6.
  • [36] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
  • [37] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data,” ArXiv reprints arXiv: 2005.11418, 2020.
  • [38] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory of gradient-based model-agnostic meta-learning algorithms,” in Proc. AISTATS, 2020, pp. 1082–1092.
  • [39] S. Yue, J. Ren, J. Xin, D. Zhang, Y. Zhang, and W. Zhuang, “Efficient federated meta-learning over multi-access wireless networks,” arXiv preprint arXiv:2108.06453, 2021.
  • [40] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms.   MIT press, 2009.
  • [41] T. D. Burd and R. W. Brodersen, “Processor design for portable systems,” J. VLSI Sig. Proc. Syst. Sig. Image Video Technol., vol. 13, no. 2, pp. 203–221, 1996.
  • [42] E. W. Weisstein, “Hungarian maximum matching algorithm,” https://mathworld.wolfram.com/, 2011.
  • [43] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” ArXiv reprints arXiv: 1708.07747, 2017.
  • [44] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical Report TR-2009, University of Toronto, 2009.
  • [45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. of IEEE CVPR, 2009, pp. 248–255.

Appendix A Proof of Lemma 1

The proof is standard [11, 9], and we provide it for completeness. Recalling the definition of FiF_{i} as Fi(θ)fi(θαfi(θ))F_{i}(\theta)\coloneqq f_{i}(\theta-\alpha\nabla f_{i}(\theta)), we have

Fi(θ)=(Iα2fi(θ))fi(θαfi(θ)).\displaystyle\nabla F_{i}(\theta)=(I-\alpha\nabla^{2}f_{i}(\theta))\nabla f_{i}(\theta-\alpha\nabla f_{i}(\theta)).

Based on that, for any θ1,θ2d{\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d} we have

Fi(θ1)Fi(θ2)=\displaystyle\|\nabla F_{i}({\theta}_{1})-\nabla F_{i}({\theta}_{2})\|= (Iα2fi(θ1))fi(θ1αfi(θ1))(Iα2fi(θ2))fi(θ2αfi(θ2))\displaystyle\;\|(I-\alpha\nabla^{2}f_{i}({\theta}_{1}))\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-(I-\alpha\nabla^{2}f_{i}({\theta}_{2}))\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\|
=\displaystyle= (Iα2fi(θ1))(fi(θ1αfi(θ1))fi(θ2αfi(θ2)))\displaystyle\;\|(I-\alpha\nabla^{2}f_{i}({\theta}_{1}))(\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2})))
+((Iα2fi(θ1))(Iα2fi(θ2)))fi(θ2αfi(θ2))\displaystyle+((I-\alpha\nabla^{2}f_{i}({\theta}_{1}))-(I-\alpha\nabla^{2}f_{i}({\theta}_{2})))\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\| (adding and subtracting the term (Iα2fi(θ1))fi(θ2αfi(θ2)(I-\alpha\nabla^{2}f_{i}(\theta_{1}))\nabla f_{i}(\theta_{2}-\alpha\nabla f_{i}(\theta_{2}))
\displaystyle\leq Iα2fi(θ1)fi(θ1αfi(θ1))fi(θ2αfi(θ2))(a)\displaystyle\;\underbrace{\|I-\alpha\nabla^{2}f_{i}({\theta}_{1})\|\|\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\|}_{(a)} (from triangle inequality)
+α2fi(θ1)2fi(θ2)fi(θ2αfi(θ2))(b).\displaystyle+\alpha\underbrace{\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\|\|\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\|}_{(b)}. (57)

Then, for (a)(a), the following holds true

Iα2fi(θ1)fi(θ1αfi(θ1))fi(θ2αfi(θ2))\displaystyle\|I-\alpha\nabla^{2}f_{i}({\theta}_{1})\|\|\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\|
\displaystyle\leq\, (1+αL)Lθ1αfi(θ1))θ2+αfi(θ2)\displaystyle(1+\alpha L)L\|{\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-{\theta}_{2}+\alpha\nabla f_{i}({\theta}_{2})\| (from Assumption 1)
\displaystyle\leq\, (1+αL)L(θ1θ2+αfi(θ1)fi(θ2))\displaystyle(1+\alpha L)L(\|{\theta}_{1}-{\theta}_{2}\|+\alpha\|\nabla f_{i}({\theta}_{1})-\nabla f_{i}({\theta}_{2})\|) (from triangle inequality)
\displaystyle\leq\, (1+αL)2Lθ1θ2.\displaystyle(1+\alpha L)^{2}L\|{\theta}_{1}-{\theta}_{2}\|. (58)

Regarding (b)(b), it can be shown that

2fi(θ1)2fi(θ2)fi(θ2αfi(θ2))ρζθ1θ2.\displaystyle\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\|\|\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\|\leq\rho\zeta\|{\theta}_{1}-{\theta}_{2}\|. (from Assumptions 1 and 2)

Substituting (a)(a) and (b)(b) into (57), we have the result.

Appendix B Proof of Lemma 2

We rewrite the stochastic gradient ~Fi(θ)\tilde{\nabla}F_{i}({\theta}) as follows

~Fi(θ)=(Iα2fi(θ)+δ1)(fi(θαfi(θ))+δ2)\displaystyle\tilde{\nabla}F_{i}({\theta})=\left(I-\alpha\nabla^{2}f_{i}({\theta})+\delta^{*}_{1}\right)\left(\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)+\delta^{*}_{2}\right) (59)

where δ1\delta^{*}_{1} and δ2\delta^{*}_{2} are given by

δ1\displaystyle\delta^{*}_{1} =α(2fi(θ)~2fi(θ,𝒟i′′))\displaystyle=\alpha\left(\nabla^{2}f_{i}({\theta})-\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\right) (60)
δ2\displaystyle\delta^{*}_{2} =~fi(θα~fi(θ,𝒟i),𝒟i)fi(θαfi(θ)).\displaystyle=\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right). (61)

Note that δ1\delta^{*}_{1} and δ2\delta^{*}_{2} are independent. Due to Assumption 3, we have

𝔼[δ1]\displaystyle\mathbb{E}[\delta^{*}_{1}] =0\displaystyle=0 (62)
𝔼[δ12]\displaystyle\mathbb{E}[\|\delta^{*}_{1}\|^{2}] (ασH)2Di′′.\displaystyle\leq\frac{(\alpha\sigma_{H})^{2}}{D^{\prime\prime}_{i}}. (63)

Next, we proceed to bound the first and second moments of δ2\delta^{*}_{2}. Regarding the first moment, we have

𝔼[δ2]\displaystyle\left\|\mathbb{E}\left[\delta^{*}_{2}\right]\right\| =𝔼[~fi(θα~fi(θ,𝒟i),𝒟i)fi(θα~fi(θ,𝒟i))+fi(θα~fi(θ,𝒟i))fi(θαfi(θ))]\displaystyle=\left\|\mathbb{E}\left[\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}+\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right]\right\|
=𝔼[fi(θα~fi(θ,𝒟i))fi(θαfi(θ))]\displaystyle=\left\|\mathbb{E}\left[\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right]\right\|
𝔼[fi(θα~fi(θ,𝒟i))fi(θαfi(θ))]\displaystyle\leq\mathbb{E}\left[\left\|\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|\right]
αL𝔼[~fi(θ,𝒟i)fi(θ)]\displaystyle\leq\alpha L\mathbb{E}\left[\left\|\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})-\nabla f_{i}({\theta})\right\|\right] (from the smoothness of fif_{i})
ασGLDi.\displaystyle\leq\frac{\alpha\sigma_{G}L}{\sqrt{D_{i}}}. (from Assumption 3)

Regarding the second moment, we have

𝔼[δ22]\displaystyle\mathbb{E}\left[\|\delta^{*}_{2}\|^{2}\right] 2𝔼[~fi(θα~fi(θ,𝒟i),𝒟i)fi(θα~fi(θ,𝒟i))2]+2𝔼[fi(θα~fi(θ,𝒟i))fi(θαfi(θ))2]\displaystyle\leq 2\mathbb{E}\left[\left\|\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}\right\|^{2}\right]+2\mathbb{E}\left[\left\|\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|^{2}\right]
2σG2Di+2(αL)2𝔼[~fi(θ,𝒟i)fi(θ)2]\displaystyle\leq\frac{2\sigma^{2}_{G}}{D^{\prime}_{i}}+2(\alpha L)^{2}\mathbb{E}\left[\left\|\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})-\nabla f_{i}({\theta})\right\|^{2}\right] (from the tower rule, smoothness of fif_{i} along with Assumption 3)
2σG2(1Di+(αL)2Di).\displaystyle\leq 2\sigma^{2}_{G}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right). (64)

Based on (59), we have

𝔼[~Fi(θ)Fi(θ)]\displaystyle\left\|\mathbb{E}\left[\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right]\right\| =(Iα2fi(θ))𝔼[δ2]+𝔼[δ1]fi(θαfi(θ))+𝔼[δ1δ2]\displaystyle=\left\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\mathbb{E}[\delta^{*}_{2}]+\mathbb{E}[\delta^{*}_{1}]\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)+\mathbb{E}[\delta^{*}_{1}\delta^{*}_{2}]\right\|
Iα2fi(θ)𝔼[δ2]\displaystyle\leq\left\|I-\alpha\nabla^{2}f_{i}({\theta})\right\|\left\|\mathbb{E}[\delta^{*}_{2}]\right\|
ασGL(1+αL)Di\displaystyle\leq\frac{\alpha\sigma_{G}L(1+\alpha L)}{\sqrt{D_{i}}} (65)

which gives us the first result in Lemma 2.

By the submultiplicative property of matrix norm,

~Fi(θ)Fi(θ)Iα2fi(θ)δ2+fi(θαfi(θ))δ1+δ1δ2.\displaystyle\left\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\|\leq\left\|I-\alpha\nabla^{2}f_{i}({\theta})\right\|\left\|\delta^{*}_{2}\right\|+\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|\left\|\delta^{*}_{1}\right\|+\left\|\delta^{*}_{1}\right\|\left\|\delta^{*}_{2}\right\|. (66)

Thus, we have

~Fi(θ)Fi(θ)23Iα2fi(θ)2δ22+3fi(θαfi(θ))2δ12+3δ12δ22.\displaystyle\left\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\|^{2}\leq 3\left\|I-\alpha\nabla^{2}f_{i}({\theta})\right\|^{2}\left\|\delta^{*}_{2}\right\|^{2}+3\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|^{2}\left\|\delta^{*}_{1}\right\|^{2}+3\left\|\delta^{*}_{1}\right\|^{2}\left\|\delta^{*}_{2}\right\|^{2}. (67)

Taking expectation on both sizes, we obtain

𝔼[~Fi(θ)Fi(θ)2]\displaystyle\mathbb{E}\left[\left\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\|^{2}\right] 3(1+αL)2𝔼[δ22]+3ζ2𝔼[δ12]+3𝔼[δ12]𝔼[δ22]\displaystyle\leq 3\left(1+\alpha L\right)^{2}\mathbb{E}\left[\left\|\delta^{*}_{2}\right\|^{2}\right]+3\zeta^{2}\mathbb{E}\left[\left\|\delta^{*}_{1}\right\|^{2}\right]+3\mathbb{E}\left[\left\|\delta^{*}_{1}\right\|^{2}\right]\mathbb{E}\left[\left\|\delta^{*}_{2}\right\|^{2}\right] (using the fact that Iα2fi(θ)1+αL\|I-\alpha\nabla^{2}f_{i}({\theta})\|\leq 1+\alpha L)
6(ασGσH)2Di′′(1Di+(αL)2Di)+6σG2(1+αL)2(1Di+(αL)2Di)+3(αζσH)2Di′′\displaystyle\leq\frac{6(\alpha\sigma_{G}\sigma_{H})^{2}}{D^{\prime\prime}_{i}}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+6\sigma^{2}_{G}(1+\alpha L)^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+\frac{3(\alpha\zeta\sigma_{H})^{2}}{D^{\prime\prime}_{i}} (68)

which completes the proof.

Appendix C Proof of Lemma 3

Recall the definition of Fi(θ)F_{i}({\theta}). We have

Fi(θ)Fj(θ)=\displaystyle\left\|\nabla F_{i}({\theta})-F_{j}({\theta})\right\|=\, (Iα2fi(θ))fi(θαfi(θ))(Iα2fj(θ))fj(θαfj(θ))\displaystyle\left\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\|
\displaystyle\leq\, (Iα2fi(θ))fi(θαfi(θ))(Iα2fj(θ))fi(θαfi(θ))\displaystyle\left\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|
+(Iα2fj(θ))fi(θαfi(θ))(Iα2fj(θ))fj(θαfj(θ))\displaystyle+\left\|\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\|
\displaystyle\leq\, α(2fi(θ)2fj(θ))fi(θαfi(θ))A\displaystyle\alpha\underbrace{\left\|\left(\nabla^{2}f_{i}({\theta})-\nabla^{2}f_{j}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|}_{A}
+(Iα2fj(θ))(fi(θαfi(θ))fj(θαfj(θ)))B.\displaystyle+\underbrace{\left\|\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\left(\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right)\right\|}_{B}. (69)

Regrading AA, due to the submultiplicative property of matrix norm, the following holds

A\displaystyle A 2fi(θ)2fj(θ)fi(θαfi(θ))\displaystyle\leq\left\|\nabla^{2}f_{i}({\theta})-\nabla^{2}f_{j}({\theta})\right\|\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|
ζγH\displaystyle\leq\zeta\gamma_{H} (70)

where the last inequality follows from Assumption 4.

Regrading BB, similarly, we obtain

B\displaystyle B\leq\, Iα2fj(θ)fi(θαfi(θ))fj(θαfj(θ))\displaystyle\left\|I-\alpha\nabla^{2}f_{j}({\theta})\right\|\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\|
(a)\displaystyle\mathop{\leq}^{\textbf{(a)}}\, (1+αL)fi(θαfi(θ))fj(θαfj(θ))\displaystyle\left(1+\alpha L\right)\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\|
\displaystyle\leq\, (1+αL)(fi(θαfi(θ))fi(θαfj(θ))\displaystyle\left(1+\alpha L\right)\Big{(}\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{i}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\|
+fi(θαfj(θ))fj(θαfj(θ)))\displaystyle+\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\|\Big{)}
(b)\displaystyle\mathop{\leq}^{\text{(b)}}\, (1+αL)(αLfi(θ)fj(θ)+γG)\displaystyle\left(1+\alpha L\right)\left(\alpha L\left\|\nabla f_{i}({\theta})-f_{j}({\theta})\right\|+\gamma_{G}\right)
\displaystyle\leq\, (1+αL)2γG\displaystyle\left(1+\alpha L\right)^{2}\gamma_{G} (71)

where (a) is derived via triangle inequality, and (b) follows from Assumptions 1 and 4.

Substituting (C) and (C) in (C), we have

Fi(θ)Fj(θ)(1+αL)2γG+αζγH\displaystyle\left\|\nabla F_{i}({\theta})-F_{j}({\theta})\right\|\leq\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H} (72)

thereby completing the proof.

Appendix D Proof of Lemma 4

Since jj is the straggler among devices, (SP1) can be equivalently transformed to

min𝝂η1i𝒩ιi2ciDiνi2+η2cjDjνjs.t.0νiνi𝑚𝑎𝑥,i𝒩ciDiνjcjDjνi,i𝒩/j.\displaystyle\begin{split}\mathop{\min}_{\bm{\nu}}&\quad\eta_{1}\sum_{i\in\mathcal{N}}\frac{\iota_{i}}{2}c_{i}D_{i}\nu^{2}_{i}+\eta_{2}\frac{c_{j}D_{j}}{\nu_{j}}\\ \mathrm{s.t.}&\quad 0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}\\ &\quad\frac{c_{i}D_{i}\nu_{j}}{c_{j}D_{j}}\leq\nu_{i},~{}\forall i\in\mathcal{N}/j.\end{split} (73)

Fixing νj\nu_{j}, for each i𝒩/ji\in\mathcal{N}/j555We implicitly assume that 𝒩/j\mathcal{N}/j is not empty., we can show that the optimal CPU frequency νi\nu^{*}_{i} of the problem (73) can be obtained via solving the following decomposed convex optimization problem

minνiιi2η1ciDiνi2s.t.0νiνi𝑚𝑎𝑥ciDiνjcjDjνi.\displaystyle\begin{split}\mathop{\min}_{\nu_{i}}&\quad\frac{\iota_{i}}{2}\eta_{1}c_{i}D_{i}\nu^{2}_{i}\\ \mathrm{s.t.}&\quad 0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i}\\ &\quad\frac{c_{i}D_{i}\nu_{j}}{c_{j}D_{j}}\leq\nu_{i}.\end{split} (74)

If νjcjDjνi𝑚𝑎𝑥ciDi\nu_{j}\leq\frac{c_{j}D_{j}\nu^{\mathit{max}}_{i}}{c_{i}D_{i}}, the optimal solution of problem (74) is given by

νi=ciDiνjcjDj.\displaystyle\nu^{*}_{i}=\frac{c_{i}D_{i}\nu_{j}}{c_{j}D_{j}}. (75)

Otherwise, the problem is infeasible because the constraints are mutually contradictory. Substituting νi=νi\nu_{i}=\nu^{*}_{i} in problem (73), we have the following problem with respect to νj\nu_{j}

minνjg(νj)=η1(i𝒩/jιi(ciDi)32(cjDj)2+ιjcjDj2)a1νj2+η2cjDja21νjs.t.0νjcjDjνi𝑚𝑎𝑥ciDi,i𝒩.\displaystyle\begin{split}\mathop{\min}_{\nu_{j}}\,&\quad g(\nu_{j})=\underbrace{\eta_{1}\bigg{(}\sum_{i\in\mathcal{N}/j}\frac{\iota_{i}(c_{i}D_{i})^{3}}{2(c_{j}D_{j})^{2}}+\frac{\iota_{j}c_{j}D_{j}}{2}\bigg{)}}_{a_{1}}\nu^{2}_{j}+\underbrace{\eta_{2}c_{j}D_{j}}_{a_{2}}\frac{1}{\nu_{j}}\\ \mathrm{s.t.}&\quad 0\leq\nu_{j}\leq\frac{c_{j}D_{j}\nu^{\mathit{max}}_{i}}{c_{i}D_{i}},~{}\forall i\in\mathcal{N}.\end{split} (76)

We simplify the expression of g(νj)g(\nu_{j}) as g(νj)=a1νj2+a2/νjg(\nu_{j})=a_{1}\nu^{2}_{j}+a_{2}/\nu_{j} where positive constants a1a_{1} and a2a_{2} are defined in (76). Then, the derivative of g(νj)g(\nu_{j}) can be written as

g(νj)=2a1νja2νj2.\displaystyle g^{\prime}(\nu_{j})=2a_{1}\nu_{j}-\frac{a_{2}}{\nu^{2}_{j}}. (77)

Based on (77), the minimum value of g(νj)g(\nu_{j}) is obtained at its stationary point ν¯jg1(0)=a22a13\bar{\nu}_{j}\coloneqq{g^{\prime}}^{-1}(0)=\sqrt[3]{\frac{a_{2}}{2a_{1}}}. Thus, the optimal solution of problem (76) is

νj=min{a22a13,mini𝒩cjDjνj𝑚𝑎𝑥ciDi}.\displaystyle\nu^{*}_{j}=\min\left\{\sqrt[3]{\frac{a_{2}}{2a_{1}}},\min_{i\in\mathcal{N}}\frac{c_{j}D_{j}\nu^{\mathit{max}}_{j}}{c_{i}D_{i}}\right\}. (78)

Combining (75) and (78), we complete the proof.

Appendix E Proof of Lemma 5

Given 𝒛\bm{z}^{*} and δ\delta^{*}, pip^{*}_{i} can be obtained via solving the following problem

minpimzi,mpilog2(1+hipiIm+BN0)s.t.0pipi𝑚𝑎𝑥mzi,mSBlog2(1+hipiIm+BN0)δ\displaystyle\begin{split}\mathop{\min}_{p_{i}}&\quad\sum_{m\in\mathcal{M}}\frac{z^{*}_{i,m}p_{i}}{\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}\\ \mathrm{s.t.}&~{}\,\quad 0\leq p_{i}\leq p^{\mathit{max}}_{i}\\ &\quad\sum_{m\in\mathcal{M}}\frac{z^{*}_{i,m}S}{B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}\leq\delta^{*}\end{split} (79)

where we eliminate uiu_{i}, η1\eta_{1}, SS, and BB in the objective. Clearly, if mzi,m=0\sum_{m\in\mathcal{M}}z^{*}_{i,m}=0, then pi=0p^{*}_{i}=0. If the constraints in (79) are mutually contradictory, i.e.

pi𝑚𝑎𝑥<(Imi+BN0)(2SBδ1)hi\displaystyle p^{\mathit{max}}_{i}<\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)}{h_{i}} (80)

then zi,mz^{*}_{i,m} must be 0, which gives (41).

When there exists mim^{*}_{i} such that zi,mi=1z^{*}_{i,m^{*}_{i}}=1, by denoting p~ihipiImi+BN0\tilde{p}_{i}\coloneqq\frac{h_{i}p_{i}}{I_{m^{*}_{i}}+BN_{0}} and rearranging the terms, we can transform the problem (79) to

minp~ig3(p~i)=p~ilog2(1+p~i)s.t.0p~ihipi𝑚𝑎𝑥Imi+BN02SBδ1p~i.\displaystyle\begin{split}\mathop{\min}_{\tilde{p}_{i}}&\quad g_{3}(\tilde{p}_{i})=\frac{\tilde{p}_{i}}{\log_{2}\left(1+\tilde{p}_{i}\right)}\\ \mathrm{s.t.}&\quad 0\leq\tilde{p}_{i}\leq\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\\ &\quad 2^{\frac{S}{B\delta^{*}}}-1\leq\tilde{p}_{i}.\end{split} (81)

We have

g3(p~i)=log2(1+p~i)p~i/((1+p~i)ln2)(log2(1+p~i))2.\displaystyle g^{\prime}_{3}(\tilde{p}_{i})=\frac{\log_{2}(1+\tilde{p}_{i})-\tilde{p}_{i}/\left((1+\tilde{p}_{i})\ln 2\right)}{\left(\log_{2}(1+\tilde{p}_{i})\right)^{2}}. (82)

Denoting the numerator of (82) as f3(p~i)f_{3}(\tilde{p}_{i}), we have f3(0)=0f_{3}(0)=0 and

f3(p~i)=1ln2(1(1+p~i)1(1+p~i)2)>0,whenp~i>0\displaystyle f^{\prime}_{3}(\tilde{p}_{i})=\frac{1}{\ln 2}\left(\frac{1}{(1+\tilde{p}_{i})}-\frac{1}{(1+\tilde{p}_{i})^{2}}\right)>0,\quad\text{when}~{}\tilde{p}_{i}>0 (83)

which implies that g3(p~i)g_{3}(\tilde{p}_{i}) is monotonically increasing with p~i>0\tilde{p}_{i}>0. Thus, recalling the definition of p~i\tilde{p}_{i}, due to (41), we obtain

pi=(Imi+BN0)(2SBδ1)hi\displaystyle p^{*}_{i}=\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)}{h_{i}} (84)

which completes the proof of (42).

Appendix F Proof of Lemma 6

If jj denotes the straggler among all devices and 𝒩\mathcal{N}^{*}\neq\emptyset, we can rewrite (6) for each i𝒩i\in\mathcal{N}^{*} as

mzi,mSBlog2(1+hipiIm+BN0)mzj,mSBlog2(1+hjpjIm+BN0)\displaystyle\sum_{m\in\mathcal{M}}z^{*}_{i,m}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{*}_{i}}{I_{m}+BN_{0}}\right)}\leq\sum_{m\in\mathcal{M}}z^{*}_{j,m}\frac{S}{B\log_{2}\left(1+\frac{h_{j}p^{*}_{j}}{I_{m}+BN_{0}}\right)}
\displaystyle\Leftrightarrow\quad log2(1+hjpjImj+BN0)log2(1+hipiImi+BN0)\displaystyle\log_{2}\left(1+\frac{h_{j}p^{*}_{j}}{I_{m^{*}_{j}}+BN_{0}}\right)\leq\log_{2}\left(1+\frac{h_{i}p^{*}_{i}}{I_{m^{*}_{i}}+BN_{0}}\right)
\displaystyle\Leftrightarrow\quad (Imi+BN0)hj(Imj+BN0)hipjpi.\displaystyle\frac{(I_{m^{*}_{i}}+BN_{0})h_{j}}{(I_{m^{*}_{j}}+BN_{0})h_{i}}p^{*}_{j}\leq p^{*}_{i}. (85)

Based on (F), by eliminating constants BB and SS, we transform (SP2) under fixed 𝒛\bm{z}^{*} to

min{pi}i𝒩i𝒩η1pilog2(1+hipiImi+BN0)+η21log2(1+hjpjImj+BN0)s.t.0pipi𝑚𝑎𝑥,i𝒩(Imi+BN0)hj(Imj+BN0)hipjpi,i𝒩/j.\displaystyle\begin{split}\mathop{\min}_{\{p_{i}\}_{i\in\mathcal{N}^{*}}}&\quad\sum_{i\in\mathcal{N}^{*}}\eta_{1}\frac{p_{i}}{\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m^{*}_{i}}+BN_{0}}\right)}+\eta_{2}\frac{1}{\log_{2}\left(1+\frac{h_{j}p_{j}}{I_{m^{*}_{j}}+BN_{0}}\right)}\\ \mathrm{s.t.}~{}\;&~{}\,\quad 0\leq p_{i}\leq p^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}^{*}\\ &~{}\quad\frac{(I_{m^{*}_{i}}+BN_{0})h_{j}}{(I_{m^{*}_{j}}+BN_{0})h_{i}}p_{j}\leq p_{i},~{}\forall i\in\mathcal{N}^{*}/j.\end{split} (86)

Due to (F), for each i𝒩/ji\in\mathcal{N}^{*}/j, we can obtain the optimal solution pip^{*}_{i} of (86) (abusing notations) via solving the decomposed problem as follows

minp~ig3(p~i)=p~ilog2(1+p~i)s.t.0p~ihipi𝑚𝑎𝑥Imi+BN0hjImj+BN0pjp~i\displaystyle\begin{split}\mathop{\min}_{\tilde{p}_{i}}\,&\quad g_{3}(\tilde{p}_{i})=\frac{\tilde{p}_{i}}{\log_{2}\left(1+\tilde{p}_{i}\right)}\\ \mathrm{s.t.}&\quad 0\leq\tilde{p}_{i}\leq\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\\ &\quad\frac{h_{j}}{I_{m^{*}_{j}}+BN_{0}}p_{j}\leq\tilde{p}_{i}\end{split} (87)

where p~ihipiImi+BN0\tilde{p}_{i}\coloneqq\frac{h_{i}p_{i}}{I_{m^{*}_{i}}+BN_{0}}. Note that in (87) we consider the nontrivial case where 𝒩/j\mathcal{N}/j is not empty. It is easy to see that

g3(p~i)=log2(1+p~i)p~i/((1+p~i)ln2)(log2(1+p~i))2.\displaystyle g^{\prime}_{3}(\tilde{p}_{i})=\frac{\log_{2}(1+\tilde{p}_{i})-\tilde{p}_{i}/\left((1+\tilde{p}_{i})\ln 2\right)}{\left(\log_{2}(1+\tilde{p}_{i})\right)^{2}}. (88)

Denoting the numerator of (88) as f3(p~i)f_{3}(\tilde{p}_{i}), we have f(0)=0f(0)=0 and

f3(p~i)=1ln2(1(1+p~i)1(1+p~i)2)>0,whenp~i>0\displaystyle f^{\prime}_{3}(\tilde{p}_{i})=\frac{1}{\ln 2}\left(\frac{1}{(1+\tilde{p}_{i})}-\frac{1}{(1+\tilde{p}_{i})^{2}}\right)>0,\quad\text{when}~{}\tilde{p}_{i}>0 (89)

which implies that g3(p~i)g_{3}(\tilde{p}_{i}) is monotonically increasing with p~i>0\tilde{p}_{i}>0. Thus, if pjhipi𝑚𝑎𝑥(Imj+BN0)hj(Imi+BN0)p_{j}\leq\frac{h_{i}p^{\mathit{max}}_{i}(I_{m^{*}_{j}}+BN_{0})}{h_{j}(I_{m^{*}_{i}}+BN_{0})}, recalling the definition of p~i\tilde{p}_{i}, we have

pi=hj(Imi+BN0)hi(Imj+BN0)pj,i𝒩/j.\displaystyle p^{*}_{i}=\frac{h_{j}(I_{m^{*}_{i}}+BN_{0})}{h_{i}(I_{m^{*}_{j}}+BN_{0})}p_{j},~{}\forall i\in\mathcal{N}^{*}/j. (90)

Otherwise, problem (87) is infeasible.

Similar to (87), denote p~jhjpjImj+BN0\tilde{p}_{j}\coloneqq\frac{h_{j}p_{j}}{I_{m^{*}_{j}}+BN_{0}} and institute pi=pip_{i}=p^{*}_{i} in (86) (noting that pi=(Imi+BN0)p~jhip_{i}=\frac{(I_{m^{*}_{i}}+BN_{0})\tilde{p}_{j}}{h_{i}}), whereby we have the following problem regarding p~j\tilde{p}_{j}

minp~jg4(p~j)=η1i𝒩Imi+BN0hib1p~jlog2(1+p~j)+η2log2(1+p~j)s.t.0p~jhipi𝑚𝑎𝑥Imi+BN0,i𝒩.\displaystyle\begin{split}\mathop{\min}_{\tilde{p}_{j}}&\quad g_{4}(\tilde{p}_{j})=\underbrace{\eta_{1}\sum_{i\in\mathcal{N}^{*}}\frac{I_{m^{*}_{i}}+BN_{0}}{h_{i}}}_{b_{1}}\cdot\frac{\tilde{p}_{j}}{\log_{2}\left(1+\tilde{p}_{j}\right)}+\frac{\eta_{2}}{\log_{2}\left(1+\tilde{p}_{j}\right)}\\ \mathrm{s.t.}&\quad 0\leq\tilde{p}_{j}\leq\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}},~{}\forall i\in\mathcal{N}^{*}.\end{split} (91)

Denoting positive constant b1b_{1} as in (91), we can write g4(p~j)=b1p~jlog2(1+p~j)+η2log2(1+p~j)g_{4}(\tilde{p}_{j})=\frac{b_{1}\tilde{p}_{j}}{\log_{2}(1+\tilde{p}_{j})}+\frac{\eta_{2}}{\log_{2}(1+\tilde{p}_{j})} and obtain

g4(p~j)\displaystyle g^{\prime}_{4}(\tilde{p}_{j}) =b1g3(p~j)η2(1+p~j)(log2(1+p~j))2ln2\displaystyle=b_{1}g^{\prime}_{3}(\tilde{p}_{j})-\frac{\eta_{2}}{(1+\tilde{p}_{j})\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}\ln 2}
=b1log2(1+p~j)p~j/((1+p~j)ln2)(log2(1+p~j))2η2(1+p~j)(log2(1+p~j))2ln2\displaystyle=b_{1}\frac{\log_{2}(1+\tilde{p}_{j})-\tilde{p}_{j}/\left((1+\tilde{p}_{j})\ln 2\right)}{\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}}-\frac{\eta_{2}}{(1+\tilde{p}_{j})\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}\ln 2}
=b1((1+p~j)log2(1+p~j)ln2p~j)η2(1+p~j)(log2(1+p~j))2ln2.\displaystyle=\frac{b_{1}\big{(}(1+\tilde{p}_{j})\log_{2}(1+\tilde{p}_{j})\ln 2-\tilde{p}_{j}\big{)}-\eta_{2}}{(1+\tilde{p}_{j})\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}\ln 2}. (92)

Next, we show that g4(p~j)g_{4}(\tilde{p}_{j}) has unique minimum point. Denoting the numerator of (F) as f4(p~j)f_{4}(\tilde{p}_{j}), we have f4(0)=η20f_{4}(0)=-\eta_{2}\leq 0 and

f4(p~j)=b1log2(1+p~j)ln20,whenp~j0\displaystyle f^{\prime}_{4}(\tilde{p}_{j})=b_{1}\log_{2}(1+\tilde{p}_{j})\ln 2\geq 0,\quad\text{when}~{}\tilde{p}_{j}\geq 0 (93)

which implies f4(p~j)f_{4}(\tilde{p}_{j}) is monotonically increasing. Besides, it can be shown that f4(b2)0f_{4}(b_{2})\geq 0, where b2=2(1+max{η2b1,1}1)/ln2b_{2}=2^{(1+\sqrt{\max\{\frac{\eta_{2}}{b_{1}},1\}-1})/\ln 2}. Thus, there must exist a unique point p~j0(0,b2]\tilde{p}^{0}_{j}\in(0,b_{2}] such that g4(p~j0)=f4(p~j0)=0g^{\prime}_{4}(\tilde{p}^{0}_{j})=f_{4}(\tilde{p}^{0}_{j})=0 holds, and it is also the minimum value of g4(p~j)g_{4}(\tilde{p}_{j}).

Accordingly, the optimal solution of the problem (91) can be expressed as

p~j=min{p~j0,mini𝒩hipi𝑚𝑎𝑥Imi+BN0}.\displaystyle\tilde{p}^{*}_{j}=\min\left\{\tilde{p}^{0}_{j},\min_{i\in\mathcal{N^{*}}}\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\right\}. (94)

Therefore, we complete the proof.

Appendix G Proof of Theorem 1

From Lemma 1 (smoothness of FiF_{i}), for any θ1,θ2d{\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}, we have

F(θ2)F(θ1)+F(θ1)(θ2θ1)+LF2θ2θ12.\displaystyle F({\theta}_{2})\leq F({\theta}_{1})+\nabla F({\theta}_{1})^{\top}({\theta}_{2}-{\theta}_{1})+\frac{L_{F}}{2}\|{\theta}_{2}-{\theta}_{1}\|^{2}. (95)

Combining (95) with (2), we have

F(θk+1)\displaystyle F({\theta}^{k+1}) F(θk)+F(θk)(θk+1θk)+LF2θk+1θk2\displaystyle\leq F({\theta}^{k})+\nabla F({\theta}^{k})^{\top}({\theta}^{k+1}-{\theta}^{k})+\frac{L_{F}}{2}\|{\theta}^{k+1}-{\theta}^{k}\|^{2}
=F(θk)βF(θk)(1nki𝒩k~Fi(θk))+LFβ221nki𝒩k~Fi(θk)2.\displaystyle=F({\theta}^{k})-\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)+\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}. (96)

Denoting

GkβF(θk)(1nki𝒩k~Fi(θk))LFβ221nki𝒩k~Fi(θk)2\displaystyle G^{k}\coloneqq\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2} (97)

we have 𝔼[F(θk)F(θk+1)]𝔼[Gk]\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\mathbb{E}[G^{k}]. Next, we bound 𝔼[Gk]\mathbb{E}[G^{k}] from below.

First, for any i𝒩i\in\mathcal{N}, we rewrite F(θK)\nabla F({\theta}^{K}) as

F(θk)=\displaystyle\nabla F({\theta}^{k})= 1nj𝒩Fj(θk)Fi(θk)Ai+Fi(θk)~Fi(θk)Bi+~Fi(θk)\displaystyle\underbrace{\frac{1}{n}\sum_{j\in\mathcal{N}}\nabla F_{j}({\theta}^{k})-\nabla F_{i}({\theta}^{k})}_{A_{i}}+\underbrace{\nabla F_{i}({\theta}^{k})-\tilde{\nabla}F_{i}({\theta}^{k})}_{B_{i}}+\tilde{\nabla}F_{i}({\theta}^{k}) (98)

and bound 𝔼[Ai2]\mathbb{E}[\|A_{i}\|^{2}] by

𝔼[Ai2]\displaystyle\mathbb{E}[\|A_{i}\|^{2}] =𝔼[Fi(θk)1nj𝒩Fj(θk)2]\displaystyle=\mathbb{E}\bigg{[}\Big{\|}\nabla F_{i}({\theta}^{k})-\frac{1}{n}\sum_{j\in\mathcal{N}}\nabla F_{j}({\theta}^{k})\Big{\|}^{2}\bigg{]}
=𝔼[1nj𝒩(Fj(θk)Fi(θk))2]\displaystyle=\mathbb{E}\bigg{[}\Big{\|}\frac{1}{n}\sum_{j\in\mathcal{N}}\left(\nabla F_{j}({\theta}^{k})-\nabla F_{i}({\theta}^{k})\right)\Big{\|}^{2}\bigg{]}
(a)1nj𝒩𝔼[Fj(θk)Fi(θk)2]\displaystyle\mathop{\leq}^{\text{(a)}}\frac{1}{n}\sum_{j\in\mathcal{N}}\mathbb{E}\left[\left\|\nabla F_{j}({\theta}^{k})-\nabla F_{i}({\theta}^{k})\right\|^{2}\right]
(b)(1+αL)2γG+αζγH\displaystyle\mathop{\leq}^{\text{(b)}}\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H} (99)

where (b) is derived from Lemma 3. Inequality (a) follows from the fact that, for any aida_{i}\in\mathbb{R}^{d} and bib_{i}\in\mathbb{R},

i𝒩kbiai2\displaystyle\left\|\sum_{i\in\mathcal{N}_{k}}b_{i}a_{i}\right\|^{2} =j=1d(i𝒩kbiai(j))2\displaystyle=\sum^{d}_{j=1}\left(\sum_{i\in\mathcal{N}_{k}}b_{i}a^{(j)}_{i}\right)^{2}
j=1d(i𝒩k(ai(j))2)(i𝒩kbi2)\displaystyle\leq\sum^{d}_{j=1}\left(\sum_{i\in\mathcal{N}_{k}}\left(a^{(j)}_{i}\right)^{2}\right)\left(\sum_{i\in\mathcal{N}_{k}}b^{2}_{i}\right)
=(j=1di𝒩k(ai(j))2)(i𝒩kbi2)\displaystyle=\left(\sum^{d}_{j=1}\sum_{i\in\mathcal{N}_{k}}\left(a^{(j)}_{i}\right)^{2}\right)\left(\sum_{i\in\mathcal{N}_{k}}b^{2}_{i}\right)
=(i𝒩kai2)(i𝒩kbi2)\displaystyle=\left(\sum_{i\in\mathcal{N}_{k}}\left\|a_{i}\right\|^{2}\right)\left(\sum_{i\in\mathcal{N}_{k}}b^{2}_{i}\right) (100)

where ai(j)a^{(j)}_{i} denotes the jj-th coordinate of aia_{i}, and the first inequality is derived by using Cauchy-Schwarz inequality.

Regarding 𝔼[Bi2]\mathbb{E}[\|B_{i}\|^{2}], from Lemma 2,

𝔼[Bi2]σFi2.\displaystyle\mathbb{E}[\|B_{i}\|^{2}]\leq\sigma^{2}_{F_{i}}. (101)

Substituting (98) in 𝔼[Gk]\mathbb{E}[G^{k}], we obtain

𝔼[Gk]\displaystyle\mathbb{E}[G^{k}] =𝔼[βF(θk)(1nki𝒩k~Fi(θk))LFβ221nki𝒩k~Fi(θk)2]\displaystyle=\mathbb{E}\left[\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}\right]
=𝔼[βnki𝒩k(Ai+Bi+~Fi(θk))~Fi(θk)LFβ221nki𝒩k~Fi(θk)2]\displaystyle=\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A_{i}+B_{i}+\tilde{\nabla}F_{i}({\theta}^{k})\right)^{\top}\tilde{\nabla}F_{i}({\theta}^{k})-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}\right]
=𝔼[βnki𝒩k(Ai~Fi(θk)+Bi~Fi(θk)+~Fi(θk)2)]LFβ22𝔼[1nki𝒩k~Fi(θk)2].\displaystyle=\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+\left\|\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}\right)\right]-\frac{L_{F}\beta^{2}}{2}\mathbb{E}\left[\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}\right]. (102)

Note that, for any random variables a,bda,b\in\mathbb{R}^{d} and for c0c\neq 0,

𝔼[ab]𝔼[2(ca)b2c](c2𝔼[a2]+𝔼[b2]4c2).\displaystyle\mathbb{E}\left[a^{\top}b\right]\geq-\mathbb{E}\left[2\left\|\left(ca\right)^{\top}\frac{b}{2c}\right\|\right]\geq-\left(c^{2}\mathbb{E}\left[\|a\|^{2}\right]+\frac{\mathbb{E}\left[\|b\|^{2}\right]}{4c^{2}}\right). (103)

With g(x)x𝔼[a2]+𝔼[b2]/(4x)g(x)\coloneqq x\mathbb{E}[\|a\|^{2}]+\mathbb{E}[\|b\|^{2}]/(4x), we have

x=argminxg(x)=𝔼[b2]4𝔼[a2]\displaystyle x^{*}=\mathop{\arg\min}_{x}g(x)=\sqrt{\frac{\mathbb{E}\left[\left\|b\right\|^{2}\right]}{4\mathbb{E}\left[\left\|a\right\|^{2}\right]}} (104)

which implies that if we set c2=xc^{2}=x^{*}, the lower bound in (103) becomes tight. Thus, substituting this in (103) and rearranging the terms, we have

𝔼[ab]𝔼[a2]𝔼[b2].\displaystyle\mathbb{E}\left[a^{\top}b\right]\geq-\sqrt{\mathbb{E}\left[\left\|a\right\|^{2}\right]\mathbb{E}\left[\left\|b\right\|^{2}\right]}. (105)

Based on (105) along with (G), due to the tower rule, we can bound 𝔼[Gk]\mathbb{E}[G^{k}] as follows

𝔼[Gk]=\displaystyle\mathbb{E}[G^{k}]=\; 𝔼[βnki𝒩k(Ai~Fi(θk)+Bi~Fi(θk)+~Fi(θk)2)]LFβ22𝔼[1nki𝒩k~Fi(θk)2]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+\left\|\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}\right)\right]-\frac{L_{F}\beta^{2}}{2}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\|}^{2}\bigg{]}
=\displaystyle=\; 𝔼[βnki𝒩k(𝔼[Ai~Fi(θk)|𝒩k]+𝔼[Bi~Fi(θk)|𝒩k]+~Fi(θk)2)]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(\mathbb{E}\left[A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})\Big{|}\mathcal{N}_{k}\right]+\mathbb{E}\left[B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})\Big{|}\mathcal{N}_{k}\right]+\left\|\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}\right)\right]
LFβ22𝔼[1nki𝒩k~Fi(θk)2]\displaystyle-\frac{L_{F}\beta^{2}}{2}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\|}^{2}\bigg{]}
\displaystyle\geq\; 𝔼[βnki𝒩k(𝔼[Ai2|𝒩k]𝔼[~Fi(θk)2|𝒩k]𝔼[Bi2|𝒩k]𝔼[~Fi(θk)2|𝒩k])]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(-\sqrt{\mathbb{E}\left[\left\|A_{i}\right\|^{2}\Big{|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}-\sqrt{\mathbb{E}\left[\left\|B_{i}\right\|^{2}\Big{|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right)\right]
+β(1LFβ2)𝔼[1nki𝒩k~Fi(θk)2]\displaystyle+\beta\left(1-\frac{L_{F}\beta}{2}\right)\mathbb{E}\bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\|}^{2}\bigg{]}
\displaystyle\geq\; 𝔼[βnki𝒩k((1+αL)2γG+αζγH+σFi)𝔼[~Fi(θk)2|𝒩k]]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}-\left(\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}\right)\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right] (from (G) and (101))
+β(1LFβ2)𝔼[1nki𝒩k~Fi(θk)2]\displaystyle+\beta\left(1-\frac{L_{F}\beta}{2}\right)\mathbb{E}\bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\|}^{2}\bigg{]}
=\displaystyle=\; β𝔼[1nki𝒩k(1LFβ2)~Fi(θk)2((1+αL)2γG+αζγH+σFi)𝔼[~Fi(θk)2|𝒩k]]\displaystyle\beta\mathbb{E}\left[\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(1-\frac{L_{F}\beta}{2}\right)\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}-\left(\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}\right)\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right] (106)

thereby completing the proof.

Appendix H Proof of Theorem 5

First, it is easy to see that g2(𝒛,𝒑)g_{2}(\bm{z},\bm{p}) is upper bounded by i𝒩ui\sum_{i\in\mathcal{N}}u_{i}. Besides, due to (43), we have

𝒛t+1=argmax𝒛{i𝒩mzi,m(uiη1δt(Im+BN0)(2SBδt1)hi),s.t.(26)(35)}.\displaystyle\bm{z}^{t+1}=\mathop{\arg\max}_{\bm{z}}\left\{\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}\left(u_{i}-\frac{\eta_{1}\delta^{t}(I_{m}+BN_{0})(2^{\frac{S}{B\delta^{t}}}-1)}{h_{i}}\right),~{}\mathrm{s.t.}~{}\eqref{eq:constraint_zm}-\eqref{eq:constraint_z01}\right\}. (107)

Based on (48), the optimal transmission power 𝒑^t+1\hat{\bm{p}}^{t+1} corresponding to 𝒛t+1\bm{z}^{t+1} is given by

p^it+1={(Imi+BN0)(2SBδt1)hi,ifmzi,mt+1=10,otherwise\displaystyle\hat{p}^{t+1}_{i}=\begin{cases}\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{t}}}-1)}{h_{i}}&,~{}\text{if}~{}\sum_{m\in\mathcal{M}}z^{t+1}_{i,m}=1\\ 0&,~{}\text{otherwise}\end{cases} (108)

where we slightly abuse the notation and denote the RB block allocated to ii as mim^{*}_{i}, i.e., zi,mit+1=1z^{t+1}_{i,m^{*}_{i}}=1. Thus, we have g(𝒛t,𝒑t)g(𝒛t+1,𝒑^t+1)g(\bm{z}^{t},\bm{p}^{t})\leq g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1}). From (49), g(𝒛t+1,𝒑^t+1)g(𝒛t+1,𝒑t+1)g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1})\leq g(\bm{z}^{t+1},\bm{p}^{t+1}). Using these two inequalities, we obtain

g(𝒛t,𝒑t)g(𝒛t+1,𝒑t+1)\displaystyle g(\bm{z}^{t},\bm{p}^{t})\leq g(\bm{z}^{t+1},\bm{p}^{t+1}) (109)

thereby completing the proof.

Appendix I Proof of Corollaries 1 and 2

I-A Proof of Corollary 1

Denote auxiliary variable θk,t1nki𝒩kθik,t{\theta}^{k,t}\coloneqq\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}{\theta}^{k,t}_{i} as a “virtual” global model in local step tt (θik,t{\theta}^{k,t}_{i} is defined in (2)). Similar to (G), the following holds

F(θk,t+1)\displaystyle F({\theta}^{k,t+1}) F(θk,t)+F(θk,t)(θk,t+1θk,t)+LF2θk,t+1θk,t2\displaystyle\leq F({\theta}^{k,t})+\nabla F({\theta}^{k,t})^{\top}({\theta}^{k,t+1}-{\theta}^{k,t})+\frac{L_{F}}{2}\|{\theta}^{k,t+1}-{\theta}^{k,t}\|^{2} (from Lemma 1)
=F(θk,t)+F(θk,t)(1nki𝒩kθik,t+1θik,t)+LF2θk,t+1θk,t2\displaystyle=F({\theta}^{k,t})+\nabla F({\theta}^{k,t})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}{\theta}^{k,t+1}_{i}-{\theta}^{k,t}_{i}\right)+\frac{L_{F}}{2}\|{\theta}^{k,t+1}-{\theta}^{k,t}\|^{2} (from the definition of θk,t\theta^{k,t})
=F(θk,t)βF(θk,t)(1nki𝒩k~Fi(θk,t))+LFβ221nki𝒩k~Fi(θk,t)2.\displaystyle=F({\theta}^{k,t})-\beta\nabla F({\theta}^{k,t})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right)+\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}. (from (2))

Denote

Gk,tβnki𝒩kF(θk,t)~Fi(θk,t)LFβ221nki𝒩k~Fi(θk,t)2.\displaystyle G^{k,t}\coloneqq\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\nabla F({\theta}^{k,t})^{\top}\tilde{\nabla}F_{i}({\theta}^{k,t})-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}. (110)

We have 𝔼[F(θk,t)F(θk,t+1)]𝔼[Gk,t]\mathbb{E}[F({\theta}^{k,t})-F({\theta}^{k,t+1})]\geq\mathbb{E}[G^{k,t}]. Rewrite F(θk,t)\nabla F({\theta}^{k,t}) for each i𝒩ki\in\mathcal{N}_{k} as follows

F(θk,t)=F(θk,t)F(θik,t)Ci+F(θik,t)Fi(θik,t)Ai+Fi(θik,t)~Fi(θik,t)Bi+~Fi(θik,t).\displaystyle\nabla F({\theta}^{k,t})=\underbrace{\nabla F({\theta}^{k,t})-\nabla F({\theta}^{k,t}_{i})}_{C_{i}}+\underbrace{\nabla F({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})}_{A_{i}}+\underbrace{\nabla F_{i}({\theta}^{k,t}_{i})-\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})}_{B_{i}}+\tilde{\nabla}F_{i}({\theta}^{k,t}_{i}). (111)

Note that in (111), AiA_{i} and BiB_{i} are similar to those in (98) with θk\theta^{k} being replaced by θik,t\theta^{k,t}_{i}. Thus, from (G) and (101), 𝔼[Ai2]\mathbb{E}[\|A_{i}\|^{2}] and 𝔼[Bi2]\mathbb{E}[\|B_{i}\|^{2}] are bounded by

𝔼[Ai2]\displaystyle\mathbb{E}[\|A_{i}\|^{2}] (1+αL)2γG+αζγH\displaystyle\leq\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H} (112)
𝔼[Bi2]\displaystyle\mathbb{E}[\|B_{i}\|^{2}] σFi2.\displaystyle\leq\sigma^{2}_{F_{i}}. (113)

Regarding 𝔼[Ci2]\mathbb{E}[\|C_{i}\|^{2}], we can write

𝔼[Ci2]\displaystyle\mathbb{E}[\|C_{i}\|^{2}] =𝔼[F(θk,t)F(θik,t)2]\displaystyle=\mathbb{E}\left[\left\|\nabla F({\theta}^{k,t})-\nabla F({\theta}^{k,t}_{i})\right\|^{2}\right]
LF2𝔼[θk,tθik,t2].\displaystyle\leq L^{2}_{F}\mathbb{E}\left[\left\|{\theta}^{k,t}-{\theta}^{k,t}_{i}\right\|^{2}\right]. (From Lemma 1)

To bound 𝔼[Ci2]\mathbb{E}[\|C_{i}\|^{2}], denote

atmaxi𝒩k{𝔼[θk,tθik,t2]}\displaystyle a_{t}\coloneqq\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\|{\theta}^{k,t}-{\theta}^{k,t}_{i}\right\|^{2}\right]\right\} (114)

with a0=0a_{0}=0. Then we have

at+1\displaystyle a_{t+1} =maxi𝒩k{𝔼[θk,t+1θik,t+12]}\displaystyle=\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\|{\theta}^{k,t+1}-{\theta}^{k,t+1}_{i}\right\|^{2}\right]\right\}
=maxi𝒩k{𝔼[θik,tβ~Fi(θik,t)1nkj𝒩k(θjk,tβ~Fj(θjk,t))2]}\displaystyle=\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\|{\theta}^{k,t}_{i}-\beta\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left({\theta}^{k,t}_{j}-\beta\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right)\right\|^{2}\right]\right\}
maxi𝒩k{(1+ϵ)𝔼[θik,t1nkj𝒩kθjk,t2]+β2(1+1ϵ)𝔼[~Fi(θik,t)1nkj𝒩k~Fj(θjk,t)2]}\displaystyle\leq\max_{i\in\mathcal{N}_{k}}\left\{(1+\epsilon)\mathbb{E}\left[\left\|{\theta}^{k,t}_{i}-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}{\theta}^{k,t}_{j}\right\|^{2}\right]+\beta^{2}\left(1+\frac{1}{\epsilon}\right)\mathbb{E}\left[\left\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right\|^{2}\right]\right\} (from [9, Equation (68)])
(1+ϵ)maxi𝒩k{𝔼[θik,t1nkj𝒩kθjk,t2]}+β2(1+1ϵ)maxi𝒩k{𝔼[~Fi(θik,t)1nkj𝒩k~Fj(θjk,t)2]}\displaystyle\leq(1+\epsilon)\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\|{\theta}^{k,t}_{i}-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}{\theta}^{k,t}_{j}\right\|^{2}\right]\right\}+\beta^{2}\left(1+\frac{1}{\epsilon}\right)\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right\|^{2}\right]\right\}
=(1+ϵ)at+β2(1+1ϵ)maxi𝒩k{𝔼[~Fi(θik,t)1nkj𝒩k~Fj(θjk,t)2]Hi}\displaystyle=(1+\epsilon)a_{t}+\beta^{2}\left(1+\frac{1}{\epsilon}\right)\max_{i\in\mathcal{N}_{k}}\Bigg{\{}\underbrace{\mathbb{E}\bigg{[}\bigg{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\bigg{\|}^{2}\bigg{]}}_{H_{i}}\Bigg{\}} (115)

for any ϵ>0\epsilon>0. For HiH_{i}, we first write

Hi2𝔼[Fi(θik,t)1nkj𝒩kFj(θjk,t)2]Hi,1+2𝔼[1nkj𝒩k(Fj(θjk,t)~Fj(θjk,t))+(~Fi(θik,t)Fi(θik,t))2]Hi,2.\displaystyle H_{i}\leq 2\underbrace{\mathbb{E}\left[\left\|\nabla F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\nabla F_{j}({\theta}^{k,t}_{j})\right\|^{2}\right]}_{H_{i,1}}+2\underbrace{\mathbb{E}\left[\left\|\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t}_{j})-\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right)+\left(\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})\right)\right\|^{2}\right]}_{H_{i,2}}. (116)

Next, we bound Hi,1H_{i,1} and Hi,2H_{i,2} separately.

  • Upper Bound of Hi,1H_{i,1}: Based on Lemma 3, we have

    Hi,1=\displaystyle H_{i,1}=\; 𝔼[1nkj𝒩k(Fi(θk,t)Fj(θk,t))+Fi(θik,t)Fi(θk,t)+1nkj𝒩k(Fj(θk,t)Fj(θjk,t))2]\displaystyle\mathbb{E}\left[\left\|\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{i}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t})\right)+\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right)\right\|^{2}\right]
    \displaystyle\leq\; 2nkj𝒩k𝔼[Fi(θk,t)Fj(θk,t)2]\displaystyle\frac{2}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\mathbb{E}\left[\left\|\nabla F_{i}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t})\right\|^{2}\right] (from (G))
    +2𝔼[Fi(θik,t)Fi(θk,t)+1nkj𝒩k(Fj(θk,t)Fj(θjk,t))2]\displaystyle+2\mathbb{E}\left[\left\|\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right)\right\|^{2}\right]
    \displaystyle\leq\; 2γG2+2𝔼[Fi(θik,t)Fi(θk,t)+1nkj𝒩k(Fj(θk,t)Fj(θjk,t))2]\displaystyle 2\gamma^{2}_{G}+2\mathbb{E}\left[\left\|\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right)\right\|^{2}\right] (from Lemma 3)
    \displaystyle\leq\; 2γG2+4𝔼[Fi(θik,t)Fi(θk,t)2+1nkj𝒩kFj(θk,t)Fj(θjk,t)2]\displaystyle 2\gamma^{2}_{G}+4\mathbb{E}\left[\left\|\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})\right\|^{2}+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left\|\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right\|^{2}\right] (using Cauchy-Schwarz inequality similar to (118))
    \displaystyle\leq\; 2γG2+4LF2𝔼[θik,tθk,t2+1nkj𝒩kθk,tθjk,t2]\displaystyle 2\gamma^{2}_{G}+4L^{2}_{F}\mathbb{E}\left[\left\|{\theta}^{k,t}_{i}-{\theta}^{k,t}\right\|^{2}+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left\|{\theta}^{k,t}-{\theta}^{k,t}_{j}\right\|^{2}\right]
    \displaystyle\leq\; 2γG2+8LF2at.\displaystyle 2\gamma^{2}_{G}+8L^{2}_{F}a_{t}. (117)
  • Upper Bound of Hi,2H_{i,2}: Due to Cauchy-Schwarz inequality, the following holds

    j=1nk+1xjyj2(j=1nk+1xj2)(j=1nk+1yj2)\displaystyle\left\|\sum^{n_{k}+1}_{j=1}x_{j}y_{j}\right\|^{2}\leq\left(\sum^{n_{k}+1}_{j=1}\left\|x_{j}\right\|^{2}\right)\left(\sum^{n_{k}+1}_{j=1}\left\|y_{j}\right\|^{2}\right) (118)

    where xj=1nkx_{j}=\frac{1}{\sqrt{n_{k}}} and yj=1nk(Fj(θjk,t)~Fj(θjk,t))y_{j}=\frac{1}{\sqrt{n_{k}}}(\nabla F_{j}({\theta}^{k,t}_{j})-\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})) for 1jnk1\leq j\leq n_{k}; xnk+1=1x_{n_{k}+1}=1 and ynk+1=~Fi(θik,t)Fi(θik,t)y_{n_{k}+1}=\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i}). Thus, we have

    Hi,2\displaystyle H_{i,2} 2𝔼[~Fi(θik,t)Fi(θik,t)2+1nkj𝒩kFj(θjk,t)~Fj(θjk,t)2]\displaystyle\leq 2\mathbb{E}\left[\left\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})\right\|^{2}+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left\|\nabla F_{j}({\theta}^{k,t}_{j})-\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right\|^{2}\right]
    4σF2\displaystyle\leq 4\sigma^{2}_{F} (using Lemma 2)

    where σF=maxi𝒩{σFi}\sigma_{F}=\max_{i\in\mathcal{N}}\{\sigma_{F_{i}}\}.

Based on the above results, substituting (116) into (115), we obtain

at+1\displaystyle a_{t+1} =(1+ϵ)at+2β2(1+1ϵ)maxi𝒩k{Hi,1+Hi,2}\displaystyle=(1+\epsilon)a_{t}+2\beta^{2}\left(1+\frac{1}{\epsilon}\right)\max_{i\in\mathcal{N}_{k}}\left\{H_{i,1}+H_{i,2}\right\}
(1+ϵ+16β2LF2(1+1ϵ))at+4β2(1+1ϵ)(γG2+2σF2).\displaystyle\leq\left(1+\epsilon+16\beta^{2}L^{2}_{F}\left(1+\frac{1}{\epsilon}\right)\right)a_{t}+4\beta^{2}\left(1+\frac{1}{\epsilon}\right)\left(\gamma^{2}_{G}+2\sigma^{2}_{F}\right). (119)

Note that (I-A) is essentially the same as [9, Equation (78)]. Therefore, due to β[0,1/(10LFτ))\beta\in[0,1/(10L_{F}\tau)) and [9, Corollay F.2.], we have

𝔼[Ci2]at35β2tτ(γG2+2σF2).\displaystyle\mathbb{E}[\|C_{i}\|^{2}]\leq a_{t}\leq 35\beta^{2}t\tau(\gamma^{2}_{G}+2\sigma^{2}_{F}). (120)

Similar to (106), substituting (111) in 𝔼[Gk,t]\mathbb{E}[G^{k,t}], we obtain

𝔼[Gk,t]=\displaystyle\mathbb{E}[G^{k,t}]=\; 𝔼[βnki𝒩kF(θk,t)~Fi(θk,t)LFβ221nki𝒩k~Fi(θk,t)2]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\nabla F({\theta}^{k,t})^{\top}\tilde{\nabla}F_{i}({\theta}^{k,t})-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}\right]
=\displaystyle=\; 𝔼[βnki𝒩k(Ai+Bi+Ci+~Fi(θik,t))~Fi(θk,t)LFβ221nki𝒩k~Fi(θk,t)2]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A_{i}+B_{i}+C_{i}+\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\right)^{\top}\tilde{\nabla}F_{i}({\theta}^{k,t})-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}\right]
=\displaystyle=\; 𝔼[βnki𝒩k(Ai~Fi(θk,t)+Bi~Fi(θk,t)+Ci~Fi(θk,t)+~Fi(θik,t)2)LFβ221nki𝒩k~Fi(θk,t)2]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})+B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})+C^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})+\left\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\right\|^{2}\right)-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}\right]
=\displaystyle=\; 𝔼[βnki𝒩k(𝔼[Ai~Fi(θk,t)|𝒩k]+𝔼[Bi~Fi(θk,t)|𝒩k]+𝔼[Ci~Fi(θk,t)|𝒩k]+~Fi(θik,t)2)\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(\mathbb{E}\Big{[}A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{|}\mathcal{N}_{k}\Big{]}+\mathbb{E}\Big{[}B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{|}\mathcal{N}_{k}\Big{]}+\mathbb{E}\Big{[}C^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{|}\mathcal{N}_{k}\Big{]}+\left\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\right\|^{2}\right)\right.
LFβ221nki𝒩k~Fi(θk,t)2]\displaystyle\left.-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}\right]
\displaystyle\geq\; 𝔼[βnki𝒩k(𝔼[Ai2|𝒩k]𝔼[~Fi(θk,t)2|𝒩k]𝔼[Bi2|𝒩k]𝔼[~Fi(θk,t)2|𝒩k]\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(-\sqrt{\mathbb{E}\left[\left\|A_{i}\right\|^{2}\Big{|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}-\sqrt{\mathbb{E}\left[\left\|B_{i}\right\|^{2}\Big{|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right.\right.
𝔼[Ci2|𝒩k]𝔼[~Fi(θk,t)2|𝒩k])]+β(1LFβ2)𝔼[1nki𝒩k~Fi(θk,t)2]\displaystyle-\left.\left.\sqrt{\mathbb{E}\left[\left\|C_{i}\right\|^{2}\Big{|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right)\right]+\beta\left(1-\frac{L_{F}\beta}{2}\right)\mathbb{E}\bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\|}^{2}\bigg{]} (using (105) and (G), and rearranging terms)
\displaystyle\geq\; β𝔼[1nki𝒩k((1LFβ2)~Fi(θk,t)2\displaystyle\beta\mathbb{E}\left[\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(\left(1-\frac{L_{F}\beta}{2}\right)\Big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\|}^{2}\right.\right.
((1+αL)2γG+αζγH+σFi+β35tτ(γG2+2σF2))𝔼[~Fi(θk,t)2|𝒩k])].\displaystyle\left.\left.-\left(\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}+\beta\sqrt{35t\tau(\gamma^{2}_{G}+2\sigma^{2}_{F})}\right)\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right)\right]. (from (112), (113), and (120))

Therefore, the following holds

𝔼[F(θk)F(θk+1)]\displaystyle\mathbb{E}\left[F({\theta}^{k})-F({\theta}^{k+1})\right] =𝔼[t=0τ1F(θk,t)F(θk,t+1)]\displaystyle=\mathbb{E}\left[\sum^{\tau-1}_{t=0}F({\theta}^{k,t})-F({\theta}^{k,t+1})\right]
t=0τ1𝔼[Gk,t]\displaystyle\geq\sum^{\tau-1}_{t=0}\mathbb{E}\left[G^{k,t}\right]
β2𝔼[1nki𝒩kt=0τ1(~Fi(θk,t)22(λ1+λ2Di)𝔼[~Fi(θk,t)2|𝒩k])]\displaystyle\geq\frac{\beta}{2}\mathbb{E}\left[\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\sum^{\tau-1}_{t=0}\left(\Big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\|}^{2}-2\left(\lambda_{1}+\frac{\lambda_{2}}{\sqrt{D_{i}}}\right)\sqrt{\mathbb{E}\Big{[}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\|}^{2}\Big{|}\mathcal{N}_{k}\Big{]}}\right)\right] (substituting (2))

where

λ1\displaystyle\lambda_{1} (1+αL)2γG+αζγH+βτ35(γG2+2σF2)\displaystyle\geq\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\beta\tau\sqrt{35(\gamma^{2}_{G}+2\sigma^{2}_{F})} (121)
λ2\displaystyle\lambda_{2} 6σG2(1+(αL)2)((ασH)2+(1+αL)2)+3(αζσH)2.\displaystyle\geq\sqrt{6\sigma^{2}_{G}\left(1+(\alpha L)^{2}\right)\left((\alpha\sigma_{H})^{2}+(1+\alpha L)^{2}\right)+3(\alpha\zeta\sigma_{H})^{2}}. (122)

We complete the proof.

I-B Proof of Corollary 2

The result of Corollary 2 can be obtained via combining [9, Lemma H.1] and [9, Lemma H.2] with the proof of Lemma 1.