Efficient Federated Meta-Learning over Multi-Access Wireless Networks

Sheng Yue, Ju Ren, Jiang Xin, Deyu Zhang, Yaoxue Zhang, and Weihua Zhuang Sheng Yue, Jiang Xin, Deyu Zhang are with the School of Computer Science and Engineering, Central South University, Changsha, 410083 China. Emails: {sheng.yue, xinjiang, zdy876}@csu.edu.cn.Ju Ren and Yaoxue Zhang are with the Department of Computer Science and Technology, Tsinghua University, Beijing, 100084 China. Emails: {renju,zhangyx}@tsinghua.edu.cn.Weihua Zhuang is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada. Email: [email protected].

Abstract

Federated meta-learning (FML) has emerged as a promising paradigm to cope with the data limitation and heterogeneity challenges in today’s edge learning arena. However, its performance is often limited by slow convergence and corresponding low communication efficiency. In addition, since the available radio spectrum and IoT devices’ energy capacity are usually insufficient, it is crucial to control the resource allocation and energy consumption when deploying FML in practical wireless networks. To overcome the challenges, in this paper, we rigorously analyze the contribution of each device to the global loss reduction in each round and develop an FML algorithm (called NUFM) with a non-uniform device selection scheme to accelerate the convergence. After that, we formulate a resource allocation problem integrating NUFM in multi-access wireless systems to jointly improve the convergence rate and minimize the wall-clock time along with energy cost. By deconstructing the original problem step by step, we devise a joint device selection and resource allocation strategy to solve the problem with theoretical guarantees. Further, we show that the computational complexity of NUFM can be reduced from $O(d^{2})$ to $O(d)$ (with the model dimension $d$ ) via combining two first-order approximation techniques. Extensive simulation results demonstrate the effectiveness and superiority of the proposed methods in comparison with existing baselines.

Index Terms:

Federated meta-learning, multi-access systems, device selection, resource allocation, efficiency.

I Introduction

The integration of Artificial Intelligence (AI) and Internet-of-Things (IoT) has led to a proliferation of studies on edge intelligence, aiming at pushing AI frontiers to the wireless network edge proximal to IoT devices and data sources [1]. It is expected that edge intelligence will reduce time-to-action latency down to milliseconds for IoT applications while minimizing network bandwidth and offering security guarantees [2]. However, a general consensus is that a single IoT device can hardly realize edge intelligence due to its limited computational and storage capabilities. Accordingly, it is natural to rely on collaboration in edge learning, whereby IoT devices work together to accomplish computation-intensive tasks [3].

Building on a synergy of federated learning [4] and meta-learning [5], federated meta-learning (FML) has been proposed under a common theme of fostering edge-edge collaboration [6, 7, 8]. In FML, IoT devices join forces to learn an initial shared model under the orchestration of a central server such that current or new devices can quickly adapt the learned model to their local datasets via one or a few gradient descent steps. Notably, FML can keep all the benefits of the federated learning paradigm (such as simplicity, data security, and flexibility), while giving a more personalized model for each device to capture the differences among tasks [9]. Therefore, FML has emerged as a promising approach to tackle the heterogeneity challenges in federated learning and to facilitate efficient edge learning [10].

Despite its promising benefits, FML comes with new challenges. On one hand, the number of participated devices can be enormous. The uniform device selection at random, often done in the existing methods, leads to a low convergence speed [11, 9]. Although recent studies [12, 13, 14] have characterized the convergence of federated learning and proposed non-uniform device selection mechanisms, they cannot be directly applied to FML problems due to the bias and high-order information in the stochastic gradients. On the other hand, the performance of FML in a wireless environment is highly related to its wall-clock time, including the computation time (determined by local data sizes and devices’ CPU types) and communication time (depending on channel gains, interference, and transmission power) [15, 16]. If not properly controlled, a large wall-clock time can cause unexpected training delay and communication inefficiency. In addition, the DNN model training, which involves a large number of data samples and epochs, usually induces high computational cost, especially for sophisticated model structures consisting of millions of parameters [15]. Due to the limited power capacity of IoT devices, energy consumption should also be properly managed to ensure system sustainability and stability [14]. In a nutshell, for the purpose of efficiently deploying FML in today’s wireless systems, the strategy of device selection and resource allocation must be carefully crafted not only to accelerate the learning process, but also to control the wall-clock time of training and energy cost in edge devices. Unfortunately, despite their importance, there are limited studies on these aspects in the current literature.

In this paper, we tackle the above-mentioned challenges in two steps: 1) We develop an algorithm (called NUFM) with a non-uniform device selection scheme to improve the convergence rate of vanilla FML algorithm; 2) based on NUFM, we propose a resource allocation strategy (called URAL) that jointly optimizes the convergence speed, wall-clock time, and energy consumption in the context of multi-access wireless systems. More specifically, first we rigorously quantify the contribution of each device to the convergence of FML via deriving a tight lower bound on the reduction of one-round global loss. Based on the quantitative results, we present the non-uniform device selection scheme that maximizes the loss reduction per round, followed by the NUFM algorithm. Then, we formulate a resource allocation problem for NUFM over wireless networks, capturing the trade-offs among convergence, wall-clock time, and energy cost. To solve this problem, we exploit its special structure and decompose it into two sub-problems. The first one is to minimize the computation time via controlling devices’ CPU-cycle frequencies, which is solved optimally based on the analysis of the effect of device heterogeneity on the objective. The second sub-problem aims at optimizing the resource block allocation and transmission power management. It is a non-convex mixed-integer non-linear programming (MINLP) problem, and to derive a closed-form solution is a non-trivial task. Thus, after deconstructing the problem step by step, we devise an iterative method to solve it and provide a convergence guarantee.

In summary, our main contributions are three-fold.

•

We provide a theoretical characterization of contribution of an individual device to the convergence of FML in each round, via establishing a tight lower bound on the one-round reduction of expected global loss. Using this quantitative result, we develop NUFM, a fast-convergent FML algorithm with non-uniform device selection;
•

To embed NUFM in the context of multi-access wireless systems, we formulate a resource allocation problem, capturing the trade-offs among the convergence, wall-clock time, and energy consumption. By decomposing the original problem into two sub-problems and deconstructing sub-problems step by step, we propose a joint device selection and resource allocation algorithm (namely URAL) to solve the problem effectively with theoretical performance guarantees;
•

To reduce the computational complexity, we further integrate our proposed algorithms with two first-order approximation techniques in [9], by which the complexity of a one-step update in NUFM can be reduced from $O(d^{2})$ to $O(d)$ . We also show that our theoretical results hold in these cases;
•

We provide extensive simulation results on challenging real-world benchmarks (i.e., Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet) to demonstrate the efficacy of our methods.

The remainder of this paper is organized as follows. Section II briefly reviews the related work. Section III introduces the FML problem and standard algorithm. We present the non-uniform device selection scheme in Section IV and adapt the scheme for wireless networks in Section V. Finally, Section VI presents the extension to first-order approximation techniques, followed by the simulation results in Section VII and a conclusion drawn in Section VIII.

II Related Work

Federated learning (FL) [4] has been proposed as a promising technique to facilitate edge-edge collaborative learning [17]. However, due to the heterogeneity in devices, models, and data distributions, a shared global model often fails to capture the individual information of each device, leading to performance degradation in inference or classification [18, 10, 19].

Very recently, based on the advances in meta-learning [5], federated meta-learning (FML) has garnered much attention, which aims to learn a personalized model for each device to cope with the heterogeneity challenges [6, 7, 8, 9, 11]. Chen et al.[6] first introduce an FML method called FedMeta, integrating the model-agnostic meta-learning (MAML) algorithm [5] into the federated learning framework. They show that FML can significantly improve the performance of FedAvg [4]. Jiang et al.[7] analyze the connection between FedAvg and MAML, and empirically demonstrate that FML enables better and more stable personalized performance. From a theoretical perspective, Lin et al.[8] analyze the convergence properties and computational complexity of FML with strongly convex loss functions and exact gradients. Fallah et al.[9] further provide the convergence guarantees in non-convex cases with stochastic gradients. Different from the above gradient descent–based approaches, another recent work [11] develops an ADMM-based FML method and gives its convergence guarantee under non-convex cases. However, due to selecting devices uniformly at random, the existing FML algorithms often suffer from slow convergence and low communication efficiency [9]. Further, deploying FML in practical wireless systems calls for effective resource allocation strategies [15, 20], which is beyond the scope of existing FML literature.

There exists a significant body of works placing interests in convergence improvement and resource allocation for FL [21, 22, 23, 24, 13, 25, 26, 27, 15, 28, 29, 14, 30, 31, 32, 33, 34, 35, 36]. Regarding the convergence improvement, Nguyen et al.[13] propose a fast-convergent FL algorithm, called FOLB, which achieves a near-optimal lower bound for the overall loss decrease in each round. Note that, while the idea of NUFM is similar to [13], the lower bound for FML is derived from a completely different technical path due to the inherent complexity in the local update. To minimize the convergence time, a probabilistic device selection scheme for FL is designed in [25], which assigns high probabilities to the devices with large effects on the global model. Ren et al.[26] investigate a batchsize selection strategy for accelerating the FL training process. Karimireddy et al.[30] employ the variance reduction technique to develop a new FL algorithm. Based on momentum methods, Yu et al.[27] give an FL algorithm with linear speedup property. Regarding the resource allocation in FL, Dinh et al.[15] embed FL in wireless networks, considering the trade-offs between training time and energy consumption, under the assumption that all devices participate in the whole training process. From a long-term perspective, Xu et al.[28] empirically investigate the device selection scheme jointly with bandwidth allocation for FL, using the Lyapunov optimization method. Chen et al.[14] investigate a device selection problem with “hard” resource constraints to enable the implementation of FL over wireless networks. Wang et al.[29] propose a control algorithm to determine the best trade-off between the local update and global aggregation under a resource budget.

Although extensive research has been carried out on FL, researchers have not treated FML in much detail. In particular, the existing FL acceleration techniques cannot be directly applied to FML due to the high-order information and biased stochastic gradients in the local update phases¹¹1Different from FML, FL is a first-order method with unbiased stochastic gradients. (see Lemma 2 in Section IV). At the same time, device selection and resource allocation require crafting jointly [6], rather than simply plugging the existing strategies in.

III Preliminaries and Assumptions

In this section, we introduce federated meta-learning, including the learning problem, standard algorithm, and assumptions for theoretical analysis.

TABLE I: Key Notations

Notation	Definition	Notation	Definition
$\theta$	Model parameter	$F_{i}(\cdot)$	Meta–loss function
$\alpha$ , $\beta$	Stepsize and meta–learning rate	$L_{i}$ , $\rho_{i}$	Lipschitz continuous parameters
$i$ , $j$	Indexes of user devices and data samples	$\sigma_{G}$ , $\sigma_{H}$	Upper bounds of variances
$k$	Index of training rounds	$\gamma_{G}$ , $\gamma_{H}$	Similarity parameters
$t$	Index of local update steps	$u_{i}$	Contribution to global loss reduction of device $i$
$\tau$	Number of local update steps	$v_{i}$	CPU-cycle frequency of device $i$
$P_{i}$	Underlying data distribution of device $i$	$p_{i}$	Transmission power of device $i$
$\mathcal{D}_{i}$ , $D_{i}$	Dataset and its size of device $i$	$z_{i,m}$	Binary variable indicating if device $i$ accesses RB $m$
$\mathcal{N}$	Set of devices	$\eta_{1}$ , $\eta_{2}$	Weight parameters
$n$	Number of devices	$c_{i}$	CPU cycles for computing one sample by device $i$
$\mathcal{N}_{k}$	Set of participating devices in round $k$	$h_{i}$ , $I_{m}$	Channel gain of device $i$ and inference in RB $m$
$n_{k}$	Number of participating devices in round $k$	$B$ , $N_{0}$	Bandwidth of each RB and noise power spectral density
$\mathrm{x}$ , $\mathrm{y}$	Input and corresponding label	$\mathcal{M}$ , $M$	Set and number of RBs
$l_{i}(\theta;\mathrm{x},\mathrm{y})$	Loss of model $\theta$ on sample $(\mathrm{x},\mathrm{y})$	$E$	Total energy consumption
$f_{i}(\cdot)$	Expected loss function	$T$	Total latency

III-A Federated Meta-Learning Problem

We consider a set $\mathcal{N}$ of user devices that are all connected to a server. Each device $i\in\mathcal{N}$ has a labeled dataset $\mathcal{D}_{i}=\{\mathrm{x}^{j}_{i},\mathrm{y}^{j}_{i}\}^{D_{i}}_{j=1}$ that can be accessed only by itself. Here, the tuple $(\mathrm{x}^{j}_{i},\mathrm{y}^{j}_{i})\in\mathcal{X}\times\mathcal{Y}$ is a data sample with input $\mathrm{x}^{j}_{i}$ and label $\mathrm{y}^{j}_{i}$ , and follows an unknown underlying distribution $P_{i}$ . Define $\theta$ as the model parameter, such as the weights of a Deep Neural Network (DNN) model. For device $i$ , the loss function of a model parameter ${\theta}\in\mathbb{R}^{d}$ is defined as $\ell_{i}({\theta};\mathrm{x},\mathrm{y})$ , which measures the error of model ${\theta}$ in predicting the true label $\mathrm{y}$ given input $\mathrm{x}$ .

Federated meta-learning (FML) looks for a good model initialization (also called meta-model) such that the well-performed models of different devices can be quickly obtained via one or a few gradient descent steps. More specially, FML aims to solve the following problem

\displaystyle\mathop{\min}_{{\theta}\in\mathbb{R}^{d}}F({\theta})\coloneqq\frac{1}{n}\sum_{i\in\mathcal{N}}f_{i}({\theta}-\alpha\nabla f_{i}({\theta}))

(1)

where $f_{i}$ represents the expected loss function over the data distribution of device $i$ , i.e., $f_{i}({\theta})\coloneqq\mathbb{E}_{(\mathrm{x},\mathrm{y})\sim P_{i}}\left[\ell_{i}({\theta};\mathrm{x},\mathrm{y})\right]$ , $n=|\mathcal{N}|$ is the number of devices, and $\alpha$ is the stepsize. The advantages of this formulation are two-fold: 1) It gives a personalized solution that can capture any heterogeneity between the devices; 2) the meta-model can quickly adapt to new devices via slightly updating it with respect to their own data. Clearly, FML well fits edge learning cases, where edge devices have insufficient computing power and limited data samples.

Next, we review the standard FML algorithm in the literature.

III-B Standard Algorithm

Similar to federated learning, vanilla FML algorithm solves (1) in two repeating steps: local update and global aggregation [9], as detailed below.

•

Local update: At the beginning of each round $k$ , the server first sends the current global model ${\theta}^{k}$ to a fraction of devices $\mathcal{N}_{k}$ chosen uniformly at random with pre-set size $n_{k}$ . Then, each device $i\in\mathcal{N}_{k}$ updates the received model based on its meta-function $F_{i}({\theta})\coloneqq f_{i}({\theta}-\alpha\nabla f_{i}({\theta}))$ by running $\tau$ ( $\geq 1$ ) steps of stochastic gradient descent locally (also called mini-batch gradient descent), i.e.,

\displaystyle{\theta}^{k,t+1}_{i}={\theta}^{k,t}_{i}-\beta\tilde{\nabla}F_{i}({\theta}^{k,t}_{i}),~{}\mathrm{for}~{}0\leq t\leq\tau-1

(2)

where ${\theta}^{k}_{t}$ denotes the local model of device $i$ in the $t$ -th step of the local update in round $k$ with ${\theta}^{k,0}_{i}={\theta}^{k}$ , and $\beta>0$ is the meta–learning rate. In (2), the stochastic gradient $\tilde{\nabla}F_{i}({\theta})$ is given by

\displaystyle\tilde{\nabla}F_{i}({\theta})\coloneqq\big{(}I-\alpha\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\big{)}\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}

(3)

where $\mathcal{D}_{i}$ , $\mathcal{D}^{\prime}_{i}$ , and $\mathcal{D}^{\prime\prime}_{i}$ are independent batches²²2We slightly abuse the notation $\mathcal{D}_{i}$ as a batch of the local dataset of the $i$ -th device., and for any batch $\mathcal{D}$ , $\tilde{\nabla}f_{i}({\theta},\mathcal{D})$ and $\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D})$ are the unbiased esimates of $\nabla f_{i}({\theta})$ and $\nabla^{2}f_{i}({\theta})$ respectively, i.e.,

	$\displaystyle\tilde{\nabla}f_{i}({\theta},\mathcal{D})$	$\displaystyle\coloneqq\frac{1}{\|\mathcal{D}\|}\sum_{(\mathrm{x},\mathrm{y})\in\mathcal{D}}\nabla\ell_{i}({\theta};\mathrm{x},\mathrm{y})$		(4)
	$\displaystyle\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D})$	$\displaystyle\coloneqq\frac{1}{\|\mathcal{D}\|}\sum_{(\mathrm{x},\mathrm{y})\in\mathcal{D}}\nabla^{2}\ell_{i}({\theta};\mathrm{x},\mathrm{y}).$		(5)

•

Global aggregation: After updating the local model parameter, each selected device sends its local model ${\theta}^{k}_{i}={\theta}^{k,\tau-1}_{i}$ to the server. The server updates the global model by averaging over the received models, i.e.,

$\displaystyle{\theta}^{k+1}=\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}{\theta}^{k}_{i}.$ (6)

It is easy to see that the main difference between federated learning and FML lies in the local update phase: In federated learning, local update is done using the unbiased gradient estimates while FML uses the biased one consisting of high-order information. Besides, federated learning can be considered as a special case of FML, i.e., FML under $\alpha=0$ .

III-C Assumptions

In this subsection, we list the standard assumptions for the analysis of FML algorithms [9, 8, 11].

Assumption 1 (Smoothness).

The expected loss function $f_{i}$ corresponding to device $i\in\mathcal{N}$ is twice continuously differentiable and $L_{i}$ -smooth, i.e.,

\displaystyle\quad\|\nabla f_{i}({\theta}_{1})-\nabla f_{i}({\theta}_{2})\|\leq L_{i}\|{\theta}_{1}-{\theta}_{2}\|,~{}\forall{\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}.

(7)

Besides, its gradient is bounded by a positive constant $\zeta_{i}$ , i.e., $\|\nabla f_{i}({\theta})\|\leq\zeta_{i}$ .

Assumption 2 (Lipschitz Hessian).

The Hessian of function $f_{i}$ is $\rho_{i}$ -Lipschitz continuous for each $i\in\mathcal{N}$ , i.e.,

\displaystyle\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\|\leq\rho_{i}\|{\theta}_{1}-{\theta}_{2}\|,~{}\forall{\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}.

(8)

Assumption 3 (Bounded Variance).

Given any ${\theta}\in\mathbb{R}^{d}$ , the following facts hold for stochastic gradient $\nabla\ell_{i}({\theta};\mathrm{x},\mathrm{y})$ and Hessians $\nabla^{2}\ell_{i}({\theta};\mathrm{x},\mathrm{y})$ with $(\mathrm{x},\mathrm{y})\in\mathcal{X}\times\mathcal{Y}$

	$\displaystyle\mathbb{E}_{(\mathrm{x},\mathrm{y})\sim P_{i}}\left[\left\\|\nabla\ell_{i}({\theta};\mathrm{x},\mathrm{y})-\nabla f_{i}({\theta})\right\\|^{2}\right]$	$\displaystyle\leq\sigma^{2}_{G}$		(9)
	$\displaystyle\mathbb{E}_{(\mathrm{x},\mathrm{y})\sim P_{i}}\left[\left\\|\nabla^{2}\ell_{i}({\theta};\mathrm{x},\mathrm{y})-\nabla^{2}f_{i}({\theta})\right\\|^{2}\right]$	$\displaystyle\leq\sigma^{2}_{H}.$		(10)

Assumption 4 (Similarity).

For any ${\theta}\in\mathbb{R}^{d}$ and $i,j\in\mathcal{N}$ , there exist nonnegtive constants $\gamma_{G}\geq 0$ and $\gamma_{H}\geq 0$ such that the gradients and Hessians of the expected loss funtions $f_{i}({\theta})$ and $f_{j}({\theta})$ satisfy the following conditions

	$\displaystyle\\|\nabla f_{i}({\theta})-\nabla f_{j}({\theta})\\|$	$\displaystyle\leq\gamma_{G}$		(11)
	$\displaystyle\\|\nabla^{2}f_{i}({\theta})-\nabla^{2}f_{j}({\theta})\\|$	$\displaystyle\leq\gamma_{H}.$		(12)

Assumption 2 implies the high-order smoothness of $f_{i}({\theta})$ dealing with the second-order information in the local update step (2). Assumption 4 indicates that the variations of gradients between different devices are bounded by some constants, which captures the similarities between devices’ tasks corresponding to non-IID data. It holds for many practical loss functions [37], such as logistic regression and hyperbolic tangent functions. In particular, $\psi^{g}_{i}$ and $\psi^{h}_{i}$ can be roughly seen as a distance between data distributions $P_{i}$ and $P_{j}$ [38].

IV Non-Uniform Federated Meta-Learning

Due to the uniform selection of devices in each round, the convergence rate of the standard FML algorithm is naturally slow. In this section, we present a non-uniform device selection scheme to tackle this challenge.

IV-A Device Contribution Quantification

We begin with quantifying the contribution of each device to the reduction of one-round global loss using its dataset size and gradient norms. For convenience, define $\zeta\coloneqq\max_{i}\zeta_{i}$ , $L\coloneqq\max_{i}L_{i}$ , and $\rho\coloneqq\max_{i}\rho_{i}$ . We first provide necessary lemmas before giving the main result.

Lemma 1.

If Assumptions 1 and 2 hold, local meta-function $F_{i}$ is smooth with parameter $L_{F}\coloneqq(1+\alpha L)^{2}L+\alpha\rho\zeta$ .

Sketch of Proof.

We expand the expression of $\|\nabla F_{i}(\theta_{1})-\nabla F_{i}(\theta_{2})\|$ into two parts by triangle inequality, followed by bounding each part via Assumptions 1 and 2. The detailed proof is presented in Appendix A of the technical report [39]. ∎

Lemma 1 gives the smoothness of the local meta-function $F_{i}$ and global loss function $F$ .

Lemma 2.

Suppose that Assumptions 1-3 are satisfied, and $\mathcal{D}_{i}$ , $\mathcal{D}^{\prime}_{i}$ , and $\mathcal{D}^{\prime\prime}_{i}$ are independent batches with sizes $D_{i}$ , $D^{\prime}_{i}$ , and $D^{\prime\prime}_{i}$ respectively. For any ${\theta}\in\mathbb{R}^{d}$ , the following holds

	$\displaystyle\left\\|\mathbb{E}\left[\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right]\right\\|$	$\displaystyle\leq\frac{\alpha\sigma_{G}L(1+\alpha L)}{\sqrt{D_{i}}}$		(13)
	$\displaystyle\mathbb{E}\left[\left\\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\\|^{2}\right]$	$\displaystyle\leq\sigma^{2}_{F_{i}}$		(14)

where $\sigma^{2}_{F_{i}}$ is denoted as

	$\displaystyle\sigma^{2}_{F_{i}}\coloneqq$	$\displaystyle~{}6\sigma^{2}_{G}(1+\alpha L)^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+\frac{3(\alpha\zeta\sigma_{H})^{2}}{D^{\prime\prime}_{i}}$
		$\displaystyle+\frac{6(\alpha\sigma_{G}\sigma_{H})^{2}}{D^{\prime\prime}_{i}}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right).$		(15)

Sketch of Proof.

We first obtain

	$\displaystyle\big{\\|}\mathbb{E}\big{[}\tilde{\nabla}F_{i}({\theta})$	$\displaystyle-\nabla F_{i}({\theta})\big{]}\big{\\|}\leq\big{\\|}\big{(}I-\alpha\nabla^{2}f_{i}({\theta})\big{)}\mathbb{E}\big{[}\delta^{}_{2}\big{]}+\mathbb{E}\big{[}\delta^{}_{1}\delta^{*}_{2}\big{]}$
		$\displaystyle+\mathbb{E}\big{[}\delta^{*}_{1}\big{]}\nabla f_{i}({\theta}-\alpha\nabla f_{i}({\theta}))\big{\\|},$
	$\displaystyle\mathbb{E}\big{[}\\|\tilde{\nabla}F_{i}({\theta})$	$\displaystyle-\nabla F_{i}({\theta})\\|^{2}\big{]}\leq 3\Big{(}\mathbb{E}\big{[}\\|\delta^{}_{1}\\|^{2}\big{]}\mathbb{E}\big{[}\\|\delta^{}_{2}\\|^{2}\big{]}$
		$\displaystyle+\zeta^{2}\mathbb{E}\big{[}\\|\delta^{}_{1}\\|^{2}\big{]}+\left(1+\alpha L\right)^{2}\mathbb{E}\big{[}\\|\delta^{}_{2}\\|^{2}\big{]}\Big{)},$

where $\delta^{*}_{1}$ and $\delta^{*}_{2}$ are given by

	$\displaystyle\delta^{*}_{1}$	$\displaystyle=\alpha\left(\nabla^{2}f_{i}({\theta})-\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\right)$
	$\displaystyle\delta^{*}_{2}$	$\displaystyle=\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right).$

Then, we derive the results via bounding the first and second moments of $\delta^{*}_{1}$ and $\delta^{*}_{2}$ . The detailed proof is presented in Appendix B of the technical report [39]. ∎

Lemma 2 shows that the stochastic gradient $\tilde{\nabla}F_{i}({\theta})$ is a biased estimate of $\nabla F_{i}({\theta})$ , revealing the challenges in analyzing FML algorithms.

Lemma 3.

If Assumptions 1, 2, and 4 are satisfied, then for any ${\theta}\in\mathbb{R}^{d}$ and $i,j\in\mathcal{N}$ , we have

\displaystyle\left\|\nabla F_{i}({\theta})-\nabla F_{j}({\theta})\right\|\leq\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}.

(16)

Sketch of Proof.

We divide the bound of $\|\nabla F_{i}({\theta})-\nabla F_{j}({\theta})\|$ into two independent terms, followed by bounding the two terms separately. The detailed proof is presented in Appendix C of the technical report [39]. ∎

Lemma 3 characterizes the similarities between the local meta-functions, which is critical for analyzing the one-step global loss reduction because it relates the local meta-function and global objective. Based on Lemmas 1–3, we are now ready to give our main result.

Theorem 1.

Suppose that Assumptions 1-4 are satisfied, and $\mathcal{D}_{i}$ , $\mathcal{D}^{\prime}_{i}$ , and $\mathcal{D}^{\prime\prime}_{i}$ are independent batches. If the local update and global aggregation follow (2) and (6) respectively, then the following fact holds true for $\tau=1$

$\displaystyle\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\;$	$\displaystyle\beta\mathbb{E}\Bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Bigg{(}\Big{(}1-\frac{L_{F}\beta}{2}\Big{)}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}$
	$\displaystyle-\bigg{(}\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}\bigg{)}$
	$\displaystyle\times\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\Bigg{)}\Bigg{]}$	(17)

where the outer expectation of RHS is taken with respect to the selected user set $\mathcal{N}_{k}$ and data sample sizes, and the inner expectation is only regarding data sample sizes.

Sketch of Proof.

Using the smoothness condition of $F_{i}$ , we express the lower bound of loss reduction by

\displaystyle\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\mathbb{E}[G^{k}],

where $G^{k}$ is defined as

	$\displaystyle G^{k}\coloneqq\;$	$\displaystyle\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)$
		$\displaystyle-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}.$

The key step to derive the desired result is providing a tight lower bound for the product of $\nabla F({\theta}^{k})$ and $\tilde{\nabla}F({\theta}^{k})$ . The detailed proof is presented in Appendix G of the technical report [39]. ∎

Theorem 1 provides a lower bound on the one-round reduction of the global objective function $F$ based on the device selection. It implies that different user selection has varying impacts on the objective improvement and quantifies the contribution of each device to the objective improvement, depending on the variance of local meta-function, task similarities, smoothness, and learning rates. It therefore provides a criterion for selecting users to accelerate the convergence.

From Theorem 1, we have the following Corollary, which simplifies the above result and extends it to multi-step cases.

Corollary 1.

Suppose that Assumptions 1-4 are satisfied, and $\mathcal{D}_{i}$ , $\mathcal{D}^{\prime}_{i}$ , and $\mathcal{D}^{\prime\prime}_{i}$ are independent batches with $D_{i}=D^{\prime}_{i}=D^{\prime\prime}_{i}$ . If the local update and global aggregation follow (2) and (6) respectively, then the following fact holds true with $\beta\in[0,1/L_{F})$

	$\displaystyle\mathbb{E}\big{[}F({\theta}^{k})-F({\theta}^{k+1})\big{]}\geq\frac{\beta}{2}\mathbb{E}\Bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\sum^{\tau-1}_{t=0}\Bigg{(}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\\|}^{2}$
	$\displaystyle-2\big{(}\lambda_{1}+\frac{\lambda_{2}}{\sqrt{D_{i}}}\big{)}\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\Bigg{)}\Bigg{]}$		(18)

where positive constants $\lambda_{1}$ and $\lambda_{2}$ satisfy that

$\displaystyle\lambda_{1}\geq$	$\displaystyle~{}\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\beta\tau\sqrt{35(\gamma^{2}_{G}+2\sigma^{2}_{F})}$	(19)
$\displaystyle\lambda^{2}_{2}\geq$	$\displaystyle~{}6\sigma^{2}_{G}\left(1+(\alpha L)^{2}\right)\left((\alpha\sigma_{H})^{2}+(1+\alpha L)^{2}\right)$
	$\displaystyle+3(\alpha\zeta\sigma_{H})^{2}.$	(20)

Sketch of Proof.

The proof is similar to Theorem 1, with additional tricks in bounding the product of $\nabla F({\theta}^{k})$ and $\tilde{\nabla}F({\theta}^{k})$ . The detailed proof is presented in Appendix I of the technical report [39]. ∎

Corollary 1 implies that the device with a large gradient naturally accelerates the global loss decrease, but a small dataset size degrades the process due to corresponding high variance. Besides, as the device dissimilarities become large, the lower bound (1) weakens.

Motivated by Corollary 1, we study the device selection in the following subsection.

IV-B Device Selection

To improve the convergence speed, we aim to maximize the lower bound (1) on the one-round objective reduction. Based on Corollary 1, we define the contribution $u^{k}_{i}$ device $i$ to the convergence in round $k$ as

\displaystyle u^{k}_{i}\coloneqq\sum^{\tau-1}_{t=0}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\|}^{2}-2\big{(}\lambda_{1}+\frac{\lambda_{2}}{\sqrt{D_{i}}}\big{)}\big{\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\big{\|}

(21)

where we replace the second moment of $\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})$ by its sample value $\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\|^{2}$ . Then, the device selection problem in round $k$ can be formulated as

\displaystyle\begin{split}\max_{\{z_{i}\}}~{}&\sum_{i\in\mathcal{N}}z_{i}u^{k}_{i}\\ \mathrm{s.t.}~{}&\sum_{i\in\mathcal{N}}z_{i}=n_{k}\\ &\,z_{i}\in\{0,1\},~{}\forall i\in\mathcal{N}.\end{split}

(22)

In (22), $z_{i}$ is a binary variable: $z_{i}=1$ for selecting device $i$ in this round; $z_{i}=0$ , otherwise. The solution of (22) can be found by the Select algorithm introduced in [40, Chapter 9] with the worst-case complexity $O(n)$ (the problem (22) is indeed a selection problem). Accordingly, our device selection scheme is presented as follows.

Device selection: Following the local update phase, instead of selecting devices uniformly at random, each device $i$ first computes its contribution scalar $u^{k}_{i}$ locally and sends it to the server. After receiving $\{u^{k}_{i}\}_{i\in\mathcal{N}}$ from all devices, the server runs the Select algorithm and finds the optimal device set denoted by $\mathcal{N}^{*}_{k}$ . Notably, although constants $\lambda_{1}$ and $\lambda_{2}$ in (22) consist of unknown parameters such as $L$ , $\gamma_{G}$ , and $\gamma_{H}$ , they can be either estimated during training as in [29] or directly tuned as in our simulations.

Based on the device selection scheme, we propose the Non-Uniform Federated Meta-Learning (NUFM) algorithm (depicted in Algorithm 1). In particular, although NUFM requires an additional communication phase to upload $u^{k}_{i}$ to the server, the communication overhead can be negligible because $u^{k}_{i}$ is just a scalar.

Input:

\alpha

\beta

\lambda_{1}

\lambda_{2}

1 Server initializes model

{\theta}^{0}

and sends it to all devices;

2 for round $k=0$ to $K-1$ do

3 foreach device $i\in\mathcal{N}$ do

// Local update

4 Initialize

{\theta}^{k,0}_{i}\leftarrow{\theta}^{k}

and

u^{k}_{i}\leftarrow 0

;

5 for local step $t=0$ to $\tau-1$ do

6 Compute stochastic gradient

\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})

by (3) using batches

\mathcal{D}_{i}

\mathcal{D}^{\prime}_{i}

, and

\mathcal{D}^{\prime\prime}_{i}

;

7 Update local model

{\theta}^{k,t+1}_{i}

by (2);

8 Update contribution scalar

u^{k}_{i}

\begin{aligned} u^{k}_{i}\leftarrow&~{}u^{k}_{i}+\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\|^{2}\\ &-2(\lambda_{1}+\tfrac{\lambda_{2}}{\sqrt{D_{i}}})\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\|\end{aligned}

;

10 end for

11 Set

{\theta}^{k}_{i}={\theta}^{k,\tau}_{i}

and send

u^{k}_{i}

to server;

13 end foreach

// Device selection

14 Once receiving

\{u^{k}_{i}\}_{i\in\mathcal{N}}

, server computes optimal device selection

\mathcal{N}^{*}_{k}

by solving (22);

// Global aggregation

15 After receiving local models

\{{\theta}^{k}_{i}\}_{i\in\mathcal{N}^{*}_{k}}

, server computes the global model by (6);

17 end for

return ${\theta}^{K}$

Algorithm 1 Non-Uniform Federated Meta-Learning (NUFM)

V Federated Meta-Learning over Wireless Networks

In this section, we extend NUFM to the context of multi-access wireless systems, where the bandwidth for uplink transmission and the power of IoT devices are limited. First, we present the system model followed by the problem formulation. Then, we decompose the original problem into two sub-problems and devise solutions for each of them with theoretical performance guarantees.

V-A System Model

As illustrated in Fig. 1, we consider a wireless multi-user system, where a set $\mathcal{N}$ of $n$ end devices joint forces to carry out federated meta-learning aided by an edge server. Each round consists of two stages: the computation phase and the communication phase. In the computation phase, each device $i\in\mathcal{N}$ downloads the current global model and computes the local model based on its local dataset; in the communication phase, the selected devices transmit the local models to the edge server via a limited number of wireless channels. After that, the edge server runs the global aggregation and starts the next round. Here we do not consider the downlink communication due to the asymmetric uplink-downlink settings in wireless networks. That is, the transmission power at the server (e.g., a base station) and the downlink communication bandwidth are generally sufficient for global meta-model transmission. Thus, the downlink time is usually neglected, compared to the uplink data transmission time [15]. Since we focus on the device selection and resource allocation problem in each round, we omit subscript $k$ for brevity throughout this section.

Refer to caption — Figure 1: The architecture of federated meta-learning over a wireless network with multiple user devices and an edge server. Due to limited communication resources, only part of user devices can upload their local models in each training round.

V-A1 Computation Model

We denote $c_{i}$ as the CPU cycles for device $i$ to update the model with one sample, which can be measured offline as a priori knowledge [15]. Assume that the batch size of device $i$ used in local update phase (2) is $D_{i}$ . Then, the number of CPU cycles required for device $i$ to run a one-step local update is $c_{i}D_{i}$ . We denote the CPU-cycle frequency of device $i$ as $\nu_{i}$ . Thus, the CPU energy consumption of device $i$ in the computation during the local update phase can be expressed by

\displaystyle E^{\mathit{cp}}_{i}(\nu_{i})\coloneqq\frac{\iota_{i}}{2}\tau_{i}c_{i}D_{i}\nu^{2}_{i}

(23)

where $\iota_{i}/2$ is the effective capacitance coefficient of the computing chipset of device $i$ [41]. The computational time of device $i$ in a round can be denoted as

\displaystyle T^{\mathit{cp}}_{i}(\nu_{i})\coloneqq\frac{\tau_{i}c_{i}D_{i}}{\nu_{i}}.

(24)

For simplicity, we set $\tau=1$ in the following.

V-A2 Communication Model

We consider a multi-access protocol for devices, i.e., the orthogonal frequency division multiple access (OFDMA) technique whereby each device can occupy one uplink resource block (RB) in a communication round to upload its local model. There are $M$ RBs in the system, denoted by $\mathcal{M}=\{1,2,\dots,M\}$ . The achievable transmission rate of device $i$ is [25]

\displaystyle r_{i}(\bm{z}_{i},p_{i})\coloneqq\sum_{m\in\mathcal{M}}z_{i,m}B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)

(25)

with $B$ being the bandwidth of each RB, $h_{i}$ the channel gain, $N_{0}$ the noise power spectral density, $p_{i}$ the transmission power of device $i$ , and $I_{m}$ the interference caused by the devices that are located in other service areas and use the same RB. In (25), $z_{i,m}\in\{0,1\}$ is a binary variable associated with the $m$ -th RB allocation for device $i$ : $z_{i,m}=1$ indicates that RB $m$ is allocated to device $i$ , and $z_{i,m}=0$ otherwise. Each device can only occupy one RB at maximum while each RB can be accessed by at most one device, thereby we have

	$\displaystyle\sum_{m\in\mathcal{M}}z_{i,m}$	$\displaystyle\leq 1\quad\forall i\in\mathcal{N}$		(26)
	$\displaystyle\sum_{i\in\mathcal{N}}z_{i,m}$	$\displaystyle\leq 1\quad\forall m\in\mathcal{M}.$		(27)

Due to the fixed dimension of model parameters, we assume that the model sizes of devices are constant throughout the learning process, denoted by $S$ . If device $i$ is selected, the time duration of transmitting the model is given by

\displaystyle T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})\coloneqq\frac{S}{r_{i}(\bm{z}_{i},p_{i})}

(28)

where $\bm{z}_{i}\coloneqq\{z_{i,m}\mid m\in\mathcal{M}\}$ . Besides, the energy consumption of the transmission is

\displaystyle E^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})\coloneqq\sum_{m\in\mathcal{M}}z_{i,m}T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})p_{i}.

(29)

If no RB is allocated to device $i$ in current round, its transmission power and energy consumption is zero.

V-B Problem Formulation

For ease of exposition, we define $\bm{\nu}\coloneqq\{\nu_{i}\mid i\in\mathcal{N}\}$ , $\bm{p}\coloneqq\{p_{i}\mid i\in\mathcal{N}\}$ , and $\bm{z}\coloneqq\{z_{i,m}\mid i\in\mathcal{N},m\in\mathcal{M}\}$ . Recall the procedure of NUFM. The total energy consumption $E(\bm{z},\bm{p},\bm{\nu})$ and wall-clock time $T(\bm{z},\bm{p},\bm{\nu})$ in a round can be expressed by

	$\displaystyle E(\bm{z},\bm{p},\bm{\nu})$	$\displaystyle\coloneqq\sum_{i\in\mathcal{I}}\Big{(}E^{\mathit{cp}}_{i}(\nu_{i})+E^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})\Big{)}$		(30)
	$\displaystyle T(\bm{z},\bm{p},\bm{\nu})$	$\displaystyle\coloneqq\max_{i\in\mathcal{N}}T^{\mathit{cp}}_{i}(\nu_{i})+\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})$		(31)

where we neglect the communication time for transmitting the scalar $u_{i}$ . The total contribution to the convergence is

\displaystyle U(\bm{z})=\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}u_{i}

(32)

where $u_{i}$ is given in (21)³³3One can regularize $u_{i}$ via adding a large enough constant in (21) to keep it positive..

We consider the following non-convex mixed-integer non-linear programming (MINLP) problem

$\displaystyle\mathrm{(P)}~{}\mathop{\max}_{\bm{z},\bm{p},\bm{\nu}}$	$\displaystyle~{}~{}U(\bm{z})-\eta_{1}E(\bm{z},\bm{p},\bm{\nu})-\eta_{2}T(\bm{z},\bm{p},\bm{\nu})$
$\displaystyle\mathrm{s.t.}$	$\displaystyle~{}~{}0\leq p_{i}\leq p^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}$	(33)
	$\displaystyle~{}~{}0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}$	(34)
	$\displaystyle~{}~{}z_{i,m}\in\{0,1\},~{}\forall i\in\mathcal{N},m\in\mathcal{M}$	(35)
	$\displaystyle~{}\,\sum\nolimits_{i\in\mathcal{N}}z_{i,m}\leq 1,~{}\forall m\in\mathcal{M}$	(26)
	$\displaystyle~{}\,\sum\nolimits_{m\in\mathcal{M}}z_{i,m}\leq 1,~{}\forall i\in\mathcal{N}$	(27)

where $\eta_{1}\geq 0$ and $\eta_{2}\geq 0$ are weight parameters to capture the Pareto-optimal trade-offs among convergence, latency, and energy consumption, the values of which depend on specific scenarios. Constraints (33) and (34) give the feasible regions of devices’ transmission power levels and CPU-cycle frequencies, respectively. Constraints (26) and (27) restrict that each device can only access one uplink RB while each RB can be allocated to one device at most.

In this formulation, we aim to maximize the convergence speed of FML, while minimizing the energy consumption and wall-clock time in each round. Notably, our solution can adapt to the problem with hard constraints on energy consumption and wall-clock time as in [14] via setting “virtual devices” (see Lemma 4 and Lemma 6).

Next, we provide a joint device selection and resource allocation algorithm to solve this problem.

V-C A Joint Device Selection and Resource Allocation Algorithm

Substituting (30), (31), and (32) into problem (P), we can easily decompose the original problem into the following two sub-problems (SP1) and (SP2).

$\displaystyle\mathrm{(SP1)}~{}$	$\displaystyle\mathop{\min}_{\bm{\nu}}$	$\displaystyle~{}~{}g_{1}(\bm{\nu})=\eta_{1}\sum_{i\in\mathcal{N}}\frac{\iota_{i}}{2}c_{i}D_{i}\nu^{2}_{i}+\eta_{2}\max_{i\in\mathcal{N}}\frac{c_{i}D_{i}}{\nu_{i}}$
	$\displaystyle~{}\,\mathrm{s.t.}$	$\displaystyle~{}~{}0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}.$
$\displaystyle\mathrm{(SP2)}~{}$	$\displaystyle\mathop{\max}_{\bm{z},\bm{p}}$	$\displaystyle~{}~{}g_{2}(\bm{z},\bm{p})=\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}u_{i}$
$\displaystyle~{}~{}-\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}\eta_{1}\frac{Sp_{i}}{B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}$
$\displaystyle~{}~{}-\eta_{2}\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}$
	$\displaystyle~{}\,\mathrm{s.t.}$	$\displaystyle~{}~{}0\leq p_{i}\leq p^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}$
$\displaystyle~{}\,\sum\nolimits_{i\in\mathcal{N}}z_{i,m}\leq 1,~{}\forall m\in\mathcal{M}$
$\displaystyle~{}\,\sum\nolimits_{m\in\mathcal{M}}z_{i,m}\leq 1,~{}\forall i\in\mathcal{N}$
$\displaystyle~{}~{}z_{i,m}\in\{0,1\},~{}\forall i\in\mathcal{N},m\in\mathcal{M}.$

(SP1) aims at controlling the CPU-cycle frequencies for devices to minimize the energy consumption and latency in the computational phase. (SP2) controls the transmission power and RB allocation to maximize the convergence speed while minimizing the transmission cost and communication delay. We provide the solutions to these two sub-problems separately.

V-C1 Solution to (SP1)

Denote the optimal CPU-cycle frequencies of (SP1) as $\bm{\nu}^{*}=\{\nu^{*}_{i}\}_{i\in\mathcal{N}}$ . We first give the following lemma to offer insights into this sub-problem.

Lemma 4.

If device $j$ is the straggler of all devices with optimal CPU frequencies $\bm{\nu}^{*}$ , i.e., $j=\mathop{\arg\max}_{i\in\mathcal{N}}c_{i}D_{i}/\nu^{*}_{i}$ , the following holds true for any $\eta_{1},\eta_{2}>0$

\displaystyle\nu^{*}_{i}=\begin{cases}\min\left\{\sqrt[3]{\frac{a_{2}}{2a_{1}}},\min_{i\in\mathcal{N}}\frac{c_{j}D_{j}\nu^{\mathit{max}}_{j}}{c_{i}D_{i}}\right\}&,\mathrm{if}~{}i=j\\ \frac{c_{i}D_{i}\nu^{*}_{j}}{c_{j}D_{j}}&,\mathrm{otherwise}.\end{cases}

(36)

The positive constants $a_{1}$ and $a_{2}$ in (36) are defined as

	$\displaystyle a_{1}$	$\displaystyle\coloneqq\eta_{1}\sum_{i\in\mathcal{N}}\frac{\iota_{i}(c_{i}D_{i})^{3}}{2(c_{j}D_{j})^{2}}$		(37)
	$\displaystyle a_{2}$	$\displaystyle\coloneqq\eta_{2}c_{j}D_{j}.$		(38)

Sketch of Proof.

The derivation of the results involves two steps, i.e., expressing $v^{*}_{i}$ by $v^{*}_{j}$ and deriving $v^{*}_{j}$ by solving the corresponding optimization problem. The detailed proof is presented in Appendix D of the technical report [39]. ∎

Lemma 4 implies that if the straggler (a device with the lowest computational time) can be determined, then the optimal CPU-cycle frequencies of all devices can be derived as closed-form solutions. Intuitively, due to the contradiction in minimizing the energy consumption and computational time, if the straggler is fixed, then the other devices can use the smallest CPU-cycle frequencies as long as the computational time is shorter than that of the straggler. It leads to the following Theorem.

Theorem 2.

Denote $\bm{\nu}^{*}_{\mathit{straggler:}j}$ as the optimal solution (i.e., (36)) under the assumption that $j$ is the straggler. Then, the global optimal solution of (SP1) can be obtained by

\displaystyle\bm{\nu}^{*}=\mathop{\arg\min}_{\bm{\nu}\in\mathcal{V}}g_{1}(\bm{\nu})

(39)

where $\mathcal{V}\coloneqq\{\bm{\nu}^{*}_{\mathit{straggler:}j}\}_{j\in\mathcal{N}}$ , and $g_{1}$ is the objective function in (SP1).

Proof.

The result can be directly obtained from Lemma 4. We omit it for brevity. ∎

Theorem 2 shows that the optimal solution of (SP1) is the fixed-straggler solution in Lemma 4 corresponding to the minimum objective $g_{1}$ . Thus, (SP1) can be solved with computational complexity $O(n)$ by comparing the achievable objective values corresponding to different stragglers.

V-C2 Solution to (SP2)

Similar to Section V-C1, we denote the optimal solutions of (SP2) as $\bm{z}^{*}=\{z^{*}_{i,m}\mid i\in\mathcal{N},m\in\mathcal{M}\}$ and $\bm{p}^{*}=\{p^{*}_{i}\mid i\in\mathcal{N}\}$ respectively, $\mathcal{N}^{*}\coloneqq\{i\in\mathcal{N}\mid\sum_{m\in\mathcal{N}}z^{*}_{i,m}=1\}$ as the optimal set of selected devices and, for each $i\in\mathcal{N}^{*}$ , RB block allocated to $i$ as $m^{*}_{i}$ , i.e., $z^{*}_{i,m^{*}_{i}}=1$ .

It is challenging to derive a closed-form solution for (SP2) because it is a non-convex MINLP problem with non-differentiable “max” operator in the objective. Thus, in the following, we develop an iterative algorithm to solve this problem and show that the algorithm will converge to a local minimum.

We begin with analyzing the properties of the optimal solution in the next lemma.

Lemma 5.

Denote the transmission delay regarding $\bm{z}^{*}$ and $\bm{p}^{*}$ as $\delta^{*}$ , i.e.,

\displaystyle\delta^{*}\coloneqq\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z^{*}_{i,m}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{*}_{i}}{I_{m}+BN_{0}}\right)}.

(40)

The following relation holds

\displaystyle\delta^{*}\geq\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m}+BN_{0}}\right)}

(41)

and $p^{*}_{i}$ can be expressed by

\displaystyle p^{*}_{i}=\begin{cases}\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)}{h_{i}}&,~{}\mathrm{if}~{}i\in\mathcal{N}^{*}\\ 0&,~{}\mathrm{otherwise}.\end{cases}

(42)

Sketch of Proof.

We prove (41) by contradiction and (42) by solving the corresponding transformed problem. The detailed proof is presented in Appendix E of the technical report [39]. ∎

Lemma 5 indicates that the optimal transmission power can be derived as a closed-form via (42), given the RB allocation and transmission delay. Lemma 5 also implies that for any RB allocation strategy $\tilde{\bm{z}}$ and transmission delay $\tilde{\delta}$ (not necessarily optimal), equation (42) provides the “optimal” transmission power under $\tilde{\bm{z}}$ and $\tilde{\delta}$ as long as (41) is satisfied. Based on that, we have the following result.

Theorem 3.

Denote $\mu_{i,m}\coloneqq(I_{m}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)/h_{i}$ . Given transmission delay $\delta^{*}$ , the optimal RB allocation strategy can be obtained by

\displaystyle\bm{z}^{*}=\mathop{\arg\max}_{\bm{z}}\bigg{\{}\sum_{i,m}z_{i,m}\left(u_{i}-e_{i,m}\right),~{}\mathrm{s.t.}~{}\eqref{eq:constraint_zm}-\eqref{eq:constraint_z01}\bigg{\}}

(43)

where

\displaystyle e_{i,m}\coloneqq\begin{cases}\eta_{1}\delta^{*}\mu_{i,m}&,~{}\mathrm{if}~{}\mu_{i,m}\leq p^{\mathit{max}}_{i}\\ u_{i}+1&,~{}\mathrm{otherwise}.\end{cases}

(44)

Proof.

From Lemma 5, Eq. (43) holds if $\mu_{i,m}\leq p^{\mathit{max}}_{i}$ . On the other hand, when $\mu_{i,m}>p^{\mathit{max}}_{i}$ , if device $i$ is selected, the transmission delay will be larger than $\delta^{*}$ (see the proof of Lemma 5), which is contradictory to the given condition. Thus, when $\mu_{i,m}>p^{\mathit{max}}_{i}$ , we set $e_{i,m}=u_{i}+1$ , ensuring device $i$ not to be selected. ∎

Theorem 3 shows the optimal RB allocation strategy can be obtained by solving (43), given transmission delay $\delta^{*}$ . Naturally, problem (43) can be equivalently transformed to a bipartite matching problem. Consider a Bipartite Graph $\mathcal{G}$ with source set $\mathcal{N}$ and destination set $\mathcal{M}$ . For each $i\in\mathcal{N}$ and $m\in\mathcal{M}$ , denote the weight of the edge from node $i$ to node $j$ as $w_{i\rightarrow j}$ : If $u_{i}-e_{i,m}>0$ , $w_{i\rightarrow j}=e_{i,m}-u_{i}$ ; otherwise, $w_{i\rightarrow j}=\infty$ . Therefore, maximizing (43) is equivalent to finding a matching in $\mathcal{G}$ with the minimum sum of weights. It means that we can obtain the optimal RB allocation strategy under fixed transmission delay via Kuhn-Munkres algorithm with the worst complexity of $O(Mn^{2})$ [42].

We proceed to show how to iteratively approximate the optimal $\delta^{*}$ , $\bm{p}^{*}$ , and $\bm{z}^{*}$ .

Lemma 6.

Let $j$ denote the communication straggler among all selected devices with respect to RB allocation $\bm{z}^{*}$ and transmission power $\bm{p}^{*}$ , i.e., for any $i\in\mathcal{N}^{*}$ ,

		$\displaystyle\sum_{m\in\mathcal{M}}z^{}_{i,m}\underbrace{\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{}_{i}}{I_{m}+BN_{0}}\right)}}_{T^{\mathit{co}}_{i}(\bm{z}^{}_{i},p^{}_{i})}$
		$\displaystyle\leq\sum_{m\in\mathcal{M}}z^{}_{j,m}\underbrace{\frac{S}{B\log_{2}\left(1+\frac{h_{j}p^{}_{j}}{I_{m}+BN_{0}}\right)}}_{T^{\mathit{co}}_{j}(\bm{z}^{}_{j},p^{}_{j})}.$		(45)

Then, the following holds true:

Define function $f_{4}(p)\coloneqq b_{1}\big{(}(1+p)\log_{2}(1+p)\ln 2-p\big{)}-\eta_{2}$ . Then, $f_{4}(p)$ is monotonically increasing with respect to $p\geq 0$ , and has unique zero point $\tilde{p}^{0}_{j}\in(0,b_{2}]$ , where $b_{1}$ and $b_{2}$ are denoted as follows

	$\displaystyle b_{1}$	$\displaystyle\coloneqq\eta_{1}\sum_{i\in\mathcal{N}^{}}\frac{I_{m^{}_{i}}+BN_{0}}{h_{i}}$		(46)
	$\displaystyle b_{2}$	$\displaystyle\coloneqq 2^{(1+\sqrt{\max\{\frac{\eta_{2}}{b_{1}},1\}-1})/\ln 2};$		(47)

Denote $(\mathit{SNR})_{j}\coloneqq(I_{m^{*}_{j}}+BN_{0})/h_{j}$ . For $i\in\mathcal{N}^{*}$ , we have

\displaystyle p^{*}_{i}=\begin{cases}\min\left\{\tilde{p}^{0}_{j},\min_{i\in\mathcal{N}^{*}}\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\right\}&,\mathrm{if}~{}i=j\\ \frac{(\mathit{SNR})_{i}}{(\mathit{SNR})_{j}}p^{*}_{j}&,\mathrm{otherwise}.\end{cases}

(48)

Sketch of Proof.

We obtain the first result by analyzing the property of $f_{4}$ and derive (48) via solving the corresponding optimization problem. The detailed proof is presented in Appendix F of the technical report [39]. ∎

Lemma 6 indicates that, given optimal RB allocation strategy $\bm{z}^{*}$ and straggler, the optimal transmission power can be derived by (48), different from Lemma 5 that requires the corresponding transmission delay $\delta^{*}$ . Notably, in (48), we can obtain zero point $\tilde{p}^{0}_{j}$ of $f_{4}$ with any required tolerance $\epsilon$ by Bisection method in $\log_{2}(\frac{b_{2}}{\epsilon})$ iterations at most.

Similar to Theorem 2, we can find the optimal transmission power by the following theorem, given the RB allocation.

Theorem 4.

Denote $\bm{p}^{*}_{\mathit{straggler:}j}$ as the optimal solution under the assumption that $j$ is the communication straggler, given fixed RB allocation $\bm{z}^{*}$ . The corresponding optimal transmission power is given by

\displaystyle\bm{p}^{*}=\mathop{\arg\max}_{\bm{p}\in\mathcal{P}}g_{2}(\bm{z}^{*},\bm{p})

(49)

where $\mathcal{P}\coloneqq\{\bm{p}^{*}_{\mathit{straggler:}j}\}_{j\in\mathcal{N}}$ and $g_{2}$ is the objective function defined in (SP2).

Proof.

We can easily obtain the result from Lemma 6, thereby omitting it for brevity. ∎

Define the communication time corresponding to $\bm{z}$ and $\bm{p}$ as $T^{\mathit{co}}(\bm{z},\bm{p})\coloneqq\max_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}T^{\mathit{co}}_{i}(\bm{z}_{i},p_{i})$ . Based on Theorem 3 and 4, we have the following Iterative Solution (IVES) algorithm to solve (SP2).

IVES: We initialize transmission delay $\delta^{0}$ (based on (42)) as follows

\displaystyle\delta^{0}=\max_{i\in\mathcal{N}}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m}+BN_{0}}\right)}.

(50)

In each iteration $t$ , we first compute an RB allocation strategy $\bm{z}^{t}$ via solving (43) by the Kuhn-Munkres algorithm. Then, based on $\bm{z}^{t}$ , we find the corresponding transmission power $\bm{p}^{t}$ by (49) and update the transmission delay by $\delta^{t+1}=T^{\mathit{co}}(\bm{z}^{t},\bm{p}^{t})$ before the next iteration. The details of IVES are depicted in Algorithm 2.

Input:

\eta_{1}

\eta_{2}

S

B

\{h_{i}\}

\{I_{m}\}

1 Initialize

t=0

and

\delta_{0}

by (50);

2 while not done do

3 Compute RB allocation stratege

\bm{z}^{t}

under

\delta^{t}

using Kuhn-Munkres algorithm based on (43);

4 Compute transmission power

\bm{p}^{t}

by (49);

5 Update

\delta^{t+1}=T^{\mathit{co}}(\bm{z}^{t},\bm{p}^{t})

;

7 end while

return $\bm{z}^{*}$ , $\bm{p}^{*}$

Algorithm 2 Iterative Solution (IVES)

Using IVES, we can solve (SP2) in an iterative manner. In the following theorem, we provide the convergence guarantee for IVES.

Theorem 5.

If we solve (SP2) by IVES, then $\{g_{2}(\bm{z}^{t},\bm{p}^{t})\}$ monotonically increases and converges to a unique point.

Sketch of Proof.

The result is derived via proving $g(\bm{z}^{t},\bm{p}^{t})\leq g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1})$ and $g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1})\leq g(\bm{z}^{t+1},\bm{p}^{t+1})$ . The detailed proof is presented in Appendix H of the technical report [39]. ∎

Although IVES solves (SP2) iteratively, we observe that it can converge extremely fast (often with only two iterations) in the simulation, achieving a low computation complexity.

Combining the solutions of (SP1) and (SP2), we provide the User selection and Resource Allocation (URAL) in Algorithm 3 to solve the original problem (P). The URAL can simultaneously optimize the convergence speed, training time, and energy consumption via jointly selecting devices and allocating resources. Further, URAL can be directly integrated into the NUFM paradigm in the device selection phase to facilitate the deployment of FML in wireless networks.

Input:

\eta_{1}

\eta_{2}

S

B

\{h_{i}\}

\{I_{m}\}

\{\iota_{i}\}

\{c_{i}\}

\{D_{i}\}

Compute

\bm{\nu}^{*}

by (39);

// Solve (SP1)

Compute

\bm{z}^{*}

and

\bm{p}^{*}

by IVES;

// Solve (SP2)

return $\bm{\nu}^{*}$ , $\bm{z}^{*}$ , $\bm{p}^{*}$

Algorithm 3 User Selection and Resource Allocation (URAL) Algorithm

VI Extension to First-Order Approximations

Due to the computation of Hessian in local update (2), it may cause high computational cost for resource-limited IoT devices. In this section, we address this challenge.

There are two common methods used in the literature to reduce the complexity in computing Hessian [9]:

replacing the stochastic gradient by

\displaystyle\tilde{\nabla}F_{i}({\theta})\approx\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)};

(51)

replacing the Hessian-gradient product by

		$\displaystyle\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}$
		$\displaystyle\approx\frac{\tilde{\nabla}f_{i}\big{(}{\theta}+\epsilon\tilde{g}_{i},\mathcal{D}^{\prime\prime}_{i}\big{)}-\tilde{\nabla}f_{i}\big{(}{\theta}-\epsilon\tilde{g}_{i},\mathcal{D}^{\prime\prime}_{i}\big{)}}{2\epsilon}$		(52)

where $\tilde{g}_{i}=\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}$ .

By doing so, the computational complexity of a one-step local update can be reduced from $O(d^{2})$ to $O(d)$ , while not sacrificing too much learning performance. Next, we show that our results in Theorem 1 hold in the above two cases.

Corollary 2.

$\displaystyle\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\;$	$\displaystyle\beta\mathbb{E}\Bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Bigg{(}\Big{(}1-\frac{L_{F}\beta}{2}\Big{)}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}$
	$\displaystyle-\Big{(}\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\tilde{\sigma}_{F_{i}}\Big{)}$
	$\displaystyle\times\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\Bigg{)}\Bigg{]}$	(53)

where $\tilde{\sigma}_{F_{i}}$ is defined as follows

\displaystyle\tilde{\sigma}^{2}_{F_{i}}=\begin{cases}4\sigma_{G}^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+2(\alpha L\zeta)^{2}&,~{}\mathrm{if~{}using}~{}\eqref{eq:first_approx}\\ \begin{aligned} &6\sigma_{G}^{2}\left(\frac{\alpha^{2}}{\epsilon^{2}D^{\prime\prime}_{i}}+\left(1+2(\alpha L)^{2}\right)\right.\\ &\left.\cdot\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)\right)+2(\alpha\rho\epsilon)^{2}\zeta^{4}\end{aligned}&,~{}\mathrm{if~{}using}~{}\eqref{eq:hessian_estimate}.\end{cases}

(54)

Proof.

The detailed proof is presented in Appendix I of the technical report [39]. ∎

Corollary 2 indicates that NUFM can be directly combined with the first-order approximation techniques to reduce computational cost. Further, similar to Corollary 1, Corollary 2 can be extended to the multi-step cases.

VII Simulation

This section evaluates the performance of our proposed algorithms by comparing with existing baselines in real-world datasets. We first present the experimental setup, including the datasets, models, parameters, baselines, and environment. Then we provide our results from various aspects.

VII-A Experimental Setup

VII-A1 Datasets and Models

We evaluate our algorithms on four widely-used benchmarks, namely Fashion-MNIST [43], CIFAR-10 [44], CIFAR-100 [44], and ImageNet [45]. Specifically, the data is distributed among $n=100$ devices as follows: a) Each device has samples from two random classes; b) the number of samples per class follows a truncated Gaussian distribution $\mathcal{N}(\mu,\sigma^{2})$ with $\mu=5$ and $\sigma=5$ . We select 50% devices at random for training with the rest for testing. For each device, we divide the local dataset into a support set and a query set. We consider 1-shot 2-class classification tasks, i.e., the support set contains only 1 labeled example for each class. We set the stepsizes as $\alpha=\beta=0.001$ . We use a convolutional neural network (CNN) with max-pooling operations and the Leaky Rectified Linear Unit (Leaky ReLU) activation function, containing three convolutional layers with sizes 32, 64, and 128 respectively, followed by a fully connected layer and a softmax layer. The strides are set as 1 for convolution operation and 2 for the pooling operation.

VII-A2 Baselines

To compare the performance of NUFM, we first consider two existing algorithms, i.e., FedAvg [4] and Per-FedAvg [9]. Further, to validate the effectiveness of URAL in multi-access wireless networks, we use two baselines, called Greedy and Random. In each round, the Greedy strategy determines the CPU-cycle frequency $\nu^{\mathit{g}}_{i}$ and transmission power $p^{\mathit{g}}_{i}$ for device $i\in\mathcal{N}$ by greedily minimizing its individual objective, i.e.,

	$\displaystyle\nu^{\mathit{g}}_{i}$	$\displaystyle=\mathop{\arg\min}_{\nu_{i}}\left\{\eta_{1}\frac{\iota_{i}c_{i}D_{i}\nu^{2}_{i}}{2}+\eta_{2}\frac{c_{i}D_{i}}{\nu_{i}},~{}\mathrm{s.t.}~{}\eqref{eq:constraints_nui}\right\}$		(55)
	$\displaystyle p^{\mathit{g}}_{i}$	$\displaystyle=\mathop{\arg\min}_{p_{i}}\left\{\sum_{m\in\mathcal{M}}\frac{z^{\mathit{g}}_{i,m}(\eta_{1}p_{i}+\eta_{2})S}{B\log_{2}\left(\frac{1+h_{i}p_{i}}{I_{m}+BN_{0}}\right)},~{}\mathrm{s.t.}~{}\eqref{eq:constraints_pi}\right\}$		(56)

where $\{z^{\mathit{g}}_{i,m}\}_{m\in\mathcal{M}}$ is selected at random (i.e., randomly allocating RBs to selected devices). The Random strategy decides the CPU-cycle frequencies, transmission powers, and RB allocation for the selected devices uniformly at random from the feasible regions.

VII-A3 Implementation

We implement the code in TensorFlow Version 1.14 on a server with two Intel^® Xeon^® Golden 5120 CPUs and one Nvidia^® Tesla-V100 32G GPU. The parameters used in the simulation can be found in Table III.

VII-B Experimental Results

VII-B1 Convergence Speed

To demonstrate the improvement of NUFM on the convergence speed, we compare the algorithms on different benchmarks with the same initial model and learning rate. We vary the number of participated devices $n_{k}$ from 20 to 40, and set the numbers of local update steps and total communication rounds as $\tau=1$ and $K=50$ , respectively. We let $\lambda_{1}=\lambda_{2}=1$ . As illustrated in Fig. 2 and Table II, NUFM significantly improve the convergence speed and corresponding test accuracy of the existing FML approach on all datasets⁴⁴4To make the graphs more legible, we draw symbols every two points in Fig. 2.. Clearly, it validates the effectiveness of our proposed device selection scheme that maximizes the lower bound of one-round global loss reduction. Interestingly, Fig. 2 also indicates that NUFM converges more quickly with relatively fewer participated devices. For example, in round 19, the loss achieved by NUFM with $n_{k}=20$ decreases by more than 9% and 20% over those with $n_{k}=30$ and $n_{k}=40$ on Fashion-MNIST, respectively. The underlying rationale is that relatively fewer “good” devices can provide a larger lower bound on the one-round global loss decrease (note that in (1) the lower bound takes the average of the selected devices). More selected devices in each round generally require more communication resources. Thus, the results reveal the potential of NUFM in applications to resource-limited wireless systems.

TABLE II: Test accuracy after fifty rounds of training.

Algorithm	Fashion-MNIST	CIFAR-10	CIFAR-100	ImageNet
NUFM	68.04%	58.80%	23.95%	34.04%
Per-FedAvg	62.75%	58.22%	21.49%	30.98%
FedAvg	61.04%	54.31%	10.13%	12.14%

VII-B2 Effect of Local Update Steps

To show the effect of local update steps on the convergence rate of NUFM, we present results with varying numbers of local update steps $\tau=1,2,\dots,10$ in each round. For clarity of illustration, we compare the loss under different numbers of local steps in the round 19 on Fashion-MNIST and CIFAR-100. Fig. 3 shows that fewer local update steps lead to a larger gap between the baselines and NUFM, which verifies the theoretical result that a small number of local steps can slow the convergence of FedAvg and Per-FedAvg [9, Theorem 4.5]. It also implies that NUFM can improve the computational efficiency of local devices.

VII-B3 Performance of URAL in Wireless Networks

We evaluate the performance of URAL by comparing with four baselines, namely NUFM-Greedy, NUFM-Random, RU-Greedy, and RU-Random, as detailed below.

a)

NUFM-Greedy: select devices by NUFM, decide CPU-cycle frequencies, RB allocation, and transmission power by Greedy strategy;
b)

NUFM-Random: select devices by NUFM, decide CPU-cycle frequencies, RB allocation, and transmission power by Random strategy;
c)

RU-Greedy: select devices uniformly at random, decide CPU-cycle frequencies, RB allocation, and transmission power by Greedy strategy;
d)

RU-Random: select devices uniformly at random, decide CPU-cycle frequencies, RB allocation, and transmission power by Random strategy.

TABLE III: Parameters in Simulation

Parameter

Value

Step size (

\alpha

) and meta–learning rate (

\beta

)

0.001

# edge devices (

n

)

100

# participating devices (

n_{k}

) in Experiments 1 and 2

An integer varying among

\{20,30,40\}

# local updates (

\tau

)

An integer varying among

\{1,2,\dots,10\}

Hyper-parameters (

\lambda_{1}

and

\lambda_{2}

)

Weight parameters (

\eta_{1}

and

\eta_{2}

)

Real numbers varying among

\{0.5,1,1.5,2.0,2.5\}

# RBs

An integer varying among

\{1,5,10,20,30,40,50\}

CPU cycles of devices(

c_{i}

)

Real numbers following

U(0,0.25)

Maximum CPU-cycle frequencies of devices (

\nu^{\mathit{max}}_{i}

)

Real numbers following

U(0,2)

Maximum transmission powers of devices (

p^{\mathit{max}}_{i}

)

Real numbers following

U(0,1)

Inference in RBs (

I_{m}

)

Real numbers following

U(0,0.8)

Model size (

S

), Bandwidth (

B

Noise power spectral density (

N_{0}

)

effective capacitance coefficients of devices (

\iota_{i}

)

Real numbers following

U(0,1)

Channel gains of devices

Real numbers following

U(0.1,h^{\mathit{max}})

where

h^{\mathit{max}}

varies among

\{0.25,0.5,0.75,\dots,2.0\}

We simulate a wireless system consisting of $M=20$ RBs and let the channel gain $h_{i}$ of device $i$ follow a uniform distribution $U(h^{\mathit{min}},h^{\mathit{max}})$ with $h^{\mathit{min}}=0.1$ and $h^{\mathit{max}}=1$ . We set $S=1$ , $B=1$ , and $n_{0}=1$ . The interference of RB $m$ is drawn from $I_{m}\sim U(0,0.8)$ . We set $\iota_{i}\sim U(0,1)$ , $c_{i}\sim U(0,0.25)$ , $p^{\mathit{max}}_{i}\sim U(0,1)$ , and $\nu^{\mathit{max}}_{i}\sim U(0,2)$ for each $i\in\mathcal{N}$ to simulate the device heterogeneity. In the following experiments, we run the algorithms on FMNIST with local update steps $\tau=1$ and $\eta_{1}=\eta_{2}=1$ .

As shown in Fig. 4, URAL can significantly reduce energy consumption and wall-clock time, as compared with the baselines. However, it is counter-intuitive that the Greedy strategy is not always better than Random. There are two reasons. On one hand, the energy cost and wall-clock time depend on the selection of weight parameters $\eta_{1}$ and $\eta_{2}$ . The results in Fig. 4 imply that, when $\eta_{1}=\eta_{2}=1$ , Greedy pays more attention to the wall-clock time than to energy consumption. Accordingly, Greedy achieves much lower average delay than that of Random, but sacrificing parts of the energy. On the other hand, the wall-clock time and energy cost require joint control with RB allocation. Although Greedy minimizes the individual objectives (55)-(56), improper RB allocation can cause arbitrary performance degradation. Different from Greedy and Random–based baselines, since URAL aims to maximize the joint objective (P) via co-optimizing the CPU-cycle frequencies, transmission power, and RB allocation strategy, it can alleviate the above-mentioned issues, and achieve a better delay and energy control. At the same time, Fig. 4 indicates that URAL converges as fast as NUFM-Greedy and NUFM-Random (the corresponding lines almost overlap), which select devices greedily to accelerate convergence (as in NUFM). Thus, URAL can have an excellent convergence rate.

VII-B4 Effect of Resource Blocks

In Fig. 5, we test the performance of URAL under different numbers of RBs. We vary the number of RBs $M$ from 1 to 50. More RBs enable more devices to be selected in each round, leading to larger energy consumption. As shown in Fig. 5, URAL keeps stable wall-clock time with the increase of RBs. Meanwhile, URAL can control the power in devices to avoid serious waste of energy. It is counter-intuitive that the convergence speed does not always decrease with the increase of RBs, especially for URAL. The reason is indeed the same as that in Section VII-B1. That is, too few selected devices can slow the convergence due to insufficient information provided in each round while a large number of participated devices may weaken the global loss reduction as shown in (1). Therefore, URAL can adapt to the practical systems with constrained wireless resources via achieving fast convergence with only a small set of devices.

VII-B5 Effect of Channel Quality

To investigate the effect of channel conditions on performance, we set the number of RBs $M=20$ and vary the maximum channel gain $h^{\mathit{max}}$ from 0.25 to 2, and show the corresponding energy consumption and wall-clock training time in Fig. 6. The results indicate that the energy consumption and latency decrease as channel quality improves, because devices can use less power to achieve a relatively large transmission rate.

VII-B6 Effect of Weight Parameters

We study how weight parameters $\eta_{1}$ and $\eta_{2}$ affect the average energy consumption and wall-clock time of URAL in Fig. 7. We first fix $\eta_{2}=1$ and vary $\eta_{1}$ from 0.5 to 2.5. As expected, the total energy consumption decreases with the increase of $\eta_{1}$ , with the opposite trend for wall-clock time. Then we vary $\eta_{2}$ with $\eta_{1}=1$ . Similarly, a larger $\eta_{2}$ leads to less latency and more energy cost. It implies that we can control the levels of wall-clock training time and energy consumption by tuning the weight parameters. In particular, even with a large $\eta_{1}$ or $\eta_{2}$ , the wall-clock time and energy cost can be controlled at low levels. Meanwhile, the convergence rate achieved by URAL is robust to $\eta_{1}$ and $\eta_{2}$ . Thus, URAL can make full use of the resources (including datasets, bandwidth, and power) and achieve great trade-offs among the convergence rate, latency, and energy consumption.

VIII Conclusion

In this paper, we have proposed an FML algorithm, called NUFM, that maximizes the theoretical lower bound of global loss reduction in each round to accelerate the convergence. Aiming at effectively deploying NUFM in wireless networks, we present a device selection and resource allocation strategy (URAL), which jointly controls the CPU-cycle frequencies and RB allocation to optimize the trade-off between energy consumption and wall-clock training time. Moreover, we integrate the proposed algorithms with two first-order approximation techniques to further reduce the computational complexity in IoT devices. Extensive simulation results demonstrate that the proposed methods outperform the baseline algorithms.

Future work will investigate the trade-off between the local update and global aggregation in FML to minimize the convergence time and energy cost from a long-term perspective. In addition, how to characterize the convergence properties and communication complexity of the NUFM algorithm requires further research.

References

[1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
[2] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intelligence: Challenges and opportunities of near-sensor machine learning applications,” in Proc. IEEE ASAP, 2018, pp. 1–7.
[3] X. Zhang, Y. Wang, S. Lu, L. Liu, W. Shi et al., “Openei: An open framework for edge intelligence,” in Proc. IEEE ICDCS, 2019, pp. 1840–1851.
[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS. PMLR, 2017, pp. 1273–1282.
[5] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
[6] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning with fast convergence and efficient communication,” ArXiv reprints arXiv: 1802.07876, 2019.
[7] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” ArXiv reprints arXiv: 1909.12488, 2019.
[8] S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via federated meta-learning,” in Proc. IEEE ICDCS, 2020, pp. 289–299.
[9] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” in Proc. NIPS, 2020, pp. 1–12.
[10] P. Kairouz et al., “Advances and open problems in federated learning,” ArXiv reprints arXiv: 1912.04977, 2021.
[11] S. Yue, J. Ren, J. Xin, S. Lin, and J. Zhang, “Inexact-admm based federated meta-learning for fast and continual edge learning,” in Proc. ACM MobiHoc, 2021, p. 91–100.
[12] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in Proc. IEEE ICC, 2019, pp. 1–7.
[13] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang, and H. V. Poor, “Fast-convergent federated learning,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 201–218, 2020.
[14] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, 2020.
[15] C. T. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless networks: Convergence analysis and resource allocation,” IEEE/ACM Trans. Networking, vol. 29, no. 1, pp. 398–409, 2020.
[16] J. Liu, J. Ren, Y. Zhang, X. Peng, Y. Zhang, and Y. Yang, “Efficient dependent task offloading for multiple applications in mec-cloud system,” IEEE Trans. Mob. Comput., 2021.
[17] Z. M. Fadlullah and N. Kato, “Hcp: Heterogeneous computing platform for federated learning based collaborative content caching towards 6g networks,” IEEE Trans. Emerging Top. Comput., 2020.
[18] Q. Wu, K. He, and X. Chen, “Personalized federated learning for intelligent iot applications: A cloud-edge based framework,” IEEE Open J. Comput. Soc., vol. 1, pp. 35–44, 2020.
[19] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, pp. 1–19, 2019.
[20] S. Yue, J. Ren, N. Qiao, Y. Zhang, H. Jiang, Y. Zhang, and Y. Yang, “Todg: Distributed task offloading with delay guarantees for edge computing,” IEEE Trans. Parallel Distrib. Syst., 2021.
[21] J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling for cellular federated edge learning with importance and channel awareness,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7690–7703, 2020.
[22] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, 2019.
[23] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in Proc. IEEE ICC Workshops, 2020, pp. 1–6.
[24] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, 2020.
[25] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time optimization for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2457–2471, 2020.
[26] J. Ren, G. Yu, and G. Ding, “Accelerating dnn training in wireless federated edge learning systems,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 219–232, 2020.
[27] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization,” in Proc. ICML, 2019, pp. 7184–7193.
[28] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, 2020.
[29] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas in Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
[30] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in Proc. ICML, 2020, pp. 5132–5143.
[31] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective federated learning design,” in Proc. IEEE INFOCOM, 2021.
[32] W. Luping, W. Wei, and L. Bo, “Cmfl: Mitigating communication overhead for federated learning,” in Proc. IEEE ICDCS, 2019, pp. 954–964.
[33] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially private meta-learning,” in Proc. ICLR, 2019.
[34] K. C. Sim et al., “Personalization of end-to-end speech recognition on mobile devices for named entities,” in Proc. IEEE ASRU, 2019, pp. 23–30.
[35] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergence for wireless federated learning,” in Proc. IEEE ICC, 2020, pp. 1–6.
[36] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
[37] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data,” ArXiv reprints arXiv: 2005.11418, 2020.
[38] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory of gradient-based model-agnostic meta-learning algorithms,” in Proc. AISTATS, 2020, pp. 1082–1092.
[39] S. Yue, J. Ren, J. Xin, D. Zhang, Y. Zhang, and W. Zhuang, “Efficient federated meta-learning over multi-access wireless networks,” arXiv preprint arXiv:2108.06453, 2021.
[40] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2009.
[41] T. D. Burd and R. W. Brodersen, “Processor design for portable systems,” J. VLSI Sig. Proc. Syst. Sig. Image Video Technol., vol. 13, no. 2, pp. 203–221, 1996.
[42] E. W. Weisstein, “Hungarian maximum matching algorithm,” https://mathworld.wolfram.com/, 2011.
[43] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” ArXiv reprints arXiv: 1708.07747, 2017.
[44] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical Report TR-2009, University of Toronto, 2009.
[45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. of IEEE CVPR, 2009, pp. 248–255.

Appendix A Proof of Lemma 1

The proof is standard [11, 9], and we provide it for completeness. Recalling the definition of $F_{i}$ as $F_{i}(\theta)\coloneqq f_{i}(\theta-\alpha\nabla f_{i}(\theta))$ , we have

\displaystyle\nabla F_{i}(\theta)=(I-\alpha\nabla^{2}f_{i}(\theta))\nabla f_{i}(\theta-\alpha\nabla f_{i}(\theta)).

Based on that, for any ${\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}$ we have

$\displaystyle\\|\nabla F_{i}({\theta}_{1})-\nabla F_{i}({\theta}_{2})\\|=$	$\displaystyle\;\\|(I-\alpha\nabla^{2}f_{i}({\theta}_{1}))\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-(I-\alpha\nabla^{2}f_{i}({\theta}_{2}))\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|$
$\displaystyle=$	$\displaystyle\;\\|(I-\alpha\nabla^{2}f_{i}({\theta}_{1}))(\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2})))$
	$\displaystyle+((I-\alpha\nabla^{2}f_{i}({\theta}_{1}))-(I-\alpha\nabla^{2}f_{i}({\theta}_{2})))\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|$	(adding and subtracting the term $(I-\alpha\nabla^{2}f_{i}(\theta_{1}))\nabla f_{i}(\theta_{2}-\alpha\nabla f_{i}(\theta_{2})$ )
$\displaystyle\leq$	$\displaystyle\;\underbrace{\\|I-\alpha\nabla^{2}f_{i}({\theta}_{1})\\|\\|\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|}_{(a)}$	(from triangle inequality)
	$\displaystyle+\alpha\underbrace{\\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\\|\\|\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|}_{(b)}.$	(57)

Then, for $(a)$ , the following holds true

	$\displaystyle\\|I-\alpha\nabla^{2}f_{i}({\theta}_{1})\\|\\|\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|$
$\displaystyle\leq\,$	$\displaystyle(1+\alpha L)L\\|{\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-{\theta}_{2}+\alpha\nabla f_{i}({\theta}_{2})\\|$	(from Assumption 1)
$\displaystyle\leq\,$	$\displaystyle(1+\alpha L)L(\\|{\theta}_{1}-{\theta}_{2}\\|+\alpha\\|\nabla f_{i}({\theta}_{1})-\nabla f_{i}({\theta}_{2})\\|)$	(from triangle inequality)
$\displaystyle\leq\,$	$\displaystyle(1+\alpha L)^{2}L\\|{\theta}_{1}-{\theta}_{2}\\|.$	(58)

Regarding $(b)$ , it can be shown that

\displaystyle\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\|\|\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\|\leq\rho\zeta\|{\theta}_{1}-{\theta}_{2}\|.

(from Assumptions 1 and 2)

Substituting $(a)$ and $(b)$ into (57), we have the result.

Appendix B Proof of Lemma 2

We rewrite the stochastic gradient $\tilde{\nabla}F_{i}({\theta})$ as follows

\displaystyle\tilde{\nabla}F_{i}({\theta})=\left(I-\alpha\nabla^{2}f_{i}({\theta})+\delta^{*}_{1}\right)\left(\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)+\delta^{*}_{2}\right)

(59)

where $\delta^{*}_{1}$ and $\delta^{*}_{2}$ are given by

	$\displaystyle\delta^{*}_{1}$	$\displaystyle=\alpha\left(\nabla^{2}f_{i}({\theta})-\tilde{\nabla}^{2}f_{i}({\theta},\mathcal{D}^{\prime\prime}_{i})\right)$		(60)
	$\displaystyle\delta^{*}_{2}$	$\displaystyle=\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right).$		(61)

Note that $\delta^{*}_{1}$ and $\delta^{*}_{2}$ are independent. Due to Assumption 3, we have

	$\displaystyle\mathbb{E}[\delta^{*}_{1}]$	$\displaystyle=0$		(62)
	$\displaystyle\mathbb{E}[\\|\delta^{*}_{1}\\|^{2}]$	$\displaystyle\leq\frac{(\alpha\sigma_{H})^{2}}{D^{\prime\prime}_{i}}.$		(63)

Next, we proceed to bound the first and second moments of $\delta^{*}_{2}$ . Regarding the first moment, we have

$\displaystyle\left\\|\mathbb{E}\left[\delta^{*}_{2}\right]\right\\|$	$\displaystyle=\left\\|\mathbb{E}\left[\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}+\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right]\right\\|$
	$\displaystyle=\left\\|\mathbb{E}\left[\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right]\right\\|$
	$\displaystyle\leq\mathbb{E}\left[\left\\|\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\\|\right]$
	$\displaystyle\leq\alpha L\mathbb{E}\left[\left\\|\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})-\nabla f_{i}({\theta})\right\\|\right]$	(from the smoothness of $f_{i}$ )
	$\displaystyle\leq\frac{\alpha\sigma_{G}L}{\sqrt{D_{i}}}.$	(from Assumption 3)

Regarding the second moment, we have

$\displaystyle\mathbb{E}\left[\\|\delta^{*}_{2}\\|^{2}\right]$	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}\right\\|^{2}\right]+2\mathbb{E}\left[\left\\|\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\\|^{2}\right]$
	$\displaystyle\leq\frac{2\sigma^{2}_{G}}{D^{\prime}_{i}}+2(\alpha L)^{2}\mathbb{E}\left[\left\\|\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})-\nabla f_{i}({\theta})\right\\|^{2}\right]$	(from the tower rule, smoothness of $f_{i}$ along with Assumption 3)
	$\displaystyle\leq 2\sigma^{2}_{G}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right).$	(64)

Based on (59), we have

$\displaystyle\left\\|\mathbb{E}\left[\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right]\right\\|$	$\displaystyle=\left\\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\mathbb{E}[\delta^{}_{2}]+\mathbb{E}[\delta^{}_{1}]\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)+\mathbb{E}[\delta^{}_{1}\delta^{}_{2}]\right\\|$
	$\displaystyle\leq\left\\|I-\alpha\nabla^{2}f_{i}({\theta})\right\\|\left\\|\mathbb{E}[\delta^{*}_{2}]\right\\|$
	$\displaystyle\leq\frac{\alpha\sigma_{G}L(1+\alpha L)}{\sqrt{D_{i}}}$	(65)

which gives us the first result in Lemma 2.

By the submultiplicative property of matrix norm,

\displaystyle\left\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\|\leq\left\|I-\alpha\nabla^{2}f_{i}({\theta})\right\|\left\|\delta^{*}_{2}\right\|+\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|\left\|\delta^{*}_{1}\right\|+\left\|\delta^{*}_{1}\right\|\left\|\delta^{*}_{2}\right\|.

(66)

Thus, we have

\displaystyle\left\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\|^{2}\leq 3\left\|I-\alpha\nabla^{2}f_{i}({\theta})\right\|^{2}\left\|\delta^{*}_{2}\right\|^{2}+3\left\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\|^{2}\left\|\delta^{*}_{1}\right\|^{2}+3\left\|\delta^{*}_{1}\right\|^{2}\left\|\delta^{*}_{2}\right\|^{2}.

(67)

Taking expectation on both sizes, we obtain

	$\displaystyle\mathbb{E}\left[\left\\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\\|^{2}\right]$	$\displaystyle\leq 3\left(1+\alpha L\right)^{2}\mathbb{E}\left[\left\\|\delta^{}_{2}\right\\|^{2}\right]+3\zeta^{2}\mathbb{E}\left[\left\\|\delta^{}_{1}\right\\|^{2}\right]+3\mathbb{E}\left[\left\\|\delta^{}_{1}\right\\|^{2}\right]\mathbb{E}\left[\left\\|\delta^{}_{2}\right\\|^{2}\right]$		(using the fact that $\\|I-\alpha\nabla^{2}f_{i}({\theta})\\|\leq 1+\alpha L$ )
		$\displaystyle\leq\frac{6(\alpha\sigma_{G}\sigma_{H})^{2}}{D^{\prime\prime}_{i}}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+6\sigma^{2}_{G}(1+\alpha L)^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+\frac{3(\alpha\zeta\sigma_{H})^{2}}{D^{\prime\prime}_{i}}$		(68)

which completes the proof.

Appendix C Proof of Lemma 3

Recall the definition of $F_{i}({\theta})$ . We have

$\displaystyle\left\\|\nabla F_{i}({\theta})-F_{j}({\theta})\right\\|=\,$	$\displaystyle\left\\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\\|$
$\displaystyle\leq\,$	$\displaystyle\left\\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\\|$
	$\displaystyle+\left\\|\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\\|$
$\displaystyle\leq\,$	$\displaystyle\alpha\underbrace{\left\\|\left(\nabla^{2}f_{i}({\theta})-\nabla^{2}f_{j}({\theta})\right)\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\\|}_{A}$
	$\displaystyle+\underbrace{\left\\|\left(I-\alpha\nabla^{2}f_{j}({\theta})\right)\left(\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right)\right\\|}_{B}.$	(69)

Regrading $A$ , due to the submultiplicative property of matrix norm, the following holds

	$\displaystyle A$	$\displaystyle\leq\left\\|\nabla^{2}f_{i}({\theta})-\nabla^{2}f_{j}({\theta})\right\\|\left\\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\\|$
		$\displaystyle\leq\zeta\gamma_{H}$		(70)

where the last inequality follows from Assumption 4.

Regrading $B$ , similarly, we obtain

$\displaystyle B\leq\,$	$\displaystyle\left\\|I-\alpha\nabla^{2}f_{j}({\theta})\right\\|\left\\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\\|$
$\displaystyle\mathop{\leq}^{\textbf{(a)}}\,$	$\displaystyle\left(1+\alpha L\right)\left\\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\\|$
$\displaystyle\leq\,$	$\displaystyle\left(1+\alpha L\right)\Big{(}\left\\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)-\nabla f_{i}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\\|$
	$\displaystyle+\left\\|\nabla f_{i}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)-\nabla f_{j}\left({\theta}-\alpha\nabla f_{j}({\theta})\right)\right\\|\Big{)}$
$\displaystyle\mathop{\leq}^{\text{(b)}}\,$	$\displaystyle\left(1+\alpha L\right)\left(\alpha L\left\\|\nabla f_{i}({\theta})-f_{j}({\theta})\right\\|+\gamma_{G}\right)$
$\displaystyle\leq\,$	$\displaystyle\left(1+\alpha L\right)^{2}\gamma_{G}$	(71)

where (a) is derived via triangle inequality, and (b) follows from Assumptions 1 and 4.

Substituting (C) and (C) in (C), we have

\displaystyle\left\|\nabla F_{i}({\theta})-F_{j}({\theta})\right\|\leq\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}

(72)

thereby completing the proof.

Appendix D Proof of Lemma 4

Since $j$ is the straggler among devices, (SP1) can be equivalently transformed to

\displaystyle\begin{split}\mathop{\min}_{\bm{\nu}}&\quad\eta_{1}\sum_{i\in\mathcal{N}}\frac{\iota_{i}}{2}c_{i}D_{i}\nu^{2}_{i}+\eta_{2}\frac{c_{j}D_{j}}{\nu_{j}}\\ \mathrm{s.t.}&\quad 0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}\\ &\quad\frac{c_{i}D_{i}\nu_{j}}{c_{j}D_{j}}\leq\nu_{i},~{}\forall i\in\mathcal{N}/j.\end{split}

(73)

Fixing $\nu_{j}$ , for each $i\in\mathcal{N}/j$ ⁵⁵5We implicitly assume that $\mathcal{N}/j$ is not empty., we can show that the optimal CPU frequency $\nu^{*}_{i}$ of the problem (73) can be obtained via solving the following decomposed convex optimization problem

\displaystyle\begin{split}\mathop{\min}_{\nu_{i}}&\quad\frac{\iota_{i}}{2}\eta_{1}c_{i}D_{i}\nu^{2}_{i}\\ \mathrm{s.t.}&\quad 0\leq\nu_{i}\leq\nu^{\mathit{max}}_{i}\\ &\quad\frac{c_{i}D_{i}\nu_{j}}{c_{j}D_{j}}\leq\nu_{i}.\end{split}

(74)

If $\nu_{j}\leq\frac{c_{j}D_{j}\nu^{\mathit{max}}_{i}}{c_{i}D_{i}}$ , the optimal solution of problem (74) is given by

\displaystyle\nu^{*}_{i}=\frac{c_{i}D_{i}\nu_{j}}{c_{j}D_{j}}.

(75)

Otherwise, the problem is infeasible because the constraints are mutually contradictory. Substituting $\nu_{i}=\nu^{*}_{i}$ in problem (73), we have the following problem with respect to $\nu_{j}$

\displaystyle\begin{split}\mathop{\min}_{\nu_{j}}\,&\quad g(\nu_{j})=\underbrace{\eta_{1}\bigg{(}\sum_{i\in\mathcal{N}/j}\frac{\iota_{i}(c_{i}D_{i})^{3}}{2(c_{j}D_{j})^{2}}+\frac{\iota_{j}c_{j}D_{j}}{2}\bigg{)}}_{a_{1}}\nu^{2}_{j}+\underbrace{\eta_{2}c_{j}D_{j}}_{a_{2}}\frac{1}{\nu_{j}}\\ \mathrm{s.t.}&\quad 0\leq\nu_{j}\leq\frac{c_{j}D_{j}\nu^{\mathit{max}}_{i}}{c_{i}D_{i}},~{}\forall i\in\mathcal{N}.\end{split}

(76)

We simplify the expression of $g(\nu_{j})$ as $g(\nu_{j})=a_{1}\nu^{2}_{j}+a_{2}/\nu_{j}$ where positive constants $a_{1}$ and $a_{2}$ are defined in (76). Then, the derivative of $g(\nu_{j})$ can be written as

\displaystyle g^{\prime}(\nu_{j})=2a_{1}\nu_{j}-\frac{a_{2}}{\nu^{2}_{j}}.

(77)

Based on (77), the minimum value of $g(\nu_{j})$ is obtained at its stationary point $\bar{\nu}_{j}\coloneqq{g^{\prime}}^{-1}(0)=\sqrt[3]{\frac{a_{2}}{2a_{1}}}$ . Thus, the optimal solution of problem (76) is

\displaystyle\nu^{*}_{j}=\min\left\{\sqrt[3]{\frac{a_{2}}{2a_{1}}},\min_{i\in\mathcal{N}}\frac{c_{j}D_{j}\nu^{\mathit{max}}_{j}}{c_{i}D_{i}}\right\}.

(78)

Combining (75) and (78), we complete the proof.

Appendix E Proof of Lemma 5

Given $\bm{z}^{*}$ and $\delta^{*}$ , $p^{*}_{i}$ can be obtained via solving the following problem

\displaystyle\begin{split}\mathop{\min}_{p_{i}}&\quad\sum_{m\in\mathcal{M}}\frac{z^{*}_{i,m}p_{i}}{\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}\\ \mathrm{s.t.}&~{}\,\quad 0\leq p_{i}\leq p^{\mathit{max}}_{i}\\ &\quad\sum_{m\in\mathcal{M}}\frac{z^{*}_{i,m}S}{B\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m}+BN_{0}}\right)}\leq\delta^{*}\end{split}

(79)

where we eliminate $u_{i}$ , $\eta_{1}$ , $S$ , and $B$ in the objective. Clearly, if $\sum_{m\in\mathcal{M}}z^{*}_{i,m}=0$ , then $p^{*}_{i}=0$ . If the constraints in (79) are mutually contradictory, i.e.

\displaystyle p^{\mathit{max}}_{i}<\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)}{h_{i}}

(80)

then $z^{*}_{i,m}$ must be 0, which gives (41).

When there exists $m^{*}_{i}$ such that $z^{*}_{i,m^{*}_{i}}=1$ , by denoting $\tilde{p}_{i}\coloneqq\frac{h_{i}p_{i}}{I_{m^{*}_{i}}+BN_{0}}$ and rearranging the terms, we can transform the problem (79) to

\displaystyle\begin{split}\mathop{\min}_{\tilde{p}_{i}}&\quad g_{3}(\tilde{p}_{i})=\frac{\tilde{p}_{i}}{\log_{2}\left(1+\tilde{p}_{i}\right)}\\ \mathrm{s.t.}&\quad 0\leq\tilde{p}_{i}\leq\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\\ &\quad 2^{\frac{S}{B\delta^{*}}}-1\leq\tilde{p}_{i}.\end{split}

(81)

We have

\displaystyle g^{\prime}_{3}(\tilde{p}_{i})=\frac{\log_{2}(1+\tilde{p}_{i})-\tilde{p}_{i}/\left((1+\tilde{p}_{i})\ln 2\right)}{\left(\log_{2}(1+\tilde{p}_{i})\right)^{2}}.

(82)

Denoting the numerator of (82) as $f_{3}(\tilde{p}_{i})$ , we have $f_{3}(0)=0$ and

\displaystyle f^{\prime}_{3}(\tilde{p}_{i})=\frac{1}{\ln 2}\left(\frac{1}{(1+\tilde{p}_{i})}-\frac{1}{(1+\tilde{p}_{i})^{2}}\right)>0,\quad\text{when}~{}\tilde{p}_{i}>0

(83)

which implies that $g_{3}(\tilde{p}_{i})$ is monotonically increasing with $\tilde{p}_{i}>0$ . Thus, recalling the definition of $\tilde{p}_{i}$ , due to (41), we obtain

\displaystyle p^{*}_{i}=\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{*}}}-1)}{h_{i}}

(84)

which completes the proof of (42).

Appendix F Proof of Lemma 6

If $j$ denotes the straggler among all devices and $\mathcal{N}^{*}\neq\emptyset$ , we can rewrite (6) for each $i\in\mathcal{N}^{*}$ as

	$\displaystyle\sum_{m\in\mathcal{M}}z^{}_{i,m}\frac{S}{B\log_{2}\left(1+\frac{h_{i}p^{}_{i}}{I_{m}+BN_{0}}\right)}\leq\sum_{m\in\mathcal{M}}z^{}_{j,m}\frac{S}{B\log_{2}\left(1+\frac{h_{j}p^{}_{j}}{I_{m}+BN_{0}}\right)}$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\log_{2}\left(1+\frac{h_{j}p^{}_{j}}{I_{m^{}_{j}}+BN_{0}}\right)\leq\log_{2}\left(1+\frac{h_{i}p^{}_{i}}{I_{m^{}_{i}}+BN_{0}}\right)$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\frac{(I_{m^{}_{i}}+BN_{0})h_{j}}{(I_{m^{}_{j}}+BN_{0})h_{i}}p^{}_{j}\leq p^{}_{i}.$	(85)

Based on (F), by eliminating constants $B$ and $S$ , we transform (SP2) under fixed $\bm{z}^{*}$ to

\displaystyle\begin{split}\mathop{\min}_{\{p_{i}\}_{i\in\mathcal{N}^{*}}}&\quad\sum_{i\in\mathcal{N}^{*}}\eta_{1}\frac{p_{i}}{\log_{2}\left(1+\frac{h_{i}p_{i}}{I_{m^{*}_{i}}+BN_{0}}\right)}+\eta_{2}\frac{1}{\log_{2}\left(1+\frac{h_{j}p_{j}}{I_{m^{*}_{j}}+BN_{0}}\right)}\\ \mathrm{s.t.}~{}\;&~{}\,\quad 0\leq p_{i}\leq p^{\mathit{max}}_{i},~{}\forall i\in\mathcal{N}^{*}\\ &~{}\quad\frac{(I_{m^{*}_{i}}+BN_{0})h_{j}}{(I_{m^{*}_{j}}+BN_{0})h_{i}}p_{j}\leq p_{i},~{}\forall i\in\mathcal{N}^{*}/j.\end{split}

(86)

Due to (F), for each $i\in\mathcal{N}^{*}/j$ , we can obtain the optimal solution $p^{*}_{i}$ of (86) (abusing notations) via solving the decomposed problem as follows

\displaystyle\begin{split}\mathop{\min}_{\tilde{p}_{i}}\,&\quad g_{3}(\tilde{p}_{i})=\frac{\tilde{p}_{i}}{\log_{2}\left(1+\tilde{p}_{i}\right)}\\ \mathrm{s.t.}&\quad 0\leq\tilde{p}_{i}\leq\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\\ &\quad\frac{h_{j}}{I_{m^{*}_{j}}+BN_{0}}p_{j}\leq\tilde{p}_{i}\end{split}

(87)

where $\tilde{p}_{i}\coloneqq\frac{h_{i}p_{i}}{I_{m^{*}_{i}}+BN_{0}}$ . Note that in (87) we consider the nontrivial case where $\mathcal{N}/j$ is not empty. It is easy to see that

\displaystyle g^{\prime}_{3}(\tilde{p}_{i})=\frac{\log_{2}(1+\tilde{p}_{i})-\tilde{p}_{i}/\left((1+\tilde{p}_{i})\ln 2\right)}{\left(\log_{2}(1+\tilde{p}_{i})\right)^{2}}.

(88)

Denoting the numerator of (88) as $f_{3}(\tilde{p}_{i})$ , we have $f(0)=0$ and

\displaystyle f^{\prime}_{3}(\tilde{p}_{i})=\frac{1}{\ln 2}\left(\frac{1}{(1+\tilde{p}_{i})}-\frac{1}{(1+\tilde{p}_{i})^{2}}\right)>0,\quad\text{when}~{}\tilde{p}_{i}>0

(89)

which implies that $g_{3}(\tilde{p}_{i})$ is monotonically increasing with $\tilde{p}_{i}>0$ . Thus, if $p_{j}\leq\frac{h_{i}p^{\mathit{max}}_{i}(I_{m^{*}_{j}}+BN_{0})}{h_{j}(I_{m^{*}_{i}}+BN_{0})}$ , recalling the definition of $\tilde{p}_{i}$ , we have

\displaystyle p^{*}_{i}=\frac{h_{j}(I_{m^{*}_{i}}+BN_{0})}{h_{i}(I_{m^{*}_{j}}+BN_{0})}p_{j},~{}\forall i\in\mathcal{N}^{*}/j.

(90)

Otherwise, problem (87) is infeasible.

Similar to (87), denote $\tilde{p}_{j}\coloneqq\frac{h_{j}p_{j}}{I_{m^{*}_{j}}+BN_{0}}$ and institute $p_{i}=p^{*}_{i}$ in (86) (noting that $p_{i}=\frac{(I_{m^{*}_{i}}+BN_{0})\tilde{p}_{j}}{h_{i}}$ ), whereby we have the following problem regarding $\tilde{p}_{j}$

\displaystyle\begin{split}\mathop{\min}_{\tilde{p}_{j}}&\quad g_{4}(\tilde{p}_{j})=\underbrace{\eta_{1}\sum_{i\in\mathcal{N}^{*}}\frac{I_{m^{*}_{i}}+BN_{0}}{h_{i}}}_{b_{1}}\cdot\frac{\tilde{p}_{j}}{\log_{2}\left(1+\tilde{p}_{j}\right)}+\frac{\eta_{2}}{\log_{2}\left(1+\tilde{p}_{j}\right)}\\ \mathrm{s.t.}&\quad 0\leq\tilde{p}_{j}\leq\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}},~{}\forall i\in\mathcal{N}^{*}.\end{split}

(91)

Denoting positive constant $b_{1}$ as in (91), we can write $g_{4}(\tilde{p}_{j})=\frac{b_{1}\tilde{p}_{j}}{\log_{2}(1+\tilde{p}_{j})}+\frac{\eta_{2}}{\log_{2}(1+\tilde{p}_{j})}$ and obtain

$\displaystyle g^{\prime}_{4}(\tilde{p}_{j})$	$\displaystyle=b_{1}g^{\prime}_{3}(\tilde{p}_{j})-\frac{\eta_{2}}{(1+\tilde{p}_{j})\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}\ln 2}$
	$\displaystyle=b_{1}\frac{\log_{2}(1+\tilde{p}_{j})-\tilde{p}_{j}/\left((1+\tilde{p}_{j})\ln 2\right)}{\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}}-\frac{\eta_{2}}{(1+\tilde{p}_{j})\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}\ln 2}$
	$\displaystyle=\frac{b_{1}\big{(}(1+\tilde{p}_{j})\log_{2}(1+\tilde{p}_{j})\ln 2-\tilde{p}_{j}\big{)}-\eta_{2}}{(1+\tilde{p}_{j})\left(\log_{2}(1+\tilde{p}_{j})\right)^{2}\ln 2}.$	(92)

Next, we show that $g_{4}(\tilde{p}_{j})$ has unique minimum point. Denoting the numerator of (F) as $f_{4}(\tilde{p}_{j})$ , we have $f_{4}(0)=-\eta_{2}\leq 0$ and

\displaystyle f^{\prime}_{4}(\tilde{p}_{j})=b_{1}\log_{2}(1+\tilde{p}_{j})\ln 2\geq 0,\quad\text{when}~{}\tilde{p}_{j}\geq 0

(93)

which implies $f_{4}(\tilde{p}_{j})$ is monotonically increasing. Besides, it can be shown that $f_{4}(b_{2})\geq 0$ , where $b_{2}=2^{(1+\sqrt{\max\{\frac{\eta_{2}}{b_{1}},1\}-1})/\ln 2}$ . Thus, there must exist a unique point $\tilde{p}^{0}_{j}\in(0,b_{2}]$ such that $g^{\prime}_{4}(\tilde{p}^{0}_{j})=f_{4}(\tilde{p}^{0}_{j})=0$ holds, and it is also the minimum value of $g_{4}(\tilde{p}_{j})$ .

Accordingly, the optimal solution of the problem (91) can be expressed as

\displaystyle\tilde{p}^{*}_{j}=\min\left\{\tilde{p}^{0}_{j},\min_{i\in\mathcal{N^{*}}}\frac{h_{i}p^{\mathit{max}}_{i}}{I_{m^{*}_{i}}+BN_{0}}\right\}.

(94)

Therefore, we complete the proof.

Appendix G Proof of Theorem 1

From Lemma 1 (smoothness of $F_{i}$ ), for any ${\theta}_{1},{\theta}_{2}\in\mathbb{R}^{d}$ , we have

\displaystyle F({\theta}_{2})\leq F({\theta}_{1})+\nabla F({\theta}_{1})^{\top}({\theta}_{2}-{\theta}_{1})+\frac{L_{F}}{2}\|{\theta}_{2}-{\theta}_{1}\|^{2}.

(95)

Combining (95) with (2), we have

	$\displaystyle F({\theta}^{k+1})$	$\displaystyle\leq F({\theta}^{k})+\nabla F({\theta}^{k})^{\top}({\theta}^{k+1}-{\theta}^{k})+\frac{L_{F}}{2}\\|{\theta}^{k+1}-{\theta}^{k}\\|^{2}$
		$\displaystyle=F({\theta}^{k})-\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)+\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}.$		(96)

Denoting

\displaystyle G^{k}\coloneqq\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\|^{2}

(97)

we have $\mathbb{E}[F({\theta}^{k})-F({\theta}^{k+1})]\geq\mathbb{E}[G^{k}]$ . Next, we bound $\mathbb{E}[G^{k}]$ from below.

First, for any $i\in\mathcal{N}$ , we rewrite $\nabla F({\theta}^{K})$ as

\displaystyle\nabla F({\theta}^{k})=

\displaystyle\underbrace{\frac{1}{n}\sum_{j\in\mathcal{N}}\nabla F_{j}({\theta}^{k})-\nabla F_{i}({\theta}^{k})}_{A_{i}}+\underbrace{\nabla F_{i}({\theta}^{k})-\tilde{\nabla}F_{i}({\theta}^{k})}_{B_{i}}+\tilde{\nabla}F_{i}({\theta}^{k})

(98)

and bound $\mathbb{E}[\|A_{i}\|^{2}]$ by

$\displaystyle\mathbb{E}[\\|A_{i}\\|^{2}]$	$\displaystyle=\mathbb{E}\bigg{[}\Big{\\|}\nabla F_{i}({\theta}^{k})-\frac{1}{n}\sum_{j\in\mathcal{N}}\nabla F_{j}({\theta}^{k})\Big{\\|}^{2}\bigg{]}$
	$\displaystyle=\mathbb{E}\bigg{[}\Big{\\|}\frac{1}{n}\sum_{j\in\mathcal{N}}\left(\nabla F_{j}({\theta}^{k})-\nabla F_{i}({\theta}^{k})\right)\Big{\\|}^{2}\bigg{]}$
	$\displaystyle\mathop{\leq}^{\text{(a)}}\frac{1}{n}\sum_{j\in\mathcal{N}}\mathbb{E}\left[\left\\|\nabla F_{j}({\theta}^{k})-\nabla F_{i}({\theta}^{k})\right\\|^{2}\right]$
	$\displaystyle\mathop{\leq}^{\text{(b)}}\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}$	(99)

where (b) is derived from Lemma 3. Inequality (a) follows from the fact that, for any $a_{i}\in\mathbb{R}^{d}$ and $b_{i}\in\mathbb{R}$ ,

$\displaystyle\left\\|\sum_{i\in\mathcal{N}_{k}}b_{i}a_{i}\right\\|^{2}$	$\displaystyle=\sum^{d}_{j=1}\left(\sum_{i\in\mathcal{N}_{k}}b_{i}a^{(j)}_{i}\right)^{2}$
	$\displaystyle\leq\sum^{d}_{j=1}\left(\sum_{i\in\mathcal{N}_{k}}\left(a^{(j)}_{i}\right)^{2}\right)\left(\sum_{i\in\mathcal{N}_{k}}b^{2}_{i}\right)$
	$\displaystyle=\left(\sum^{d}_{j=1}\sum_{i\in\mathcal{N}_{k}}\left(a^{(j)}_{i}\right)^{2}\right)\left(\sum_{i\in\mathcal{N}_{k}}b^{2}_{i}\right)$
	$\displaystyle=\left(\sum_{i\in\mathcal{N}_{k}}\left\\|a_{i}\right\\|^{2}\right)\left(\sum_{i\in\mathcal{N}_{k}}b^{2}_{i}\right)$	(100)

where $a^{(j)}_{i}$ denotes the $j$ -th coordinate of $a_{i}$ , and the first inequality is derived by using Cauchy-Schwarz inequality.

Regarding $\mathbb{E}[\|B_{i}\|^{2}]$ , from Lemma 2,

\displaystyle\mathbb{E}[\|B_{i}\|^{2}]\leq\sigma^{2}_{F_{i}}.

(101)

Substituting (98) in $\mathbb{E}[G^{k}]$ , we obtain

$\displaystyle\mathbb{E}[G^{k}]$	$\displaystyle=\mathbb{E}\left[\beta\nabla F({\theta}^{k})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right)-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}\right]$
	$\displaystyle=\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A_{i}+B_{i}+\tilde{\nabla}F_{i}({\theta}^{k})\right)^{\top}\tilde{\nabla}F_{i}({\theta}^{k})-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}\right]$
	$\displaystyle=\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+\left\\|\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}\right)\right]-\frac{L_{F}\beta^{2}}{2}\mathbb{E}\left[\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}\right].$	(102)

Note that, for any random variables $a,b\in\mathbb{R}^{d}$ and for $c\neq 0$ ,

\displaystyle\mathbb{E}\left[a^{\top}b\right]\geq-\mathbb{E}\left[2\left\|\left(ca\right)^{\top}\frac{b}{2c}\right\|\right]\geq-\left(c^{2}\mathbb{E}\left[\|a\|^{2}\right]+\frac{\mathbb{E}\left[\|b\|^{2}\right]}{4c^{2}}\right).

(103)

With $g(x)\coloneqq x\mathbb{E}[\|a\|^{2}]+\mathbb{E}[\|b\|^{2}]/(4x)$ , we have

\displaystyle x^{*}=\mathop{\arg\min}_{x}g(x)=\sqrt{\frac{\mathbb{E}\left[\left\|b\right\|^{2}\right]}{4\mathbb{E}\left[\left\|a\right\|^{2}\right]}}

(104)

which implies that if we set $c^{2}=x^{*}$ , the lower bound in (103) becomes tight. Thus, substituting this in (103) and rearranging the terms, we have

\displaystyle\mathbb{E}\left[a^{\top}b\right]\geq-\sqrt{\mathbb{E}\left[\left\|a\right\|^{2}\right]\mathbb{E}\left[\left\|b\right\|^{2}\right]}.

(105)

Based on (105) along with (G), due to the tower rule, we can bound $\mathbb{E}[G^{k}]$ as follows

$\displaystyle\mathbb{E}[G^{k}]=\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})+\left\\|\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}\right)\right]-\frac{L_{F}\beta^{2}}{2}\mathbb{E}\bigg{[}\Big{\\|}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\\|}^{2}\bigg{]}$
$\displaystyle=\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(\mathbb{E}\left[A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\|}\mathcal{N}_{k}\right]+\mathbb{E}\left[B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\|}\mathcal{N}_{k}\right]+\left\\|\tilde{\nabla}F_{i}({\theta}^{k})\right\\|^{2}\right)\right]$
	$\displaystyle-\frac{L_{F}\beta^{2}}{2}\mathbb{E}\bigg{[}\Big{\\|}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\\|}^{2}\bigg{]}$
$\displaystyle\geq\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(-\sqrt{\mathbb{E}\left[\left\\|A_{i}\right\\|^{2}\Big{\|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}-\sqrt{\mathbb{E}\left[\left\\|B_{i}\right\\|^{2}\Big{\|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right)\right]$
	$\displaystyle+\beta\left(1-\frac{L_{F}\beta}{2}\right)\mathbb{E}\bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\\|}^{2}\bigg{]}$
$\displaystyle\geq\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}-\left(\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}\right)\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right]$	(from (G) and (101))
	$\displaystyle+\beta\left(1-\frac{L_{F}\beta}{2}\right)\mathbb{E}\bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\Big{\\|}^{2}\bigg{]}$
$\displaystyle=\;$	$\displaystyle\beta\mathbb{E}\left[\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(1-\frac{L_{F}\beta}{2}\right)\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}-\left(\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}\right)\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right]$	(106)

thereby completing the proof.

Appendix H Proof of Theorem 5

First, it is easy to see that $g_{2}(\bm{z},\bm{p})$ is upper bounded by $\sum_{i\in\mathcal{N}}u_{i}$ . Besides, due to (43), we have

\displaystyle\bm{z}^{t+1}=\mathop{\arg\max}_{\bm{z}}\left\{\sum_{i\in\mathcal{N}}\sum_{m\in\mathcal{M}}z_{i,m}\left(u_{i}-\frac{\eta_{1}\delta^{t}(I_{m}+BN_{0})(2^{\frac{S}{B\delta^{t}}}-1)}{h_{i}}\right),~{}\mathrm{s.t.}~{}\eqref{eq:constraint_zm}-\eqref{eq:constraint_z01}\right\}.

(107)

Based on (48), the optimal transmission power $\hat{\bm{p}}^{t+1}$ corresponding to $\bm{z}^{t+1}$ is given by

\displaystyle\hat{p}^{t+1}_{i}=\begin{cases}\frac{(I_{m^{*}_{i}}+BN_{0})(2^{\frac{S}{B\delta^{t}}}-1)}{h_{i}}&,~{}\text{if}~{}\sum_{m\in\mathcal{M}}z^{t+1}_{i,m}=1\\ 0&,~{}\text{otherwise}\end{cases}

(108)

where we slightly abuse the notation and denote the RB block allocated to $i$ as $m^{*}_{i}$ , i.e., $z^{t+1}_{i,m^{*}_{i}}=1$ . Thus, we have $g(\bm{z}^{t},\bm{p}^{t})\leq g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1})$ . From (49), $g(\bm{z}^{t+1},\hat{\bm{p}}^{t+1})\leq g(\bm{z}^{t+1},\bm{p}^{t+1})$ . Using these two inequalities, we obtain

\displaystyle g(\bm{z}^{t},\bm{p}^{t})\leq g(\bm{z}^{t+1},\bm{p}^{t+1})

(109)

thereby completing the proof.

Appendix I Proof of Corollaries 1 and 2

I-A Proof of Corollary 1

Denote auxiliary variable ${\theta}^{k,t}\coloneqq\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}{\theta}^{k,t}_{i}$ as a “virtual” global model in local step $t$ ( ${\theta}^{k,t}_{i}$ is defined in (2)). Similar to (G), the following holds

$\displaystyle F({\theta}^{k,t+1})$	$\displaystyle\leq F({\theta}^{k,t})+\nabla F({\theta}^{k,t})^{\top}({\theta}^{k,t+1}-{\theta}^{k,t})+\frac{L_{F}}{2}\\|{\theta}^{k,t+1}-{\theta}^{k,t}\\|^{2}$	(from Lemma 1)
	$\displaystyle=F({\theta}^{k,t})+\nabla F({\theta}^{k,t})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}{\theta}^{k,t+1}_{i}-{\theta}^{k,t}_{i}\right)+\frac{L_{F}}{2}\\|{\theta}^{k,t+1}-{\theta}^{k,t}\\|^{2}$	(from the definition of $\theta^{k,t}$ )
	$\displaystyle=F({\theta}^{k,t})-\beta\nabla F({\theta}^{k,t})^{\top}\left(\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right)+\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\\|^{2}.$	(from (2))

Denote

\displaystyle G^{k,t}\coloneqq\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\nabla F({\theta}^{k,t})^{\top}\tilde{\nabla}F_{i}({\theta}^{k,t})-\frac{L_{F}\beta^{2}}{2}\left\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\|^{2}.

(110)

We have $\mathbb{E}[F({\theta}^{k,t})-F({\theta}^{k,t+1})]\geq\mathbb{E}[G^{k,t}]$ . Rewrite $\nabla F({\theta}^{k,t})$ for each $i\in\mathcal{N}_{k}$ as follows

\displaystyle\nabla F({\theta}^{k,t})=\underbrace{\nabla F({\theta}^{k,t})-\nabla F({\theta}^{k,t}_{i})}_{C_{i}}+\underbrace{\nabla F({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})}_{A_{i}}+\underbrace{\nabla F_{i}({\theta}^{k,t}_{i})-\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})}_{B_{i}}+\tilde{\nabla}F_{i}({\theta}^{k,t}_{i}).

(111)

Note that in (111), $A_{i}$ and $B_{i}$ are similar to those in (98) with $\theta^{k}$ being replaced by $\theta^{k,t}_{i}$ . Thus, from (G) and (101), $\mathbb{E}[\|A_{i}\|^{2}]$ and $\mathbb{E}[\|B_{i}\|^{2}]$ are bounded by

	$\displaystyle\mathbb{E}[\\|A_{i}\\|^{2}]$	$\displaystyle\leq\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}$		(112)
	$\displaystyle\mathbb{E}[\\|B_{i}\\|^{2}]$	$\displaystyle\leq\sigma^{2}_{F_{i}}.$		(113)

Regarding $\mathbb{E}[\|C_{i}\|^{2}]$ , we can write

	$\displaystyle\mathbb{E}[\\|C_{i}\\|^{2}]$	$\displaystyle=\mathbb{E}\left[\left\\|\nabla F({\theta}^{k,t})-\nabla F({\theta}^{k,t}_{i})\right\\|^{2}\right]$
		$\displaystyle\leq L^{2}_{F}\mathbb{E}\left[\left\\|{\theta}^{k,t}-{\theta}^{k,t}_{i}\right\\|^{2}\right].$		(From Lemma 1)

To bound $\mathbb{E}[\|C_{i}\|^{2}]$ , denote

\displaystyle a_{t}\coloneqq\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\|{\theta}^{k,t}-{\theta}^{k,t}_{i}\right\|^{2}\right]\right\}

(114)

with $a_{0}=0$ . Then we have

$\displaystyle a_{t+1}$	$\displaystyle=\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\\|{\theta}^{k,t+1}-{\theta}^{k,t+1}_{i}\right\\|^{2}\right]\right\}$
	$\displaystyle=\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\\|{\theta}^{k,t}_{i}-\beta\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left({\theta}^{k,t}_{j}-\beta\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right)\right\\|^{2}\right]\right\}$
	$\displaystyle\leq\max_{i\in\mathcal{N}_{k}}\left\{(1+\epsilon)\mathbb{E}\left[\left\\|{\theta}^{k,t}_{i}-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}{\theta}^{k,t}_{j}\right\\|^{2}\right]+\beta^{2}\left(1+\frac{1}{\epsilon}\right)\mathbb{E}\left[\left\\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right\\|^{2}\right]\right\}$	(from [9, Equation (68)])
	$\displaystyle\leq(1+\epsilon)\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\\|{\theta}^{k,t}_{i}-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}{\theta}^{k,t}_{j}\right\\|^{2}\right]\right\}+\beta^{2}\left(1+\frac{1}{\epsilon}\right)\max_{i\in\mathcal{N}_{k}}\left\{\mathbb{E}\left[\left\\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right\\|^{2}\right]\right\}$
	$\displaystyle=(1+\epsilon)a_{t}+\beta^{2}\left(1+\frac{1}{\epsilon}\right)\max_{i\in\mathcal{N}_{k}}\Bigg{\{}\underbrace{\mathbb{E}\bigg{[}\bigg{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\bigg{\\|}^{2}\bigg{]}}_{H_{i}}\Bigg{\}}$	(115)

for any $\epsilon>0$ . For $H_{i}$ , we first write

\displaystyle H_{i}\leq 2\underbrace{\mathbb{E}\left[\left\|\nabla F_{i}({\theta}^{k,t}_{i})-\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\nabla F_{j}({\theta}^{k,t}_{j})\right\|^{2}\right]}_{H_{i,1}}+2\underbrace{\mathbb{E}\left[\left\|\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t}_{j})-\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right)+\left(\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})\right)\right\|^{2}\right]}_{H_{i,2}}.

(116)

Next, we bound $H_{i,1}$ and $H_{i,2}$ separately.

•

Upper Bound of $H_{i,1}$ : Based on Lemma 3, we have

$\displaystyle H_{i,1}=\;$	$\displaystyle\mathbb{E}\left[\left\\|\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{i}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t})\right)+\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right)\right\\|^{2}\right]$
$\displaystyle\leq\;$	$\displaystyle\frac{2}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\mathbb{E}\left[\left\\|\nabla F_{i}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t})\right\\|^{2}\right]$	(from (G))
	$\displaystyle+2\mathbb{E}\left[\left\\|\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right)\right\\|^{2}\right]$
$\displaystyle\leq\;$	$\displaystyle 2\gamma^{2}_{G}+2\mathbb{E}\left[\left\\|\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left(\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right)\right\\|^{2}\right]$	(from Lemma 3)
$\displaystyle\leq\;$	$\displaystyle 2\gamma^{2}_{G}+4\mathbb{E}\left[\left\\|\nabla F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t})\right\\|^{2}+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left\\|\nabla F_{j}({\theta}^{k,t})-\nabla F_{j}({\theta}^{k,t}_{j})\right\\|^{2}\right]$	(using Cauchy-Schwarz inequality similar to (118))
$\displaystyle\leq\;$	$\displaystyle 2\gamma^{2}_{G}+4L^{2}_{F}\mathbb{E}\left[\left\\|{\theta}^{k,t}_{i}-{\theta}^{k,t}\right\\|^{2}+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left\\|{\theta}^{k,t}-{\theta}^{k,t}_{j}\right\\|^{2}\right]$
$\displaystyle\leq\;$	$\displaystyle 2\gamma^{2}_{G}+8L^{2}_{F}a_{t}.$	(117)

•

Upper Bound of $H_{i,2}$ : Due to Cauchy-Schwarz inequality, the following holds

\displaystyle\left\|\sum^{n_{k}+1}_{j=1}x_{j}y_{j}\right\|^{2}\leq\left(\sum^{n_{k}+1}_{j=1}\left\|x_{j}\right\|^{2}\right)\left(\sum^{n_{k}+1}_{j=1}\left\|y_{j}\right\|^{2}\right)

(118)

where $x_{j}=\frac{1}{\sqrt{n_{k}}}$ and $y_{j}=\frac{1}{\sqrt{n_{k}}}(\nabla F_{j}({\theta}^{k,t}_{j})-\tilde{\nabla}F_{j}({\theta}^{k,t}_{j}))$ for $1\leq j\leq n_{k}$ ; $x_{n_{k}+1}=1$ and $y_{n_{k}+1}=\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})$ . Thus, we have

	$\displaystyle H_{i,2}$	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})-\nabla F_{i}({\theta}^{k,t}_{i})\right\\|^{2}+\frac{1}{n_{k}}\sum_{j\in\mathcal{N}_{k}}\left\\|\nabla F_{j}({\theta}^{k,t}_{j})-\tilde{\nabla}F_{j}({\theta}^{k,t}_{j})\right\\|^{2}\right]$
		$\displaystyle\leq 4\sigma^{2}_{F}$		(using Lemma 2)

where $\sigma_{F}=\max_{i\in\mathcal{N}}\{\sigma_{F_{i}}\}$ .

Based on the above results, substituting (116) into (115), we obtain

	$\displaystyle a_{t+1}$	$\displaystyle=(1+\epsilon)a_{t}+2\beta^{2}\left(1+\frac{1}{\epsilon}\right)\max_{i\in\mathcal{N}_{k}}\left\{H_{i,1}+H_{i,2}\right\}$
		$\displaystyle\leq\left(1+\epsilon+16\beta^{2}L^{2}_{F}\left(1+\frac{1}{\epsilon}\right)\right)a_{t}+4\beta^{2}\left(1+\frac{1}{\epsilon}\right)\left(\gamma^{2}_{G}+2\sigma^{2}_{F}\right).$		(119)

Note that (I-A) is essentially the same as [9, Equation (78)]. Therefore, due to $\beta\in[0,1/(10L_{F}\tau))$ and [9, Corollay F.2.], we have

\displaystyle\mathbb{E}[\|C_{i}\|^{2}]\leq a_{t}\leq 35\beta^{2}t\tau(\gamma^{2}_{G}+2\sigma^{2}_{F}).

(120)

Similar to (106), substituting (111) in $\mathbb{E}[G^{k,t}]$ , we obtain

$\displaystyle\mathbb{E}[G^{k,t}]=\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\nabla F({\theta}^{k,t})^{\top}\tilde{\nabla}F_{i}({\theta}^{k,t})-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\\|^{2}\right]$
$\displaystyle=\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A_{i}+B_{i}+C_{i}+\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\right)^{\top}\tilde{\nabla}F_{i}({\theta}^{k,t})-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\\|^{2}\right]$
$\displaystyle=\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})+B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})+C^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})+\left\\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\right\\|^{2}\right)-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\\|^{2}\right]$
$\displaystyle=\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(\mathbb{E}\Big{[}A^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\|}\mathcal{N}_{k}\Big{]}+\mathbb{E}\Big{[}B^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\|}\mathcal{N}_{k}\Big{]}+\mathbb{E}\Big{[}C^{\top}_{i}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\|}\mathcal{N}_{k}\Big{]}+\left\\|\tilde{\nabla}F_{i}({\theta}^{k,t}_{i})\right\\|^{2}\right)\right.$
	$\displaystyle\left.-\frac{L_{F}\beta^{2}}{2}\left\\|\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\tilde{\nabla}F_{i}({\theta}^{k,t})\right\\|^{2}\right]$
$\displaystyle\geq\;$	$\displaystyle\mathbb{E}\left[\frac{\beta}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(-\sqrt{\mathbb{E}\left[\left\\|A_{i}\right\\|^{2}\Big{\|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}-\sqrt{\mathbb{E}\left[\left\\|B_{i}\right\\|^{2}\Big{\|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right.\right.$
	$\displaystyle-\left.\left.\sqrt{\mathbb{E}\left[\left\\|C_{i}\right\\|^{2}\Big{\|}\mathcal{N}_{k}\right]}\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right)\right]+\beta\left(1-\frac{L_{F}\beta}{2}\right)\mathbb{E}\bigg{[}\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\Big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\\|}^{2}\bigg{]}$	(using (105) and (G), and rearranging terms)
$\displaystyle\geq\;$	$\displaystyle\beta\mathbb{E}\left[\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\left(\left(1-\frac{L_{F}\beta}{2}\right)\Big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\\|}^{2}\right.\right.$
	$\displaystyle\left.\left.-\left(\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\sigma_{F_{i}}+\beta\sqrt{35t\tau(\gamma^{2}_{G}+2\sigma^{2}_{F})}\right)\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right)\right].$	(from (112), (113), and (120))

Therefore, the following holds

$\displaystyle\mathbb{E}\left[F({\theta}^{k})-F({\theta}^{k+1})\right]$	$\displaystyle=\mathbb{E}\left[\sum^{\tau-1}_{t=0}F({\theta}^{k,t})-F({\theta}^{k,t+1})\right]$
	$\displaystyle\geq\sum^{\tau-1}_{t=0}\mathbb{E}\left[G^{k,t}\right]$
	$\displaystyle\geq\frac{\beta}{2}\mathbb{E}\left[\frac{1}{n_{k}}\sum_{i\in\mathcal{N}_{k}}\sum^{\tau-1}_{t=0}\left(\Big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\Big{\\|}^{2}-2\left(\lambda_{1}+\frac{\lambda_{2}}{\sqrt{D_{i}}}\right)\sqrt{\mathbb{E}\Big{[}\big{\\|}\tilde{\nabla}F_{i}({\theta}^{k,t})\big{\\|}^{2}\Big{\|}\mathcal{N}_{k}\Big{]}}\right)\right]$	(substituting (2))

where

	$\displaystyle\lambda_{1}$	$\displaystyle\geq\sqrt{\left(1+\alpha L\right)^{2}\gamma_{G}+\alpha\zeta\gamma_{H}}+\beta\tau\sqrt{35(\gamma^{2}_{G}+2\sigma^{2}_{F})}$		(121)
	$\displaystyle\lambda_{2}$	$\displaystyle\geq\sqrt{6\sigma^{2}_{G}\left(1+(\alpha L)^{2}\right)\left((\alpha\sigma_{H})^{2}+(1+\alpha L)^{2}\right)+3(\alpha\zeta\sigma_{H})^{2}}.$		(122)

We complete the proof.

I-B Proof of Corollary 2

The result of Corollary 2 can be obtained via combining [9, Lemma H.1] and [9, Lemma H.2] with the proof of Lemma 1.

	$\displaystyle\big{\\|}\mathbb{E}\big{[}\tilde{\nabla}F_{i}({\theta})$	$\displaystyle-\nabla F_{i}({\theta})\big{]}\big{\\|}\leq\big{\\|}\big{(}I-\alpha\nabla^{2}f_{i}({\theta})\big{)}\mathbb{E}\big{[}\delta^{}_{2}\big{]}+\mathbb{E}\big{[}\delta^{}_{1}\delta^{*}_{2}\big{]}$
		$\displaystyle+\mathbb{E}\big{[}\delta^{*}_{1}\big{]}\nabla f_{i}({\theta}-\alpha\nabla f_{i}({\theta}))\big{\\|},$
	$\displaystyle\mathbb{E}\big{[}\\|\tilde{\nabla}F_{i}({\theta})$	$\displaystyle-\nabla F_{i}({\theta})\\|^{2}\big{]}\leq 3\Big{(}\mathbb{E}\big{[}\\|\delta^{}_{1}\\|^{2}\big{]}\mathbb{E}\big{[}\\|\delta^{}_{2}\\|^{2}\big{]}$
		$\displaystyle+\zeta^{2}\mathbb{E}\big{[}\\|\delta^{}_{1}\\|^{2}\big{]}+\left(1+\alpha L\right)^{2}\mathbb{E}\big{[}\\|\delta^{}_{2}\\|^{2}\big{]}\Big{)},$

$\displaystyle\\|\nabla F_{i}({\theta}_{1})-\nabla F_{i}({\theta}_{2})\\|=$	$\displaystyle\;\\|(I-\alpha\nabla^{2}f_{i}({\theta}_{1}))\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-(I-\alpha\nabla^{2}f_{i}({\theta}_{2}))\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|$
$\displaystyle=$	$\displaystyle\;\\|(I-\alpha\nabla^{2}f_{i}({\theta}_{1}))(\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2})))$
	$\displaystyle+((I-\alpha\nabla^{2}f_{i}({\theta}_{1}))-(I-\alpha\nabla^{2}f_{i}({\theta}_{2})))\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|$	(adding and subtracting the term $(I-\alpha\nabla^{2}f_{i}(\theta_{1}))\nabla f_{i}(\theta_{2}-\alpha\nabla f_{i}(\theta_{2})$ )
$\displaystyle\leq$	$\displaystyle\;\underbrace{\\|I-\alpha\nabla^{2}f_{i}({\theta}_{1})\\|\\|\nabla f_{i}({\theta}_{1}-\alpha\nabla f_{i}({\theta}_{1}))-\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|}_{(a)}$	(from triangle inequality)
	$\displaystyle+\alpha\underbrace{\\|\nabla^{2}f_{i}({\theta}_{1})-\nabla^{2}f_{i}({\theta}_{2})\\|\\|\nabla f_{i}({\theta}_{2}-\alpha\nabla f_{i}({\theta}_{2}))\\|}_{(b)}.$	(57)

$\displaystyle\left\\|\mathbb{E}\left[\delta^{*}_{2}\right]\right\\|$	$\displaystyle=\left\\|\mathbb{E}\left[\tilde{\nabla}f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i}),\mathcal{D}^{\prime}_{i}\big{)}-\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}+\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right]\right\\|$
	$\displaystyle=\left\\|\mathbb{E}\left[\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right]\right\\|$
	$\displaystyle\leq\mathbb{E}\left[\left\\|\nabla f_{i}\big{(}{\theta}-\alpha\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})\big{)}-\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)\right\\|\right]$
	$\displaystyle\leq\alpha L\mathbb{E}\left[\left\\|\tilde{\nabla}f_{i}({\theta},\mathcal{D}_{i})-\nabla f_{i}({\theta})\right\\|\right]$	(from the smoothness of $f_{i}$ )
	$\displaystyle\leq\frac{\alpha\sigma_{G}L}{\sqrt{D_{i}}}.$	(from Assumption 3)

$\displaystyle\left\\|\mathbb{E}\left[\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right]\right\\|$	$\displaystyle=\left\\|\left(I-\alpha\nabla^{2}f_{i}({\theta})\right)\mathbb{E}[\delta^{}_{2}]+\mathbb{E}[\delta^{}_{1}]\nabla f_{i}\left({\theta}-\alpha\nabla f_{i}({\theta})\right)+\mathbb{E}[\delta^{}_{1}\delta^{}_{2}]\right\\|$
	$\displaystyle\leq\left\\|I-\alpha\nabla^{2}f_{i}({\theta})\right\\|\left\\|\mathbb{E}[\delta^{*}_{2}]\right\\|$
	$\displaystyle\leq\frac{\alpha\sigma_{G}L(1+\alpha L)}{\sqrt{D_{i}}}$	(65)

	$\displaystyle\mathbb{E}\left[\left\\|\tilde{\nabla}F_{i}({\theta})-\nabla F_{i}({\theta})\right\\|^{2}\right]$	$\displaystyle\leq 3\left(1+\alpha L\right)^{2}\mathbb{E}\left[\left\\|\delta^{}_{2}\right\\|^{2}\right]+3\zeta^{2}\mathbb{E}\left[\left\\|\delta^{}_{1}\right\\|^{2}\right]+3\mathbb{E}\left[\left\\|\delta^{}_{1}\right\\|^{2}\right]\mathbb{E}\left[\left\\|\delta^{}_{2}\right\\|^{2}\right]$		(using the fact that $\\|I-\alpha\nabla^{2}f_{i}({\theta})\\|\leq 1+\alpha L$ )
		$\displaystyle\leq\frac{6(\alpha\sigma_{G}\sigma_{H})^{2}}{D^{\prime\prime}_{i}}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+6\sigma^{2}_{G}(1+\alpha L)^{2}\left(\frac{1}{D^{\prime}_{i}}+\frac{(\alpha L)^{2}}{D_{i}}\right)+\frac{3(\alpha\zeta\sigma_{H})^{2}}{D^{\prime\prime}_{i}}$		(68)