Automated Federated Learning in Mobile Edge Networks — Fast Adaptation and Convergence

Chaoqun You, Kun Guo,
Gang Feng, Peng Yang,
and Tony Q. S. Quek C. You and T. Quek are with the Singapore University of Design and Technology, 487372, Singapore (e-mail: [email protected] and [email protected]).K. Guo is with the East China Normal University, Shanghai 200241, P.R.China (e-mail:[email protected]).G. Feng is with University of Electronic Science and Technology of China, Chengdu 611731, P.R.China (e-mail:[email protected]).P. Yang is with the Beihang University, Beijing 100088, P.R.China (e-mail:[email protected]).

Abstract

Federated Learning (FL) can be used in mobile edge networks to train machine learning models in a distributed manner. Recently, FL has been interpreted within a Model-Agnostic Meta-Learning (MAML) framework, which brings FL significant advantages in fast adaptation and convergence over heterogeneous datasets. However, existing research simply combines MAML and FL without explicitly addressing how much benefit MAML brings to FL and how to maximize such benefit over mobile edge networks. In this paper, we quantify the benefit from two aspects: optimizing FL hyperparameters (i.e., sampled data size and the number of communication rounds) and resource allocation (i.e., transmit power) in mobile edge networks. Specifically, we formulate the MAML-based FL design as an overall learning time minimization problem, under the constraints of model accuracy and energy consumption. Facilitated by the convergence analysis of MAML-based FL, we decompose the formulated problem and then solve it using analytical solutions and the coordinate descent method. With the obtained FL hyperparameters and resource allocation, we design a MAML-based FL algorithm, called Automated Federated Learning (AutoFL), that is able to conduct fast adaptation and convergence. Extensive experimental results verify that AutoFL outperforms other benchmark algorithms regarding the learning time and convergence performance.

Index Terms:

Fast adaptation and convergence, federated learning, model-agnostic meta-learning, mobile edge networks

I Introduction

In recent years, modern mobile user equipments (UEs) such as smart phones and wearable devices have been equipped with advanced powerful sensing and computing capabilities [1]. This enables their access to a wealth of data that are suitable for learning models, which brings countless opportunities for meaningful applications such as Artificial Intelligence (AI) medical diagnosis [2] and air quality monitoring [3]. Traditionally, learning models requires data to be processed in a cloud data center [4]. However, due to the long distance between the devices where data is generated and the servers in data centers, cloud-based Machine Learning (ML) for mobile UEs may incur unacceptable latencies and communication overhead. Therefore, Mobile Edge Computing (MEC) [5, 6] has been proposed to facilitate the deployment of servers near the base station (BS) at mobile edge networks, so as to bring intelligence to network edge.

In mobile edge networks, for a traditional ML paradigm, it is inevitable to upload raw data from mobile UEs to a server which could be deployed near the BS for model learning. However, the unprecedented amount of data created by mobile UEs are private in nature, leading to the increasing concern of data security and user privacy. To address this issue, Federated Learning (FL) has been proposed [7] as a new ML paradigm. FL in mobile edge networks refers to training models across multiple distributed UEs without ever uploading their raw data to the server. In particular, UEs compute local updates to the current global model based on their own local data, which are then aggregated and fed-back by an edge server, so that all UEs have access to the same global model for their new local updates. Such a procedure is implemented in one communication round and is repeated until a certain model accuracy is reached.

Despite its promising benefits, FL also comes with new challenges in practice. Particularly, the datasets across UEs are heterogenous. Not only does the number of data samples generated by UEs vary, but these data samples are usually not independent and identically distributed (non-i.i.d). Learning from such heterogeneous data is not easy, as conventional FL algorithms usually develop a common model for all UEs [8–10] such that, the global model obtained by minimizing the average loss could perform arbitrarily poorly once applied to the local dataset of a specific UE. That is, the global model derived from conventional FL algorithms may conduct weak adaptations to local UEs. Such weak adaptations will further restrict the convergence rate of the FL algorithms.

Multiple techniques are emerging as promising solutions to the data heterogeneity problem, such like adding user context [8], transfer-learning [9], multi-task learning [10], and Model-Agnostic Meta-Learning (MAML) [11]. Of all these techniques, MAML is the only one that not only addresses the heterogenous data problem, but greatly speeds up the FL learning process as well. MAML is a well-known meta-learning [12] approach that learns from the past experience to adapt to new tasks much faster. Such adaptations make it possible for MAML to develop well-behaved user-specific models. More specifically, MAML aims to find a sensitive initial point that learned from the past experience to conduct fast adaptations requiring only a few data points on each UE. With the learned initial model, MAML dramatically speeds up the learning process by replacing hand-engineered algorithms with an automated, data-driven approach [13]. Fortunately, in both MAML and FL, existing algorithms use a variant of gradient descent method locally, and send an overall update to a coordinator to update the global model. This similarity makes it possible to interpret FL within a MAML framework [14, 15, 16, 17]. Such a simple MAML-based FL is termed as Personalized Federated Learning (PFL), and the algorithm that realizes PFL is termed as Personalized FederatedAveraging (Per-FedAvg). However, Per-FedAvg is a pure ML algorithm, and how to apply it into practical mobile edge networks remains unclear. Recently several attempts have been made to study the implementation of Per-FedAvg in practice. [18] proposes a framework to execute Per-FedAvg for intelligent IoT applications without considering Per-FedAvg’s strength in saving learning time. [19], although aims to achieve fast learning for IoT applications at the edge, does not consider the resource allocation in a practical mobile edge network. Therefore, to what degree a MAML-based FL algorithm expedites FL in mobile edge networks and under what conditions such benefit could be achieved still remain to be unexplored areas.

In order to quantify the benefit MAML brings to FL in mobile edge networks, we consider two aspects to minimize the overall learning time: optimizing learning-related hyperparameters as well as resource allocation. On the one hand, the typical FL hyperparameters, including the sampled data sizes across UEs and the number of communication rounds, have significant impact on the overall learning time and thus, these hyperparameters need to be carefully specified. On the other hand, the resource allocation should be considered to account for UEs’ practical wireless communication environments and limited battery lifetimes. Particularly, due to the existence of random channel gain and noise over wireless channels, the transmit power allocation on an UE will decide whether the transmitted updates from the UE can be successfully decoded by the edge server. Restricted by the limited battery lifetime, it will also determine whether the remaining energy on that UE is sufficient to support its local training.

It is non-trivial to solve the formulated optimization problem considering above two quantitative dimensions, given that the relationship between the variables (i.e., the sampled data sizes across UEs, the number of communication rounds, and the transmit power on UEs) and the model accuracy $\epsilon$ is implicit. Therefore, we start with the convergence analysis of the MAML-based FL algorithm, by which the three variables are bounded as functions of $\epsilon$ . After that, the formulated optimization problem can be approximatively decoupled into three sub-problems, each of which accounts for one of the three variables. Specifically, solving the first sub-problem gives an insight to achieve the required number of communication round. As for the other two sub-problems, we use the coordinate descent method [20] to compute the sampled data size and the transmit power of all UEs iteratively. In other words, we first give an initial value of the transmit power of UEs, and use it to compute the sampled data sizes in the second sub-problem; then with the attained sampled data sizes, we compute the transmit power in the third sub-problem. This process repeats until a certain model accuracy is achieved.

The solution to the optimization problem guides us to the design of Automated Federated Learning (AutoFL). AutoFL uses the results derived from the optimization problem to design the learning process, thereby quantifying as well as maximizing the benefit MAML brings to FL over mobile edge networks. More specifically, in each round $k$ , the BS first sends the current model parameter $w_{k}$ to a random subset of UEs. According to the given model accuracy $\epsilon$ , the sampled data sizes and the transmit power of the selected UEs are determined by our proposed solution. Each selected UE then trains its local model with the determined sampled data size and then transmits the local update to the edge server with the determined transmit power. The server receives local updates from the selected UEs and decides whether these updates can be successfully decoded. Those successfully decoded updates are then aggregated to update the global model as $w_{k+1}$ . Such a communication round is executed repeatedly until the model accuracy $\epsilon$ is achieved. In this way, AutoFL is able to inherit the advantages of MAML over mobile edge networks, thereby conducting model learning with fast adaptation and convergence.

To summarize, in this paper we make the following contributions:

•

We provide a comprehensive problem formulation for the MAML-based FL design over mobile edge networks, which can account for the impact of FL hyperparameters and network parameters on the model accuracy and learning time at the same time. Specifically, we jointly optimize the UEs’ sampled data size and transmit power, as well as the number of communication rounds, to quantify and maximize the benefit MAML brings to FL, by which the overall learning time is minimized under the constraint of model accuracy $\epsilon$ and energy consumption.
•

We analyse the convergence of MAML-based FL algorithm as the first step to solve the formulated problem. The convergence analysis makes the relationships between optimization variables and the model accuracy explicit, especially characterizes the sampled data size, the transmit power, and the number of communication rounds as functions of model accuracy $\epsilon$ .
•

Applying the obtained results of convergence analysis, the formulated problem is decoupled into three sub-problems. By solving these sub-problems, the number of communication rounds can be characterized using a closed-form solution, while the sampled data sizes and the transmit power of all UEs in each round can be achieved using the coordinate descent method. With the optimized hyperparameters and resource allocation, we further propose AutoFL with fast adaptation and convergence.
•

By conducting extensive experiments with MNIST and CIFAR-10 datasets, we demonstrate the effectiveness and advantages of AutoFL over Per-FedAvg and FedAvg, two baseline FL algorithms in terms of learning time, model accuracy and training loss.

The rest of this paper is organized as follows. We first give the system model and problem formulation in Section II. Then we give the convergence analysis to make the formulated problem tractable in Section III. We recast the optimization problem and then propose the solutions to guide the design of AutoFL in Section IV. Extensive experimental results are presented and discussed in Section V. Finally, we conclude this paper in Section VI.

II System Model and Problem Formulation

Consider a typical mobile edge network with a edge server co-deployed with a BS and $n$ UEs, and the UEs are indexed by $\mathcal{U}=\{1,\dots,n\}$ , as illustrated in Fig. 1. In this section, we first explain why the FL problem can be interpreted within the MAML framework. Based on this framework, we introduce our system model and problem formulation.

Refer to caption — Figure 1: An illustration of mobile edge network.

II-A Interpreting FL within the MAML framework

In FL, we consider a set of $n$ UEs which are connected with the server via the BS, where each UE has access only to its local data [7]. For a sample data point $\{x,y\}$ with input $x$ , the goal of the server is to find the model parameter $w$ that characterizes the output $y$ with loss function $f(w):\mathbb{R}^{m}\rightarrow\mathbb{R}$ , such that the value of $f(w)$ can be minimized. More specifically, if we define $f_{i}(w):\mathbb{R}^{m}\rightarrow\mathbb{R}$ as the loss function of UE $i$ , the goal of the server is to solve

\min_{w\in\mathbb{R}^{m}}f(w):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w).

(1)

In particular, for each UE $i$ , we have

f_{i}(w):=\frac{1}{D_{i}}\sum_{(x,y)\in\mathcal{D}_{i}}l_{i}(w;x,y),

(2)

where $l_{i}(w;x,y)$ is the error between the true label $y\in\mathcal{Y}_{i}$ and the prediction of model $w$ using input $x\in\mathcal{X}_{i}$ . Each UE $i$ has a local dataset $\mathcal{D}_{i}=\{x\in\mathcal{X}_{i},y\in\mathcal{Y}_{i}\}$ , with $D_{i}=|\mathcal{D}_{i}|$ data samples. Since the datasets captured by the UEs are naturally heterogenous, the probability distribution of $\mathcal{D}_{i}$ across UEs is not identical.

MAML, on the other hand, is one of the most attractive technique in meta-learning. MAML is proposed to learn an initial model that adapts quickly to a new task through one or more gradient steps with only a few data points from that new task. Each task can be regarded as an object with its own dataset to learn, just like an UE in FL. MAML allows us to replace hand-engineered algorithms with data-driven approaches to learn initial parameters automatically.

To show how to exploit the fundamental idea behind the MAML framework [11] to design an automated variant of FL algorithm, let us first briefly recap the MAML formulation. In MAML, if we regard the tasks as UEs and assume each UE takes the initial model and updates it using one step of gradient descent with respect to its own loss function¹¹1A more general case is to perform multiple steps of gradient descent. However, this would lead to expensive cost of computing multiple Hessians. Therefore, for simplicity, throughout the whole paper we consider only one single step of gradient descent., problem (1) then changes to

\min_{w\in\mathbb{R}^{m}}F(w):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w-\alpha\nabla f_{i}(w)),

(3)

where $\alpha\geq 0$ is the learning rate at a UE. For UE $i$ , its optimization objective $F_{i}(w)$ can be expressed as

F_{i}(w):=f_{i}(w-\alpha\nabla f_{i}(w)).

(4)

Such a transformation from problem (1) to (3) implies that the FL problem can be interpreted within the MAML framework. The FL algorithm proposed to solve (3) is termed as Personalized Federated Averaging (Per-FedAvg) [21, 14]. Per-FedAvg is inspired by FedAvg, which is proposed in [7] as a classic and general FL algorithm to solve (1) in a distributed manner.

Per-FedAvg is summarized in Algorithm 1. In each round $k$ $(k=1,\dots,K)$ , the central BS randomly picks a set of UEs $\mathcal{A}_{k}$ and then sends them the current model parameter $w_{k}$ . Each UE first adapts the global model $w_{k}$ to its local data and obtains an intermediate parameter $\theta_{k}^{i}$ , where $\theta_{k}^{i}=w_{k}-\alpha\nabla f_{i}(w_{k})$ . Then, with $\theta_{k}^{i}$ , UE $i$ updates its local model using one or more steps of gradient descent and obtains $w_{k+1}^{i}$ . Such local model parameter updates are then sent to the server for model aggregation. The server will update the global model or meta model as $w_{k+1}$ . This process repeats in the sequel rounds until a certain model accuracy is achieved or a predefined maximum number of rounds is reached.

Input : UE learning rate:

\alpha

; BS learning rate:

\beta

; Initial model parameter:

w_{0}

;

1 for $k=1$ to $K$ do

2 Choose a subset of UEs

\mathcal{A}_{k}

uniformly at random with size

A_{k}

;

3 BS sends

w_{k}

to all UEs in

\mathcal{A}_{k}

;

4 for each UE $i\in\mathcal{A}_{k}$ do

5 Compute

\theta_{k}^{i}:=w_{k}-\alpha\nabla_{w_{k}}f_{i}(w_{k})

;

6 Compute

w_{k+1}^{i}:=w_{k}-\beta\nabla_{w_{k}}F_{i}(\theta_{k}^{i})=w_{k}-\beta(I-\nabla_{w_{k}}^{2}f_{i}(w_{k}))\nabla_{\theta_{k}^{i}}f_{i}(\theta_{k}^{i})

;

7 UE

i

sends

w_{k+1}^{i}

to the central BS.

8 end for

9 BS updates the global model as:

w_{k+1}=\frac{1}{A_{k}}\sum_{i\in\mathcal{A}_{k}}w_{k+1}^{i}

;

11 end for

Algorithm 1 Per-FedAvg Pseudo-code

Per-FedAvg is a MAML-based FL algorithm, which suggests a general approach to use the MAML method for solving the data heterogeneity problem in FL. It is proposed with focus on the personalization in FL, which is a natural feature inherited from MAML. Except for the personalization, MAML is also a few-shot learning approach, as only a few samples are required to learn new skills after just minutes of experience. Therefore, Per-FedAvg also inherits this feature of MAML, thereby adapting quickly from only a few samples. To what degree of fast adaptation MAML benefits FL and under what conditions such benefit can be achieved is what exactly this paper is about.

II-B Machine Learning Model

We consider the above described MAML-based FL. In detail, we concentrate on the situation where UEs communicate in a synchronous manner, so as to avoid using outdated parameters for global model update and make high-quality refinement in each round. Meanwhile, for each UE, we consider the case that only one step of stochastic gradient descent (SGD) is performed, following the same setting as [11].

As for the concerned MAML-based FL, our goal is to optimize the initial model using only a few data points at each UE. Hence, we only obtain an estimate of the desired gradient with SGD. Here, the desired gradient $\nabla F_{i}(w)$ on UE $i$ is computed using all data points in its dataset $\mathcal{D}_{i}$ , while the estimated gradient $\tilde{\nabla}F_{i}(w)$ on UE $i$ is computed using SGD with the sampled dataset $\mathcal{D}_{i}^{(\cdot)}\in\mathcal{D}_{i}$ . Note that, the superscript $(\cdot)$ represents different sampled datasets are used to estimate the involved gradients and Hessian in $\nabla F_{i}(w)$ . Meanwhile, the sampled data size is $D_{i}^{(\cdot)}=|\mathcal{D}_{i}^{(\cdot)}|$ .

More specifically, in order to solve (3), each UE $i$ computes the desired gradient in round $k$ , as follows:

\nabla F_{i}(w_{k})=(I-\alpha\nabla^{2}f_{i}(w_{k}))\nabla f_{i}(w_{k}-\alpha\nabla f_{i}(w_{k})).

(5)

At every round, computing the gradient $\nabla f_{i}(w_{k})$ by using all data points of UE $i$ is often computationally expensive. Therefore, we take a subset $\mathcal{D}_{i}^{\text{in}}$ from $\mathcal{D}_{i}$ to obtain an unbiased estimate $\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}})$ for $\nabla f_{i}(w_{k})$ , which is given by

\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}})=\frac{1}{D_{i}^{\text{in}}}\sum_{(x,y)\in\mathcal{D}_{i}^{\text{in}}}\nabla l_{i}(w_{k};x,y).

(6)

Similarly, the outer gradient update $\nabla f_{i}(\theta_{k}^{i})$ and Hessian update $\nabla^{2}f_{i}(w_{k})$ in (5) can be replaced by their unbiased estimates $\tilde{\nabla}f_{i}(\theta^{i}_{k};\mathcal{D}_{i}^{\text{o}})$ and $\tilde{\nabla}^{2}f_{i}(w_{k};\mathcal{D}_{i}^{\text{h}})$ respectively. Here, $\mathcal{D}_{i}^{\text{o}}$ and $\mathcal{D}_{i}^{\text{h}}$ are sampled from $\mathcal{D}_{i}$ as well. Therefore, using SGD, we can finally obtain an estimated local gradient $\tilde{\nabla}F_{i}(w_{k})$ on UE $i$ in round $k$ , which is given by

\tilde{\nabla}F_{i}(w_{k})=(I-\alpha\tilde{\nabla}^{2}f_{i}(w_{k};\mathcal{D}_{i}^{\text{h}}))\tilde{\nabla}f_{i}(w_{k}-\alpha\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}});\mathcal{D}_{i}^{\text{o}}).

(7)

It is worth noting that $\tilde{\nabla}F_{i}(w_{k})$ is a biased estimator of $\nabla F_{i}(w_{k})$ . This is because the stochastic gradient $\tilde{\nabla}f_{i}(w_{k}-\alpha\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}});\mathcal{D}_{i}^{\text{o}})$ contains another stochastic gradient $\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}})$ inside. Hence, to improve the estimate accuracy, $\mathcal{D}_{i}^{\text{in}}$ used for inner gradient update is independent from the sampled datasets $\mathcal{D}_{i}^{\text{o}}$ and $\mathcal{D}_{i}^{\text{h}}$ used for outer gradient and Hessian update respectively. Meanwhile, in this paper we assume $\mathcal{D}_{i}^{\text{o}}$ and $\mathcal{D}_{i}^{\text{h}}$ are also independent from each other.

II-C Communication and Computation Model

When exploring the MAML-based FL in realistic mobile edge networks, the communication and computation model should be captured carefully. Particularly, we consider that UEs access the BS through a channel partitioning scheme, such as orthogonal frequency division multiple access (OFDMA). In order to successfully upload the local update to the BS, two conditions need to be satisfied: (1) the UE is selected, and (2) the transmitted local update is successfully decoded. In this respect, we first introduce $s_{i}^{k}\in\{0,1\}$ as a selection indicator, where $s_{k}^{i}=1$ indicates the event that UE $i$ is chosen in round $k$ , and $s_{k}^{i}=0$ otherwise. Next, we characterize the transmission quality of the wireless links. For the signals transmitted from UE $i$ , the SNR received at the BS can be expressed as $\xi_{k}^{i}=\frac{p_{k}^{i}h_{k}^{i}\|c_{i}\|^{-\kappa}}{N_{0}}$ , where $p_{k}^{i}$ is the transmit power of UE $i$ during round $k$ , $\kappa$ is the path loss exponent. $h_{k}^{i}\|c_{i}\|^{-\kappa}$ is the channel gain between UE $i$ and the BS with $c_{i}$ being the distance between UE $i$ and the BS and $h_{k}^{i}$ being the small-scale channel coefficient. $N_{0}$ is the noise power spectral density. In order for the BS to successfully decode the local update from UE $i$ , it is required that the received SNR exceeds a decoding threshold $\phi$ , i.e., $\xi_{k}^{i}>\phi$ . Assume that the small-scale channel coefficients across communication rounds follow Rayleigh distribution, then according to [22], the update success transmission probability, defined as $q_{k}^{i}=\mathbb{P}(s_{k}^{i}=1,\xi_{k}^{i}>\phi)$ , can be estimated as follows:

q_{k}^{i}=\mathbb{P}(s_{k}^{i}=1,\xi_{k}^{i}>\phi)\approx\frac{1/n}{1+\nu(p_{k}^{i})}.

(8)

where $\nu(p_{k}^{i})=\frac{N_{0}\phi}{p_{k}^{i}}$ . Then the achievable uplink rate $r_{k,i}$ of UE $i$ transmitting its local update to the BS in round $k$ is given by

r_{k,i}=B\log_{2}(1+\xi_{k}^{i}),

(9)

where $B$ is the bandwidth allocated to each UE. Based on $r_{k,i}$ , the uplink transmission delay of UE $i$ in round $k$ can be specified as follows:

t_{k,i}^{\text{com}}=\frac{Z}{r_{k,i}},

(10)

where $Z$ is the size of $w_{k}$ in number of bits. Since the transmit power of the BS is much higher than that of the UEs, the downlink transmission delay is much smaller than the uplink transmission delay. Meanwhile, we care more about the transmit power allocation on individual UEs rather than that on the BS, so here we ignore the downlink transmission delay for simplicity.

Further, we calculate the computation time for each UE, which is consumed for computing the local update. Given the CPU-cycle frequency of UE $i$ by $\vartheta_{i}$ , the computation time of UE $i$ is expressed as

t_{k,i}^{\text{cmp}}=\frac{c_{i}d_{k}^{i}}{\vartheta_{i}}.

(11)

In (11), $c_{i}$ denotes the number of CPU cycles for UE $i$ to execute one sample of data points and $d_{k}^{i}=D_{k,i}^{\text{in}}+D_{k,i}^{\text{o}}+D_{k,i}^{\text{h}}$ denotes the sampled data size of UE $i$ in round $k$ .

In term of $t_{k,i}^{\text{com}}$ and $t_{k,i}^{\text{cmp}}$ , we then give the energy consumption of each UE $i$ in round $k$ , which consists of two parts: (1) the energy for transmitting the local updates and (2) the energy for computing the local updates. Let $e_{k}^{i}$ denote the energy consumption of each UE $i$ in round $k$ , and then $e_{k}^{i}$ can be computed as follows [23, 24]:

e_{k}^{i}=\frac{\varsigma}{2}\vartheta_{i}^{3}t_{k,i}^{\text{cmp}}+p_{k}^{i}t_{k,i}^{\text{com}},

(12)

where $\frac{\varsigma}{2}$ is the effective capacitance coefficient of UE $i$ ’s computing chipset. From (12), we observe that in round $k$ for each UE $i$ , both the sampled data size $d_{i}^{k}$ and transmit power $p_{i}^{k}$ have significant impacts on its energy consumption. This observation motivates us to jointly consider these two variables when quantifying the benefit MAML brings to FL in mobile edge networks.

II-D Problem Formulation

In order to quantify the benefit MAML brings to FL in mobile edge networks, we focus on the learning time minimization under the constraint of model accuracy and UEs’ energy consumption. Particularly, the learning time is the duration over all $K$ communication rounds, in which the duration of round $k$ is determined by the slowest UE as follows,

T_{k}^{\text{round}}=\max_{i\in\mathcal{A}_{k}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\}.

(13)

Note that we can replace $i\in\mathcal{A}_{k}$ with $i\in\mathcal{U}$ in $T_{k}^{\text{round}}$ , that is, $T_{k}^{\text{round}}=\max_{i\in\mathcal{U}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\}$ . This is because the UEs that are not chosen in $\mathcal{A}_{k}$ has a $T_{k}^{i}$ with the value of $0$ , which would not effect the result of $T_{k}^{\text{round}}$ . Let $\mathbf{d}\triangleq\{\mathbf{d}^{1},\dots,\mathbf{d}^{n}\}$ and $\mathbf{p}\triangleq\{\mathbf{p}^{1},\dots,\mathbf{p}^{n}\}$ denote the UEs’ sampled data size vector and the transmit power vector respectively, where $\mathbf{d}^{i}\triangleq\{d_{1}^{i},\dots,d_{K}^{i}\}$ and $\mathbf{p}^{i}\triangleq\{p_{1}^{i},\dots,p_{K}^{i}\}$ . Then, we can give the problem formulation by


$\displaystyle\min_{\mathbf{d},\mathbf{p},K}\qquad$	$\displaystyle\sum_{k=0}^{K-1}\max_{i\in\mathcal{U}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\}$	(P1)
s.t.	$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\\|\nabla F(w_{k})\\|^{2}]\leq\epsilon,$	(C1.1)
	$\displaystyle 0\leq e_{k}^{i}\leq E_{\max},\quad\forall i\in\mathcal{U},k=0,...,K-1$	(C1.2)
	$\displaystyle 0\leq p_{k}^{i}\leq P_{\max},\quad\forall i\in\mathcal{U},k=0,...,K-1$	(C1.3)
	$\displaystyle 0\leq d_{k}^{i}\leq D_{i},\quad\forall i\in\mathcal{U},k=0,...,K-1.$	(C1.4)

In problem (P1), we not only optimize $\mathbf{d}$ and $\mathbf{p}$ , but also the total number of rounds $K$ . The hidden reason is that, both $\mathbf{d}$ and $\mathbf{p}$ have an effect on the duration of each round, thereby impacting the number of rounds $K$ to achieve a certain model accuracy $\epsilon$ . Besides, (C1.1) characterizes an $\epsilon$ -approximate convergence performance. (C1.2) limits the energy consumption of UE $i$ in round $k$ not larger the predefined maximum value $E_{\max}$ . (C1.3) gives the maximum transmit power of UE $i$ in round $k$ by $P_{\max}$ . (C1.4) is the sampled data size constraint, that the sampled data size of one UE in round $k$ is smaller than or equal to the data points generated by that UE. The solution to problem (P1) can be exploited for designing an optimized MAML-based FL algorithm, which is implemented iteratively to update the global model parameter $w$ . According to the communication and computation model, only those local updates from the selected UEs as well as being successfully decoded by the BS can contribute to updating the global model parameter. That is, in round $k$ , we have

	$\displaystyle w_{k+1}=$	$\displaystyle\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}$
		$\displaystyle\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}\geq\phi\}(w_{k}-\beta\nabla_{w_{k}}F(\theta_{k}^{i})),$		(15)

where $\mathds{1}\{s_{k}^{i}=1,\gamma_{k}^{i}>\phi\}=1$ if the event of $s_{k}^{i}=1,\gamma_{k}^{i}>\phi$ is true, and it equals to zero otherwise.

Based on this update rule, (C1.1) related to the MAML-based FL can be analytically analysed to make the relationship between the decision variables and the model accuracy $\epsilon$ explicit, thereby facilitating solving problem (P1). In this regard, the convergence analysis of the MAML-based FL is given in the following section.

III Convergence Analysis of MAML-based FL

III-A Preliminaries

For a general purpose, we concentrate on the non-convex settings on loss functions and aim to find an $\epsilon$ -approximate first-order stationary point (FOSP) for the loss function minimization problem (3) in the MAML-based FL, which is defined as follows.

Definition 1.

A random vector $w_{\epsilon}\in\mathbb{R}^{m}$ is called an $\epsilon$ -FOSP if it satisfies $\mathbb{E}[\|\nabla F(w_{\epsilon})\|^{2}]\leq\epsilon$ .

In terms of this Definition, we then elaborate on the assumptions, convergence bounds, and discussions on the convergence analysis, respectively, from which we can reformulate (C1.1) in problem (P1) as an explicit constraint with respect to the optimization variables.

Without loss of generality, the assumptions used for the convergence analysis of MAML-based FL algorithm is consistent with that of Per-FedAvg [21, 14] and are given in the following.

Assumption 1.

For every UE $i\in\{1,\dots,n\}$ , $f_{i}(w)$ is twice continuously differentiable. Its gradient $\nabla f_{i}(w)$ is $L$ -Lipschitz continuous, that is,

\|\nabla f_{i}(w)-\nabla f_{i}(u)\|\leq L\|w-u\|,\qquad\forall w,u\in\mathbb{R}^{m}.

(16)

Assumption 2.

For every UE $i\in\{1,\dots,n\}$ , the Hessian matrix of $f_{i}(w)$ is $\rho$ -Lipschitz continuous, that is,

\|\nabla^{2}f_{i}(w)-\nabla^{2}f_{i}(u)\|\leq\rho\|w-u\|,\qquad\forall w,u\in\mathbb{R}^{m}.

(17)

Assumption 3.

For any $w\in\mathbb{R}^{m}$ , $\nabla l_{i}(w;x,y)$ and $\nabla^{2}l_{i}(w;x,y)$ , computed w.r.t. a single data point $(x,y)\in\mathcal{X}_{i}\times\mathcal{Y}_{i}$ , have bounded variance, that is,

	$\displaystyle\mathbb{E}_{(x,y)\thicksim p_{i}}[\\|\nabla l_{i}(w;x,y)-\nabla f_{i}(w)\\|^{2}]$	$\displaystyle\leq\sigma^{2}_{G},$
	$\displaystyle\mathbb{E}_{(x,y)\thicksim p_{i}}[\\|\nabla^{2}l_{i}(w;x,y)-\nabla^{2}f_{i}(w)\\|^{2}]$	$\displaystyle\leq\sigma^{2}_{H}.$		(18)

Assumption 4.

For any $w\in\mathbb{R}^{m}$ , the gradient and Hessian matrix of local loss function $f_{i}(w)$ and the average loss function $f(w)=(1/n)\sum_{i=1}^{n}f_{i}(w)$ have bound variance, that is

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\\|\nabla f_{i}(w)-\nabla f(w)\\|^{2}$	$\displaystyle\leq\gamma_{G}^{2},$
	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\\|\nabla^{2}f_{i}(w)-\nabla^{2}f(w)\\|^{2}$	$\displaystyle\leq\gamma_{H}^{2}.$		(19)

Before the convergence analysis, we first introduce three lemmas inherited from [14, 21] quantifying the smoothness of $F_{i}(w)$ and $F(w)$ , the deviation between $\nabla F_{i}(w)$ and its estimate $\tilde{\nabla}F_{i}(w)$ , and the deviation between $\nabla F_{i}(w)$ and $\nabla F(w)$ , respectively.

Lemma 1.

If Assumptions 1-3 hold, then $F_{i}(w)$ is $L_{F}$ -Lipschitz continuous with $L_{F}:=4L+\alpha\rho B$ . As a result, the average function $F(w)=(1/n)\sum_{i=1}^{n}F_{i}(w)$ is also $L_{F}$ -Lipschitz continuous.

Lemma 2.

Recall the gradient estimate $\tilde{\nabla}F_{i}(w)$ shown in (7), which is computed using $\mathcal{D}_{i}^{in}$ , $\mathcal{D}_{i}^{o}$ and $\mathcal{D}_{i}^{in}$ that are independent sampled datasets with size $D_{i}^{in}$ , $D_{i}^{in}$ and $D_{i}^{in}$ , respectively. If the conditions in Assumptions 1-3 hold, then for any $\alpha\in(0,1/L]$ and $w\in\mathbb{R}^{m}$ , we have

	$\displaystyle\left\\|\mathbb{E}\left[\tilde{\nabla}F_{i}(w)-\nabla F_{i}(w)\right]\right\\|$	$\displaystyle\leq\frac{2\alpha L\sigma_{G}}{\sqrt{D^{in}}},$		(20)
	$\displaystyle\mathbb{E}\left[\\|\tilde{\nabla}F_{i}(w)-\nabla F_{i}(w)\\|^{2}\right]$	$\displaystyle\leq\sigma_{F}^{2},$		(21)

where $\sigma_{F}^{2}$ is defined as

\sigma_{F}^{2}:=12\left[B^{2}+\sigma_{G}^{2}\left[\frac{1}{D^{o}}+\frac{(\alpha L)^{2}}{D^{in}}\right]\right]\left[1+\sigma_{H}^{2}\frac{\alpha^{2}}{4D^{h}}\right]-12B^{2},

(22)

with $D^{in}=\max_{i\in\mathcal{U}}D_{i}^{in}$ , $D^{o}=\max_{i\in\mathcal{U}}D_{i}^{o}$ and $D^{h}=\max_{i\in\mathcal{U}}D_{i}^{h}$ .

Lemma 3.

Given the loss function $F_{i}(w)$ shown in (4) and $\alpha\in(0,1/L]$ , if the conditions in Assumptions 1, 2, and 4 hold, then for any $w\in\mathbb{R}^{m}$ , we have

\frac{1}{n}\sum_{i=1}^{n}\|\nabla F_{i}(w)-\nabla F(w)\|^{2}\leq\gamma_{F}^{2},

(23)

where $\gamma_{F}^{2}$ is defined as

\gamma_{F}^{2}:=3B^{2}\alpha^{2}\gamma_{H}^{2}+192\gamma_{G}^{2},

(24)

with $\nabla F(w)=(1/n)\sum_{i=1}^{n}\nabla F_{i}(w)$ .

III-B Analysis of Convergence Bound

Let $U_{i}=\max_{k=\{1,\dots,K\}}U_{k}^{i}$ , where $U_{k}^{i}=\frac{\mathbb{P}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}{\sum_{i=1}^{n}\mathbb{P}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}=\frac{q_{k}^{i}}{\sum_{i=1}^{n}q_{k}^{i}}$ denotes the normalized update successful probability of UE $i$ in round $k$ . Then, with Lemmas 1, 2 and 3, the expected convergence result of the MAML-based FL within a general mobile edge network we describe in Section II can now be obtained by the following theorem.

Theorem 1.

Given the transmit power vector $\mathbf{p}$ , the sampled data size vector $\mathbf{d}$ , the number of communication rounds $K$ , and the optimal global loss $F(w_{\epsilon})$ , the normalized update successful probability $U_{i}$ , and $\alpha\in(0,1/L]$ , then we have the following FOSP conditions,

		$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\\|\nabla F(\bar{w}_{k})\\|^{2}]\leq\frac{4(F(w_{0})-F(w_{\epsilon}))}{\beta K}$
		$\displaystyle+\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}+\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{in}}\right),$		(25)

where $\bar{w}_{k}$ is the average of the local updates $w_{k}^{i}$ that are successfully decoded by the BS in round $k$ . That is, we have

\bar{w}_{k}=\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}w_{k}^{i}.

(26)

Proof:

See the Appendix. ∎

Unlike the convergence guarantee of Per-FedAvg in [14] which is characterized by $K$ and $\mathbf{d}$ , the convergence guarantee obtained from Theorem 1 is characterized by $K$ , $\mathbf{d}$ and $U_{i}$ , where $U_{i}$ is a function of the transmit power $\mathbf{p}^{i}$ . That is, the convergence bound of the proposed MAML-based FL in our paper is described in terms of $K$ , $\mathbf{d}$ and $\mathbf{p}$ . Therefore, our convergence analysis can combine the FL hyperparameters as well as the resource allocation in mobile edge networks.

III-C Discussions

From Theorem 1, we are able to characterize $K$ , $\mathbf{d}$ and $\mathbf{p}$ with respect to the predefined model accuracy $\epsilon$ . According to Theorem 1 and (C1.1) in problem (P1), it is desired for the right-hand-side of (1) to be equal to or smaller than $\epsilon$ . Consequently, we further present the following corollary.

Corollary 1.

Suppose the conditions in Theorem 1 are satisfied. If we set the number of total communication rounds as $K=\mathcal{O}(\frac{1}{\epsilon^{3}})$ , the global learning rate as $\beta=\mathcal{O}(\epsilon^{2})$ , the number of data samples as $d_{i}=\mathcal{O}(\frac{1}{\epsilon})$ , then we find an $\epsilon$ -FOSP for the MAML-based FL in problem (P1).

Proof:

$K=\mathcal{O}(\frac{1}{\epsilon^{3}})$ and $\beta=\mathcal{O}(\epsilon^{2})$ will make sure that the order of magnitude of the first term on the right-hand-side of (1) to be equal to $\mathcal{O}(\epsilon)$ .

Then we examine the second term. Since $U_{i}$ is the normalized update successful probability with the consideration of $n$ UEs, its order of magnitude is defined by $n$ rather than $\epsilon$ , i.e., $U_{i}=\mathcal{O}(\frac{1}{n})$ . Therefore, it is natural to take attentions on the term $\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}+\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}$ for an $\epsilon$ -FOSP. By setting $D_{i}^{\text{in}}=D_{i}^{\text{o}}=D_{i}^{\text{h}}=\mathcal{O}(\frac{1}{\epsilon})$ (i.e., $d_{i}=\mathcal{O}(\frac{1}{\epsilon})$ ), $\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}$ will dominate the value of $\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}+\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}$ . This is because, in this case, $\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}=\mathcal{O}(\epsilon^{4})$ , which is a high-order infinitesimal of $\epsilon$ . Finally, combining the magnitudes of the first and second terms, we can conclude that the joint of $K=\mathcal{O}(\frac{1}{\epsilon^{3}})$ , $\beta=\mathcal{O}(\epsilon^{2})$ , and $d_{i}=\mathcal{O}(\frac{1}{\epsilon})$ yields an $\mathcal{\epsilon}$ -FOSP for the MAML-based FL. ∎

Remark 1.

It is worth noting that the result in Corollary 1 provides a possible choice of the hyperparameters. Other options may also be valid as long as the option makes the order of magnitude of the hyperparamters (w.r.t. the model accuracy $\epsilon$ ) satisfy the condition in Theorem 1.

Guided by Corollary 1, we can recast (C1.1) in problem (P1) as a series of constraints, which give explicit relationships between optimization variables and the model accuracy $\epsilon$ . In this way, we then solve problem (P1) with high efficiency in the next section.

IV Automated Federated Learning (AutoFL)

In this section, we first resort to Theorem 1 and Corollary 1 for problem (P1) decomposition, and then solve the resultant subproblems respectively. Based on the obtained solutions, we finally propose Automated Federated Learning (AutoFL) algorithm, by which the benefit MAML brings to FL can be quantified in mobile edge networks.

IV-A Problem Decoupling

The clue to the decomposition of problem (P1) follows the primal decomposition [25]. Specifically, the master problem is optimized with respect to the number of communication rounds $K$ , which is given as follows:


$\displaystyle\min_{K}\qquad$	$\displaystyle\sum_{k=0}^{K-1}\max_{i\in\mathcal{U}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\}$	(P2)
s.t.	$\displaystyle\frac{4(F(w_{0})-F(w_{\epsilon}))}{\beta K}\leq\epsilon,\qquad K\in\mathbb{N}^{+}.$	(C2.1)

Note that, we refer to Theorem 1 and use a contracted (C1.1), i.e., (C2.1), as the constraint in problem (P2) for tractability. This contraction replaces $\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla F(\bar{w}_{k})\|^{2}]$ with its upper bound without the second term. The reason that we get rid of the second term is that we only consider $K$ as the decision variable in (P2), and the second term of the upper bound can be regarded as a constant and thus removed as long as its value is no larger than $\epsilon$ . Meanwhile, the above contraction shrinks the feasible region of $K$ . Therefore, as long as (C2.1) is satisfied, (C1.1) must be satisfied. Moreover, if there exists an optimal solution $K$ to problem (P2), it is also feasible to the original problem (P1). We can regard the optimal solution to (P2) as an optimal approximation to (P1).

The reason that we consider $K$ as a decision variable is for the decoupling of (P1). The original objective of minimizing the overall learning time in $K$ rounds can be regarded as the minimization of the learning time in each round, once $K$ is determined and is set to be a constant in the following decoupling. As such, problem (P2), which has a slave problem with respect to the sampled data size $\mathbf{d}$ and transmit power $\mathbf{p}$ , can be regarded as an approximation of problem (P1).

From Theorem 1, we find that $\mathbf{d}$ and $\mathbf{p}$ are coupled even though (C1.1) is replaced by the right-hand-side term in (1). In this regard, we aim to optimize $\mathbf{d}$ and $\mathbf{p}$ in the slave problem with the coordinate descent [20] method. That is, the slave problem is firstly solved with $\mathbf{d}$ as variables and $\mathbf{p}$ as constants, and then is solved with $\mathbf{p}$ as variables and $\mathbf{d}$ as constants. This procedure is repeated in an iterative manner. In this way, $\mathbf{d}$ - and $\mathbf{p}$ -related slave problems can be further decomposed among rounds under the assistance of Corollary 1. Specifically, in any round $k$ , the sampled data size for UE $i$ is optimized by solving problem (P3):


$\displaystyle\min_{d_{k}^{i}}\qquad$	$\displaystyle\max_{i}\quad\left\{\frac{c_{i}}{\vartheta_{i}}d_{k}^{i}+t_{k,i}^{\text{com}}\right\}$	(P3)
s.t.	$\displaystyle d_{k}^{i}\geq\frac{1}{\epsilon},\quad\forall i\in\mathcal{U},$	(C3.1)
	$\displaystyle\frac{\varsigma}{2}c_{i}d_{k}^{i}\vartheta_{i}^{2}+p_{k}^{i}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\leq E_{\max},\quad\forall i\in\mathcal{U},$	(C3.2)
	$\displaystyle 0\leq d_{k}^{i}\leq D_{i},\quad\forall i\in\mathcal{U}.$	(C3.3)

(C3.1) is obtained from Corollary 1, that a $d_{i}=\mathcal{O}(\frac{1}{\epsilon})$ will yield an $\epsilon$ -FOSP. (C3.1) uses $d_{k}^{i}\geq\frac{1}{\epsilon}$ instead to indicate that as long as $d_{k}^{i}$ is not smaller than $\frac{1}{\epsilon}$ , an $\epsilon$ -FOSP would be guaranteed. This is a reasonable replacement (i.e., from $d_{k}^{i}=\mathcal{O}(\frac{1}{\epsilon})$ to $d_{k}^{i}\geq\frac{1}{\epsilon}$ ) such that, the larger the sampled data size is, the larger accuracy AutoFL would achieve, thereby leading to a smaller $\epsilon$ . Note that (C3.1) is also a contraction of (C1.1), and the feasible domain of $d_{k}^{i}$ is also shrunk. That is, the optimal $d_{k}^{i*}$ , as long as it exists in (P3), is also feasible in problem (P1), and can be regarded as an optimal approximation to (P1). (C3.2) is the energy consumption constraint on each UE and (C3.3) indicates that the sampled data size of UE $i$ is no greater than the number of data points generated by UE $i$ . Furthermore, addressing the following problem can make an decision on the transmit power of UE $i$ in round $k$ :


$\displaystyle\min_{p_{k}^{i}}\qquad$	$\displaystyle\max_{i}\quad\left\{t_{k,i}^{\text{cmp}}+\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\right\}$	(P4)
s.t.	$\displaystyle\left(\sum_{i=1}^{n}{U_{k}^{i}}^{2}\right)\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}\leq\epsilon,$	(C4.1)
	$\displaystyle\frac{\varsigma}{2}c_{i}d_{k}^{i}\vartheta_{i}^{2}+p_{k}^{i}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\leq E_{\max},\quad\forall i\in\mathcal{U},$	(C4.2)
	$\displaystyle 0\leq p_{k}^{i}\leq P_{\max},\quad\forall i\in\mathcal{U},$	(C4.3)

where (C4.1) is obtained from Corollary 1, in terms of the fact that $(\sum_{i=1}^{n}{U_{k}^{i}}^{2})\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}$ dominates the value of the second term in the right-hand-side of (1) to guarantee an $\epsilon$ -FOSP for the MAML-based FL. Here we replace $U_{i}$ with $U_{k}^{i}$ to show that in each round $k$ , (C4.1) should be satisfied. The reason that we replace $(\sum_{i=1}^{n}{U_{k}^{i}}^{2})\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}=\mathcal{O}(\frac{1}{\epsilon})$ with $(\sum_{i=1}^{n}{U_{k}^{i}}^{2})\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}\leq\epsilon$ is to indicate that, as long as (C4.1) is satisfied, an $\epsilon$ -FOSP would be guaranteed. Note that (C4.1) is also a contraction of (C1.1). That is, as long as there exists an optimal solution to (P4), this solution is also feasible and is close to the optimal solution to (P1). We can regard the optimal solution to (P4) as an optimal approximation to (P1). (C4.2) and (C4.3) are the energy constraint and transmit power constraint on each UE, respectively.

Based on the above decomposition, we can solve the original problem (P1) by solving problem (P2), in which problems (P3) and (P4) are nested and addressed with the coordinate descent method. This decoupling only provides the approximate optimal results. However, the evaluation results show that, with the approximate optimal results obtained from (P2), (P3) and (P4), the proposed MAML-based FL algorithm always outperforms Per-FedAvg in learning time and convergence performances. Next, we deal with problems (P2), (P3), and (P4), respectively.

IV-B Problem Solution

IV-B1 Communication Round Optimization

Once the power transmit power $\mathbf{p}$ and the sampled data size $\mathbf{d}$ are obtained from the slave problem, the master problem (P2) is monotonously increasing with respect to $K$ . Therefore, to minimize the total training time, the optimal value of $K$ should be obtained as its lower bound:

K^{*}=\frac{4(F(w_{0})-F(w_{\epsilon}))}{\beta\epsilon},

(30)

Note that, we do not use the formulation of $K^{*}$ to predict the actual optimal global communication rounds. It is not practical to determine this value in advance because in practice, the optimal value of the communication rounds can be easily observed once the training loss starts to converge. There are so many factors that can affect the actual value of $K^{*}$ , even the version of Pytorch/Tensorflow in the experiments! Therefore, the theoretical formulation of $K^{*}$ in (31), with the initial global model $w_{0}$ , the global optimal loss $F(w_{\epsilon})$ , the global learning rate $\beta$ , and the model accuracy $\epsilon$ to be decided, is only used as a guidance indicating the order of magnitude of the practical optimal communication rounds.

IV-B2 Data Sample Size Optimization

It is easy to observe from problem (P3) that for each UE $i$ , its optimal $d_{k}^{i}$ lies at the lower bound $\frac{1}{\epsilon}$ . However, whether this lower bound can be reached depends on the relationship between the values of $d_{k}^{i}$ ’s lower bound $\frac{1}{\epsilon}$ and upper bounds defined by (C3.2) and (C3.3). Specifically, the upper bound defined by (C3.2) is $D_{i}^{\text{p}}:=\frac{2E_{\max}}{\varsigma c_{i}\vartheta_{i}^{2}}-\frac{2p_{k}^{i}Z}{\varsigma c_{i}\vartheta_{i}^{2}B\log_{2}(1+\xi_{k}^{i})}$ , while the upper bound defined by (C3.3) is $D_{i}$ . Consequently, for each UE $i$ , we need to consider two cases: $D_{i}\geq\frac{1}{\epsilon}$ and $D_{i}<\frac{1}{\epsilon}$ , in which we further discuss other two cases: $D_{i}^{\text{p}}>\frac{1}{\epsilon}$ and $D_{i}^{\text{p}}\leq\frac{1}{\epsilon}$ . That is, we discuss the optimal solution of $d_{k}^{i}$ for problem (P3) as follows.

Case 1: In this case, UE $i$ generates sufficient data points which are larger than or equal to its lower bound for a model accuracy $\epsilon$ . That is, for each UE $i$ , we have

0<\frac{1}{\epsilon}\leq D_{i}.

(31)

Under this case, we need to further discuss the relationship between $D^{\text{p}}_{i}$ and $\frac{1}{\epsilon}$ :

•

When $D_{i}^{\text{p}}>\frac{1}{\epsilon}$ , the lower bound $\frac{1}{\epsilon}$ of $d_{k}^{i}$ can be obtained with enough data points to achieve the model accuracy $\epsilon$ . Therefore, the optimal sampled data size $d_{k}^{i}=\frac{1}{\epsilon}$ .
•

When $D_{i}^{\text{p}}\leq\frac{1}{\epsilon}$ , the lower bound $\frac{1}{\epsilon}$ of $d_{k}^{i}$ cannot be achieved. According to (1) in Theorem 1, the larger $d_{i}$ is, the higher model accuracy can be achieved. Therefore, the optimal value of $d_{k}^{i}$ should be equal to $D_{i}^{\text{p}}$ .

Case 2: In this case, UE $i$ creates deficient data points which are smaller than its lower bound for a model accuracy $\epsilon$ . That is, for each UE $i$ , we have

0\leq D_{i}<\frac{1}{\epsilon}.

(32)

Under this case, we also need to further discuss the relationship between $D^{\text{p}}_{i}$ and $\frac{1}{\epsilon}$ :

•

When $D^{\text{p}}_{i}>\frac{1}{\epsilon}$ , the optimal value of $d_{k}^{i}$ is $D_{i}$ ;
•

When $D^{\text{p}}_{i}\leq\frac{1}{\epsilon}$ , the optimal value of $d_{k}^{i}$ is $\min\{D_{i},D_{i}^{\text{p}}\}$ ;

The above process of computing $d_{k}^{i}$ for problem (P3) can be summarized as SampledDataSize algorithm, which is omitted due to the space limitation.

IV-B3 Transmit Power Optimization

At last, we turn to solving problem (P4), where transmit power $p_{k}^{i}$ for UE $i$ in round $k$ is optimized with fixed $d_{k}^{i}$ . To this end, we first convert the problem into a more concise form. Given that $U_{k}^{i}$ denotes the normalized update successful probability, we have $U_{k}^{i}\leq 1$ . Then the inequality $\sum_{i}^{n}{U_{k}^{i}}^{2}\leq 1$ is always true. Consequently, we can further transform (C4.1) to $\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D_{i}^{\text{in}}}\leq\epsilon$ . Combine the transformed constraint with (C3.1), (C3.2) and (C3.3), we find that as long as $\alpha^{2}L^{2}\sigma_{G}^{2}\leq\min\{1,D_{i}\epsilon,D_{p}\epsilon\}$ , (C4.1) can always be satisfied. Therefore, problem (P4) can be further transformed to a new optimization problem without constraint (C4.1).

With the above concise formulation of (P4), we analyze the monotonicity of $t_{k,i}^{\text{com}}$ in the objective function of problem (P4). Specifically, the derivative of $t_{k,i}^{\text{com}}$ can be computed as follows:

	$\displaystyle\frac{\text{d}}{\text{d}p_{k}^{i}}t_{k,i}^{\text{com}}=$	$\displaystyle\frac{\text{d}}{\text{d}p_{k}^{i}}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}$
	$\displaystyle=$	$\displaystyle-\frac{Zh_{i}\\|c_{i}\\|^{-\kappa}}{BN_{0}(1+\xi_{k}^{i})[\log_{2}(1+\xi_{k}^{i})]^{2}}<0,$		(33)

which means $t_{k,i}^{\text{com}}$ monotonically decreases with $p_{k}^{i}$ .

Further, as for the constraint (C4.2), the derivative of its left-hand-side term can be proved to be always larger than 0, which is shown as follows,

	$\displaystyle\frac{\text{d}}{\text{d}p_{k}^{i}}p_{k}^{i}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}$
$\displaystyle=$	$\displaystyle\frac{Z}{B}\left(\frac{1}{\log_{2}(1+\xi_{k}^{i})}-\frac{\xi_{k}^{i}}{[\log_{2}(1+\xi_{k}^{i})]^{2}(1+\xi_{k}^{i})}\right)$
$\displaystyle>$	$\displaystyle\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\left(1-\frac{\xi_{k}^{i}}{\frac{\xi_{k}^{i}}{1+\xi_{k}^{i}}(1+\xi_{k}^{i})}\right)$
$\displaystyle=$	$\displaystyle\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}(1-1)=0$	(34)

where the inequality is derived from the fact that $\log_{2}(1+x)>\frac{x}{1+x}$ , for $x>0$ . Therefore, the left-hand-side term of (C4.2) monotonically increases with $p_{k}^{i}$ . It means (C4.2) defines another upper bound of $p_{k}^{i}$ just as $P_{\max}$ in (C4.3).

From the above analysis, the optimal solutions to (P4) lies at the upper bound of $p_{k}^{i}$ , which is defined by (C4.2) or (C4.3). This process of computing $p_{k}^{i}$ can be summarized as PowerComputation algorithm, which is omitted due to space concern.

IV-C AutoFL Implementation

Input :

\phi

E_{\max}

P_{\max}

N_{0}

c_{i}

B^{\text{up}}

\kappa

\vartheta_{i}

w_{0}

p_{-1}^{i}:=P_{\max}

;

2 for $k=0,1,\dots,K-1$ do

3 The BS randomly selects a subset of associated UEs

\mathcal{A}_{k}

where each UE

i\in\mathcal{A}_{k}

is with

s_{k}^{i}=1

;

4 The BS sends

w_{k}

to all UEs in

\mathcal{A}_{k}

;

5 for $i\in\mathcal{A}_{k}$ do

d_{k}^{i}:=\texttt{SampledDataSize}(p_{k-1}^{i})

;

p_{k}^{i}:=\texttt{PowerComputation}(d_{k}^{i})

;

8 if $\gamma_{k}^{i}\geq\phi$ then

w_{k+1}^{i}:=\texttt{LocalModelTraining}(d_{k}^{i},w_{k})

;

10 UE

i

sends

w_{k+1}^{i}

back to the BS ;

12 end if

14 end for

15 The BS updates its model using

w_{k+1}:=\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\gamma_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\gamma_{k}^{i}\geq\phi\}w_{k+1}^{i}

;

17 end for

If the model accuracy

\epsilon

is reached, terminate the algorithm and set

K^{*}:=K

Algorithm 2 Automated Federated Learning (AutoFL)

After devising the algorithms to determining the number of communication round $K$ , the sampled data size $\mathbf{d}$ , and the transmit power $\mathbf{p}$ , we are able to combine these sub-algorithms together to design the AutoFL algorithm, which is shown in Algorithm 2. Note that, to reduce the computational complexity, we only adopt the coordinate descent method to solve problem (P3) and (P4) once in one round. More specifically, given the current transmit power $p_{k-1}^{i}$ , problem (P4) with respect to $d_{k}^{i}$ is solved for UE $i\in\mathcal{A}_{k}$ in round $k$ using SampledDataSize algorithm, as line 6 in AutoFL. Then, based on the obtained $d_{k}^{i}$ , problem (P4) with respect to $p_{k}^{i}$ is addressed for UE $i\in\mathcal{A}_{k}$ in round $k$ with PowerComputation algorithm, as line 7 in AutoFL. The result of $p_{k}^{i}$ is further used in the next round $k+1$ to compute $d_{k+1}^{i}$ for UE $i\in\mathcal{A}_{k}$ , which is used to compute $p_{k+1}^{i}$ again. This process repeats until model accuracy $\epsilon$ is achieved. As in line 15 in AutoFL, once the model accuracy is reached, the algorithm is terminated and the required number of rounds $K^{*}$ is output.

V Performance Evaluation

In this section, we evaluate the performances of AutoFL to (1) verify its effectiveness in saving the learning time, (2) examine its training loss and accuracy to demonstrate its effectiveness in fast adaptation and convergence, (3) present its advantages in different network and dataset settings.

V-A Simulation Settings

V-A1 Network settings

TABLE I: System Parameters

Parameter	Value	Parameter	Value	Parameter	Value
$\alpha$ (MNIST)	$0.03$	$B$	1 MHz	$E_{\max}$	$0.003$ J
$\beta$ (MNIST)	$0.07$	$\varsigma$	$10^{-27}$	$P_{\max}$	$0.01$ W
$\alpha$ (CIFAR-10)	0.06	$\kappa$	$3.8$	$\phi$	$30$ dB
$\beta$ (CIFAR-10)	0.02	$\vartheta_{i}$	$10^{9}$	$N_{0}$	$-174$ dBm/Hz

Unless otherwise sepecified, we consider a mobile edge network that consists of $n=20$ UEs located in a cell of radius $R=200$ m and a BS located at its center. Assume that all UEs are uniformly distributed in the cell. The Rayleigh distribution parameter of $h_{k}^{i}$ is $40$ . The other parameters used in simulations are listed in Table I.

V-A2 Datasets

We evaluate the training performance using two datasets, the MNIST dataset [26] for handwritten digits classification and CIFAR-10 [27]. The network model we used for MNIST is a 2-layer NN with hidden layer of size 100, while the network model we used for CIFAR-10 is LeNet-5 [28], where there are two convolutional layers and three fully connected layers. The datasets are partitioned randomly with $75\%$ and $25\%$ for training and testing respectively. Meanwhile, in the simulation we use both i.i.d datasets and non-i.i.d datasets. In order to generate i.i.d datasets, the original training dataset is randomly partitioned into 20 portions and each UE is assigned a portion. As for the non-i.i.d datasets, the original training set is first partitioned into 10 portions according to the labels. Each UE has only $l$ of the 10 labels and is allocated a different local data size in the range of $[2,3834]$ , where $l=[1,\dots,10]$ . The value of $l$ reflects the non-i.i.d level of local datasets. The smaller $l$ is, the higher non-i.i.d level. Unless otherwise defined, we use $l=5$ in the evaluation. All experiments are conducted by PyTorch [29] version 1.7.1.

V-A3 Baselines

We compare the performance of AutoFL with Per-FedAvg [14] and FedAvg [7]. FedAvg is the first algorithm that is proposed for FL and is the most general FL algorithm. Per-FedAvg, on the other hand, is the most general MAML-based FL algorithm. In Per-FedAvg, we first fix the sampled data size of all UEs to be $5$ . As for the transmit power settings in Per-FedAvg, all UEs use the maximal power $P_{\max}$ , given that there are only 5 data samples to train on each UE and more power can be saved to transmit the local updates to the BS. As for FedAvg, we consider two sets of sampled data sizes: for the first one, we set the sampled data sizes of the UEs to be $5$ and the transmit power in all rounds to be $P_{\max}$ , which is the same with the settings with Per-FedAvg. We name the FedAvg algorithm with the first setting as FedAvg-1; For the second one, we set the sampled data sizes and transmit power of the UEs to be the same with AutoFL. We name the FedAvg with the second setting as FedAvg-2. Later, we may change the sampled data sizes in Per-FedAvg and FedAvg-1 to check the influence of the number of sampled data on the performance gap between Per-FedAvg and AutoFL.

V-B Learning Time Comparisons

We first compare the learning time of AutoFL with that of Per-FedAvg, FedAvg-1, and FedAvg-2 with respect to the test accuracy. The results are shown in Fig. 2. We analyse the results from three aspects. (i) First, it is clear that at the same accuracy, AutoFL consumes the smallest learning time. For example, when the algorithms start to converge, Per-FedAvg takes at least twice as much time as AutoFL. This is a reasonable result as AutoFL is designed to minimize the overall learning time. Meanwhile, as the test accuracy increases, the average learning time of all algorithms grows rapidly. The experimental results verify our theoretical analysis and confirms the effectiveness of AutoFL in saving the overall learning time. (ii) Second, we observe that the two MAML-based FL algorithms outperform the two conventional FL algorithms. This advantage is even more pronounced when using the non-i.i.d CIFAR-10 dataset rather than the MNIST datasets. This result testifies the advantage of MAML in accelerating the training speed of FL. Meanwhile, the two algorithms with designed sampled data size and transmit power, AutoFL and FedAvg-2 outperform Per-FedAvg and FedAvg-1. This indicates that the joint optimization of the sampled data size and transmit power is a promising way to improve the FL performance over mobile edge networks. (iii) Third, the results derived from i.i.d datasets are more stable than that from non-i.i.d datasets, meanwhile the average learning time under i.i.d datasets is smaller than that under non-i.i.d datasets. This is because, in the i.i.d case, the local model of each individual UE has a higher representative for the learned global model, which is beneficial to improve the learning performance.

Thereafter, we change the sampled data size of Per-FedAvg to see how the gap between Per-FedAvg and AutoFL would change with different number of data samples for Per-FedAvg. Note that when the number of samples increases, the transmit power of UEs in Per-FedAvg is not always $P_{\max}$ . Therefore, here we use the PowerComputation algorithm to compute the transmit power with the predifined data sample sizes. Fig. 3 shows the results. From Fig. 3 we observe that, no matter how does the number of data samples change in Per-FedAvg, its overall learning time is smaller than AutoFL. As the sampled data size increase from $5$ to $40$ , for the same test accuracy, the average learning time of Per-FedAvg is getting smaller and smaller. This phenomenon is more obvious using the CIFAR-10 dataset. However, when the number of sampled data points is $60$ , the learning time performance of Per-FedAvg suddenly deteriorate. We attribute this phenomenon to the reason that the larger sampled data size requires the larger transmit power. And once the transmit power of a certain UE reaches its upper bound, it can not be improved. Due to insufficient transmit power, the uploads transmitted from this UE fail to arrive the server for global model update. Therefore, although the number of samples on each UE has increased, the learning time may have become longer and longer. This result gives us great confidence, that although the original problem (P1) is NP-hard, and AutoFL is designed to approximate the optimal solutions to (P1), the approximate results are also effective in consuming as little time as possible.

V-C Convergence Performance Comparisons

Next, we compare convergence performance of AutoFL, Per-FedAvg, FedAvg-1, and FedAvg-2, in terms of training loss and test accuracy. Specifically, the smaller the training loss is, the better the convergence performance is. On the contrary, as for the test accuracy, the larger, the better. Fig. 4 shows the convergence performance using the i.i.d MNIST, non-i.i.d MNIST and non-i.i.d CIFAR-10 datasets, respectively.

From Fig. 4, it is observed that AutoFL outperforms the other three algorithms on the two concerned metrics. This phenomenon is more obvious using the non-i.i.d CIFAR-10 dataset. Besides, AutoFL has the fastest convergence rate, which is consistent with the learning time performance, that AutoFL performs the best. This result indicates that the $K^{*}$ in AutoFL is smaller than that in Per-FedAvg and FedAvg. For example, for the non-i.i.d MNIST datasets shown in Fig. 4, after about $50$ rounds, AutoFL start to converge, while Per-FedAvg starts to converge after about $70$ rounds, the two FedAvg algorithms start to converge after about $100$ rounds.

V-D Effect of Network and Dataset Settings

To understand how different network and dataset settings, such as the radius of the cell and the data heterogeneity, affect the convergence of AutoFL, we conduct a set of experiments on the MNIST dataset.

•

Effect of the radius of the cell $R$ : Figs. 5a and 5b show the average $K^{*}$ and the highest achievable test accuracy with different radius of the cell $R$ . $K^{*}$ is measured in number of rounds the algorithm starts to converge with test accuracy std $\pm 1\%$ . From Figs. 5a and 5b, we observe that $K^{*}$ increases faster and faster with $R$ , while the test accuracy decreases faster and faster with $R$ . The hidden reason is that, with $R$ increasing, the number of UEs whose local updates can be successfully decoded at the BS decreases. This reduction of UEs’ participation in global model update will definitely cause increased $K^{*}$ and decreased test accuracy. Besides, comparing AutoFL with Per-FedAvg, FedAvg-2 with FedAvg-1, we find that the optimization of transmit power and sampled data size on inividual UEs is beneficial to FL convergence, especially when the wireless environment becomes worse with increasing $R$ .
•

Effect of the metric of data heterogeneity $l$ : The non-i.i.d level $l$ is used to measure the data heterogeneity. Specifically, the smaller $l$ is, the larger data heterogeneity is. Figs. 5c and 5d show the average $K^{*}$ and the highest achievable model accuracy with respect to the non-i.i.d level $l$ . We can observe from the figures that as the non-i.i.d level decreases, $K^{*}$ is decreasing while the test accuracy is increasing. This result is reasonable as the higher degree of data heterogeneity across UEs has more negative impact on the learning process. As expected, Figs. 5c and 5d also demonstrate that AutoFL can achieve more gains for the more heterogeneous datasets across UEs.

VI Conclusions

In this paper, we have quantified the benefit MAML brings to FL over mobile edge networks. The quantification is achieved from two aspects: the determination of FL hyperparameters (i.e., sampled data sizes and the number of communication rounds) and the resource allocation (i.e., transmit power) on individual UEs. In this regard, we have formulated an overall learning time minimization problem, constrained by the model accuracy and energy consumption at individual UEs. We have solved this optimization problem by firstly analysing the convergence rate of MAML-based FL, which is used to bound the three variables as functions of the model accuracy $\epsilon$ . With these upper bounds, the optimization problem can be decoupled into three sub-problems, each of which considers one of the variables and uses the corresponding upper bound as its constraint. The first sub-problem guides the optimization of the number of communication rounds. The second and the third sub-problem are computed using the coordinate descent method, to achieve the sampled data size and the transmit power for each UE in each round respectively. Based on the solutions, we have proposed the AutoFL, a MAML-based FL algorithm that not only quantifies the benefit MAML brings to FL but also maximize such benefit to conduct fast adaptation and convergence over mobile edge networks. Extensive experimental results have verified that AutoFL outperforms Per-FedAvg and FedAvg, with the fast adaptation and convergence.

Appendix

In order to prove Theorem 1, we first introduce an intermediate inference derived from the Lipschitzan gradient assumption. That is, if $f(w)$ is L-Lipschitz continuous, then $\|\nabla f(w)-\nabla f(u)\|\leq L\|w-u\|$ is equivalent to

f(w)\leq f(u)+\nabla f(w)^{\top}(w-u)+\frac{L}{2}\|w-u\|^{2}.

(35)

Note that although we assume that each UE only performs one step of gradient descent given the current global model parameter, in the appendix we first consider the general case where $\tau$ ( $\tau=1,2,\dots$ ) steps of local gradient descents are performed. Then for each UE $i$ , we have

	$\displaystyle\tilde{w}_{k,t}^{i}=$	$\displaystyle w_{k,t-1}^{i}-\alpha\tilde{\nabla}f_{i}(w_{k,t-1}^{i};\mathcal{D}_{i}^{\text{in}});$		(36)
	$\displaystyle w_{k,t}^{i}=$	$\displaystyle w_{k,t-1}^{i}-\beta(I-\alpha\tilde{\nabla}^{2}f_{i}(w_{k,t-1}^{i};\mathcal{D}_{i}^{\text{o}}))\tilde{\nabla}f_{i}(\tilde{w}_{k,t}^{i};\mathcal{D}_{i}^{\text{h}}).$		(37)

where $t=1,\dots,\tau$ . After we find the $\epsilon$ -FOSP, we use $\tau=1$ to obtain the desired result shown in Theorem 1. In this case, we introduce the following lemma that has been proved in [14] to facilitate our proof.

Lemma 4.

If the conditions in Assumptions 2-4 hold, then for any $\alpha\in[0,1/L]$ and any $t\geq 0$ , we have

\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\|w_{k,t}^{i}-w_{k,t}\|^{2}\right]\leq 35\beta^{2}t\tau(2\sigma_{F}^{2}+\gamma_{F}^{2}).

(38)

where $w_{k,t}=\frac{1}{n}\sum_{i=1}^{n}w_{k,t}^{i}$ .

Recall that $\bar{w}_{k,t}=\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}w_{k,t}^{i}$ , from Lemma 1, we know $F(w)$ is $L_{F}$ -Lipschitz continuous, and thus, by (35), we have

	$\displaystyle F(\bar{w}_{k+1,t+1})$
$\displaystyle\leq$	$\displaystyle F(\bar{w}_{k+1,t})+\nabla F(\bar{w}_{k+1,t+1})^{\top}(\bar{w}_{k+1,t+1}-\bar{w}_{k+1,t})$
	$\displaystyle+\frac{L_{F}}{2}\\|\bar{w}_{k+1,t+1}-\bar{w}_{k+1,t}\\|^{2}$
$\displaystyle\leq$	$\displaystyle F(\bar{w}_{k+1,t})-\beta\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)$
	$\displaystyle+\frac{L_{F}}{2}\beta^{2}\left\\|\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\\|^{2},$	(39)

where the last inequality is obtained given the fact that

	$\displaystyle\bar{w}_{k+1,t+1}$
$\displaystyle=$	$\displaystyle\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}w_{k+1,t+1}^{i}$
$\displaystyle=$	$\displaystyle\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}$
	$\displaystyle\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}\left(w_{k+1,t}^{i}-\beta\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)$
$\displaystyle=$	$\displaystyle\bar{w}_{k+1,t}-\frac{\beta}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}$
	$\displaystyle\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})$
$\displaystyle=$	$\displaystyle\bar{w}_{k+1,t}-\beta\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i}).$	(40)

Taking expectation on both sides of (Appendix) yields

	$\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t+1})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t})]+\frac{L_{F}}{2}\beta^{2}\mathbb{E}\left[\left\\|\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\\|^{2}\right]$
	$\displaystyle-\beta\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right].$	(41)

From the above inequality, it is obvious that the key is to bound the term $\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})$ . Let

\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})=X+Y+Z+\sum_{i=1}^{n}U_{k}^{i}\nabla F_{i}(\bar{w}_{k+1,t}),

(42)

where

$\displaystyle X=$	$\displaystyle\sum_{i=1}^{n}U_{k}^{i}\left(\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i})\right),$
$\displaystyle Y=$	$\displaystyle\sum_{i=1}^{n}U_{k}^{i}\left(\nabla F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t})\right),$
$\displaystyle Z=$	$\displaystyle\sum_{i=1}^{n}U_{k}^{i}\left(\nabla F_{i}(w_{k+1,t})-\nabla F_{i}(\bar{w}_{k+1,t})\right).$	(43)

Let $\mathcal{F}_{k+1,t}$ be the information up to round $k+1$ , local step $t$ . Our next step is to bound the moments of $X$ , $Y$ , $Z$ , conditioning on $\mathcal{F}_{k+1,t}$ . Recall the Cauchy-Schwarz inequality

\left\|\sum_{i=1}^{n}a_{i}b_{i}\right\|^{2}\leq\left(\sum_{i=1}^{n}\|a_{i}\|^{2}\right)\left(\sum_{i=1}^{n}\|b_{i}\|^{2}\right).

(44)

•

As for $X$ , consider the Cauchy-Schwarz inequality (44) with $a_{i}=\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i})$ and $b_{i}=U_{k}^{i}$ , we obtain

		$\displaystyle\\|X\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\sum_{i=1}^{n}\left\\|\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i})\right\\|^{2}\right),$		(45)

where $U_{i}=\max_{k}U_{k}^{i}$ . Hence, by using Lemma 2 along with the tower rule, we have

\mathbb{E}[\|X\|^{2}]=\mathbb{E}[\mathbb{E}[\|X\|^{2}|\mathcal{F}_{k+1,t}]]\leq\left(\sum_{i=1}^{n}U_{i}^{2}\right)\sigma_{F}^{2}.

(46)

•

As for $Y$ , consider the Cauchy-Schwarz inequality (44) with $a_{i}=\nabla F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t})$ and $b_{i}=U_{k}^{i}$ , along with the smoothness of $F_{i}$ , we obtain

	$\displaystyle\\|Y\\|^{2}$
$\displaystyle\leq$	$\displaystyle\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\sum_{i=1}^{n}\left\\|\nabla F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t})\right\\|^{2}\right)$
$\displaystyle\leq$	$\displaystyle L_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\sum_{i=1}^{n}\\|w_{k+1,t}^{i}-w_{k+1,t}\\|^{2}.$	(47)

Again, taking expectation on both sides of (• ‣ Appendix) along with the tower rule, we obtain

	$\displaystyle\mathbb{E}[\\|Y\\|^{2}]\leq$	$\displaystyle L_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\mathbb{E}\left[\sum_{i=1}^{n}\\|w_{k,t}^{i}-w_{k,t}\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle 35\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right),$		(48)

where the last step is obtained from (38) in Lemma 4 along with the fact that $t\leq\tau-1$ .

•

As for $Z$ , first recall that if we have $n$ numbers $a_{1},a_{2},\dots,a_{n}$ with mean $\mu=1/n\sum_{i=1}^{n}a_{i}$ , and variance $\sigma^{2}=1/n\sum_{i=1}^{n}|a_{i}-\mu|^{2}$ . If we denote the number of UEs that successfully update their local parameters to the global model as $\delta n$ , ( $0\leq\delta\leq 1$ ), according to [14], we have

\mathbb{E}\left[\left|\sum_{i=1}^{n}U_{i}a_{i}-\mu\right|^{2}\right]=\frac{\sigma^{2}(1-\delta)}{\delta(n-1)}.

(49)

Using this, we have

		$\displaystyle\mathbb{E}\left[\\|\bar{w}_{k+1,t}-w_{k+1,t}\|\mathcal{F}_{k+1,t}\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{(1-\delta)\sum_{i=1}^{n}\\|w_{k+1,t}^{i}-w_{k+1,t}\\|^{2}}{\delta(n-1)n}.$		(50)

Therefore, by taking expectation on both sides of (• ‣ Appendix) along with the tower rule, we have

		$\displaystyle\mathbb{E}\left[\\|\bar{w}_{k+1,t}-w_{k+1,t}\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{35\beta^{2}(1-\delta)\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})}{\delta(n-1)},$		(51)

where the inequality is also obtained from (38) in Lemma 4. Next, consider the Cauchy-Schwarz inequality (44) with $a_{i}=\nabla F_{i}(w_{k+1,t})-\nabla F_{i}(\bar{w}_{k+1,t})$ and $b_{i}=U_{k}^{i}$ , we have

	$\displaystyle\\|Z\\|^{2}$
$\displaystyle\leq$	$\displaystyle\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\sum_{i=1}^{n}\left\\|\nabla F_{i}(w_{k+1,t})-\nabla F_{i}(\bar{w}_{k+1,t})\right\\|^{2}\right)$
$\displaystyle\leq$	$\displaystyle L_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\sum_{i=1}^{n}\\|w_{k+1,t}-\bar{w}_{k+1,t}\\|^{2}.$	(52)

At this point, taking expectation on both sides of (• ‣ Appendix) along with the use of (• ‣ Appendix) yields,

		$\displaystyle\mathbb{E}[\\|Z\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{35\beta^{2}L_{F}^{2}(1-\delta)n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})}{\delta(n-1)}\left(\sum_{i=1}^{n}U_{i}^{2}\right).$		(53)

Now, getting back to (Appendix), we first lower bound the term

\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right].

(54)

From (42), we have

$\displaystyle\mathbb{E}$	$\displaystyle\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(X+Y+Z+\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right)\right]$
$\displaystyle\geq$	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{n}U_{i}\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}\right]-\\|\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}X\right]\\|$
	$\displaystyle-\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]-\mathbb{E}[\\|Y+Z\\|^{2}].$	(55)

This inequality follows the same thought as [14] by using the fact that

		$\displaystyle\mathbb{E}[\nabla F(\bar{w}_{k+1,t})^{\top}(Y+Z)]$
	$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]+\mathbb{E}[\\|Y+Z\\|^{2}].$		(56)

Meanwhile, the bounds of the terms on the right-hand-side of (Appendix) can be derived similar to what the authors did in [14]. That is,

	$\displaystyle\\|\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}X\right]\\|$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]+\mathbb{E}\left[\left\\|\mathbb{E}\left[X\|\mathcal{F}_{k+1,t}\right]\right\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]$
	$\displaystyle+\mathbb{E}\left[\left\\|\sum_{i=1}^{n}U_{i}(\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i}))\right\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]+\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}.$	(57)

Meanwhile, using the Cauchy-Schwarz inequality (44), we have

	$\displaystyle\mathbb{E}[\\|Y+Z\\|^{2}]$
$\displaystyle\leq$	$\displaystyle 2(\mathbb{E}[\\|Y\\|^{2}]+\mathbb{E}[\\|Z\\|^{2}])$
$\displaystyle\leq$	$\displaystyle 70\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(1+\frac{1-\delta}{\delta(n-1)}\right)\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
$\displaystyle\leq$	$\displaystyle 140\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right).$	(58)

Combining (Appendix) and (Appendix) yields

	$\displaystyle\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right]$
$\displaystyle\geq$	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{n}(U_{i}-\frac{1}{2})\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}\right]-\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}$
	$\displaystyle-140\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
$\displaystyle\geq$	$\displaystyle\frac{1}{2}\mathbb{E}\left[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}\right]-\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}$
	$\displaystyle-140\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right),$	(59)

where the last step is derived from the fact that $\sum_{i=1}^{n}U_{i}\geq\sum_{i=1}^{n}U_{k}^{i}=1$ . Next, we characterize an upper bound for the other term in (Appendix):

\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}\right].

(60)

To do this, we still use the equality (42), that is,

	$\displaystyle\left\\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2\\|X+Y+Z\\|^{2}+2\left\\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\\|^{2}$
$\displaystyle\leq$	$\displaystyle 4\\|X\\|^{2}+4\\|Y+Z\\|^{2}+2\left\\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\\|^{2}.$	(61)

Hence, taking expectations on both sides of (Appendix) along with (46) and (Appendix) we have

	$\displaystyle\mathbb{E}\left[\left\\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle 2\mathbb{E}\left[\left\\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\\|^{2}\right]+4\sigma_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
	$\displaystyle+560\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right).$	(62)

Again, consider the Cauchy-Schwarz inequality with $a_{i}=\nabla F_{i}(w_{k+1,t})$ and $b_{i}=U_{i}$ , we have

		$\displaystyle\left\\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\left(\left\\|\sum_{i=1}^{n}\nabla F_{i}(w_{k+1,t})\right\\|^{2}\right)\left(\sum_{i=1}^{n}U_{i}^{2}\right).$		(63)

By Lemma 3, we have

\mathbb{E}\left[\|\nabla F_{i}(\bar{w}_{k+1,t})-\nabla F(\bar{w}_{k+1,t})\|^{2}\right]\leq\gamma_{F}^{2}.

(64)

Recall the relationship between the expectation of a random vector $\mathbf{x}$ , $\mathbb{E}(\mathbf{x})$ and its variance, $\mathbb{D}(\mathbf{x})$ , that $\mathbb{D}(\mathbf{x})=\mathbb{E}(\mathbf{x}^{2})-[\mathbb{E}(\mathbf{x})]^{2}$ . Given that $1/n\sum_{i=1}^{n}\nabla F_{i}(\bar{w}_{k+1,t})=\nabla F(\bar{w}_{k+1,t})$ , we have

\mathbb{E}\left[\left\|\sum_{i=1}^{n}\nabla F_{i}(w_{k+1,t})\right\|^{2}\right]=\gamma_{F}^{2}+\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}].

(65)

Combining (Appendix), (Appendix) and (65) yields

	$\displaystyle\mathbb{E}\left[\left\\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle 2\left(\sum_{i=1}^{n}U_{i}^{2}\right)\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]$
	$\displaystyle+560\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
	$\displaystyle+2\left(\sum_{i=1}^{n}U_{i}^{2}\right)\gamma_{F}^{2}+4\sigma_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right).$	(66)

Substituting (Appendix) and (Appendix) in (Appendix) implies

$\displaystyle\mathbb{E}$	$\displaystyle[F(\bar{w}_{k+1,t+1})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t})]-\beta\left[\frac{1}{2}-\beta L_{F}\sum_{i=1}^{n}U_{i}^{2}\right]\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]$
	$\displaystyle+140(1+2\beta L_{F})\beta^{3}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
	$\displaystyle+L_{F}\beta^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\gamma_{F}^{2}+\beta\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}$
	$\displaystyle+2L_{F}\beta^{2}\sigma_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
$\displaystyle\leq$	$\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t})]-\frac{\beta}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]+\beta\sigma_{T}^{2},$	(67)

where

	$\displaystyle\sigma_{T}^{2}=$	$\displaystyle 280(\beta L_{F})^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)$
		$\displaystyle+\beta L_{F}(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)+\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}\left(\sum_{i=1}^{n}U_{i}^{2}\right).$		(68)

The last step of (Appendix) is obtained by $\beta\leq 1/(2\tau L_{F})$ . Summarizing (Appendix) from $t=0$ to $t=\tau-1$ , we have

		$\displaystyle\mathbb{E}[F(w_{k+1})]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[F(w_{k})]-\frac{\beta\tau}{4}\left(\frac{1}{\tau}\sum_{t=0}^{\tau-1}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]\right)+\beta\tau\sigma_{T}^{2},$		(69)

which is obtained by the fact that $\bar{w}_{k+1,\tau}=w_{k+1}$ . Finally, summarizing (Appendix) from $k=0$ to $K-1$ , we have

		$\displaystyle\mathbb{E}[F(w_{K})]\leq\mathbb{E}[F(w_{0})]$
		$\displaystyle-\frac{\beta\tau K}{4}\left(\frac{1}{\tau K}\sum_{k=0}^{K-1}\sum_{t=0}^{\tau-1}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]\right)+\beta\tau K\sigma_{T}^{2}.$		(70)

As a result, we have

	$\displaystyle\frac{1}{\tau K}\sum_{k=0}^{K-1}\sum_{t=0}^{\tau-1}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]$
$\displaystyle\leq$	$\displaystyle\frac{4}{\beta\tau K}(F(w_{0})-\mathbb{E}[(F(w_{K}))]+\beta\tau K\sigma_{T}^{2})$
$\displaystyle\leq$	$\displaystyle\frac{4(F(w_{0})-F^{*})}{\beta\tau K}+4\sigma_{T}^{2}.$	(71)

Given that we only consider one step of local SGD in this paper, we use $\tau=1$ , and then the desired result is obtained.

References

[1] M. Ghahramani, M. Zhou, and G. Wang, “Urban sensing based on mobile phone data: approaches, applications, and challenges,” IEEE/CAA Journal of Automatica Sinica, vol. 7, no. 3, pp. 627–637, 2020.
[2] R. Pryss, M. Reichert, J. Herrmann, B. Langguth, and W. Schlee, “Mobile crowd sensing in clinical and psychological trials–a case study,” in IEEE International Symposium on Computer-Based Medical Systems, 2015, pp. 23–24.
[3] R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: current state and future challenges,” IEEE Communications Magazine, vol. 49, no. 11, pp. 32–39, 2011.
[4] P. Li, J. Li, Z. Huang, T. Li, C.-Z. Gao, S.-M. Yiu, and K. Chen, “Multi-key privacy-preserving deep learning in cloud computing,” Future Generation Computer Systems, vol. 74, pp. 76–85, 2017.
[5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017.
[6] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
[7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
[8] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh, “Three approaches for personalization with applications to federated learning,” arXiv preprint arXiv:2002.10619, 2020.
[9] J. Schneider and M. Vlachos, “Mass personalization of deep learning,” arXiv preprint arXiv:1909.02803, 2019.
[10] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” vol. 30, 2017.
[11] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International Conference on Machine Learning (ICML), vol. 70, 2017, pp. 1126–1135.
[12] J. Vanschoren, “Meta-learning,” in Automated Machine Learning. Springer, Cham, 2019, pp. 35–61.
[13] F. Hutter, L. Kotthoff, and J. Vanschoren, Automated machine learning: methods, systems, challenges. Springer Nature, 2019.
[14] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” in Conference on Neural Information Processing Systems (NeuIPS), 2020.
[15] Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” arXiv preprint arXiv:1909.12488, 2019.
[16] Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized federated learning,” arXiv preprint arXiv:2003.13461, 2020.
[17] C. T. Dinh, N. H. Tran, and T. D. Nguyen, “Personalized federated learning with moreau envelopes,” arXiv preprint arXiv:2006.08848, 2020.
[18] Q. Wu, K. He, and X. Chen, “Personalized federated learning for intelligent iot applications: A cloud-edge based framework,” IEEE Open Journal of the Computer Society, vol. 1, pp. 35–44, 2020.
[19] S. Yue, J. Ren, J. Xin, S. Lin, and J. Zhang, “Inexact-admm based federated meta-learning for fast and continual edge learning,” in Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, 2021, pp. 91–100.
[20] S. J. Wright, Coordinate descent algorithms. Springer, 2015, vol. 151, no. 1.
[21] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning: A meta-learning approach,” arXiv preprint arXiv:2002.07948, 2020.
[22] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Transactions on Communications, vol. 68, no. 1, pp. 317–333, 2019.
[23] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in IEEE International Conference on Computer Communications (INFOCOM), 2019, pp. 1387–1395.
[24] Y. Pei, Z. Peng, Z. Wang, and H. Wang, “Energy-efficient mobile edge computing: three-tier computing under heterogeneous networks,” Hindawi Wireless Communications and Mobile Computing, vol. 2020, 2020.
[25] D. P. Palomar and M. Chiang, “A tutorial on decomposition methods for network utility maximization,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 24, no. 8, pp. 1439–1451, 2006.
[26] L. Yann, C. Corinna, and B. Christopher, “The MNIST database,” http://yann.lecun.com/exdb/mnist/, 1998.
[27] “The CIFAR-10 dataset,” https://, 2014.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037.

	$\displaystyle\\|\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}X\right]\\|$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]+\mathbb{E}\left[\left\\|\mathbb{E}\left[X\|\mathcal{F}_{k+1,t}\right]\right\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]$
	$\displaystyle+\mathbb{E}\left[\left\\|\sum_{i=1}^{n}U_{i}(\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i}))\right\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\mathbb{E}[\\|\nabla F(\bar{w}_{k+1,t})\\|^{2}]+\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}.$	(57)