This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Automated Federated Learning in Mobile Edge Networks — Fast Adaptation and Convergence

Chaoqun You,  Kun Guo, 
Gang Feng,  Peng Yang, 
and Tony Q. S. Quek
C. You and T. Quek are with the Singapore University of Design and Technology, 487372, Singapore (e-mail: [email protected] and [email protected]).K. Guo is with the East China Normal University, Shanghai 200241, P.R.China (e-mail:[email protected]).G. Feng is with University of Electronic Science and Technology of China, Chengdu 611731, P.R.China (e-mail:[email protected]).P. Yang is with the Beihang University, Beijing 100088, P.R.China (e-mail:[email protected]).
Abstract

Federated Learning (FL) can be used in mobile edge networks to train machine learning models in a distributed manner. Recently, FL has been interpreted within a Model-Agnostic Meta-Learning (MAML) framework, which brings FL significant advantages in fast adaptation and convergence over heterogeneous datasets. However, existing research simply combines MAML and FL without explicitly addressing how much benefit MAML brings to FL and how to maximize such benefit over mobile edge networks. In this paper, we quantify the benefit from two aspects: optimizing FL hyperparameters (i.e., sampled data size and the number of communication rounds) and resource allocation (i.e., transmit power) in mobile edge networks. Specifically, we formulate the MAML-based FL design as an overall learning time minimization problem, under the constraints of model accuracy and energy consumption. Facilitated by the convergence analysis of MAML-based FL, we decompose the formulated problem and then solve it using analytical solutions and the coordinate descent method. With the obtained FL hyperparameters and resource allocation, we design a MAML-based FL algorithm, called Automated Federated Learning (AutoFL), that is able to conduct fast adaptation and convergence. Extensive experimental results verify that AutoFL outperforms other benchmark algorithms regarding the learning time and convergence performance.

Index Terms:
Fast adaptation and convergence, federated learning, model-agnostic meta-learning, mobile edge networks

I Introduction

In recent years, modern mobile user equipments (UEs) such as smart phones and wearable devices have been equipped with advanced powerful sensing and computing capabilities [1]. This enables their access to a wealth of data that are suitable for learning models, which brings countless opportunities for meaningful applications such as Artificial Intelligence (AI) medical diagnosis [2] and air quality monitoring [3]. Traditionally, learning models requires data to be processed in a cloud data center [4]. However, due to the long distance between the devices where data is generated and the servers in data centers, cloud-based Machine Learning (ML) for mobile UEs may incur unacceptable latencies and communication overhead. Therefore, Mobile Edge Computing (MEC) [5, 6] has been proposed to facilitate the deployment of servers near the base station (BS) at mobile edge networks, so as to bring intelligence to network edge.

In mobile edge networks, for a traditional ML paradigm, it is inevitable to upload raw data from mobile UEs to a server which could be deployed near the BS for model learning. However, the unprecedented amount of data created by mobile UEs are private in nature, leading to the increasing concern of data security and user privacy. To address this issue, Federated Learning (FL) has been proposed [7] as a new ML paradigm. FL in mobile edge networks refers to training models across multiple distributed UEs without ever uploading their raw data to the server. In particular, UEs compute local updates to the current global model based on their own local data, which are then aggregated and fed-back by an edge server, so that all UEs have access to the same global model for their new local updates. Such a procedure is implemented in one communication round and is repeated until a certain model accuracy is reached.

Despite its promising benefits, FL also comes with new challenges in practice. Particularly, the datasets across UEs are heterogenous. Not only does the number of data samples generated by UEs vary, but these data samples are usually not independent and identically distributed (non-i.i.d). Learning from such heterogeneous data is not easy, as conventional FL algorithms usually develop a common model for all UEs [8–10] such that, the global model obtained by minimizing the average loss could perform arbitrarily poorly once applied to the local dataset of a specific UE. That is, the global model derived from conventional FL algorithms may conduct weak adaptations to local UEs. Such weak adaptations will further restrict the convergence rate of the FL algorithms.

Multiple techniques are emerging as promising solutions to the data heterogeneity problem, such like adding user context [8], transfer-learning [9], multi-task learning [10], and Model-Agnostic Meta-Learning (MAML) [11]. Of all these techniques, MAML is the only one that not only addresses the heterogenous data problem, but greatly speeds up the FL learning process as well. MAML is a well-known meta-learning [12] approach that learns from the past experience to adapt to new tasks much faster. Such adaptations make it possible for MAML to develop well-behaved user-specific models. More specifically, MAML aims to find a sensitive initial point that learned from the past experience to conduct fast adaptations requiring only a few data points on each UE. With the learned initial model, MAML dramatically speeds up the learning process by replacing hand-engineered algorithms with an automated, data-driven approach [13]. Fortunately, in both MAML and FL, existing algorithms use a variant of gradient descent method locally, and send an overall update to a coordinator to update the global model. This similarity makes it possible to interpret FL within a MAML framework  [14, 15, 16, 17]. Such a simple MAML-based FL is termed as Personalized Federated Learning (PFL), and the algorithm that realizes PFL is termed as Personalized FederatedAveraging (Per-FedAvg). However, Per-FedAvg is a pure ML algorithm, and how to apply it into practical mobile edge networks remains unclear. Recently several attempts have been made to study the implementation of Per-FedAvg in practice. [18] proposes a framework to execute Per-FedAvg for intelligent IoT applications without considering Per-FedAvg’s strength in saving learning time. [19], although aims to achieve fast learning for IoT applications at the edge, does not consider the resource allocation in a practical mobile edge network. Therefore, to what degree a MAML-based FL algorithm expedites FL in mobile edge networks and under what conditions such benefit could be achieved still remain to be unexplored areas.

In order to quantify the benefit MAML brings to FL in mobile edge networks, we consider two aspects to minimize the overall learning time: optimizing learning-related hyperparameters as well as resource allocation. On the one hand, the typical FL hyperparameters, including the sampled data sizes across UEs and the number of communication rounds, have significant impact on the overall learning time and thus, these hyperparameters need to be carefully specified. On the other hand, the resource allocation should be considered to account for UEs’ practical wireless communication environments and limited battery lifetimes. Particularly, due to the existence of random channel gain and noise over wireless channels, the transmit power allocation on an UE will decide whether the transmitted updates from the UE can be successfully decoded by the edge server. Restricted by the limited battery lifetime, it will also determine whether the remaining energy on that UE is sufficient to support its local training.

It is non-trivial to solve the formulated optimization problem considering above two quantitative dimensions, given that the relationship between the variables (i.e., the sampled data sizes across UEs, the number of communication rounds, and the transmit power on UEs) and the model accuracy ϵ\epsilon is implicit. Therefore, we start with the convergence analysis of the MAML-based FL algorithm, by which the three variables are bounded as functions of ϵ\epsilon. After that, the formulated optimization problem can be approximatively decoupled into three sub-problems, each of which accounts for one of the three variables. Specifically, solving the first sub-problem gives an insight to achieve the required number of communication round. As for the other two sub-problems, we use the coordinate descent method [20] to compute the sampled data size and the transmit power of all UEs iteratively. In other words, we first give an initial value of the transmit power of UEs, and use it to compute the sampled data sizes in the second sub-problem; then with the attained sampled data sizes, we compute the transmit power in the third sub-problem. This process repeats until a certain model accuracy is achieved.

The solution to the optimization problem guides us to the design of Automated Federated Learning (AutoFL). AutoFL uses the results derived from the optimization problem to design the learning process, thereby quantifying as well as maximizing the benefit MAML brings to FL over mobile edge networks. More specifically, in each round kk, the BS first sends the current model parameter wkw_{k} to a random subset of UEs. According to the given model accuracy ϵ\epsilon, the sampled data sizes and the transmit power of the selected UEs are determined by our proposed solution. Each selected UE then trains its local model with the determined sampled data size and then transmits the local update to the edge server with the determined transmit power. The server receives local updates from the selected UEs and decides whether these updates can be successfully decoded. Those successfully decoded updates are then aggregated to update the global model as wk+1w_{k+1}. Such a communication round is executed repeatedly until the model accuracy ϵ\epsilon is achieved. In this way, AutoFL is able to inherit the advantages of MAML over mobile edge networks, thereby conducting model learning with fast adaptation and convergence.

To summarize, in this paper we make the following contributions:

  • We provide a comprehensive problem formulation for the MAML-based FL design over mobile edge networks, which can account for the impact of FL hyperparameters and network parameters on the model accuracy and learning time at the same time. Specifically, we jointly optimize the UEs’ sampled data size and transmit power, as well as the number of communication rounds, to quantify and maximize the benefit MAML brings to FL, by which the overall learning time is minimized under the constraint of model accuracy ϵ\epsilon and energy consumption.

  • We analyse the convergence of MAML-based FL algorithm as the first step to solve the formulated problem. The convergence analysis makes the relationships between optimization variables and the model accuracy explicit, especially characterizes the sampled data size, the transmit power, and the number of communication rounds as functions of model accuracy ϵ\epsilon.

  • Applying the obtained results of convergence analysis, the formulated problem is decoupled into three sub-problems. By solving these sub-problems, the number of communication rounds can be characterized using a closed-form solution, while the sampled data sizes and the transmit power of all UEs in each round can be achieved using the coordinate descent method. With the optimized hyperparameters and resource allocation, we further propose AutoFL with fast adaptation and convergence.

  • By conducting extensive experiments with MNIST and CIFAR-10 datasets, we demonstrate the effectiveness and advantages of AutoFL over Per-FedAvg and FedAvg, two baseline FL algorithms in terms of learning time, model accuracy and training loss.

The rest of this paper is organized as follows. We first give the system model and problem formulation in Section II. Then we give the convergence analysis to make the formulated problem tractable in Section III. We recast the optimization problem and then propose the solutions to guide the design of AutoFL in Section IV. Extensive experimental results are presented and discussed in Section V. Finally, we conclude this paper in Section VI.

II System Model and Problem Formulation

Consider a typical mobile edge network with a edge server co-deployed with a BS and nn UEs, and the UEs are indexed by 𝒰={1,,n}\mathcal{U}=\{1,\dots,n\}, as illustrated in Fig. 1. In this section, we first explain why the FL problem can be interpreted within the MAML framework. Based on this framework, we introduce our system model and problem formulation.

Refer to caption
Figure 1: An illustration of mobile edge network.

II-A Interpreting FL within the MAML framework

In FL, we consider a set of nn UEs which are connected with the server via the BS, where each UE has access only to its local data [7]. For a sample data point {x,y}\{x,y\} with input xx, the goal of the server is to find the model parameter ww that characterizes the output yy with loss function f(w):mf(w):\mathbb{R}^{m}\rightarrow\mathbb{R}, such that the value of f(w)f(w) can be minimized. More specifically, if we define fi(w):mf_{i}(w):\mathbb{R}^{m}\rightarrow\mathbb{R} as the loss function of UE ii, the goal of the server is to solve

minwmf(w):=1ni=1nfi(w).\min_{w\in\mathbb{R}^{m}}f(w):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w). (1)

In particular, for each UE ii, we have

fi(w):=1Di(x,y)𝒟ili(w;x,y),f_{i}(w):=\frac{1}{D_{i}}\sum_{(x,y)\in\mathcal{D}_{i}}l_{i}(w;x,y), (2)

where li(w;x,y)l_{i}(w;x,y) is the error between the true label y𝒴iy\in\mathcal{Y}_{i} and the prediction of model ww using input x𝒳ix\in\mathcal{X}_{i}. Each UE ii has a local dataset 𝒟i={x𝒳i,y𝒴i}\mathcal{D}_{i}=\{x\in\mathcal{X}_{i},y\in\mathcal{Y}_{i}\}, with Di=|𝒟i|D_{i}=|\mathcal{D}_{i}| data samples. Since the datasets captured by the UEs are naturally heterogenous, the probability distribution of 𝒟i\mathcal{D}_{i} across UEs is not identical.

MAML, on the other hand, is one of the most attractive technique in meta-learning. MAML is proposed to learn an initial model that adapts quickly to a new task through one or more gradient steps with only a few data points from that new task. Each task can be regarded as an object with its own dataset to learn, just like an UE in FL. MAML allows us to replace hand-engineered algorithms with data-driven approaches to learn initial parameters automatically.

To show how to exploit the fundamental idea behind the MAML framework [11] to design an automated variant of FL algorithm, let us first briefly recap the MAML formulation. In MAML, if we regard the tasks as UEs and assume each UE takes the initial model and updates it using one step of gradient descent with respect to its own loss function111A more general case is to perform multiple steps of gradient descent. However, this would lead to expensive cost of computing multiple Hessians. Therefore, for simplicity, throughout the whole paper we consider only one single step of gradient descent., problem (1) then changes to

minwmF(w):=1ni=1nfi(wαfi(w)),\min_{w\in\mathbb{R}^{m}}F(w):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w-\alpha\nabla f_{i}(w)), (3)

where α0\alpha\geq 0 is the learning rate at a UE. For UE ii, its optimization objective Fi(w)F_{i}(w) can be expressed as

Fi(w):=fi(wαfi(w)).F_{i}(w):=f_{i}(w-\alpha\nabla f_{i}(w)). (4)

Such a transformation from problem (1) to (3) implies that the FL problem can be interpreted within the MAML framework. The FL algorithm proposed to solve (3) is termed as Personalized Federated Averaging (Per-FedAvg) [21, 14]. Per-FedAvg is inspired by FedAvg, which is proposed in [7] as a classic and general FL algorithm to solve (1) in a distributed manner.

Per-FedAvg is summarized in Algorithm 1. In each round kk (k=1,,K)(k=1,\dots,K), the central BS randomly picks a set of UEs 𝒜k\mathcal{A}_{k} and then sends them the current model parameter wkw_{k}. Each UE first adapts the global model wkw_{k} to its local data and obtains an intermediate parameter θki\theta_{k}^{i}, where θki=wkαfi(wk)\theta_{k}^{i}=w_{k}-\alpha\nabla f_{i}(w_{k}). Then, with θki\theta_{k}^{i}, UE ii updates its local model using one or more steps of gradient descent and obtains wk+1iw_{k+1}^{i}. Such local model parameter updates are then sent to the server for model aggregation. The server will update the global model or meta model as wk+1w_{k+1}. This process repeats in the sequel rounds until a certain model accuracy is achieved or a predefined maximum number of rounds is reached.

Input :  UE learning rate: α\alpha; BS learning rate: β\beta; Initial model parameter: w0w_{0};
1 for k=1k=1 to KK do
2       Choose a subset of UEs 𝒜k\mathcal{A}_{k} uniformly at random with size AkA_{k};
3       BS sends wkw_{k} to all UEs in 𝒜k\mathcal{A}_{k};
4       for each UE i𝒜ki\in\mathcal{A}_{k} do
5             Compute θki:=wkαwkfi(wk)\theta_{k}^{i}:=w_{k}-\alpha\nabla_{w_{k}}f_{i}(w_{k}) ;
6             Compute wk+1i:=wkβwkFi(θki)=wkβ(Iwk2fi(wk))θkifi(θki)w_{k+1}^{i}:=w_{k}-\beta\nabla_{w_{k}}F_{i}(\theta_{k}^{i})=w_{k}-\beta(I-\nabla_{w_{k}}^{2}f_{i}(w_{k}))\nabla_{\theta_{k}^{i}}f_{i}(\theta_{k}^{i});
7             UE ii sends wk+1iw_{k+1}^{i} to the central BS.
8       end for
9      BS updates the global model as:  wk+1=1Aki𝒜kwk+1iw_{k+1}=\frac{1}{A_{k}}\sum_{i\in\mathcal{A}_{k}}w_{k+1}^{i};
10      
11 end for
Algorithm 1 Per-FedAvg Pseudo-code

Per-FedAvg is a MAML-based FL algorithm, which suggests a general approach to use the MAML method for solving the data heterogeneity problem in FL. It is proposed with focus on the personalization in FL, which is a natural feature inherited from MAML. Except for the personalization, MAML is also a few-shot learning approach, as only a few samples are required to learn new skills after just minutes of experience. Therefore, Per-FedAvg also inherits this feature of MAML, thereby adapting quickly from only a few samples. To what degree of fast adaptation MAML benefits FL and under what conditions such benefit can be achieved is what exactly this paper is about.

II-B Machine Learning Model

We consider the above described MAML-based FL. In detail, we concentrate on the situation where UEs communicate in a synchronous manner, so as to avoid using outdated parameters for global model update and make high-quality refinement in each round. Meanwhile, for each UE, we consider the case that only one step of stochastic gradient descent (SGD) is performed, following the same setting as [11].

As for the concerned MAML-based FL, our goal is to optimize the initial model using only a few data points at each UE. Hence, we only obtain an estimate of the desired gradient with SGD. Here, the desired gradient Fi(w)\nabla F_{i}(w) on UE ii is computed using all data points in its dataset 𝒟i\mathcal{D}_{i}, while the estimated gradient ~Fi(w)\tilde{\nabla}F_{i}(w) on UE ii is computed using SGD with the sampled dataset 𝒟i()𝒟i\mathcal{D}_{i}^{(\cdot)}\in\mathcal{D}_{i}. Note that, the superscript ()(\cdot) represents different sampled datasets are used to estimate the involved gradients and Hessian in Fi(w)\nabla F_{i}(w). Meanwhile, the sampled data size is Di()=|𝒟i()|D_{i}^{(\cdot)}=|\mathcal{D}_{i}^{(\cdot)}|.

More specifically, in order to solve (3), each UE ii computes the desired gradient in round kk, as follows:

Fi(wk)=(Iα2fi(wk))fi(wkαfi(wk)).\nabla F_{i}(w_{k})=(I-\alpha\nabla^{2}f_{i}(w_{k}))\nabla f_{i}(w_{k}-\alpha\nabla f_{i}(w_{k})). (5)

At every round, computing the gradient fi(wk)\nabla f_{i}(w_{k}) by using all data points of UE ii is often computationally expensive. Therefore, we take a subset 𝒟iin\mathcal{D}_{i}^{\text{in}} from 𝒟i\mathcal{D}_{i} to obtain an unbiased estimate ~fi(wk;𝒟iin)\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}}) for fi(wk)\nabla f_{i}(w_{k}), which is given by

~fi(wk;𝒟iin)=1Diin(x,y)𝒟iinli(wk;x,y).\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}})=\frac{1}{D_{i}^{\text{in}}}\sum_{(x,y)\in\mathcal{D}_{i}^{\text{in}}}\nabla l_{i}(w_{k};x,y). (6)

Similarly, the outer gradient update fi(θki)\nabla f_{i}(\theta_{k}^{i}) and Hessian update 2fi(wk)\nabla^{2}f_{i}(w_{k}) in (5) can be replaced by their unbiased estimates ~fi(θki;𝒟io)\tilde{\nabla}f_{i}(\theta^{i}_{k};\mathcal{D}_{i}^{\text{o}}) and ~2fi(wk;𝒟ih)\tilde{\nabla}^{2}f_{i}(w_{k};\mathcal{D}_{i}^{\text{h}}) respectively. Here, 𝒟io\mathcal{D}_{i}^{\text{o}} and 𝒟ih\mathcal{D}_{i}^{\text{h}} are sampled from 𝒟i\mathcal{D}_{i} as well. Therefore, using SGD, we can finally obtain an estimated local gradient ~Fi(wk)\tilde{\nabla}F_{i}(w_{k}) on UE ii in round kk, which is given by

~Fi(wk)=(Iα~2fi(wk;𝒟ih))~fi(wkα~fi(wk;𝒟iin);𝒟io).\tilde{\nabla}F_{i}(w_{k})=(I-\alpha\tilde{\nabla}^{2}f_{i}(w_{k};\mathcal{D}_{i}^{\text{h}}))\tilde{\nabla}f_{i}(w_{k}-\alpha\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}});\mathcal{D}_{i}^{\text{o}}). (7)

It is worth noting that ~Fi(wk)\tilde{\nabla}F_{i}(w_{k}) is a biased estimator of Fi(wk)\nabla F_{i}(w_{k}). This is because the stochastic gradient ~fi(wkα~fi(wk;𝒟iin);𝒟io)\tilde{\nabla}f_{i}(w_{k}-\alpha\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}});\mathcal{D}_{i}^{\text{o}}) contains another stochastic gradient ~fi(wk;𝒟iin)\tilde{\nabla}f_{i}(w_{k};\mathcal{D}_{i}^{\text{in}}) inside. Hence, to improve the estimate accuracy, 𝒟iin\mathcal{D}_{i}^{\text{in}} used for inner gradient update is independent from the sampled datasets 𝒟io\mathcal{D}_{i}^{\text{o}} and 𝒟ih\mathcal{D}_{i}^{\text{h}} used for outer gradient and Hessian update respectively. Meanwhile, in this paper we assume 𝒟io\mathcal{D}_{i}^{\text{o}} and 𝒟ih\mathcal{D}_{i}^{\text{h}} are also independent from each other.

II-C Communication and Computation Model

When exploring the MAML-based FL in realistic mobile edge networks, the communication and computation model should be captured carefully. Particularly, we consider that UEs access the BS through a channel partitioning scheme, such as orthogonal frequency division multiple access (OFDMA). In order to successfully upload the local update to the BS, two conditions need to be satisfied: (1) the UE is selected, and (2) the transmitted local update is successfully decoded. In this respect, we first introduce sik{0,1}s_{i}^{k}\in\{0,1\} as a selection indicator, where ski=1s_{k}^{i}=1 indicates the event that UE ii is chosen in round kk, and ski=0s_{k}^{i}=0 otherwise. Next, we characterize the transmission quality of the wireless links. For the signals transmitted from UE ii, the SNR received at the BS can be expressed as ξki=pkihkiciκN0\xi_{k}^{i}=\frac{p_{k}^{i}h_{k}^{i}\|c_{i}\|^{-\kappa}}{N_{0}}, where pkip_{k}^{i} is the transmit power of UE ii during round kk, κ\kappa is the path loss exponent. hkiciκh_{k}^{i}\|c_{i}\|^{-\kappa} is the channel gain between UE ii and the BS with cic_{i} being the distance between UE ii and the BS and hkih_{k}^{i} being the small-scale channel coefficient. N0N_{0} is the noise power spectral density. In order for the BS to successfully decode the local update from UE ii, it is required that the received SNR exceeds a decoding threshold ϕ\phi, i.e., ξki>ϕ\xi_{k}^{i}>\phi. Assume that the small-scale channel coefficients across communication rounds follow Rayleigh distribution, then according to [22], the update success transmission probability, defined as qki=(ski=1,ξki>ϕ)q_{k}^{i}=\mathbb{P}(s_{k}^{i}=1,\xi_{k}^{i}>\phi), can be estimated as follows:

qki=(ski=1,ξki>ϕ)1/n1+ν(pki).q_{k}^{i}=\mathbb{P}(s_{k}^{i}=1,\xi_{k}^{i}>\phi)\approx\frac{1/n}{1+\nu(p_{k}^{i})}. (8)

where ν(pki)=N0ϕpki\nu(p_{k}^{i})=\frac{N_{0}\phi}{p_{k}^{i}}. Then the achievable uplink rate rk,ir_{k,i} of UE ii transmitting its local update to the BS in round kk is given by

rk,i=Blog2(1+ξki),r_{k,i}=B\log_{2}(1+\xi_{k}^{i}), (9)

where BB is the bandwidth allocated to each UE. Based on rk,ir_{k,i}, the uplink transmission delay of UE ii in round kk can be specified as follows:

tk,icom=Zrk,i,t_{k,i}^{\text{com}}=\frac{Z}{r_{k,i}}, (10)

where ZZ is the size of wkw_{k} in number of bits. Since the transmit power of the BS is much higher than that of the UEs, the downlink transmission delay is much smaller than the uplink transmission delay. Meanwhile, we care more about the transmit power allocation on individual UEs rather than that on the BS, so here we ignore the downlink transmission delay for simplicity.

Further, we calculate the computation time for each UE, which is consumed for computing the local update. Given the CPU-cycle frequency of UE ii by ϑi\vartheta_{i}, the computation time of UE ii is expressed as

tk,icmp=cidkiϑi.t_{k,i}^{\text{cmp}}=\frac{c_{i}d_{k}^{i}}{\vartheta_{i}}. (11)

In (11), cic_{i} denotes the number of CPU cycles for UE ii to execute one sample of data points and dki=Dk,iin+Dk,io+Dk,ihd_{k}^{i}=D_{k,i}^{\text{in}}+D_{k,i}^{\text{o}}+D_{k,i}^{\text{h}} denotes the sampled data size of UE ii in round kk.

In term of tk,icomt_{k,i}^{\text{com}} and tk,icmpt_{k,i}^{\text{cmp}}, we then give the energy consumption of each UE ii in round kk, which consists of two parts: (1) the energy for transmitting the local updates and (2) the energy for computing the local updates. Let ekie_{k}^{i} denote the energy consumption of each UE ii in round kk, and then ekie_{k}^{i} can be computed as follows [23, 24]:

eki=ς2ϑi3tk,icmp+pkitk,icom,e_{k}^{i}=\frac{\varsigma}{2}\vartheta_{i}^{3}t_{k,i}^{\text{cmp}}+p_{k}^{i}t_{k,i}^{\text{com}}, (12)

where ς2\frac{\varsigma}{2} is the effective capacitance coefficient of UE ii’s computing chipset. From (12), we observe that in round kk for each UE ii, both the sampled data size dikd_{i}^{k} and transmit power pikp_{i}^{k} have significant impacts on its energy consumption. This observation motivates us to jointly consider these two variables when quantifying the benefit MAML brings to FL in mobile edge networks.

II-D Problem Formulation

In order to quantify the benefit MAML brings to FL in mobile edge networks, we focus on the learning time minimization under the constraint of model accuracy and UEs’ energy consumption. Particularly, the learning time is the duration over all KK communication rounds, in which the duration of round kk is determined by the slowest UE as follows,

Tkround=maxi𝒜k{tk,icmp+tk,icom}.T_{k}^{\text{round}}=\max_{i\in\mathcal{A}_{k}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\}. (13)

Note that we can replace i𝒜ki\in\mathcal{A}_{k} with i𝒰i\in\mathcal{U} in TkroundT_{k}^{\text{round}}, that is, Tkround=maxi𝒰{tk,icmp+tk,icom}T_{k}^{\text{round}}=\max_{i\in\mathcal{U}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\}. This is because the UEs that are not chosen in 𝒜k\mathcal{A}_{k} has a TkiT_{k}^{i} with the value of 0, which would not effect the result of TkroundT_{k}^{\text{round}}. Let 𝐝{𝐝1,,𝐝n}\mathbf{d}\triangleq\{\mathbf{d}^{1},\dots,\mathbf{d}^{n}\} and 𝐩{𝐩1,,𝐩n}\mathbf{p}\triangleq\{\mathbf{p}^{1},\dots,\mathbf{p}^{n}\} denote the UEs’ sampled data size vector and the transmit power vector respectively, where 𝐝i{d1i,,dKi}\mathbf{d}^{i}\triangleq\{d_{1}^{i},\dots,d_{K}^{i}\} and 𝐩i{p1i,,pKi}\mathbf{p}^{i}\triangleq\{p_{1}^{i},\dots,p_{K}^{i}\}. Then, we can give the problem formulation by

min𝐝,𝐩,K\displaystyle\min_{\mathbf{d},\mathbf{p},K}\qquad k=0K1maxi𝒰{tk,icmp+tk,icom}\displaystyle\sum_{k=0}^{K-1}\max_{i\in\mathcal{U}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\} (P1)
s.t. 1Kk=0K1𝔼[F(wk)2]ϵ,\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla F(w_{k})\|^{2}]\leq\epsilon, (C1.1)
0ekiEmax,i𝒰,k=0,,K1\displaystyle 0\leq e_{k}^{i}\leq E_{\max},\quad\forall i\in\mathcal{U},k=0,...,K-1 (C1.2)
0pkiPmax,i𝒰,k=0,,K1\displaystyle 0\leq p_{k}^{i}\leq P_{\max},\quad\forall i\in\mathcal{U},k=0,...,K-1 (C1.3)
0dkiDi,i𝒰,k=0,,K1.\displaystyle 0\leq d_{k}^{i}\leq D_{i},\quad\forall i\in\mathcal{U},k=0,...,K-1. (C1.4)

In problem (P1), we not only optimize 𝐝\mathbf{d} and 𝐩\mathbf{p}, but also the total number of rounds KK. The hidden reason is that, both 𝐝\mathbf{d} and 𝐩\mathbf{p} have an effect on the duration of each round, thereby impacting the number of rounds KK to achieve a certain model accuracy ϵ\epsilon. Besides, (C1.1) characterizes an ϵ\epsilon-approximate convergence performance. (C1.2) limits the energy consumption of UE ii in round kk not larger the predefined maximum value EmaxE_{\max}. (C1.3) gives the maximum transmit power of UE ii in round kk by PmaxP_{\max}. (C1.4) is the sampled data size constraint, that the sampled data size of one UE in round kk is smaller than or equal to the data points generated by that UE. The solution to problem (P1) can be exploited for designing an optimized MAML-based FL algorithm, which is implemented iteratively to update the global model parameter ww. According to the communication and computation model, only those local updates from the selected UEs as well as being successfully decoded by the BS can contribute to updating the global model parameter. That is, in round kk, we have

wk+1=\displaystyle w_{k+1}= 1i=1n𝟙{ski=1,ξki>ϕ}\displaystyle\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}
i=1n𝟙{ski=1,ξkiϕ}(wkβwkF(θki)),\displaystyle\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}\geq\phi\}(w_{k}-\beta\nabla_{w_{k}}F(\theta_{k}^{i})), (15)

where 𝟙{ski=1,γki>ϕ}=1\mathds{1}\{s_{k}^{i}=1,\gamma_{k}^{i}>\phi\}=1 if the event of ski=1,γki>ϕs_{k}^{i}=1,\gamma_{k}^{i}>\phi is true, and it equals to zero otherwise.

Based on this update rule, (C1.1) related to the MAML-based FL can be analytically analysed to make the relationship between the decision variables and the model accuracy ϵ\epsilon explicit, thereby facilitating solving problem (P1). In this regard, the convergence analysis of the MAML-based FL is given in the following section.

III Convergence Analysis of MAML-based FL

III-A Preliminaries

For a general purpose, we concentrate on the non-convex settings on loss functions and aim to find an ϵ\epsilon-approximate first-order stationary point (FOSP) for the loss function minimization problem (3) in the MAML-based FL, which is defined as follows.

Definition 1.

A random vector wϵmw_{\epsilon}\in\mathbb{R}^{m} is called an ϵ\epsilon-FOSP if it satisfies 𝔼[F(wϵ)2]ϵ\mathbb{E}[\|\nabla F(w_{\epsilon})\|^{2}]\leq\epsilon.

In terms of this Definition, we then elaborate on the assumptions, convergence bounds, and discussions on the convergence analysis, respectively, from which we can reformulate (C1.1) in problem (P1) as an explicit constraint with respect to the optimization variables.

Without loss of generality, the assumptions used for the convergence analysis of MAML-based FL algorithm is consistent with that of Per-FedAvg [21, 14] and are given in the following.

Assumption 1.

For every UE i{1,,n}i\in\{1,\dots,n\}, fi(w)f_{i}(w) is twice continuously differentiable. Its gradient fi(w)\nabla f_{i}(w) is LL-Lipschitz continuous, that is,

fi(w)fi(u)Lwu,w,um.\|\nabla f_{i}(w)-\nabla f_{i}(u)\|\leq L\|w-u\|,\qquad\forall w,u\in\mathbb{R}^{m}. (16)
Assumption 2.

For every UE i{1,,n}i\in\{1,\dots,n\}, the Hessian matrix of fi(w)f_{i}(w) is ρ\rho-Lipschitz continuous, that is,

2fi(w)2fi(u)ρwu,w,um.\|\nabla^{2}f_{i}(w)-\nabla^{2}f_{i}(u)\|\leq\rho\|w-u\|,\qquad\forall w,u\in\mathbb{R}^{m}. (17)
Assumption 3.

For any wmw\in\mathbb{R}^{m}, li(w;x,y)\nabla l_{i}(w;x,y) and 2li(w;x,y)\nabla^{2}l_{i}(w;x,y), computed w.r.t. a single data point (x,y)𝒳i×𝒴i(x,y)\in\mathcal{X}_{i}\times\mathcal{Y}_{i}, have bounded variance, that is,

𝔼(x,y)pi[li(w;x,y)fi(w)2]\displaystyle\mathbb{E}_{(x,y)\thicksim p_{i}}[\|\nabla l_{i}(w;x,y)-\nabla f_{i}(w)\|^{2}] σG2,\displaystyle\leq\sigma^{2}_{G},
𝔼(x,y)pi[2li(w;x,y)2fi(w)2]\displaystyle\mathbb{E}_{(x,y)\thicksim p_{i}}[\|\nabla^{2}l_{i}(w;x,y)-\nabla^{2}f_{i}(w)\|^{2}] σH2.\displaystyle\leq\sigma^{2}_{H}. (18)
Assumption 4.

For any wmw\in\mathbb{R}^{m}, the gradient and Hessian matrix of local loss function fi(w)f_{i}(w) and the average loss function f(w)=(1/n)i=1nfi(w)f(w)=(1/n)\sum_{i=1}^{n}f_{i}(w) have bound variance, that is

1ni=1nfi(w)f(w)2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\|\nabla f_{i}(w)-\nabla f(w)\|^{2} γG2,\displaystyle\leq\gamma_{G}^{2},
1ni=1n2fi(w)2f(w)2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\|\nabla^{2}f_{i}(w)-\nabla^{2}f(w)\|^{2} γH2.\displaystyle\leq\gamma_{H}^{2}. (19)

Before the convergence analysis, we first introduce three lemmas inherited from [14, 21] quantifying the smoothness of Fi(w)F_{i}(w) and F(w)F(w), the deviation between Fi(w)\nabla F_{i}(w) and its estimate ~Fi(w)\tilde{\nabla}F_{i}(w), and the deviation between Fi(w)\nabla F_{i}(w) and F(w)\nabla F(w), respectively.

Lemma 1.

If Assumptions 1-3 hold, then Fi(w)F_{i}(w) is LFL_{F}-Lipschitz continuous with LF:=4L+αρBL_{F}:=4L+\alpha\rho B. As a result, the average function F(w)=(1/n)i=1nFi(w)F(w)=(1/n)\sum_{i=1}^{n}F_{i}(w) is also LFL_{F}-Lipschitz continuous.

Lemma 2.

Recall the gradient estimate ~Fi(w)\tilde{\nabla}F_{i}(w) shown in (7), which is computed using 𝒟iin\mathcal{D}_{i}^{in}, 𝒟io\mathcal{D}_{i}^{o} and 𝒟iin\mathcal{D}_{i}^{in} that are independent sampled datasets with size DiinD_{i}^{in}, DiinD_{i}^{in} and DiinD_{i}^{in}, respectively. If the conditions in Assumptions 1-3 hold, then for any α(0,1/L]\alpha\in(0,1/L] and wmw\in\mathbb{R}^{m}, we have

𝔼[~Fi(w)Fi(w)]\displaystyle\left\|\mathbb{E}\left[\tilde{\nabla}F_{i}(w)-\nabla F_{i}(w)\right]\right\| 2αLσGDin,\displaystyle\leq\frac{2\alpha L\sigma_{G}}{\sqrt{D^{in}}}, (20)
𝔼[~Fi(w)Fi(w)2]\displaystyle\mathbb{E}\left[\|\tilde{\nabla}F_{i}(w)-\nabla F_{i}(w)\|^{2}\right] σF2,\displaystyle\leq\sigma_{F}^{2}, (21)

where σF2\sigma_{F}^{2} is defined as

σF2:=12[B2+σG2[1Do+(αL)2Din]][1+σH2α24Dh]12B2,\sigma_{F}^{2}:=12\left[B^{2}+\sigma_{G}^{2}\left[\frac{1}{D^{o}}+\frac{(\alpha L)^{2}}{D^{in}}\right]\right]\left[1+\sigma_{H}^{2}\frac{\alpha^{2}}{4D^{h}}\right]-12B^{2}, (22)

with Din=maxi𝒰DiinD^{in}=\max_{i\in\mathcal{U}}D_{i}^{in}, Do=maxi𝒰DioD^{o}=\max_{i\in\mathcal{U}}D_{i}^{o} and Dh=maxi𝒰DihD^{h}=\max_{i\in\mathcal{U}}D_{i}^{h}.

Lemma 3.

Given the loss function Fi(w)F_{i}(w) shown in (4) and α(0,1/L]\alpha\in(0,1/L], if the conditions in Assumptions 12, and 4 hold, then for any wmw\in\mathbb{R}^{m}, we have

1ni=1nFi(w)F(w)2γF2,\frac{1}{n}\sum_{i=1}^{n}\|\nabla F_{i}(w)-\nabla F(w)\|^{2}\leq\gamma_{F}^{2}, (23)

where γF2\gamma_{F}^{2} is defined as

γF2:=3B2α2γH2+192γG2,\gamma_{F}^{2}:=3B^{2}\alpha^{2}\gamma_{H}^{2}+192\gamma_{G}^{2}, (24)

with F(w)=(1/n)i=1nFi(w)\nabla F(w)=(1/n)\sum_{i=1}^{n}\nabla F_{i}(w).

III-B Analysis of Convergence Bound

Let Ui=maxk={1,,K}UkiU_{i}=\max_{k=\{1,\dots,K\}}U_{k}^{i}, where Uki={ski=1,ξki>ϕ}i=1n{ski=1,ξki>ϕ}=qkii=1nqkiU_{k}^{i}=\frac{\mathbb{P}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}{\sum_{i=1}^{n}\mathbb{P}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}=\frac{q_{k}^{i}}{\sum_{i=1}^{n}q_{k}^{i}} denotes the normalized update successful probability of UE ii in round kk. Then, with Lemmas 12 and 3, the expected convergence result of the MAML-based FL within a general mobile edge network we describe in Section II can now be obtained by the following theorem.

Theorem 1.

Given the transmit power vector 𝐩\mathbf{p}, the sampled data size vector 𝐝\mathbf{d}, the number of communication rounds KK, and the optimal global loss F(wϵ)F(w_{\epsilon}), the normalized update successful probability UiU_{i}, and α(0,1/L]\alpha\in(0,1/L], then we have the following FOSP conditions,

1Kk=0K1𝔼[F(w¯k)2]4(F(w0)F(wϵ))βK\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla F(\bar{w}_{k})\|^{2}]\leq\frac{4(F(w_{0})-F(w_{\epsilon}))}{\beta K}
+(i=1nUi2)(βLFσF2+βLFγF2+α2L2σG2Din),\displaystyle+\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}+\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{in}}\right), (25)

where w¯k\bar{w}_{k} is the average of the local updates wkiw_{k}^{i} that are successfully decoded by the BS in round kk. That is, we have

w¯k=1i=1n𝟙{ski=1,ξki>ϕ}i=1n𝟙{ski=1,ξki>ϕ}wki.\bar{w}_{k}=\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}w_{k}^{i}. (26)
Proof:

See the Appendix. ∎

Unlike the convergence guarantee of Per-FedAvg in [14] which is characterized by KK and 𝐝\mathbf{d}, the convergence guarantee obtained from Theorem 1 is characterized by KK, 𝐝\mathbf{d} and UiU_{i}, where UiU_{i} is a function of the transmit power 𝐩i\mathbf{p}^{i}. That is, the convergence bound of the proposed MAML-based FL in our paper is described in terms of KK, 𝐝\mathbf{d} and 𝐩\mathbf{p}. Therefore, our convergence analysis can combine the FL hyperparameters as well as the resource allocation in mobile edge networks.

III-C Discussions

From Theorem 1, we are able to characterize KK, 𝐝\mathbf{d} and 𝐩\mathbf{p} with respect to the predefined model accuracy ϵ\epsilon. According to Theorem 1 and (C1.1) in problem (P1), it is desired for the right-hand-side of (1) to be equal to or smaller than ϵ\epsilon. Consequently, we further present the following corollary.

Corollary 1.

Suppose the conditions in Theorem 1 are satisfied. If we set the number of total communication rounds as K=𝒪(1ϵ3)K=\mathcal{O}(\frac{1}{\epsilon^{3}}), the global learning rate as β=𝒪(ϵ2)\beta=\mathcal{O}(\epsilon^{2}), the number of data samples as di=𝒪(1ϵ)d_{i}=\mathcal{O}(\frac{1}{\epsilon}), then we find an ϵ\epsilon-FOSP for the MAML-based FL in problem (P1).

Proof:

K=𝒪(1ϵ3)K=\mathcal{O}(\frac{1}{\epsilon^{3}}) and β=𝒪(ϵ2)\beta=\mathcal{O}(\epsilon^{2}) will make sure that the order of magnitude of the first term on the right-hand-side of (1) to be equal to 𝒪(ϵ)\mathcal{O}(\epsilon).

Then we examine the second term. Since UiU_{i} is the normalized update successful probability with the consideration of nn UEs, its order of magnitude is defined by nn rather than ϵ\epsilon, i.e., Ui=𝒪(1n)U_{i}=\mathcal{O}(\frac{1}{n}). Therefore, it is natural to take attentions on the term βLFσF2+βLFγF2+α2L2σG2Din\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}+\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}} for an ϵ\epsilon-FOSP. By setting Diin=Dio=Dih=𝒪(1ϵ)D_{i}^{\text{in}}=D_{i}^{\text{o}}=D_{i}^{\text{h}}=\mathcal{O}(\frac{1}{\epsilon}) (i.e., di=𝒪(1ϵ)d_{i}=\mathcal{O}(\frac{1}{\epsilon})), α2L2σG2Din\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}} will dominate the value of βLFσF2+βLFγF2+α2L2σG2Din\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}+\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}. This is because, in this case, βLFσF2+βLFγF2=𝒪(ϵ4)\beta L_{F}\sigma_{F}^{2}+\beta L_{F}\gamma_{F}^{2}=\mathcal{O}(\epsilon^{4}), which is a high-order infinitesimal of ϵ\epsilon. Finally, combining the magnitudes of the first and second terms, we can conclude that the joint of K=𝒪(1ϵ3)K=\mathcal{O}(\frac{1}{\epsilon^{3}}), β=𝒪(ϵ2)\beta=\mathcal{O}(\epsilon^{2}), and di=𝒪(1ϵ)d_{i}=\mathcal{O}(\frac{1}{\epsilon}) yields an ϵ\mathcal{\epsilon}-FOSP for the MAML-based FL. ∎

Remark 1.

It is worth noting that the result in Corollary 1 provides a possible choice of the hyperparameters. Other options may also be valid as long as the option makes the order of magnitude of the hyperparamters (w.r.t. the model accuracy ϵ\epsilon) satisfy the condition in Theorem 1.

Guided by Corollary 1, we can recast (C1.1) in problem (P1) as a series of constraints, which give explicit relationships between optimization variables and the model accuracy ϵ\epsilon. In this way, we then solve problem (P1) with high efficiency in the next section.

IV Automated Federated Learning (AutoFL)

In this section, we first resort to Theorem 1 and Corollary 1 for problem (P1) decomposition, and then solve the resultant subproblems respectively. Based on the obtained solutions, we finally propose Automated Federated Learning (AutoFL) algorithm, by which the benefit MAML brings to FL can be quantified in mobile edge networks.

IV-A Problem Decoupling

The clue to the decomposition of problem (P1) follows the primal decomposition [25]. Specifically, the master problem is optimized with respect to the number of communication rounds KK, which is given as follows:

minK\displaystyle\min_{K}\qquad k=0K1maxi𝒰{tk,icmp+tk,icom}\displaystyle\sum_{k=0}^{K-1}\max_{i\in\mathcal{U}}\{t_{k,i}^{\text{cmp}}+t_{k,i}^{\text{com}}\} (P2)
s.t. 4(F(w0)F(wϵ))βKϵ,K+.\displaystyle\frac{4(F(w_{0})-F(w_{\epsilon}))}{\beta K}\leq\epsilon,\qquad K\in\mathbb{N}^{+}. (C2.1)

Note that, we refer to Theorem 1 and use a contracted (C1.1), i.e., (C2.1), as the constraint in problem (P2) for tractability. This contraction replaces 1Kk=0K1𝔼[F(w¯k)2]\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla F(\bar{w}_{k})\|^{2}] with its upper bound without the second term. The reason that we get rid of the second term is that we only consider KK as the decision variable in (P2), and the second term of the upper bound can be regarded as a constant and thus removed as long as its value is no larger than ϵ\epsilon. Meanwhile, the above contraction shrinks the feasible region of KK. Therefore, as long as (C2.1) is satisfied, (C1.1) must be satisfied. Moreover, if there exists an optimal solution KK to problem (P2), it is also feasible to the original problem (P1). We can regard the optimal solution to (P2) as an optimal approximation to (P1).

The reason that we consider KK as a decision variable is for the decoupling of (P1). The original objective of minimizing the overall learning time in KK rounds can be regarded as the minimization of the learning time in each round, once KK is determined and is set to be a constant in the following decoupling. As such, problem (P2), which has a slave problem with respect to the sampled data size 𝐝\mathbf{d} and transmit power 𝐩\mathbf{p}, can be regarded as an approximation of problem (P1).

From Theorem 1, we find that 𝐝\mathbf{d} and 𝐩\mathbf{p} are coupled even though (C1.1) is replaced by the right-hand-side term in (1). In this regard, we aim to optimize 𝐝\mathbf{d} and 𝐩\mathbf{p} in the slave problem with the coordinate descent [20] method. That is, the slave problem is firstly solved with 𝐝\mathbf{d} as variables and 𝐩\mathbf{p} as constants, and then is solved with 𝐩\mathbf{p} as variables and 𝐝\mathbf{d} as constants. This procedure is repeated in an iterative manner. In this way, 𝐝\mathbf{d}- and 𝐩\mathbf{p}-related slave problems can be further decomposed among rounds under the assistance of Corollary 1. Specifically, in any round kk, the sampled data size for UE ii is optimized by solving problem (P3):

mindki\displaystyle\min_{d_{k}^{i}}\qquad maxi{ciϑidki+tk,icom}\displaystyle\max_{i}\quad\left\{\frac{c_{i}}{\vartheta_{i}}d_{k}^{i}+t_{k,i}^{\text{com}}\right\} (P3)
s.t. dki1ϵ,i𝒰,\displaystyle d_{k}^{i}\geq\frac{1}{\epsilon},\quad\forall i\in\mathcal{U}, (C3.1)
ς2cidkiϑi2+pkiZBlog2(1+ξki)Emax,i𝒰,\displaystyle\frac{\varsigma}{2}c_{i}d_{k}^{i}\vartheta_{i}^{2}+p_{k}^{i}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\leq E_{\max},\quad\forall i\in\mathcal{U}, (C3.2)
0dkiDi,i𝒰.\displaystyle 0\leq d_{k}^{i}\leq D_{i},\quad\forall i\in\mathcal{U}. (C3.3)

(C3.1) is obtained from Corollary 1, that a di=𝒪(1ϵ)d_{i}=\mathcal{O}(\frac{1}{\epsilon}) will yield an ϵ\epsilon-FOSP. (C3.1) uses dki1ϵd_{k}^{i}\geq\frac{1}{\epsilon} instead to indicate that as long as dkid_{k}^{i} is not smaller than 1ϵ\frac{1}{\epsilon}, an ϵ\epsilon-FOSP would be guaranteed. This is a reasonable replacement (i.e., from dki=𝒪(1ϵ)d_{k}^{i}=\mathcal{O}(\frac{1}{\epsilon}) to dki1ϵd_{k}^{i}\geq\frac{1}{\epsilon}) such that, the larger the sampled data size is, the larger accuracy AutoFL would achieve, thereby leading to a smaller ϵ\epsilon. Note that (C3.1) is also a contraction of (C1.1), and the feasible domain of dkid_{k}^{i} is also shrunk. That is, the optimal dkid_{k}^{i*}, as long as it exists in (P3), is also feasible in problem (P1), and can be regarded as an optimal approximation to (P1). (C3.2) is the energy consumption constraint on each UE and (C3.3) indicates that the sampled data size of UE ii is no greater than the number of data points generated by UE ii. Furthermore, addressing the following problem can make an decision on the transmit power of UE ii in round kk:

minpki\displaystyle\min_{p_{k}^{i}}\qquad maxi{tk,icmp+ZBlog2(1+ξki)}\displaystyle\max_{i}\quad\left\{t_{k,i}^{\text{cmp}}+\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\right\} (P4)
s.t. (i=1nUki2)α2L2σG2Dinϵ,\displaystyle\left(\sum_{i=1}^{n}{U_{k}^{i}}^{2}\right)\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}\leq\epsilon, (C4.1)
ς2cidkiϑi2+pkiZBlog2(1+ξki)Emax,i𝒰,\displaystyle\frac{\varsigma}{2}c_{i}d_{k}^{i}\vartheta_{i}^{2}+p_{k}^{i}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\leq E_{\max},\quad\forall i\in\mathcal{U}, (C4.2)
0pkiPmax,i𝒰,\displaystyle 0\leq p_{k}^{i}\leq P_{\max},\quad\forall i\in\mathcal{U}, (C4.3)

where (C4.1) is obtained from Corollary 1, in terms of the fact that (i=1nUki2)α2L2σG2Din(\sum_{i=1}^{n}{U_{k}^{i}}^{2})\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}} dominates the value of the second term in the right-hand-side of (1) to guarantee an ϵ\epsilon-FOSP for the MAML-based FL. Here we replace UiU_{i} with UkiU_{k}^{i} to show that in each round kk, (C4.1) should be satisfied. The reason that we replace (i=1nUki2)α2L2σG2Din=𝒪(1ϵ)(\sum_{i=1}^{n}{U_{k}^{i}}^{2})\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}=\mathcal{O}(\frac{1}{\epsilon}) with (i=1nUki2)α2L2σG2Dinϵ(\sum_{i=1}^{n}{U_{k}^{i}}^{2})\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D^{\text{in}}}\leq\epsilon is to indicate that, as long as (C4.1) is satisfied, an ϵ\epsilon-FOSP would be guaranteed. Note that (C4.1) is also a contraction of (C1.1). That is, as long as there exists an optimal solution to (P4), this solution is also feasible and is close to the optimal solution to (P1). We can regard the optimal solution to (P4) as an optimal approximation to (P1). (C4.2) and (C4.3) are the energy constraint and transmit power constraint on each UE, respectively.

Based on the above decomposition, we can solve the original problem (P1) by solving problem (P2), in which problems (P3) and (P4) are nested and addressed with the coordinate descent method. This decoupling only provides the approximate optimal results. However, the evaluation results show that, with the approximate optimal results obtained from (P2), (P3) and (P4), the proposed MAML-based FL algorithm always outperforms Per-FedAvg in learning time and convergence performances. Next, we deal with problems (P2), (P3), and (P4), respectively.

IV-B Problem Solution

IV-B1 Communication Round Optimization

Once the power transmit power 𝐩\mathbf{p} and the sampled data size 𝐝\mathbf{d} are obtained from the slave problem, the master problem (P2) is monotonously increasing with respect to KK. Therefore, to minimize the total training time, the optimal value of KK should be obtained as its lower bound:

K=4(F(w0)F(wϵ))βϵ,K^{*}=\frac{4(F(w_{0})-F(w_{\epsilon}))}{\beta\epsilon}, (30)

Note that, we do not use the formulation of KK^{*} to predict the actual optimal global communication rounds. It is not practical to determine this value in advance because in practice, the optimal value of the communication rounds can be easily observed once the training loss starts to converge. There are so many factors that can affect the actual value of KK^{*}, even the version of Pytorch/Tensorflow in the experiments! Therefore, the theoretical formulation of KK^{*} in (31), with the initial global model w0w_{0}, the global optimal loss F(wϵ)F(w_{\epsilon}), the global learning rate β\beta, and the model accuracy ϵ\epsilon to be decided, is only used as a guidance indicating the order of magnitude of the practical optimal communication rounds.

IV-B2 Data Sample Size Optimization

It is easy to observe from problem (P3) that for each UE ii, its optimal dkid_{k}^{i} lies at the lower bound 1ϵ\frac{1}{\epsilon}. However, whether this lower bound can be reached depends on the relationship between the values of dkid_{k}^{i}’s lower bound 1ϵ\frac{1}{\epsilon} and upper bounds defined by (C3.2) and (C3.3). Specifically, the upper bound defined by (C3.2) is Dip:=2Emaxςciϑi22pkiZςciϑi2Blog2(1+ξki)D_{i}^{\text{p}}:=\frac{2E_{\max}}{\varsigma c_{i}\vartheta_{i}^{2}}-\frac{2p_{k}^{i}Z}{\varsigma c_{i}\vartheta_{i}^{2}B\log_{2}(1+\xi_{k}^{i})}, while the upper bound defined by (C3.3) is DiD_{i}. Consequently, for each UE ii, we need to consider two cases: Di1ϵD_{i}\geq\frac{1}{\epsilon} and Di<1ϵD_{i}<\frac{1}{\epsilon}, in which we further discuss other two cases: Dip>1ϵD_{i}^{\text{p}}>\frac{1}{\epsilon} and Dip1ϵD_{i}^{\text{p}}\leq\frac{1}{\epsilon}. That is, we discuss the optimal solution of dkid_{k}^{i} for problem (P3) as follows.

Case 1: In this case, UE ii generates sufficient data points which are larger than or equal to its lower bound for a model accuracy ϵ\epsilon. That is, for each UE ii, we have

0<1ϵDi.0<\frac{1}{\epsilon}\leq D_{i}. (31)

Under this case, we need to further discuss the relationship between DipD^{\text{p}}_{i} and 1ϵ\frac{1}{\epsilon}:

  • When Dip>1ϵD_{i}^{\text{p}}>\frac{1}{\epsilon}, the lower bound 1ϵ\frac{1}{\epsilon} of dkid_{k}^{i} can be obtained with enough data points to achieve the model accuracy ϵ\epsilon. Therefore, the optimal sampled data size dki=1ϵd_{k}^{i}=\frac{1}{\epsilon}.

  • When Dip1ϵD_{i}^{\text{p}}\leq\frac{1}{\epsilon}, the lower bound 1ϵ\frac{1}{\epsilon} of dkid_{k}^{i} cannot be achieved. According to (1) in Theorem 1, the larger did_{i} is, the higher model accuracy can be achieved. Therefore, the optimal value of dkid_{k}^{i} should be equal to DipD_{i}^{\text{p}}.

Case 2: In this case, UE ii creates deficient data points which are smaller than its lower bound for a model accuracy ϵ\epsilon. That is, for each UE ii, we have

0Di<1ϵ.0\leq D_{i}<\frac{1}{\epsilon}. (32)

Under this case, we also need to further discuss the relationship between DipD^{\text{p}}_{i} and 1ϵ\frac{1}{\epsilon}:

  • When Dip>1ϵD^{\text{p}}_{i}>\frac{1}{\epsilon}, the optimal value of dkid_{k}^{i} is DiD_{i};

  • When Dip1ϵD^{\text{p}}_{i}\leq\frac{1}{\epsilon}, the optimal value of dkid_{k}^{i} is min{Di,Dip}\min\{D_{i},D_{i}^{\text{p}}\};

The above process of computing dkid_{k}^{i} for problem (P3) can be summarized as SampledDataSize algorithm, which is omitted due to the space limitation.

IV-B3 Transmit Power Optimization

At last, we turn to solving problem (P4), where transmit power pkip_{k}^{i} for UE ii in round kk is optimized with fixed dkid_{k}^{i}. To this end, we first convert the problem into a more concise form. Given that UkiU_{k}^{i} denotes the normalized update successful probability, we have Uki1U_{k}^{i}\leq 1. Then the inequality inUki21\sum_{i}^{n}{U_{k}^{i}}^{2}\leq 1 is always true. Consequently, we can further transform (C4.1) to α2L2σG2Diinϵ\frac{\alpha^{2}L^{2}\sigma_{G}^{2}}{D_{i}^{\text{in}}}\leq\epsilon. Combine the transformed constraint with (C3.1), (C3.2) and (C3.3), we find that as long as α2L2σG2min{1,Diϵ,Dpϵ}\alpha^{2}L^{2}\sigma_{G}^{2}\leq\min\{1,D_{i}\epsilon,D_{p}\epsilon\}, (C4.1) can always be satisfied. Therefore, problem (P4) can be further transformed to a new optimization problem without constraint (C4.1).

With the above concise formulation of (P4), we analyze the monotonicity of tk,icomt_{k,i}^{\text{com}} in the objective function of problem (P4). Specifically, the derivative of tk,icomt_{k,i}^{\text{com}} can be computed as follows:

ddpkitk,icom=\displaystyle\frac{\text{d}}{\text{d}p_{k}^{i}}t_{k,i}^{\text{com}}= ddpkiZBlog2(1+ξki)\displaystyle\frac{\text{d}}{\text{d}p_{k}^{i}}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}
=\displaystyle= ZhiciκBN0(1+ξki)[log2(1+ξki)]2<0,\displaystyle-\frac{Zh_{i}\|c_{i}\|^{-\kappa}}{BN_{0}(1+\xi_{k}^{i})[\log_{2}(1+\xi_{k}^{i})]^{2}}<0, (33)

which means tk,icomt_{k,i}^{\text{com}} monotonically decreases with pkip_{k}^{i}.

Further, as for the constraint (C4.2), the derivative of its left-hand-side term can be proved to be always larger than 0, which is shown as follows,

ddpkipkiZBlog2(1+ξki)\displaystyle\frac{\text{d}}{\text{d}p_{k}^{i}}p_{k}^{i}\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}
=\displaystyle= ZB(1log2(1+ξki)ξki[log2(1+ξki)]2(1+ξki))\displaystyle\frac{Z}{B}\left(\frac{1}{\log_{2}(1+\xi_{k}^{i})}-\frac{\xi_{k}^{i}}{[\log_{2}(1+\xi_{k}^{i})]^{2}(1+\xi_{k}^{i})}\right)
>\displaystyle> ZBlog2(1+ξki)(1ξkiξki1+ξki(1+ξki))\displaystyle\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}\left(1-\frac{\xi_{k}^{i}}{\frac{\xi_{k}^{i}}{1+\xi_{k}^{i}}(1+\xi_{k}^{i})}\right)
=\displaystyle= ZBlog2(1+ξki)(11)=0\displaystyle\frac{Z}{B\log_{2}(1+\xi_{k}^{i})}(1-1)=0 (34)

where the inequality is derived from the fact that log2(1+x)>x1+x\log_{2}(1+x)>\frac{x}{1+x}, for x>0x>0. Therefore, the left-hand-side term of (C4.2) monotonically increases with pkip_{k}^{i}. It means (C4.2) defines another upper bound of pkip_{k}^{i} just as PmaxP_{\max} in (C4.3).

From the above analysis, the optimal solutions to (P4) lies at the upper bound of pkip_{k}^{i}, which is defined by (C4.2) or (C4.3). This process of computing pkip_{k}^{i} can be summarized as PowerComputation algorithm, which is omitted due to space concern.

IV-C AutoFL Implementation

Input :  ϕ\phi, EmaxE_{\max}, PmaxP_{\max}, N0N_{0}, cic_{i}, BupB^{\text{up}}, κ\kappa, ϑi\vartheta_{i}, w0w_{0}.
1 p1i:=Pmaxp_{-1}^{i}:=P_{\max};
2 for k=0,1,,K1k=0,1,\dots,K-1 do
3       The BS randomly selects a subset of associated UEs 𝒜k\mathcal{A}_{k} where each UE i𝒜ki\in\mathcal{A}_{k} is with ski=1s_{k}^{i}=1;
4       The BS sends wkw_{k} to all UEs in 𝒜k\mathcal{A}_{k} ;
5       for i𝒜ki\in\mathcal{A}_{k} do
6             dki:=SampledDataSize(pk1i)d_{k}^{i}:=\texttt{SampledDataSize}(p_{k-1}^{i}) ;
7             pki:=PowerComputation(dki)p_{k}^{i}:=\texttt{PowerComputation}(d_{k}^{i}) ;
8             if γkiϕ\gamma_{k}^{i}\geq\phi then
9                   wk+1i:=LocalModelTraining(dki,wk)w_{k+1}^{i}:=\texttt{LocalModelTraining}(d_{k}^{i},w_{k}) ;
10                   UE ii sends wk+1iw_{k+1}^{i} back to the BS ;
11                  
12             end if
13            
14       end for
15      The BS updates its model using wk+1:=1i=1n𝟙{ski=1,γki>ϕ}i=1n𝟙{ski=1,γkiϕ}wk+1iw_{k+1}:=\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\gamma_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\gamma_{k}^{i}\geq\phi\}w_{k+1}^{i} ;
16      
17 end for
If the model accuracy ϵ\epsilon is reached, terminate the algorithm and set K:=KK^{*}:=K
Algorithm 2 Automated Federated Learning (AutoFL)

After devising the algorithms to determining the number of communication round KK, the sampled data size 𝐝\mathbf{d}, and the transmit power 𝐩\mathbf{p}, we are able to combine these sub-algorithms together to design the AutoFL algorithm, which is shown in Algorithm 2. Note that, to reduce the computational complexity, we only adopt the coordinate descent method to solve problem (P3) and (P4) once in one round. More specifically, given the current transmit power pk1ip_{k-1}^{i}, problem (P4) with respect to dkid_{k}^{i} is solved for UE i𝒜ki\in\mathcal{A}_{k} in round kk using SampledDataSize algorithm, as line 6 in AutoFL. Then, based on the obtained dkid_{k}^{i}, problem (P4) with respect to pkip_{k}^{i} is addressed for UE i𝒜ki\in\mathcal{A}_{k} in round kk with PowerComputation algorithm, as line 7 in AutoFL. The result of pkip_{k}^{i} is further used in the next round k+1k+1 to compute dk+1id_{k+1}^{i} for UE i𝒜ki\in\mathcal{A}_{k}, which is used to compute pk+1ip_{k+1}^{i} again. This process repeats until model accuracy ϵ\epsilon is achieved. As in line 15 in AutoFL, once the model accuracy is reached, the algorithm is terminated and the required number of rounds KK^{*} is output.

V Performance Evaluation

In this section, we evaluate the performances of AutoFL to (1) verify its effectiveness in saving the learning time, (2) examine its training loss and accuracy to demonstrate its effectiveness in fast adaptation and convergence, (3) present its advantages in different network and dataset settings.

V-A Simulation Settings

V-A1 Network settings

TABLE I: System Parameters
Parameter Value Parameter Value Parameter Value
α\alpha (MNIST) 0.030.03 BB 1 MHz EmaxE_{\max} 0.0030.003 J
β\beta (MNIST) 0.070.07 ς\varsigma 102710^{-27} PmaxP_{\max} 0.010.01 W
α\alpha (CIFAR-10) 0.06 κ\kappa 3.83.8 ϕ\phi 3030 dB
β\beta (CIFAR-10) 0.02 ϑi\vartheta_{i} 10910^{9} N0N_{0} 174-174 dBm/Hz

Unless otherwise sepecified, we consider a mobile edge network that consists of n=20n=20 UEs located in a cell of radius R=200R=200 m and a BS located at its center. Assume that all UEs are uniformly distributed in the cell. The Rayleigh distribution parameter of hkih_{k}^{i} is 4040. The other parameters used in simulations are listed in Table I.

V-A2 Datasets

We evaluate the training performance using two datasets, the MNIST dataset [26] for handwritten digits classification and CIFAR-10 [27]. The network model we used for MNIST is a 2-layer NN with hidden layer of size 100, while the network model we used for CIFAR-10 is LeNet-5 [28], where there are two convolutional layers and three fully connected layers. The datasets are partitioned randomly with 75%75\% and 25%25\% for training and testing respectively. Meanwhile, in the simulation we use both i.i.d datasets and non-i.i.d datasets. In order to generate i.i.d datasets, the original training dataset is randomly partitioned into 20 portions and each UE is assigned a portion. As for the non-i.i.d datasets, the original training set is first partitioned into 10 portions according to the labels. Each UE has only ll of the 10 labels and is allocated a different local data size in the range of [2,3834][2,3834], where l=[1,,10]l=[1,\dots,10]. The value of ll reflects the non-i.i.d level of local datasets. The smaller ll is, the higher non-i.i.d level. Unless otherwise defined, we use l=5l=5 in the evaluation. All experiments are conducted by PyTorch [29] version 1.7.1.

Refer to caption
(a) I.i.d MNIST datasets
Refer to caption
(b) Non-i.i.d MNIST datasets
Refer to caption
(c) Non-i.i.d CIFAR-10 datasets
Figure 2: Learning time comparisons for i.i.d MNIST, non-i.i.d MNIST and non-i.i.d CIFAR-10 datasets. As for the non-i.i.d MNIST and CIFAR-10 datasets, l=5l=5.
Refer to caption
(a) Non-i.i.d MNIST dataset
Refer to caption
(b) Non-i.i.d CIFAR-10 dataset
Figure 3: Learning time comparisons between AutoFL and Per-FedAvg with respect to different sampled data sizes using non-i.i.d MNIST and CIFAR-10 datasets.

V-A3 Baselines

We compare the performance of AutoFL with Per-FedAvg [14] and FedAvg [7]. FedAvg is the first algorithm that is proposed for FL and is the most general FL algorithm. Per-FedAvg, on the other hand, is the most general MAML-based FL algorithm. In Per-FedAvg, we first fix the sampled data size of all UEs to be 55. As for the transmit power settings in Per-FedAvg, all UEs use the maximal power PmaxP_{\max}, given that there are only 5 data samples to train on each UE and more power can be saved to transmit the local updates to the BS. As for FedAvg, we consider two sets of sampled data sizes: for the first one, we set the sampled data sizes of the UEs to be 55 and the transmit power in all rounds to be PmaxP_{\max}, which is the same with the settings with Per-FedAvg. We name the FedAvg algorithm with the first setting as FedAvg-1; For the second one, we set the sampled data sizes and transmit power of the UEs to be the same with AutoFL. We name the FedAvg with the second setting as FedAvg-2. Later, we may change the sampled data sizes in Per-FedAvg and FedAvg-1 to check the influence of the number of sampled data on the performance gap between Per-FedAvg and AutoFL.

V-B Learning Time Comparisons

We first compare the learning time of AutoFL with that of Per-FedAvg, FedAvg-1, and FedAvg-2 with respect to the test accuracy. The results are shown in Fig. 2. We analyse the results from three aspects. (i) First, it is clear that at the same accuracy, AutoFL consumes the smallest learning time. For example, when the algorithms start to converge, Per-FedAvg takes at least twice as much time as AutoFL. This is a reasonable result as AutoFL is designed to minimize the overall learning time. Meanwhile, as the test accuracy increases, the average learning time of all algorithms grows rapidly. The experimental results verify our theoretical analysis and confirms the effectiveness of AutoFL in saving the overall learning time. (ii) Second, we observe that the two MAML-based FL algorithms outperform the two conventional FL algorithms. This advantage is even more pronounced when using the non-i.i.d CIFAR-10 dataset rather than the MNIST datasets. This result testifies the advantage of MAML in accelerating the training speed of FL. Meanwhile, the two algorithms with designed sampled data size and transmit power, AutoFL and FedAvg-2 outperform Per-FedAvg and FedAvg-1. This indicates that the joint optimization of the sampled data size and transmit power is a promising way to improve the FL performance over mobile edge networks. (iii) Third, the results derived from i.i.d datasets are more stable than that from non-i.i.d datasets, meanwhile the average learning time under i.i.d datasets is smaller than that under non-i.i.d datasets. This is because, in the i.i.d case, the local model of each individual UE has a higher representative for the learned global model, which is beneficial to improve the learning performance.

Refer to caption
(a) I.i.d MNIST dataset
Refer to caption
(b) Non-i.i.d MNIST dataset
Refer to caption
(c) Non-i.i.d CIFAR-10 dataset
Refer to caption
(d) I.i.d MNIST dataset
Refer to caption
(e) Non-i.i.d MNIST dataset
Refer to caption
(f) Non-i.i.d CIFAR-10 dataset
Figure 4: Convergence performance comparisons using i.i.d, non-i.i.d MNIST and non-i.i.d CIFAR-10 datasets. As for the non-i.i.d MNIST and CIFAR-10 datasets, l=5l=5.

Thereafter, we change the sampled data size of Per-FedAvg to see how the gap between Per-FedAvg and AutoFL would change with different number of data samples for Per-FedAvg. Note that when the number of samples increases, the transmit power of UEs in Per-FedAvg is not always PmaxP_{\max}. Therefore, here we use the PowerComputation algorithm to compute the transmit power with the predifined data sample sizes. Fig. 3 shows the results. From Fig. 3 we observe that, no matter how does the number of data samples change in Per-FedAvg, its overall learning time is smaller than AutoFL. As the sampled data size increase from 55 to 4040, for the same test accuracy, the average learning time of Per-FedAvg is getting smaller and smaller. This phenomenon is more obvious using the CIFAR-10 dataset. However, when the number of sampled data points is 6060, the learning time performance of Per-FedAvg suddenly deteriorate. We attribute this phenomenon to the reason that the larger sampled data size requires the larger transmit power. And once the transmit power of a certain UE reaches its upper bound, it can not be improved. Due to insufficient transmit power, the uploads transmitted from this UE fail to arrive the server for global model update. Therefore, although the number of samples on each UE has increased, the learning time may have become longer and longer. This result gives us great confidence, that although the original problem (P1) is NP-hard, and AutoFL is designed to approximate the optimal solutions to (P1), the approximate results are also effective in consuming as little time as possible.

V-C Convergence Performance Comparisons

Next, we compare convergence performance of AutoFL, Per-FedAvg, FedAvg-1, and FedAvg-2, in terms of training loss and test accuracy. Specifically, the smaller the training loss is, the better the convergence performance is. On the contrary, as for the test accuracy, the larger, the better. Fig. 4 shows the convergence performance using the i.i.d MNIST, non-i.i.d MNIST and non-i.i.d CIFAR-10 datasets, respectively.

From Fig. 4, it is observed that AutoFL outperforms the other three algorithms on the two concerned metrics. This phenomenon is more obvious using the non-i.i.d CIFAR-10 dataset. Besides, AutoFL has the fastest convergence rate, which is consistent with the learning time performance, that AutoFL performs the best. This result indicates that the KK^{*} in AutoFL is smaller than that in Per-FedAvg and FedAvg. For example, for the non-i.i.d MNIST datasets shown in Fig. 4, after about 5050 rounds, AutoFL start to converge, while Per-FedAvg starts to converge after about 7070 rounds, the two FedAvg algorithms start to converge after about 100100 rounds.

V-D Effect of Network and Dataset Settings

Refer to caption
(a) The average KK^{*} vs. radius RR
Refer to caption
(b) The highest achievable accuracy vs. radius RR
Refer to caption
(c) The average KK^{*} vs. the non-i.i.d level ll
Refer to caption
(d) The highest achievable accuracy vs. the non-i.i.d level ll
Figure 5: Convergence performance w.r.t. the radius of the cell and the non-i.i.d level ll respectively using the MNIST dataset.

To understand how different network and dataset settings, such as the radius of the cell and the data heterogeneity, affect the convergence of AutoFL, we conduct a set of experiments on the MNIST dataset.

  • Effect of the radius of the cell RR: Figs. 5a and 5b show the average KK^{*} and the highest achievable test accuracy with different radius of the cell RR. KK^{*} is measured in number of rounds the algorithm starts to converge with test accuracy std ±1%\pm 1\%. From Figs. 5a and 5b, we observe that KK^{*} increases faster and faster with RR, while the test accuracy decreases faster and faster with RR. The hidden reason is that, with RR increasing, the number of UEs whose local updates can be successfully decoded at the BS decreases. This reduction of UEs’ participation in global model update will definitely cause increased KK^{*} and decreased test accuracy. Besides, comparing AutoFL with Per-FedAvg, FedAvg-2 with FedAvg-1, we find that the optimization of transmit power and sampled data size on inividual UEs is beneficial to FL convergence, especially when the wireless environment becomes worse with increasing RR.

  • Effect of the metric of data heterogeneity ll: The non-i.i.d level ll is used to measure the data heterogeneity. Specifically, the smaller ll is, the larger data heterogeneity is. Figs. 5c and 5d show the average KK^{*} and the highest achievable model accuracy with respect to the non-i.i.d level ll. We can observe from the figures that as the non-i.i.d level decreases, KK^{*} is decreasing while the test accuracy is increasing. This result is reasonable as the higher degree of data heterogeneity across UEs has more negative impact on the learning process. As expected, Figs. 5c and 5d also demonstrate that AutoFL can achieve more gains for the more heterogeneous datasets across UEs.

VI Conclusions

In this paper, we have quantified the benefit MAML brings to FL over mobile edge networks. The quantification is achieved from two aspects: the determination of FL hyperparameters (i.e., sampled data sizes and the number of communication rounds) and the resource allocation (i.e., transmit power) on individual UEs. In this regard, we have formulated an overall learning time minimization problem, constrained by the model accuracy and energy consumption at individual UEs. We have solved this optimization problem by firstly analysing the convergence rate of MAML-based FL, which is used to bound the three variables as functions of the model accuracy ϵ\epsilon. With these upper bounds, the optimization problem can be decoupled into three sub-problems, each of which considers one of the variables and uses the corresponding upper bound as its constraint. The first sub-problem guides the optimization of the number of communication rounds. The second and the third sub-problem are computed using the coordinate descent method, to achieve the sampled data size and the transmit power for each UE in each round respectively. Based on the solutions, we have proposed the AutoFL, a MAML-based FL algorithm that not only quantifies the benefit MAML brings to FL but also maximize such benefit to conduct fast adaptation and convergence over mobile edge networks. Extensive experimental results have verified that AutoFL outperforms Per-FedAvg and FedAvg, with the fast adaptation and convergence.

Appendix

In order to prove Theorem 1, we first introduce an intermediate inference derived from the Lipschitzan gradient assumption. That is, if f(w)f(w) is L-Lipschitz continuous, then f(w)f(u)Lwu\|\nabla f(w)-\nabla f(u)\|\leq L\|w-u\| is equivalent to

f(w)f(u)+f(w)(wu)+L2wu2.f(w)\leq f(u)+\nabla f(w)^{\top}(w-u)+\frac{L}{2}\|w-u\|^{2}. (35)

Note that although we assume that each UE only performs one step of gradient descent given the current global model parameter, in the appendix we first consider the general case where τ\tau (τ=1,2,\tau=1,2,\dots) steps of local gradient descents are performed. Then for each UE ii, we have

w~k,ti=\displaystyle\tilde{w}_{k,t}^{i}= wk,t1iα~fi(wk,t1i;𝒟iin);\displaystyle w_{k,t-1}^{i}-\alpha\tilde{\nabla}f_{i}(w_{k,t-1}^{i};\mathcal{D}_{i}^{\text{in}}); (36)
wk,ti=\displaystyle w_{k,t}^{i}= wk,t1iβ(Iα~2fi(wk,t1i;𝒟io))~fi(w~k,ti;𝒟ih).\displaystyle w_{k,t-1}^{i}-\beta(I-\alpha\tilde{\nabla}^{2}f_{i}(w_{k,t-1}^{i};\mathcal{D}_{i}^{\text{o}}))\tilde{\nabla}f_{i}(\tilde{w}_{k,t}^{i};\mathcal{D}_{i}^{\text{h}}). (37)

where t=1,,τt=1,\dots,\tau. After we find the ϵ\epsilon-FOSP, we use τ=1\tau=1 to obtain the desired result shown in Theorem 1. In this case, we introduce the following lemma that has been proved in [14] to facilitate our proof.

Lemma 4.

If the conditions in Assumptions 2-4 hold, then for any α[0,1/L]\alpha\in[0,1/L] and any t0t\geq 0, we have

𝔼[1ni=1nwk,tiwk,t2]35β2tτ(2σF2+γF2).\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\|w_{k,t}^{i}-w_{k,t}\|^{2}\right]\leq 35\beta^{2}t\tau(2\sigma_{F}^{2}+\gamma_{F}^{2}). (38)

where wk,t=1ni=1nwk,tiw_{k,t}=\frac{1}{n}\sum_{i=1}^{n}w_{k,t}^{i}.

Recall that w¯k,t=1i=1n𝟙{ski=1,ξki>ϕ}i=1n𝟙{ski=1,ξki>ϕ}wk,ti\bar{w}_{k,t}=\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}w_{k,t}^{i}, from Lemma 1, we know F(w)F(w) is LFL_{F}-Lipschitz continuous, and thus, by (35), we have

F(w¯k+1,t+1)\displaystyle F(\bar{w}_{k+1,t+1})
\displaystyle\leq F(w¯k+1,t)+F(w¯k+1,t+1)(w¯k+1,t+1w¯k+1,t)\displaystyle F(\bar{w}_{k+1,t})+\nabla F(\bar{w}_{k+1,t+1})^{\top}(\bar{w}_{k+1,t+1}-\bar{w}_{k+1,t})
+LF2w¯k+1,t+1w¯k+1,t2\displaystyle+\frac{L_{F}}{2}\|\bar{w}_{k+1,t+1}-\bar{w}_{k+1,t}\|^{2}
\displaystyle\leq F(w¯k+1,t)βF(w¯k+1,t+1)(i=1nUi~Fi(wk+1,ti))\displaystyle F(\bar{w}_{k+1,t})-\beta\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)
+LF2β2i=1nUki~Fi(wk+1,ti)2,\displaystyle+\frac{L_{F}}{2}\beta^{2}\left\|\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}, (39)

where the last inequality is obtained given the fact that

w¯k+1,t+1\displaystyle\bar{w}_{k+1,t+1}
=\displaystyle= 1i=1n𝟙{ski=1,ξki>ϕ}i=1n𝟙{ski=1,ξki>ϕ}wk+1,t+1i\displaystyle\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}w_{k+1,t+1}^{i}
=\displaystyle= 1i=1n𝟙{ski=1,ξki>ϕ}\displaystyle\frac{1}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}
i=1n𝟙{ski=1,ξki>ϕ}(wk+1,tiβ~Fi(wk+1,ti))\displaystyle\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}\left(w_{k+1,t}^{i}-\beta\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)
=\displaystyle= w¯k+1,tβi=1n𝟙{ski=1,ξki>ϕ}\displaystyle\bar{w}_{k+1,t}-\frac{\beta}{\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}}
i=1n𝟙{ski=1,ξki>ϕ}~Fi(wk+1,ti)\displaystyle\sum_{i=1}^{n}\mathds{1}\{s_{k}^{i}=1,\xi_{k}^{i}>\phi\}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})
=\displaystyle= w¯k+1,tβi=1nUki~Fi(wk+1,ti).\displaystyle\bar{w}_{k+1,t}-\beta\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i}). (40)

Taking expectation on both sides of (Appendix) yields

𝔼[F(w¯k+1,t+1)]\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t+1})]
\displaystyle\leq 𝔼[F(w¯k+1,t)]+LF2β2𝔼[i=1nUki~Fi(wk+1,ti)2]\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t})]+\frac{L_{F}}{2}\beta^{2}\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}\right]
β𝔼[F(w¯k+1,t+1)(i=1nUki~Fi(wk+1,ti))].\displaystyle-\beta\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right]. (41)

From the above inequality, it is obvious that the key is to bound the term i=1nUki~Fi(wk+1,ti)\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i}). Let

i=1nUki~Fi(wk+1,ti)=X+Y+Z+i=1nUkiFi(w¯k+1,t),\sum_{i=1}^{n}U_{k}^{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})=X+Y+Z+\sum_{i=1}^{n}U_{k}^{i}\nabla F_{i}(\bar{w}_{k+1,t}), (42)

where

X=\displaystyle X= i=1nUki(~Fi(wk+1,ti)Fi(wk+1,ti)),\displaystyle\sum_{i=1}^{n}U_{k}^{i}\left(\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i})\right),
Y=\displaystyle Y= i=1nUki(Fi(wk+1,ti)Fi(wk+1,t)),\displaystyle\sum_{i=1}^{n}U_{k}^{i}\left(\nabla F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t})\right),
Z=\displaystyle Z= i=1nUki(Fi(wk+1,t)Fi(w¯k+1,t)).\displaystyle\sum_{i=1}^{n}U_{k}^{i}\left(\nabla F_{i}(w_{k+1,t})-\nabla F_{i}(\bar{w}_{k+1,t})\right). (43)

Let k+1,t\mathcal{F}_{k+1,t} be the information up to round k+1k+1, local step tt. Our next step is to bound the moments of XX, YY, ZZ, conditioning on k+1,t\mathcal{F}_{k+1,t}. Recall the Cauchy-Schwarz inequality

i=1naibi2(i=1nai2)(i=1nbi2).\left\|\sum_{i=1}^{n}a_{i}b_{i}\right\|^{2}\leq\left(\sum_{i=1}^{n}\|a_{i}\|^{2}\right)\left(\sum_{i=1}^{n}\|b_{i}\|^{2}\right). (44)
  • As for XX, consider the Cauchy-Schwarz inequality (44) with ai=~Fi(wk+1,ti)Fi(wk+1,ti)a_{i}=\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i}) and bi=Ukib_{i}=U_{k}^{i}, we obtain

    X2\displaystyle\|X\|^{2}
    \displaystyle\leq (i=1nUi2)(i=1n~Fi(wk+1,ti)Fi(wk+1,ti)2),\displaystyle\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\sum_{i=1}^{n}\left\|\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i})\right\|^{2}\right), (45)

    where Ui=maxkUkiU_{i}=\max_{k}U_{k}^{i}. Hence, by using Lemma 2 along with the tower rule, we have

    𝔼[X2]=𝔼[𝔼[X2|k+1,t]](i=1nUi2)σF2.\mathbb{E}[\|X\|^{2}]=\mathbb{E}[\mathbb{E}[\|X\|^{2}|\mathcal{F}_{k+1,t}]]\leq\left(\sum_{i=1}^{n}U_{i}^{2}\right)\sigma_{F}^{2}. (46)
  • As for YY, consider the Cauchy-Schwarz inequality (44) with ai=Fi(wk+1,ti)Fi(wk+1,t)a_{i}=\nabla F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}) and bi=Ukib_{i}=U_{k}^{i}, along with the smoothness of FiF_{i}, we obtain

    Y2\displaystyle\|Y\|^{2}
    \displaystyle\leq (i=1nUi2)(i=1nFi(wk+1,ti)Fi(wk+1,t)2)\displaystyle\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\sum_{i=1}^{n}\left\|\nabla F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t})\right\|^{2}\right)
    \displaystyle\leq LF2(i=1nUi2)i=1nwk+1,tiwk+1,t2.\displaystyle L_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\sum_{i=1}^{n}\|w_{k+1,t}^{i}-w_{k+1,t}\|^{2}. (47)

    Again, taking expectation on both sides of (Appendix) along with the tower rule, we obtain

    𝔼[Y2]\displaystyle\mathbb{E}[\|Y\|^{2}]\leq LF2(i=1nUi2)𝔼[i=1nwk,tiwk,t2]\displaystyle L_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\mathbb{E}\left[\sum_{i=1}^{n}\|w_{k,t}^{i}-w_{k,t}\|^{2}\right]
    \displaystyle\leq 35β2LF2nτ(τ1)(2σF2+γF2)(i=1nUi2),\displaystyle 35\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right), (48)

    where the last step is obtained from (38) in Lemma 4 along with the fact that tτ1t\leq\tau-1.

  • As for ZZ, first recall that if we have nn numbers a1,a2,,ana_{1},a_{2},\dots,a_{n} with mean μ=1/ni=1nai\mu=1/n\sum_{i=1}^{n}a_{i}, and variance σ2=1/ni=1n|aiμ|2\sigma^{2}=1/n\sum_{i=1}^{n}|a_{i}-\mu|^{2}. If we denote the number of UEs that successfully update their local parameters to the global model as δn\delta n, (0δ10\leq\delta\leq 1), according to [14], we have

    𝔼[|i=1nUiaiμ|2]=σ2(1δ)δ(n1).\mathbb{E}\left[\left|\sum_{i=1}^{n}U_{i}a_{i}-\mu\right|^{2}\right]=\frac{\sigma^{2}(1-\delta)}{\delta(n-1)}. (49)

    Using this, we have

    𝔼[w¯k+1,twk+1,t|k+1,t2]\displaystyle\mathbb{E}\left[\|\bar{w}_{k+1,t}-w_{k+1,t}|\mathcal{F}_{k+1,t}\|^{2}\right]
    \displaystyle\leq (1δ)i=1nwk+1,tiwk+1,t2δ(n1)n.\displaystyle\frac{(1-\delta)\sum_{i=1}^{n}\|w_{k+1,t}^{i}-w_{k+1,t}\|^{2}}{\delta(n-1)n}. (50)

    Therefore, by taking expectation on both sides of (Appendix) along with the tower rule, we have

    𝔼[w¯k+1,twk+1,t2]\displaystyle\mathbb{E}\left[\|\bar{w}_{k+1,t}-w_{k+1,t}\|^{2}\right]
    \displaystyle\leq 35β2(1δ)τ(τ1)(2σF2+γF2)δ(n1),\displaystyle\frac{35\beta^{2}(1-\delta)\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})}{\delta(n-1)}, (51)

    where the inequality is also obtained from (38) in Lemma 4. Next, consider the Cauchy-Schwarz inequality (44) with ai=Fi(wk+1,t)Fi(w¯k+1,t)a_{i}=\nabla F_{i}(w_{k+1,t})-\nabla F_{i}(\bar{w}_{k+1,t}) and bi=Ukib_{i}=U_{k}^{i}, we have

    Z2\displaystyle\|Z\|^{2}
    \displaystyle\leq (i=1nUi2)(i=1nFi(wk+1,t)Fi(w¯k+1,t)2)\displaystyle\left(\sum_{i=1}^{n}U_{i}^{2}\right)\left(\sum_{i=1}^{n}\left\|\nabla F_{i}(w_{k+1,t})-\nabla F_{i}(\bar{w}_{k+1,t})\right\|^{2}\right)
    \displaystyle\leq LF2(i=1nUi2)i=1nwk+1,tw¯k+1,t2.\displaystyle L_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\sum_{i=1}^{n}\|w_{k+1,t}-\bar{w}_{k+1,t}\|^{2}. (52)

    At this point, taking expectation on both sides of (Appendix) along with the use of (Appendix) yields,

    𝔼[Z2]\displaystyle\mathbb{E}[\|Z\|^{2}]
    \displaystyle\leq 35β2LF2(1δ)nτ(τ1)(2σF2+γF2)δ(n1)(i=1nUi2).\displaystyle\frac{35\beta^{2}L_{F}^{2}(1-\delta)n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})}{\delta(n-1)}\left(\sum_{i=1}^{n}U_{i}^{2}\right). (53)

Now, getting back to (Appendix), we first lower bound the term

𝔼[F(w¯k+1,t+1)(i=1nUi~Fi(wk+1,ti))].\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right]. (54)

From (42), we have

𝔼\displaystyle\mathbb{E} [F(w¯k+1,t+1)(i=1nUi~Fi(wk+1,ti))]\displaystyle\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right]
=\displaystyle= 𝔼[F(w¯k+1,t+1)(X+Y+Z+i=1nUiFi(w¯k+1,t))]\displaystyle\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(X+Y+Z+\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right)\right]
\displaystyle\geq 𝔼[i=1nUiF(w¯k+1,t)2]𝔼[F(w¯k+1,t+1)X]\displaystyle\mathbb{E}\left[\sum_{i=1}^{n}U_{i}\|\nabla F(\bar{w}_{k+1,t})\|^{2}\right]-\|\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}X\right]\|
14𝔼[F(w¯k+1,t)2]𝔼[Y+Z2].\displaystyle-\frac{1}{4}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]-\mathbb{E}[\|Y+Z\|^{2}]. (55)

This inequality follows the same thought as [14] by using the fact that

𝔼[F(w¯k+1,t)(Y+Z)]\displaystyle\mathbb{E}[\nabla F(\bar{w}_{k+1,t})^{\top}(Y+Z)]
\displaystyle\leq 14𝔼[F(w¯k+1,t)2]+𝔼[Y+Z2].\displaystyle\frac{1}{4}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]+\mathbb{E}[\|Y+Z\|^{2}]. (56)

Meanwhile, the bounds of the terms on the right-hand-side of (Appendix) can be derived similar to what the authors did in [14]. That is,

𝔼[F(w¯k+1,t+1)X]\displaystyle\|\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}X\right]\|
\displaystyle\leq 14𝔼[F(w¯k+1,t)2]+𝔼[𝔼[X|k+1,t]2]\displaystyle\frac{1}{4}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]+\mathbb{E}\left[\left\|\mathbb{E}\left[X|\mathcal{F}_{k+1,t}\right]\right\|^{2}\right]
\displaystyle\leq 14𝔼[F(w¯k+1,t)2]\displaystyle\frac{1}{4}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]
+𝔼[i=1nUi(~Fi(wk+1,ti)Fi(wk+1,ti))2]\displaystyle+\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{i}(\tilde{\nabla}F_{i}(w_{k+1,t}^{i})-\nabla F_{i}(w_{k+1,t}^{i}))\right\|^{2}\right]
\displaystyle\leq 14𝔼[F(w¯k+1,t)2]+(i=1nUi2)4α2L2σG2D.\displaystyle\frac{1}{4}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]+\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}. (57)

Meanwhile, using the Cauchy-Schwarz inequality (44), we have

𝔼[Y+Z2]\displaystyle\mathbb{E}[\|Y+Z\|^{2}]
\displaystyle\leq 2(𝔼[Y2]+𝔼[Z2])\displaystyle 2(\mathbb{E}[\|Y\|^{2}]+\mathbb{E}[\|Z\|^{2}])
\displaystyle\leq 70β2LF2nτ(τ1)(2σF2+γF2)(1+1δδ(n1))(i=1nUi2)\displaystyle 70\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(1+\frac{1-\delta}{\delta(n-1)}\right)\left(\sum_{i=1}^{n}U_{i}^{2}\right)
\displaystyle\leq 140β2LF2nτ(τ1)(2σF2+γF2)(i=1nUi2).\displaystyle 140\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right). (58)

Combining (Appendix) and (Appendix) yields

𝔼[F(w¯k+1,t+1)(i=1nUi~Fi(wk+1,ti))]\displaystyle\mathbb{E}\left[\nabla F(\bar{w}_{k+1,t+1})^{\top}\left(\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right)\right]
\displaystyle\geq 𝔼[i=1n(Ui12)F(w¯k+1,t)2](i=1nUi2)4α2L2σG2D\displaystyle\mathbb{E}\left[\sum_{i=1}^{n}(U_{i}-\frac{1}{2})\|\nabla F(\bar{w}_{k+1,t})\|^{2}\right]-\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}
140β2LF2nτ(τ1)(2σF2+γF2)(i=1nUi2)\displaystyle-140\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)
\displaystyle\geq 12𝔼[F(w¯k+1,t)2](i=1nUi2)4α2L2σG2D\displaystyle\frac{1}{2}\mathbb{E}\left[\|\nabla F(\bar{w}_{k+1,t})\|^{2}\right]-\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}
140β2LF2nτ(τ1)(2σF2+γF2)(i=1nUi2),\displaystyle-140\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right), (59)

where the last step is derived from the fact that i=1nUii=1nUki=1\sum_{i=1}^{n}U_{i}\geq\sum_{i=1}^{n}U_{k}^{i}=1. Next, we characterize an upper bound for the other term in (Appendix):

𝔼[i=1nUi~Fi(wk+1,ti)2].\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}\right]. (60)

To do this, we still use the equality (42), that is,

i=1nUi~Fi(wk+1,ti)2\displaystyle\left\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}
\displaystyle\leq 2X+Y+Z2+2i=1nUiFi(w¯k+1,t)2\displaystyle 2\|X+Y+Z\|^{2}+2\left\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\|^{2}
\displaystyle\leq 4X2+4Y+Z2+2i=1nUiFi(w¯k+1,t)2.\displaystyle 4\|X\|^{2}+4\|Y+Z\|^{2}+2\left\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\|^{2}. (61)

Hence, taking expectations on both sides of (Appendix) along with (46) and (Appendix) we have

𝔼[i=1nUi~Fi(wk+1,ti)2]\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}\right]
\displaystyle\leq 2𝔼[i=1nUiFi(w¯k+1,t)2]+4σF2(i=1nUi2)\displaystyle 2\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\|^{2}\right]+4\sigma_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)
+560β2LF2nτ(τ1)(2σF2+γF2)(i=1nUi2).\displaystyle+560\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right). (62)

Again, consider the Cauchy-Schwarz inequality with ai=Fi(wk+1,t)a_{i}=\nabla F_{i}(w_{k+1,t}) and bi=Uib_{i}=U_{i}, we have

i=1nUiFi(w¯k+1,t)2\displaystyle\left\|\sum_{i=1}^{n}U_{i}\nabla F_{i}(\bar{w}_{k+1,t})\right\|^{2}
\displaystyle\leq (i=1nFi(wk+1,t)2)(i=1nUi2).\displaystyle\left(\left\|\sum_{i=1}^{n}\nabla F_{i}(w_{k+1,t})\right\|^{2}\right)\left(\sum_{i=1}^{n}U_{i}^{2}\right). (63)

By Lemma 3, we have

𝔼[Fi(w¯k+1,t)F(w¯k+1,t)2]γF2.\mathbb{E}\left[\|\nabla F_{i}(\bar{w}_{k+1,t})-\nabla F(\bar{w}_{k+1,t})\|^{2}\right]\leq\gamma_{F}^{2}. (64)

Recall the relationship between the expectation of a random vector 𝐱\mathbf{x}, 𝔼(𝐱)\mathbb{E}(\mathbf{x}) and its variance, 𝔻(𝐱)\mathbb{D}(\mathbf{x}), that 𝔻(𝐱)=𝔼(𝐱2)[𝔼(𝐱)]2\mathbb{D}(\mathbf{x})=\mathbb{E}(\mathbf{x}^{2})-[\mathbb{E}(\mathbf{x})]^{2}. Given that 1/ni=1nFi(w¯k+1,t)=F(w¯k+1,t)1/n\sum_{i=1}^{n}\nabla F_{i}(\bar{w}_{k+1,t})=\nabla F(\bar{w}_{k+1,t}), we have

𝔼[i=1nFi(wk+1,t)2]=γF2+𝔼[F(w¯k+1,t)2].\mathbb{E}\left[\left\|\sum_{i=1}^{n}\nabla F_{i}(w_{k+1,t})\right\|^{2}\right]=\gamma_{F}^{2}+\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]. (65)

Combining (Appendix), (Appendix) and (65) yields

𝔼[i=1nUi~Fi(wk+1,ti)2]\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{n}U_{i}\tilde{\nabla}F_{i}(w_{k+1,t}^{i})\right\|^{2}\right]
\displaystyle\leq 2(i=1nUi2)𝔼[F(w¯k+1,t)2]\displaystyle 2\left(\sum_{i=1}^{n}U_{i}^{2}\right)\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]
+560β2LF2nτ(τ1)(2σF2+γF2)(i=1nUi2)\displaystyle+560\beta^{2}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)
+2(i=1nUi2)γF2+4σF2(i=1nUi2).\displaystyle+2\left(\sum_{i=1}^{n}U_{i}^{2}\right)\gamma_{F}^{2}+4\sigma_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right). (66)

Substituting (Appendix) and (Appendix) in (Appendix) implies

𝔼\displaystyle\mathbb{E} [F(w¯k+1,t+1)]\displaystyle[F(\bar{w}_{k+1,t+1})]
\displaystyle\leq 𝔼[F(w¯k+1,t)]β[12βLFi=1nUi2]𝔼[F(w¯k+1,t)2]\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t})]-\beta\left[\frac{1}{2}-\beta L_{F}\sum_{i=1}^{n}U_{i}^{2}\right]\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]
+140(1+2βLF)β3LF2nτ(τ1)(2σF2+γF2)(i=1nUi2)\displaystyle+140(1+2\beta L_{F})\beta^{3}L_{F}^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)
+LFβ2(i=1nUi2)γF2+β(i=1nUi2)4α2L2σG2D\displaystyle+L_{F}\beta^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)\gamma_{F}^{2}+\beta\left(\sum_{i=1}^{n}U_{i}^{2}\right)\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}
+2LFβ2σF2(i=1nUi2)\displaystyle+2L_{F}\beta^{2}\sigma_{F}^{2}\left(\sum_{i=1}^{n}U_{i}^{2}\right)
\displaystyle\leq 𝔼[F(w¯k+1,t)]β4𝔼[F(w¯k+1,t)2]+βσT2,\displaystyle\mathbb{E}[F(\bar{w}_{k+1,t})]-\frac{\beta}{4}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]+\beta\sigma_{T}^{2}, (67)

where

σT2=\displaystyle\sigma_{T}^{2}= 280(βLF)2nτ(τ1)(2σF2+γF2)(i=1nUi2)\displaystyle 280(\beta L_{F})^{2}n\tau(\tau-1)(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)
+βLF(2σF2+γF2)(i=1nUi2)+4α2L2σG2D(i=1nUi2).\displaystyle+\beta L_{F}(2\sigma_{F}^{2}+\gamma_{F}^{2})\left(\sum_{i=1}^{n}U_{i}^{2}\right)+\frac{4\alpha^{2}L^{2}\sigma_{G}^{2}}{D}\left(\sum_{i=1}^{n}U_{i}^{2}\right). (68)

The last step of (Appendix) is obtained by β1/(2τLF)\beta\leq 1/(2\tau L_{F}). Summarizing (Appendix) from t=0t=0 to t=τ1t=\tau-1, we have

𝔼[F(wk+1)]\displaystyle\mathbb{E}[F(w_{k+1})]
\displaystyle\leq 𝔼[F(wk)]βτ4(1τt=0τ1𝔼[F(w¯k+1,t)2])+βτσT2,\displaystyle\mathbb{E}[F(w_{k})]-\frac{\beta\tau}{4}\left(\frac{1}{\tau}\sum_{t=0}^{\tau-1}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]\right)+\beta\tau\sigma_{T}^{2}, (69)

which is obtained by the fact that w¯k+1,τ=wk+1\bar{w}_{k+1,\tau}=w_{k+1}. Finally, summarizing (Appendix) from k=0k=0 to K1K-1, we have

𝔼[F(wK)]𝔼[F(w0)]\displaystyle\mathbb{E}[F(w_{K})]\leq\mathbb{E}[F(w_{0})]
βτK4(1τKk=0K1t=0τ1𝔼[F(w¯k+1,t)2])+βτKσT2.\displaystyle-\frac{\beta\tau K}{4}\left(\frac{1}{\tau K}\sum_{k=0}^{K-1}\sum_{t=0}^{\tau-1}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]\right)+\beta\tau K\sigma_{T}^{2}. (70)

As a result, we have

1τKk=0K1t=0τ1𝔼[F(w¯k+1,t)2]\displaystyle\frac{1}{\tau K}\sum_{k=0}^{K-1}\sum_{t=0}^{\tau-1}\mathbb{E}[\|\nabla F(\bar{w}_{k+1,t})\|^{2}]
\displaystyle\leq 4βτK(F(w0)𝔼[(F(wK))]+βτKσT2)\displaystyle\frac{4}{\beta\tau K}(F(w_{0})-\mathbb{E}[(F(w_{K}))]+\beta\tau K\sigma_{T}^{2})
\displaystyle\leq 4(F(w0)F)βτK+4σT2.\displaystyle\frac{4(F(w_{0})-F^{*})}{\beta\tau K}+4\sigma_{T}^{2}. (71)

Given that we only consider one step of local SGD in this paper, we use τ=1\tau=1, and then the desired result is obtained.

References

  • [1] M. Ghahramani, M. Zhou, and G. Wang, “Urban sensing based on mobile phone data: approaches, applications, and challenges,” IEEE/CAA Journal of Automatica Sinica, vol. 7, no. 3, pp. 627–637, 2020.
  • [2] R. Pryss, M. Reichert, J. Herrmann, B. Langguth, and W. Schlee, “Mobile crowd sensing in clinical and psychological trials–a case study,” in IEEE International Symposium on Computer-Based Medical Systems, 2015, pp. 23–24.
  • [3] R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: current state and future challenges,” IEEE Communications Magazine, vol. 49, no. 11, pp. 32–39, 2011.
  • [4] P. Li, J. Li, Z. Huang, T. Li, C.-Z. Gao, S.-M. Yiu, and K. Chen, “Multi-key privacy-preserving deep learning in cloud computing,” Future Generation Computer Systems, vol. 74, pp. 76–85, 2017.
  • [5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017.
  • [6] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
  • [7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
  • [8] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh, “Three approaches for personalization with applications to federated learning,” arXiv preprint arXiv:2002.10619, 2020.
  • [9] J. Schneider and M. Vlachos, “Mass personalization of deep learning,” arXiv preprint arXiv:1909.02803, 2019.
  • [10] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” vol. 30, 2017.
  • [11] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International Conference on Machine Learning (ICML), vol. 70, 2017, pp. 1126–1135.
  • [12] J. Vanschoren, “Meta-learning,” in Automated Machine Learning.   Springer, Cham, 2019, pp. 35–61.
  • [13] F. Hutter, L. Kotthoff, and J. Vanschoren, Automated machine learning: methods, systems, challenges.   Springer Nature, 2019.
  • [14] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” in Conference on Neural Information Processing Systems (NeuIPS), 2020.
  • [15] Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” arXiv preprint arXiv:1909.12488, 2019.
  • [16] Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized federated learning,” arXiv preprint arXiv:2003.13461, 2020.
  • [17] C. T. Dinh, N. H. Tran, and T. D. Nguyen, “Personalized federated learning with moreau envelopes,” arXiv preprint arXiv:2006.08848, 2020.
  • [18] Q. Wu, K. He, and X. Chen, “Personalized federated learning for intelligent iot applications: A cloud-edge based framework,” IEEE Open Journal of the Computer Society, vol. 1, pp. 35–44, 2020.
  • [19] S. Yue, J. Ren, J. Xin, S. Lin, and J. Zhang, “Inexact-admm based federated meta-learning for fast and continual edge learning,” in Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, 2021, pp. 91–100.
  • [20] S. J. Wright, Coordinate descent algorithms.   Springer, 2015, vol. 151, no. 1.
  • [21] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning: A meta-learning approach,” arXiv preprint arXiv:2002.07948, 2020.
  • [22] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Transactions on Communications, vol. 68, no. 1, pp. 317–333, 2019.
  • [23] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in IEEE International Conference on Computer Communications (INFOCOM), 2019, pp. 1387–1395.
  • [24] Y. Pei, Z. Peng, Z. Wang, and H. Wang, “Energy-efficient mobile edge computing: three-tier computing under heterogeneous networks,” Hindawi Wireless Communications and Mobile Computing, vol. 2020, 2020.
  • [25] D. P. Palomar and M. Chiang, “A tutorial on decomposition methods for network utility maximization,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 24, no. 8, pp. 1439–1451, 2006.
  • [26] L. Yann, C. Corinna, and B. Christopher, “The MNIST database,” http://yann.lecun.com/exdb/mnist/, 1998.
  • [27] “The CIFAR-10 dataset,” https://, 2014.
  • [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037.