This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Toward Multiple Federated Learning Services Resource Sharing in Mobile Edge Networks

Minh N. H. Nguyen1,     Nguyen H. Tran2,     Yan Kyaw Tun1,     Zhu Han3,     Choong Seon Hong1 1Department of Computer Science and Engineering, Kyung Hee University, South Korea 2School of Computer Science, The University of Sydney, Sydney, NSW 2006, Australia 3Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77004-4005, USA
Abstract

Federated Learning is a new learning scheme for collaborative training a shared prediction model while keeping data locally on participating devices. In this paper, we study a new model of multiple federated learning services at the multi-access edge computing server. Accordingly, the sharing of CPU resources among learning services at each mobile device for the local training process and allocating communication resources among mobile devices for exchanging learning information must be considered. Furthermore, the convergence performance of different learning services depends on the hyper-learning rate parameter that needs to be precisely decided. Towards this end, we propose a joint resource optimization and hyper-learning rate control problem, namely MS-FEDL, regarding the energy consumption of mobile devices and overall learning time. We design a centralized algorithm based on the block coordinate descent method and a decentralized JP-miADMM algorithm for solving the MS-FEDL problem. Different from the centralized approach, the decentralized approach requires many iterations to obtain but it allows each learning service to independently manage the local resource and learning process without revealing the learning service information. Our simulation results demonstrate the convergence performance of our proposed algorithms and the superior performance of our proposed algorithms compared to the heuristic strategy.

Index Terms:
Federated Learning, resource allocation, multi-access edge computing, decentralized optimization.

I Introduction

Nowadays, following the great success of Machine Learning (ML) and Artificial Intelligence (AI) applications, there are more and more intelligent services that have transformed our lives. This progress has been drastically elevated by the ubiquity of device-generated data that is available to the service operator and stronger computing power at cloud data centers and mobile devices. Recently, the deployment of MEC servers at the edge networks has been acknowledged as one of the key pillars to revolutionize mobile communication by assisting cellular base stations with low latency computing capability. When compared to cloud datacenter, the machine learning training process can be done at the mobile edge network with the help of multi-access edge computing (MEC) servers, resulting in lower communication latency for exchanging learning information. Therefore, these enablers unlock the full potential of edge ML applications for the vision of truly intelligent next-generation communication systems in 6G [1]. However, the ML applications raise a privacy concern in the data collection for training purposes. In many ML applications (e.g., Crowdtracker [2], Waze Carpool [3], etc.), users are required to share their sensitive personal information (i.e., user location, user identity, user photos, etc.) to the server. Furthermore, uploading a massive amount of data throughout radio access links or the Internet to the cloud data centers is costly. Hence, the strong computation capabilities of the increasingly powerful mobile devices empower the local inference and fine-tuning of Deep Neural Networks (DNNs) model on device without sending their training data to the server [4, 5]. On the other hand, using solely the personalized local data could lead to the overfitting problem of the local training models. Thus, sharing the local learning model parameters among user equipments (UEs) equip to build up a generalized global model is the primary idea of a brand-new ML scheme, namely federated learning [6, 7, 8]. The deployment of this learning scheme at the edge networks brings up latency and transmission cost reduction by sending the weight parameters of local learning models to the MEC server instead of sending device-generated data to the cloud and enhances the user privacy compared to the conventional centralized training [9]. Inevitably, the federated learning scheme is one of the vital enablers to bring edge intelligence into reality.

Refer to caption
Figure 1: Multiple Federated Learning services model.

In the typical federated learning scheme, the actual training process is decentralized as each UE constructs a local model based on its local dataset. Then a shared global model is updated at the server by aggregating local learning weight parameters from all UEs such as gradient, and learning weight parameters. After that the updated global model is broadcast back to UEs. As an example, the work of [6] has provided the simplest form of a federated learning algorithm in which the learning parameters of the global model are averaged from local ones at UEs. However, the performance of this learning scheme at the edge networks firmly depends on the allocated computation resources for local training and communication resources for uploading the updated local model parameters to the MEC server. Therefore, the allocation problem for both computation and communication resources is crucial for deploying a federated learning scheme at the edge networks. This type of problem is well-studied in [10, 11] that design resource allocation frameworks regarding learning performance and resource cost. Since these works focus on the model of a single federated learning service in the edge networks, there is the necessity of an extensive study and analysis for the upcoming multiple learning services systems. Indeed, more data consisting of network-level mobile data (e.g., call detail record, radio information, performance indicator, etc.) and app-level data (e.g., device profile, sensing data, photos, video streaming, voice data, etc.) [12] can be collected on user equipment such as mobile devices, wearable devices, augmented reality headset, IoT devices. As a result, it enables many ML applications and services can be deployed on UEs or in the edge networks. In addition to the independent deployment for different learning tasks, multiple deep learning models can be deployed together and provide better performance systems such as the mobile vision system [4, 5].

According to the essential deployment of multiple federated learning services, the computation resources necessarily are shared among these services for the local training while the communication resources are shared among mobile devices for exchanging learning information from different services. Moreover, the performance of learning services depends on the learning parameters that need to be precisely decided regarding the resource allocation cost and the overall learning time. In this paper, we study the under-explored problem - the shared computation, communication resource allocation, and the learning parameter control for multiple federated learning services coexisting at the edge networks.

In Fig. 1, we depict the system of multiple federate learning services where a Federated Learning Orchestrator (FLO) at the MEC server is in charge of the computation and communication resource management, controls the hyper-learning rate parameter of learning services and operates these federated learning services. Accordingly, FLO performs two main processes as follows

Resource allocation process: In this work, we consider the flexible CPU sharing model such as the CPU frequency sharing among virtual machines or containers to perform the local learning updates. Since those virtual instances often require a high deployment cost, we consider the pre-allocating CPU strategy for different services. To capture the trade-off between the energy consumption of mobile devices and overall learning time, we propose a resource optimization problem, namely MS-FEDL that decides the optimal CPU frequency for each learning service and the fraction of total uplink bandwidth for each UE. As shown in Fig. 1, computing (i.e., CPU cycles) and communication (i.e., bandwidth) resources are shared among learning services and UEs, respectively. In addition to resource allocation, it also controls the hyper-learning rate of learning services such as the relative accuracy of the local learning problem at the UEs.

Federated learning process: After the shared computation and communication resources allocation, FLO performs the federated learning process iteratively according to the following steps: training local model, transmitting local learning model to FLO, updating the global model, and broadcasting the global model to UEs until the convergence is observed.

In order to provide an efficient approach for the resource allocation and learning parameter control of multiple federated learning services, in this work, we develop the problem design and analysis for FLO, which can be summarized as follows

  • In Section III, we first propose a resource-sharing model for multiple learning services. Then, we pose the resource allocation and learning parameter control problem for FLO to manage multiple learning services with the MS-FEDL problem, which is in the form of a multi-convex problem.

  • In Section IV, we develop both centralized and decentralized approaches to solve the MS-FEDL problem. Specifically, We first propose a centralized algorithm based on block coordinate descent framework that provides a quick scheme by decoupling the global problem into three subproblems such as CPU allocation, bandwidth allocation, and hyper-learning rate decision subproblems. After that, we develop a decentralized approach based on the JP-miADMM algorithm which allows each learning service to independently manage the resource allocation, local learning process and cooperatively operate under the management of FLO. Although the decentralized approach requires many iterations to convergence, it provides a more flexible and scalable method for resource allocation without revealing the learning service information.

  • In Section V, we provide extensive numerical results to demonstrate the convergence performance of the proposed algorithms. Moreover, we present the performance gain of our proposed algorithms when compared with the heuristic strategy. Finally, we present the conclusions in Section VI.

II Related Works

Many attempts on distributed training over multiple machines have recently given rise to research on decentralized machine learning [13, 11, 14]. However, most of the algorithms in these works are designed for machines having balanced and i.i.d. data, and is connected to high-throughput networks in data centers. With a different scheme, Federated Learning (and related on-device intelligence approaches), which has attracted much attention recently [15, 16, 7, 6, 17, 18, 19], exploits the collaboration of mobile devices that can be large in number, slow and/or unstable in Internet connections, and have non-i.i.d. and unbalanced data locally. In the simplest form of a federated learning algorithm (i.e., FedAvg [6]), the authors provide a simple mechanism of averaging the updated learning parameters of the local model using local data at individual UEs. On the other hand, CoCoA+ [17] provides a general framework under a strong theoretical convergence analysis, in which the local learning problem of UEs is transformed into dual problems and can be solved by any arbitrary solvers. Different from CoCoA+ framework, in our recent work [20], we develop a new federated learning algorithm, namely FEDL that uses an additional Bregman divergence in the learning objective [21] to encourage the local model being close to the global model. Most of these existing works aim to provide the learning algorithms that focus on learning performance and providing the theoretical convergence analysis. However, for the practical deployment in mobile edge networks, a resource allocation problem for federated learning service needs to be carefully designed to manage the computation and communication resources. Furthermore, it is important to control the learning parameters by considering energy consumption and learning time convergence. The work of [11] introduced this type of problem for the distributed gradient descent analysis, in which the authors propose a control algorithm that determines the best trade-off between local updates and global parameter aggregation to minimize the loss function under a given resource budget.

In our previous work [10, 20], we propose a resource allocation problem among UEs and the hyper-learning rate control of learning services in the wireless environment regarding the computation, communication latency, UE energy consumption, and the heterogeneity of UEs. In this paper, we study the extensive design for multiple federated learning services that co-exist at the edge networks with the change in the sharing of bandwidth allocation based on OFDMA instead of transmission power control in our previous work. Also, we consider the additional broadcast time, extra communication overhead, and the time for averaging operation at the edge server. Furthermore, both resource allocation and learning processes can be controlled by a FLO at the MEC server which performs the resource allocation in the centralized or decentralized approaches and is being an aggregator for the global learning update. For the centralized solution approach, the MS-FEDL problem is bi-convex and we adopt the alternative minimization algorithm to provide a solid approach that can help the subproblems of the MS-FEDL problem can be solved by arbitrary convex solvers. When all services from the same owner such as multiple deep learning models can be deployed together and provide better performance mobile vision systems in [4, 5], this approach can be sufficient to provide an efficient resource allocation mechanism in which the sharing of service information in the MS-FEDL problem is not an issue. However, the centralized approach will be limited when scaling up to a large number of users and require the sharing of all information among services. To resolve these problems, we develop another flexible decentralized solution approach based on the combination of multi-convex and parallel setting of ADMM. The conventional ADMM algorithm can be applied in convex problem [22], and later extended to parallelly solve subproblem with JP-ADMM [23]. Recently, the new analysis for the convergence of the multi-convex ADMM problem [24]. To the best of our knowledge, the combination of these two extended algorithms for ADMM hasn’t applied in other works as in our decentralized approach. Our decentralized approach can be useful for independent service providers such as multi-tenant FL services that use the common shared resources from the third-party edge provider. The decisions can be made by each service and shared with FLO only without revealing the learning service information (i.e., dataset information, exchange local updates information between UEs and the MEC server, the number of CPU cycles for each UE to execute one sample of data).

III Multi-Service Federated Learning at the Edge

Algorithm 1 Federated Learning Algorithm (FEDL)[20]
1:Input: w0w^{0}, θ[0,1]\theta\in[0,1], η>0\eta>0.
2:for t=1 to Kgt=1\text{\,to\,}K_{g} do
3:     
Local Training: Each UE nn solves its local learning problem (3) in KlK_{l} rounds to achieve θ\theta-approximation solution wntw_{n}^{t} satisfying (5).
4:     
Communication: All UEs transmit wntw_{n}^{t} and Fn(wnt)\nabla F_{n}(w_{n}^{t}) to the edge server.
5:     
Aggregation and Feedbacks: The edge server updates the global model as in (9) and then fed-backs wtw^{t} and F(wt)\nabla F(w^{t}) to all UEs.

III-A Federated Learning Algorithm Design

In this subsection, we summarize our recent federated learning algorithm design according to [20] as in the detail of Algorithm 1. Accordingly, in the typical setting of federated learning for a general supervised learning problem, given a sample data {xi,yi}𝒟\{x_{i},y_{i}\}\in\cal{D} with input xidx_{i}\in\mathbb{R}^{d}, the learning task is required to train the model parameter wdw\in\mathbb{R}^{d} to predict the correct label yiy_{i} by minimizing the loss function fi(w)f_{i}(w). The training data at UEs can be the usage information or sensing data from the integrated sensors. Different from the conventional centralized learning, the dataset of the federated learning scheme is distributed over a set of NN UEs where each participating UE nn collects training data samples and stores a local dataset 𝒟n\mathcal{D}_{n} such that

𝒟=n=1..N𝒟n;n=1..N𝒟n=.\mathcal{D}=\underset{n=1..N}{\cup}\mathcal{D}_{n};\underset{n=1..N}{\cap}\mathcal{D}_{n}=\emptyset.

The local loss function of the learning problem using the local dataset of UE nn is defined as

Fn(w):=1|𝒟n|i𝒟nfi(w).\displaystyle F_{n}(w)\mathrel{\mathop{:}}=\frac{1}{|\mathcal{D}_{n}|}\sum\nolimits_{i\in\mathcal{D}_{n}}f_{i}(w). (1)
Assumption 1.

The local loss function Fn()F_{n}(\cdot) is LL-smooth and β\beta-strongly convex, n\forall n, respectively, as follows, w,wd\forall w,w^{\prime}\in\mathbb{R}^{d}:

Fn(w)\displaystyle F_{n}(w) Fn(w)+Fn(w),ww+L2ww2\displaystyle\leq F_{n}(w^{\prime})\!+\!\bigl{\langle}\nabla F_{n}(w^{\prime}),w-w^{\prime}\bigr{\rangle}\!+\!\frac{L}{2}\left\lVert w-w^{\prime}\right\rVert^{2}
Fn(w)\displaystyle F_{n}(w) Fn(w)+Fn(w),ww+β2ww2,\displaystyle\geq F_{n}(w^{\prime})\!+\!\bigl{\langle}\nabla F_{n}(w^{\prime}),w-w^{\prime}\bigr{\rangle}\!+\!\frac{\beta}{2}\left\lVert w-w^{\prime}\right\rVert^{2},

where x,x\langle x,x^{\prime}\rangle denotes the inner product of vectors xx and xx^{\prime} and \left\lVert\cdot\right\rVert is Euclidean norm. These strong convexity, smoothness assumptions are also used in [25], and satisfied in popular ML problems such as l2l_{2}-regularized linear regression model with fi(w)=12(xi,wyi)2+β2w2,yif_{i}(w)=\frac{1}{2}(\langle{x_{i}},{w}\rangle-y_{i})^{2}+\frac{\beta}{2}\left\lVert w\right\rVert^{2},y_{i}\in\mathbb{R}, and l2l_{2}-regularized logistic regression with fi(w)=log(1+exp(yixi,w))+β2w2,yi{1,1}f_{i}(w)=\log\bigl{(}1+\exp(-y_{i}\langle{x_{i}},{w}\rangle)\bigr{)}+\frac{\beta}{2}\left\lVert w\right\rVert^{2},y_{i}\in\{-1,1\}. Accordingly, we denote ρ:=Lβ\rho\mathrel{\mathop{:}}=\frac{L}{\beta} as the condition number of Fn()F_{n}(\cdot)’s Hessian matrix.

Then, the global loss function of the global learning problem is as follows

minwdF(w):=n=1N|𝒟n||𝒟|Fn(w).\displaystyle\min_{w\in\mathbb{R}^{d}}F(w)\mathrel{\mathop{:}}=\sum\nolimits_{n=1}^{N}\frac{|\mathcal{D}_{n}|}{|\mathcal{D}|}F_{n}(w). (2)

Accordingly, the global learning model can be obtained by solving the global problem (2) using an iterative update process at the server and UEs in the federated learning scheme. These updates perform alternatively within a number of global rounds (i.e., KgK_{g}) that consists of four following steps at one global round as

  • S1. Local Training: Every UE needs to train a local model by using the local training data 𝒟n\mathcal{D}_{n}.

  • S2. Upload local model: UEs transmit the local learning model and global gradient updates to the server.

  • S3. Update global model: The global model is constructed based on the weight parameters of local models at the server.

  • S4. Broadcast global model: The updated global model and gradient are broadcast to all UEs.

Local Training at UEs: According to [20], instead of solving the local objective in the equation (1), the surrogate problem is solved to attain the local model wntw_{n}^{t} for each global round tt as follows

minwdJnt(w):=DFn(w,wt1)+ηF^(w|wt1),\displaystyle\min_{w\in\mathbb{R}^{d}}J_{n}^{t}(w)\mathrel{\mathop{:}}=D_{F_{n}}(w,w^{t-1})+\eta\,\hat{F}(w|w^{t-1}), (3)

where DFnD_{F_{n}} denotes the Bregman divergence [21] of Fn()F_{n}(\cdot)

DFn(w,wt1)\displaystyle D_{F_{n}}(w,w^{t-1})
:=Fn(w)Fn(wt1)Fn(wt1),wwt1;\displaystyle\quad\mathrel{\mathop{:}}=F_{n}(w)-F_{n}(w^{t-1})-\langle{\nabla F_{n}(w^{t-1})},{w-w^{t-1}}\rangle;

F^(w|wt1)\hat{F}(w|w^{t-1}) denotes the first-order approximation of the global function F()F(\cdot) at wt1w^{t-1}

F^(w|wt1):=F(wt1)+F(wt1),wwt1;\displaystyle\hat{F}(w|w^{t-1})\mathrel{\mathop{:}}=F(w^{t-1})+\langle{\nabla F(w^{t-1})},{w-w^{t-1}}\rangle;

and η>0\eta>0 is the weight that balances between two objectives which is also our controlled learning parameter. The Bregman divergence is the generalized distance between the local model solution ww and the latest global model parameter wt1w^{t-1} (e.g., square Euclidean distance) that is widely applied in machine learning applications, statistics, and information geometry [21]. Thus, the local model at UEs can be constructed by minimizing the surrogate objective with the approximated loss function minimization such that its parameters is close to the latest global model parameter wt1w^{t-1}. Then the equivalent local learning problem is derived as follows

minwdJnt(w)=:Fn(w)+ηF(wt1)Fn(wt1),w.\displaystyle\min_{w\in\mathbb{R}^{d}}J_{n}^{t}(w)\mathrel{\mathop{=}}:F_{n}(w)+\bigl{\langle}\eta\nabla F(w^{t-1})-\nabla F_{n}(w^{t-1}),w\bigr{\rangle}. (4)

Since it is usually difficult to obtain the optimal solution in the learning problem (4), UEs is required find a (possibly weak) solution wntw_{n}^{t} instead. As an analogy from the definition for the relative accuracy in [26, 14] for the approximation, the local weight parameters at all UEs satisfy

Jn(wnt)θJn(wt1),n,\displaystyle\left\lVert\nabla J_{n}(w_{n}^{t})\right\rVert\leq\theta\,\left\lVert\nabla J_{n}(w^{t-1})\right\rVert,\forall n, (5)

where the relative local accuracy θ(0,1)\theta\in(0,1) is common to all UEs. This parameter also defines the quality of the approximation solutions when solving the local learning problem (3), in which θ=0\theta=0 the optimal solution is obtained, while θ=1\theta=1 we have no progress (i.e., wnt=wt1w_{n}^{t}=w^{t-1}). Since the objectives Jnt(w)J_{n}^{t}(w) and Fn()F_{n}(\cdot) have the same Hessian matrix, Jnt(w)J_{n}^{t}(w) is also β\beta-strongly convex and LL-smooth. Accordingly, the gradient descent (GD) method is reasonable to solve (4) and requires KlK_{l} number of local iterations to achieve the accuracy θ\theta-approximation of the solution.

wnk+1=wnkhkJnt(wnk),\displaystyle w^{k+1}_{n}=w^{k}_{n}-h_{k}\nabla J_{n}^{t}(w^{k}_{n}), (6)

where hkh_{k} is a learning rate. Note that each UE holds a small portion of samples, i.e, Dn<<DD_{n}<<D, n\forall n. In case of large DnD_{n}, mini-batch SGD can be used to alleviate the computation burden on UEs, but the convergence rate will be different. We assume that the generated convergent sequence (wnk)k0(w^{k}_{n})_{k\geq 0} for the local model satisfying a linear convergence rate [27] as follows

Jnt(wnk)Jnt(wn)c(1γ)k(Jnt(w0)Jnt(wn)),\displaystyle J_{n}^{t}(w^{k}_{n})-J_{n}^{t}(w^{*}_{n})\leq c(1-\gamma)^{k}\bigl{(}J_{n}^{t}(w_{0})-J_{n}^{t}(w^{*}_{n})\bigr{)}, (7)

where wnw^{*}_{n} is the optimal solution of the local problem (4), and cc and γ(0,1)\gamma\in(0,1) are constants depending on ρ\rho.

Lemma 1.

With Assumption 1 and the assumed linear convergence rate (7) with w0=wt1w_{0}=w^{t-1}, the number of local rounds for solving (3) to achieve a θ\theta-approximation condition (5) is

Kl=2γlogCθ,\displaystyle K_{l}=\frac{2}{\gamma}\log\frac{C}{\theta}, (8)

where C:=cρC\mathrel{\mathop{:}}=c\rho.

Global model updates at the server: Considering a synchronous federated learning scheme, the global model parameter is then updated by aggregating the local model parameter wntw_{n}^{t} from all UEs as follows

wt=n=1N|𝒟n||𝒟|wnt.\displaystyle w^{t}=\sum\nolimits_{n=1}^{N}\frac{|\mathcal{D}_{n}|}{|\mathcal{D}|}w_{n}^{t}. (9)

This updated global model is then broadcast along with F(wt)\nabla F(w^{t}) to all UEs (line 5) to all UEs. The convergence of the global problem (2) is achieved by satisfying

F(wt)F(w)ϵ,tKg,\displaystyle F(w^{t})-F(w^{*})\leq\epsilon,\;\forall t\geq K_{g}, (10)

where ϵ>0\epsilon>0 is an arbitrarily small constant, ww^{*} is the optimal solution of the problem (2). The algorithm enhances data privacy by exchanging learning information only rather than the local training data. The convergence analysis for FEDL is developed in [20].

Theorem 1.

With Assumption 1, the convergence of the FEDL algorithm is achieved with linear rate

F(wt)F(w)(1Θ)k(F(w(0))F(w)),\displaystyle F(w^{t})-F(w^{*})\leq(1-\Theta)^{k}(F(w^{(0)})-F(w^{*})), (11)

where Θ(0,1)\Theta\in(0,1) is defined as

Θ:=η(2(θ1)2(θ+1)θ(3η+2)ρ2(θ+1)ηρ2)2ρ((1+θ)2η2ρ2+1).\displaystyle\Theta\mathrel{\mathop{:}}=\frac{\eta(2(\theta-1)^{2}-(\theta+1)\theta(3\eta+2)\rho^{2}-(\theta+1)\eta\rho^{2})}{2\rho\bigl{(}(1+\theta)^{2}\eta^{2}\rho^{2}+1\bigr{)}}. (12)
Corollary 1.

The number of global rounds for FEDL to achieve the convergence satisfying (10) is

Kg=1ΘlogF(w0)F(w)ϵ,\displaystyle K_{g}=\frac{1}{\Theta}\log\frac{F(w^{0})-F(w^{*})}{\epsilon}, (13)

We show the detail proofs of Lemma 1, Theorem 1, and Corollary 1 in the work [20].

Learning Time Model: According to the convergence analysis of the federated learning algorithm, we obtain the convergence rate and the global rounds depend on the hyper-learning rate η\eta and the relative accuracy of the local learning problem θ\theta as in Corollary 1. Therefore, the total learning time can be defined in the general form as follows

TIME(η,θ)=Kg(Θ)×(c+𝒯(θ)),\text{TIME}(\eta,\theta)=K_{g}(\Theta)\times(c+\mathcal{T}(\theta)), (14)

where cc is the one round of communication time, 𝒯(θ)\mathcal{T}(\theta) is the required time to obtain the relative accuracy θ\theta of the local learning algorithm, and Kg(Θ)K_{g}(\Theta) is the required number of global rounds in the equation (12) and the Θ\Theta is defined in the equation (12). In a common setting of many federated learning frameworks [6, 18], the number of local iterations is often fixed for each UE, thus, the remaining control parameter is the hyper-learning rate η\eta that affects to the number of global rounds. Accordingly, we substitute the constants and get the simplified form as follows

Kg(Θ)=AΘ;\displaystyle K_{g}(\Theta)=\frac{A}{\Theta};
Θ=η(2(θ1)22(θ+1)θρ2ηρ2(θ+1)(3θ+1))2ρ((1+θ)2η2ρ2+1),\displaystyle\Theta=\frac{\eta\big{(}2(\theta-1)^{2}-2(\theta+1)\theta\rho^{2}-\eta\rho^{2}(\theta+1)(3\theta+1)\big{)}}{2\rho\bigl{(}(1+\theta)^{2}\eta^{2}\rho^{2}+1\bigr{)}},
=CηDη22ρ(Bη2+1),\displaystyle=\frac{C\eta-D\eta^{2}}{2\rho\bigl{(}B\eta^{2}+1\bigr{)}}, (15)

where

A:=\displaystyle A\mathrel{\mathop{:}}= logF(w0)F(w)ϵ>0,\displaystyle\log\frac{F(w^{0})-F(w^{*})}{\epsilon}>0,
B:=\displaystyle B\mathrel{\mathop{:}}= (1+θ)2ρ2,\displaystyle(1+\theta)^{2}\rho^{2},
C:=\displaystyle C\mathrel{\mathop{:}}= 2(θ1)22(θ+1)θρ2,\displaystyle 2(\theta-1)^{2}-2(\theta+1)\theta\rho^{2},
D:=\displaystyle D\mathrel{\mathop{:}}= ρ2(θ+1)(3θ+1).\displaystyle\rho^{2}(\theta+1)(3\theta+1).
TABLE I: The summary table of important notations
Var Definition
ss Index denoting FL service
nn Index denoting participating UE
Ds,nD_{s,n} The size of local dataset of the service ss at UE nn
csc_{s} The number of CPU cycles required to process 1 bit of data sample of the service ss
Es,ncmpE_{s,n}^{cmp} The energy consumption of service ss at UE nn to compute one local iteration
Es,ncomE_{s,n}^{com} The energy consumption of service ss at UE nn to transmit the local updates
TscmpT^{cmp}_{s} The local training time for one local iteration of service ss
TscomT^{com}_{s} The transimission time of service ss
BulB^{ul} The total uplink bandwidth
BdlB^{dl} The downlink bandwidth
vsv_{s} The size of local information updates of the service ss
Kl,sK_{l,s} The number of local iterations of service ss
Kg(Θs)K_{g}(\Theta_{s}) The number of global rounds using FEDL
𝒞s\mathcal{C}_{s} The total cost of the learning service ss
ηs\eta_{s} The hyper-learning rate ηs\eta_{s} using FEDL
fs,nf_{s,n} The allocated CPU-cycle frequency for service ss at UE nn
wnw_{n} The allocated fraction of the uplink bandwidth for UE nn

III-B Multi-Service Sharing Model

In this paper, we consider a multi-service federated learning scheme with one Federated Learning Orchestrator (FLO) at the MEC server and a set 𝒩\mathcal{N} of NN UEs as shown in Fig. 1. Each participating UE nn stores a local data set 𝒟s,n\mathcal{D}_{s,n} with size Ds,nD_{s,n} for each federated learning service ss. Then, we can define the total data size of a service ss by Ds=n=1NDs,nD_{s}=\sum_{n=1}^{N}D_{s,n}. The CPU resource of each UE is consumed to perform the local learning problem in (4) for each service ss by using the local data. Therefore, it is crucial to share the CPU resource of each UE amongst the local learning problems of multiple services efficiently. After the local training, all UEs upload their updated local model parameters to the MEC server by using the wireless medium. Hence, it is also important to efficiently share the communication resource (i.e., bandwidth) among the UEs. At the MEC server, FLO is deployed to manage computation (i.e., CPU) and communication resources sharing among learning services and UEs. In addition to resource allocation, FLO also controls the hyper-learning rate of learning services.

III-B1 Local Computation Model

We denote the required number of CPU cycles for each UE to execute one sample of data belong to service ss by csc_{s}, which can be measured offline [28]. The required CPU cycles are directly proportional to the number of samples in the local dataset. Since all samples {xi,yi}i𝒟s,n\{x_{i},y_{i}\}_{i\in\mathcal{D}_{s,n}} have the same size (i.e., number of bits), the number of CPU cycles required for UE nn to run one local iteration of learning service ss is csDs,nc_{s}D_{s,n}. The allocated CPU-cycle frequency for the service ss is denoted by fs,nf_{s,n}. Then the energy consumption of UE nn to compute one local iteration for learning service ss can be expressed as follows [29]

Es,ncmp=i=1csDs,nβn2fs,n2=βn2csDs,nfs,n2,\displaystyle E_{s,n}^{cmp}=\sum\nolimits_{i=1}^{c_{s}D_{s,n}}\frac{\beta_{n}}{2}f_{s,n}^{2}=\frac{\beta_{n}}{2}c_{s}D_{s,n}f_{s,n}^{2}, (16)

where βn/2\beta_{n}/2 is the effective capacitance coefficient of UE nn’s computing chipset. In addition, the computation time per local iteration of the UE nn for a service ss is csDs,nfs,n.\frac{c_{s}D_{s,n}}{f_{s,n}}. Using a synchronous federated learning scheme, local training time for one local iteration of the learning service ss is the same with the computation time of the slowest UE as

Tscmp=maxn𝒩csDs,nfs,n+τs,nm,T^{cmp}_{s}=\max_{n\in\mathcal{N}}\frac{c_{s}D_{s,n}}{f_{s,n}}+\tau^{m}_{s,n}, (17)

where τs,nm\tau^{m}_{s,n} is the extra overhead to access memory. We denote the vector of fs,nf_{s,n} by fsn.f_{s}\in\mathbb{R}^{n}.

III-B2 Communication Model

After processing the local learning problem, UEs need to exchange the local information to FLO on a shared wireless medium via a multi-access protocol (i.e., OFDMA). Therefore, the achievable transmission rate (bps) of UE nn on the given the allocated fraction wnw_{n} of the total uplink bandwidth BulB^{ul} is defined as follows:

rnul(wn)=wnBullog2(1+hnpnN0),\displaystyle r^{ul}_{n}(w_{n})=w_{n}B^{ul}\log_{2}\bigl{(}1+\frac{h_{n}p_{n}}{N_{0}}\bigr{)}, (18)

where N0N_{0} is the background noise, pnp_{n} is the transmission power, and hnh_{n} is the channel gain of the UE nn. Since the dimensions of local models and global gradient updates in line 4 of the Alg. 1 are fixed for all UEs, the data size (in bits) of local information updates does not change and is denoted by vsv_{s} for each learning service ss. Thus, uplink transmission time of each UE nn for a service ss is

τs,nul(wn)=vsrnul(wn).\displaystyle\tau^{ul}_{s,n}(w_{n})=\frac{v_{s}}{r^{ul}_{n}(w_{n})}. (19)

In addition, the downlink broadcast delay should be taken into account to transmit the global changes to all users using the downlink bandwidth BdlB^{dl} as follows

rndl=Bdllog2(1+hnPN0).\displaystyle r^{dl}_{n}=B^{dl}\log_{2}\Bigl{(}1+\frac{h_{n}P}{N_{0}}\Bigr{)}. (20)

Since the global information and the local information has the same size vsv_{s}, then, the downlink transmission time for the updated global information is τs,ndl=vsrndl.\tau^{dl}_{s,n}=\frac{v_{s}}{r^{dl}_{n}}. Thus, the downlink of the service ss is τsdl=maxn𝒩τs,ndl\tau^{dl}_{s}=\max_{n\in\mathcal{N}}\tau^{dl}_{s,n}.

Accordingly, the communication time of a learning service ss consisting of the uplink transmission time and downlink transmission time is defined as

Tscom=maxn𝒩τs,nul(wn)+τsdl+τs,nex,T^{com}_{s}=\max_{n\in\mathcal{N}}\tau^{ul}_{s,n}(w_{n})+\tau^{dl}_{s}+\tau^{ex}_{s,n}, (21)

where τs,nex\tau^{ex}_{s,n} is the extra communications overhead during the transmission (e.g., establishing TCP connection) and assumed to be random constants for each FL service communication.

Furthermore, the energy consumption for the uplink communication of UE nn for the service ss is defined as

Es,ncom=pnτs,nul(wn).E_{s,n}^{com}=p_{n}\tau^{ul}_{s,n}(w_{n}).

III-B3 Global Round Model

As aforementioned in subsection II.A, we have the number of global rounds and the number of local iterations are Kg(Θ)K_{g}(\Theta) and Kl,sK_{l,s}, respectively. Then, the running time of one global round of the learning service ss includes local learning time and transmission time which is defined as follows

Tsgl(Tscmp,Tscom):=Tscom+Tsavg+Kl,sTscmp,\displaystyle T_{s}^{gl}(T^{cmp}_{s},T^{com}_{s})\mathrel{\mathop{:}}=T^{com}_{s}+T^{avg}_{s}+K_{l,s}T^{cmp}_{s}, (22)

where TsavgT^{avg}_{s} is the computation time of the averaging operation. Since the simple averaging operation can be performed very quickly with strong computation capabilities of the edge server and the size of local updates information (i.e., local learning model, global gradient updates) are the same for every global round, TsavgT^{avg}_{s} is assumed to be small constant for each service ss.

Furthermore, in one global round of each service ss, the total energy consumption of all UEs for learning and uplink transmission is expressed as follows

Esgl(fs,w):=n=1NEs,ncom+Kl,sEs,ncmp.\displaystyle E_{s}^{gl}(f_{s},w)\mathrel{\mathop{:}}=\sum\nolimits_{n=1}^{N}E_{s,n}^{com}+K_{l,s}E_{s,n}^{cmp}. (23)

Finally, the total cost of a learning service ss is defined as

𝒞s:=Kg(Θs)(Esgl(fs,w)+κsTsgl(Tscmp,Tscom)),\displaystyle\mathcal{C}_{s}\mathrel{\mathop{:}}=K_{g}(\Theta_{s})\Big{(}E_{s}^{gl}(f_{s},w)+\kappa_{s}T_{s}^{gl}(T^{cmp}_{s},T^{com}_{s})\Big{)},

where κs\kappa_{s} is the trade-off between running time and the energy consumption of UEs that needs to be minimized.

III-C Problem formulation

Since the learning services jointly occupy the shared CPU resource in each UE, and the shared uplink bandwidth resource to upload the weight parameters of local models. Thus, FLO takes a role to manage these shared resources and controls the hyper-learning rate of learning services. To minimize the running time cost and the energy consumption of UEs, we propose the multi-service Federated Learning optimization problem for FLO, MS-FEDL, as follows

min.f,w,η\displaystyle\underset{{f,w,\eta}}{\text{min.}} s𝒮Kg(Θs)(Esgl(fs,w)+κsTsgl(Tscmp,Tscom))\displaystyle\sum_{s\in\mathcal{S}}K_{g}(\Theta_{s})\Big{(}E_{s}^{gl}(f_{s},w)+\kappa_{s}T_{s}^{gl}(T^{cmp}_{s},T^{com}_{s})\Big{)}
s.t. s𝒮fs,n=fntot,n𝒩,(Shared CPU)\displaystyle\sum_{s\in\cal{S}}f_{s,n}=f_{n}^{tot},\forall n\in\mathcal{N},(\textbf{{Shared CPU}}) (24)
n𝒩wn=1,(Shared Bandwidth)\displaystyle\sum_{n\in\mathcal{N}}w_{n}=1,\,(\textbf{{Shared Bandwidth}}) (25)
fs,nfs,min,s𝒮,n𝒩,\displaystyle f_{s,n}\geq f_{s,min},\forall s\in\mathcal{S},\forall n\in\mathcal{N}, (26)
wnwmin,n𝒩,\displaystyle w_{n}\geq w_{min},\forall n\in\mathcal{N}, (27)
0<Θs<1;ηs>0,s𝒮,(Learning parameters)\displaystyle 0<\Theta_{s}<1;\eta_{s}>0,\forall s\in\mathcal{S},(\textbf{{Learning parameters}}) (28)
TscmpcsDs,nfs,n+τs,nm,s𝒮,n𝒩,\displaystyle T^{cmp}_{s}\geq\frac{c_{s}D_{s,n}}{f_{s,n}}+\tau^{m}_{s,n},\forall s\in\mathcal{S},\forall n\in\mathcal{N}, (29)
Tscomτs,nul(wn)+τsdl+τs,nex,s𝒮,n𝒩,\displaystyle T^{com}_{s}\geq\tau^{ul}_{s,n}(w_{n})+\tau^{dl}_{s}+\tau^{ex}_{s,n},\forall s\in\mathcal{S},\forall n\in\mathcal{N}, (30)

where fntotf_{n}^{tot} is the total CPU frequency of UE nn. The main decision variables include the allocated CPU frequency (i.e., f:={fs,n}{f}:=\{f_{s,n}\}) for each service ss at UE nn, the allocated fraction (i.e., w:={wn}{w}:=\{w_{n}\} ) of total uplink bandwidth for each UE nn, and the relative accuracy (i.e., θ:={θs}{\theta}:=\{\theta_{s}\}) of the local learning problem at UEs. According to the constraints in (26), (27), all of the learning services and UEs are required to be allocated at least the minimum amount of CPU frequency and bandwidth to train and upload the learning parameters of local model. Besides, the auxiliary variables TscmpT^{cmp}_{s} is the computational time of one local iteration, and TscomT^{com}_{s} is the uplink transmission time which depends on ff and ww in the constraint (29), (30), respectively. The constraint (24) indicates the shared CPU resource among learning services and the constraint (25) defines the shared uplink bandwidth for each UE. Lastly, the constraint (25) provides the feasible ranges of the learning parameters in the FEDL algorithm for each learning service.

IV Solutions to MS-FEDL

IV-A Centralized Approach

Even though the MS-FEDL problem is non-convex, we later show the specific form of this problem is bi-convex [30]. The convexity proofs of subproblems are shown in the appendices. For solving this type of problem, we adopt the popular technique, Block Coordinate Descent (BCD) algorithm [30]. The block coordinate descent method cyclically solves the optimization problem for each block of variables while fixing the remaining blocks at their last updated values by following the Gauss-Seidel update scheme. In the MS-FEDL problem, there are two blocks of variables such as block of θ\theta and block of f,wf,w. The detail of the centralized algorithm is shown in Algorithm 2.

Given the fixed values f^,w^\hat{f},\hat{w} for the resource allocation variable block f,wf,w and the corresponding computation, communication time Tscmp^,Tscom^\hat{T^{cmp}_{s}},\hat{T^{com}_{s}}, the total cost for each learning service ss (i.e., 𝒞^s:=Esgl(f^s,w^)+κsTsgl(Tscmp^,Tscom^)\hat{\cal{C}}_{s}\mathrel{\mathop{:}}=E_{s}^{gl}(\hat{f}_{s},\hat{w})+\kappa_{s}T_{s}^{gl}(\hat{T^{cmp}_{s}},\hat{T^{com}_{s}})), we have the learning parameter decision problem as follows

SUB1-c ​: Learning Parameter Decision Problem

min.𝜂\displaystyle\underset{{\eta}}{\text{min.}} s𝒮Kg(Θs)𝒞^s\displaystyle\sum_{s\in\mathcal{S}}K_{g}(\Theta_{s})\hat{\cal{C}}_{s}
s.t. 0<Θs<1,s𝒮,\displaystyle 0<\Theta_{s}<1,\forall s\in\mathcal{S},
ηs>0,s𝒮,\displaystyle\eta_{s}>0,\forall s\in\mathcal{S},

where η:={ηs}{\eta}:=\{\eta_{s}\}. In Kg(Θs)K_{g}(\Theta_{s}), Θs\Theta_{s} are the functions of the hyper-learning rate ηs\eta_{s} and it is defined in the equation (15). However, the hyper-learning rate decision subproblem SUB1-c can be decentralized for each learning service ss without any coupling among services in the constraints. Thus, each service ss can make the decision independently by solving the following decentralized subproblem.

SUB1-d ​: Decentralized Learning Parameter Decision

min.ηs\displaystyle\underset{{\eta_{s}}}{\text{min.}} Kg(Θs)𝒞^s\displaystyle K_{g}(\Theta_{s})\hat{\cal{C}}_{s}
s.t. 0<Θs<1,\displaystyle 0<\Theta_{s}<1, (31)
ηs>0.\displaystyle\eta_{s}>0. (32)
Lemma 2.

There exists a unique solution θ\theta^{*} of the convex problem SUB1-d   satisfying the following equation:

ηs=D+D2+BC2BC,\eta_{s}^{*}=\frac{-D+\sqrt{D^{2}+BC^{2}}}{BC},

where the relative accuracy of the local learning problem sufficient small and closed to 0.

Algorithm 2 Centralized algorithm for MS-FEDL
1:FLO updates the information of learning service requirement, UE resources;
2:Compute η\eta^{*} from SUB1-c problem using Lemma 2;
3:Compute ff^{*} from SUB2-c problem given η\eta^{*};
4:Compute ww^{*} from SUB3-c problem given η\eta^{*};
Refer to caption
Figure 2: SUB1-d convexity for three learning services using the similar settings in the next section.

Accordingly, we provide the proof for Lemma 2 in the appendix section. In addition, Fig. 2 illustrates the convexity of the SUB1-d subproblem for three learning services. The optimal solutions are obtained by using Lemma 2 and marked as the circles which are also the lowest values of the objective curves. As an observation, both high and low values of the hyper-learning rate η\eta cause a higher number of global rounds KgK_{g} and higher total cost for each learning service.

According to Lemma 2, the optimal hyper-learning rate solutions do not depend on the total cost for each learning service 𝒞^s\hat{\cal{C}}_{s}. Thus, the optimal solution of this problem is independent to the other decisions. Then, given the optimal learning parameter η\eta^{*} and the corresponding Θs\Theta^{*}_{s}, the problem can be decomposed into two independent sub-problems for CPU frequency allocation and bandwidth allocation as follows

SUB2-c ​: CPU Allocation Problem

min.f,Tcmp\displaystyle\underset{{f,T^{cmp}}}{\text{min.}} s𝒮Kl,sKg(Θs)(n𝒩βn2csDs,nfs,n2+κsTscmp)\displaystyle\sum_{s\in\mathcal{S}}K_{l,s}K_{g}(\Theta^{*}_{s})\Big{(}\sum_{n\in\mathcal{N}}\frac{\beta_{n}}{2}c_{s}D_{s,n}f_{s,n}^{2}+\kappa_{s}T^{cmp}_{s}\Big{)}
s.t. TscmpcsDs,nfs,n+τs,nm,s𝒮,n𝒩,\displaystyle T^{cmp}_{s}\geq\frac{c_{s}D_{s,n}}{f_{s,n}}+\tau^{m}_{s,n},\forall s\in\mathcal{S},\forall n\in\mathcal{N},
s𝒮fs,n=fntot,n𝒩,(Shared CPU)\displaystyle\sum_{s\in\cal{S}}f_{s,n}=f_{n}^{tot},\forall n\in\mathcal{N},(\textbf{{Shared CPU}})
fs,nfs,min,s𝒮,n𝒩,\displaystyle f_{s,n}\geq f_{s,min},\forall s\in\mathcal{S},\forall n\in\mathcal{N},

where Tcmp:={Tscmp}{T^{cmp}}:=\{T^{cmp}_{s}\}.
This problem decides a number of CPU frequency for each learning service at UEs.

SUB3-c ​: Bandwidth Allocation Problem

min.w,Tcom\displaystyle\underset{{w,T^{com}}}{\text{min.}} s𝒮Kg(Θs)(n𝒩pnτs,nul(wn)+κsTscom)\displaystyle\sum_{s\in\mathcal{S}}K_{g}(\Theta^{*}_{s})\Big{(}\sum_{n\in\mathcal{N}}p_{n}\tau^{ul}_{s,n}(w_{n})+\kappa_{s}T^{com}_{s}\Big{)}
s.t. Tscomτs,nul(wn)+τsdl+τs,nex,s𝒮,n𝒩,\displaystyle T^{com}_{s}\geq\tau^{ul}_{s,n}(w_{n})+\tau^{dl}_{s}+\tau^{ex}_{s,n},\forall s\in\mathcal{S},\forall n\in\mathcal{N},
n𝒩wn=1,(Shared Bandwidth)\displaystyle\sum_{n\in\mathcal{N}}w_{n}=1,\,(\textbf{{Shared Bandwidth}})
wnwmin,n𝒩,\displaystyle w_{n}\geq w_{min},\forall n\in\mathcal{N},

where Tcom:={Tscom}{T^{com}}:=\{T^{com}_{s}\}.

Refer to caption
Figure 3: Centralized Deployment.
Refer to caption
Figure 4: Decentralized Deployment.

Using block coordinate descent, FLO solves alternatively three convex subproblems of the hyper-learning rate decision, CPU allocation, bandwidth allocation. Discussion as mentioned above, the optimal hyper-learning rate decision can be obtained independently, then only one iteration is required in block coordinate descent style algorithm. These convex problems that can be easily solved by using a off-the-shelf convex solver (i.e., IpOpt solver [31]). After solving these problems, the resource allocation solutions and the decision of the hyper-learning rate are sent to the UEs to start the learning process. The whole operation process is illustrated in Fig. 3.

IV-B Decentralized Approach

In addition to the centralized algorithm, we develop a decentralized algorithm, which leverages the parallelism structure for subproblems update of Jacobi-Proximal ADMM[23] into the multi-convex ADMM framework[24], namely JP-miADMM. Since the original form of multi-convex ADMM using the conventional Gauss-Seidel scheme does not allow solving the CPU allocation subproblem independently, the integrated Jacobi-Proximal ADMM form provides the parallelism structure for this subproblem. The JP-miADMM algorithm consists of two procedures, such as primal update which can be independently solved by each service ss and dual update takes the role of a coordinator from the solutions of learning services. Note that in the following primal subproblems of JP-miADMM, the objectives comprise the addictive norm-2 terms which are the augmented term that originally is introduced in ADMM and proximal term in Jacobi-Proximal ADMM. These two updates are performed iteratively until the convergence condition is obtained. In this algorithm, we introduce the dual variables v,yv,y, and the auxiliary variable zz that are used in the next subproblems. The detail of the decentralized algorithm is shown in Algorithm 3 by alternatively updating the primal variables, primal residual and dual variables until the convergence conditions in line 14 are obtained. Specifically, the first condition is the condition of CPU allocation based on the Frobenius norm of allocation matrix ff while the second one is based on the vector norm of bandwidth allocation solution ww.

Since the optimal hyper-learning rate decision (i.e., η\eta^{*}) is obtained independently according to the closed-form in Lemma 2 for each learning service, the JP-miADMM algorithm consists of an iterative process on the shared CPU and bandwidth allocation as follows

IV-B1 Primal Update

In the primal update, each service ss solves iteratively its CPU allocation, bandwidth allocation, and hyper-learning rate decision subproblems.

SUB2-d ​: Decentralized CPU Allocation (finding fs(k+1)f_{s}^{(k+1)})

min.Tscmp,fs\displaystyle\underset{T^{cmp}_{s},f_{s}}{\text{min.}} Kl,sKg(Θs)(n𝒩βn2csDs,nfs,n2+κTscmp)+y(k)Tfs\displaystyle K_{l,s}K_{g}(\Theta^{*}_{s})\Big{(}\sum_{n\in\mathcal{N}}\frac{\beta_{n}}{2}c_{s}D_{s,n}f_{s,n}^{2}+\kappa T^{cmp}_{s}\Big{)}+{y}^{(k)T}f_{s}
+ρ12js,j𝒮fj(k)+fsftot2+ν2fsfs(k)2\displaystyle+\frac{\rho_{1}}{2}\big{\|}\sum_{j\neq s,j\in\mathcal{S}}f_{j}^{(k)}+f_{s}-f^{tot}\big{\|}^{2}+\frac{\nu}{2}\|f_{s}-{f}_{s}^{(k)}\|^{2}
s.t. TscmpcsDs,nfs,n+τs,nm,n𝒩,\displaystyle T^{cmp}_{s}\geq\frac{c_{s}D_{s,n}}{f_{s,n}}+\tau^{m}_{s,n},\forall n\in\mathcal{N},
fs,nfs,min,n𝒩,\displaystyle f_{s,n}\geq f_{s,min},\forall n\in\mathcal{N},

where Θs\Theta^{*}_{s} is the function of η\eta^{*} and the decision variable fsf_{s} is the vector of {fs,n}\{f_{s,n}\}. Under the mild conditions, i.e., the splittable objective functions are closed proper convex and the existence of a saddle point satisfying KKT condition, the sufficient condition of JP-ADMM for the global convergence to the saddle point according to Theorem 2.1 in [23] can be guaranteed by choosing parameters such that

ν>ρ1(|𝒮|2α1), and 0<α<2,\nu>\rho_{1}\Big{(}\frac{|\mathcal{S}|}{2-\alpha}-1\Big{)},\text{ and }0<\alpha<2,
Algorithm 3 Decentralized algorithm for MS-FEDL
1:FLO updates the information of learning service requirement, UE resources;
2:Each learning service ss computes ηs\eta^{*}_{s} from Lemma 2;
3:Initialize k=1,f(1),w(1)k=1,f^{(1)},w^{(1)};
4:repeat
5:     Primal update:
6:     for learning service s𝒮s\in\mathcal{S} do
7:         Compute fs(k+1)f_{s}^{(k+1)} from SUB2-d problem given ηs,fs(k)\eta^{*}_{s},f_{s}^{(k)};
8:         Compute ws(k+1)w_{s}^{(k+1)} from SUB3-d problem given ηs,z(k)\eta^{*}_{s},z^{(k)};      
9:     Dual update:
10:     Update the global consensus bandwidth allocation variable z(k+1)z^{(k+1)} in the equation (34);
11:     Update primal residual in the equation (35), (36);
12:     Update dual variable in the equation (37), (38);
13:     k=k+1k=k+1;
14:until f(k+1)f(k)Fϵ1\|f^{(k+1)}-f^{(k)}\|_{F}\leq\epsilon_{1}, z(k+1)z(k)ϵ2\|z^{(k+1)}-z^{(k)}\|\leq\epsilon_{2}.

In the bandwidth allocation subproblem, there is a coupling among services due to all services have the same allocated bandwidth allocation wnw_{n} at UE nn. Thus, we transform the SUB3-c subproblem into the following equivalent problem by introducing the auxiliary consensus variable (i.e., z:={zn}{z}:=\{z_{n}\}) and the consensus constraint (33). This transformation is commonly used to handle global consensus variables in ADMM framework [22].

min.Tscom,w,z\displaystyle\underset{T^{com}_{s},{w,z}}{\text{min.}} s𝒮Kg(Θs)(nθpnτs,nul(ws,n)+κTscom)\displaystyle\sum_{s\in\mathcal{S}}K_{g}(\Theta^{*}_{s})\Big{(}\sum_{n\theta}p_{n}\tau^{ul}_{s,n}(w_{s,n})+\kappa T^{com}_{s}\Big{)}
s.t. Tscomτs,nul(ws,n)+τsdl+τs,nex,s𝒮,n𝒩,\displaystyle T^{com}_{s}\geq\tau^{ul}_{s,n}(w_{s,n})+\tau^{dl}_{s}+\tau^{ex}_{s,n},\forall s\in\mathcal{S},\forall n\in\mathcal{N},
n𝒩ws,n=1,s𝒮,(Shared Bandwidth)\displaystyle\sum_{n\in\mathcal{N}}w_{s,n}=1,\,\forall s\in\mathcal{S},(\textbf{{Shared Bandwidth}})
ws,nwmin,s𝒮,n𝒩,\displaystyle w_{s,n}\geq w_{min},\forall s\in\mathcal{S},\forall n\in\mathcal{N},
ws,n=zn,s𝒮,n𝒩.\displaystyle w_{s,n}=z_{n},\forall s\in\mathcal{S},\forall n\in\mathcal{N}. (33)

Accordingly, each service ss decides the allocated bandwidth ws(k+1)w_{s}^{(k+1)} by solving individually its subproblem as follows

SUB3-d ​: Decentralized Bandwidth Allocation

min.Tscom,ws\displaystyle\underset{T^{com}_{s},{w_{s}}}{\text{min.}} Kg(Θs)(n𝒩pnτs,nul(ws,n)+κTscom)\displaystyle K_{g}(\Theta^{*}_{s})\Big{(}\sum_{n\in\mathcal{N}}p_{n}\tau^{ul}_{s,n}(w_{s,n})+\kappa T^{com}_{s}\Big{)}
+ν(k)T(wsz(k))+ρ22wsz(k)22\displaystyle+\nu^{(k)T}({w_{s}}-{z}^{(k)})+\frac{\rho_{2}}{2}\big{\|}{w_{s}}-{z}^{(k)}\big{\|}_{2}^{2}
s.t. Tscomτs,nul(ws,n)+τsdl+τs,nex,n𝒩,\displaystyle T^{com}_{s}\geq\tau^{ul}_{s,n}(w_{s,n})+\tau^{dl}_{s}+\tau^{ex}_{s,n},\forall n\in\mathcal{N},
n𝒩ws,n=1,(Shared Bandwidth)\displaystyle\sum_{n\in\mathcal{N}}w_{s,n}=1,\,(\textbf{{Shared Bandwidth}})
ws,nwmin,n𝒩,\displaystyle w_{s,n}\geq w_{min},\forall n\in\mathcal{N},

where ws:={ws,n}{w_{s}}:=\{w_{s,n}\}.

The optimal solution of these convex problems can be easily obtained by using a convex solver (i.e., IpOpt solver [31]).

IV-B2 Dual Update

After independently updating the resource allocation, hyper-learning rate decision for each learning service, the dual update is performed to coordinate these solutions and update the dual variables for the next iteration. We first update the global consensus variable znz_{n} of the allocated bandwidth for each UE as follows

zn(k+1)=1|𝒮|s𝒮(ws,n(k+1)+(1/ρ2)νs(k+1)),n𝒩.\displaystyle z_{n}^{(k+1)}=\frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}}\big{(}w_{s,n}^{(k+1)}+(1/\rho_{2})\nu_{s}^{(k+1)}\big{)},\forall n\in\mathcal{N}. (34)

Update primal residual:

r1(k+1)\displaystyle r_{1}^{(k+1)} =s𝒮fs(k+1)ftot,\displaystyle=\sum_{s\in\mathcal{S}}f_{s}^{(k+1)}-f^{tot}, (35)
r2,s(k+1)\displaystyle r_{2,s}^{(k+1)} =ws(k+1)z(k+1),\displaystyle=w_{s}^{(k+1)}-z^{(k+1)}, (36)

where r1,r2r_{1},r_{2} is the vector of NN devices.

Update dual variable:

y(k+1)\displaystyle y^{(k+1)} =y(k)+ρ1r1(k+1),\displaystyle=y^{(k)}+\rho_{1}r_{1}^{(k+1)}, (37)
νs(k+1)\displaystyle\nu_{s}^{(k+1)} =νs(k)+ρ2r2,s(k+1),\displaystyle=\nu_{s}^{(k)}+\rho_{2}r_{2,s}^{(k+1)}, (38)

where y:={yn},νs:={νs,n}y:=\{y_{n}\},\nu_{s}:=\{\nu_{s,n}\}.

Using JP-miADMM, FLO needs Management Aggregator and a particular module for each learning service. First, each FL learning service module performs CPU allocation, bandwidth allocation, and hyper-learning rate decision then sends the CPU and bandwidth allocation decision to Management Aggregator and then running the aggregation process for variable zz, primal residual, and dual variables. This process iteratively performs until the convergence condition is achieved. Then, the resource allocation solutions and the decision of the hyper-learning rate are sent to the UEs that participate in the learning process. The whole process of the algorithm deployment is illustrated in Fig. 4. In the MEC server, each service can run its own virtual instance to manage the resource allocation and learning aggregator. Although the decentralized approach requires many iterations to convergence, it provides a more flexible and scalable approach for the resource allocation without revealing the learning service information (i.e., dataset information, the learning weight parameters exchange between UEs and the MEC server, the number of CPU cycles for each UE to execute one sample of data). The decisions can be made by each service and shared with FLO only.

Refer to caption
Figure 5: Generated distance (in m) and training size (in MB) at each UE.
Refer to caption
(a) Objective.
Refer to caption
(b) Primal residual r1r_{1}.
Refer to caption
(c) Primal residual r2r_{2}.
Refer to caption
(d) CPU allocation at UE 11.
Refer to caption
(e) Hyper-learning rate for each service.
Refer to caption
(f) Bandwidth allocation for each UE.
Figure 6: Convergence experiment.

Note that the chosen parameters ρ1\rho_{1}, ρ2\rho_{2} in the augmented and proximal terms could affect the convergence performance of the decentralized approach. Furthermore, for a particular global convex problem with additional running conditions, JP-ADMM obtains o(1/k)o(1/k) convergence rate according to Theorem 2.2 in [23], where kk denotes the number of iterations. Specifically, xkxk+1Mx2=o(1/k)\|x^{k}-x^{k+1}\|^{2}_{M_{x}}=o(1/k) where xkx^{k} is the primal solution at the iteration kk and MxM_{x} is defined in [23]. Accordingly, the gaps between updated primal variables become gradually smaller throughout the iterative updates and the solutions converge toward the optimal ones. Conventionally, for a global convex problem, JP-ADMM converges faster than the dual decomposition method [23] but still requires a higher number of iterations compared to Gauss-Seidel ADMM as shown in the simulation results of [32]. For the multi-convex case, miADMM can guarantee the global convergence to the Nash point (i.e., stationary point) with the convergence rate o(1/k)o(1/k) [24]. In the next section, we provide numerical results for the convergence performance and the efficiency of the proposed algorithms.

V Performance Evaluation

V-A Numerical Settings

In this work, we assume that three learning services are deployed at the edge networks. Moreover, 50 heterogeneous UEs are positioned within the coverage area of the base station to participate in the federated learning system. Similar to our prior works [20, 10], we consider that the channel gain between the base station and the UE follows the exponential distribution with the mean g0(d0/d)4g_{0}(d_{0}/d)^{4} where g0=40dBg_{0}=-40\,dB, the reference distance d0=1md_{0}=1\,m between BS and UEs. Here, the actual distance dd between the UEs and the base station is uniformly generated between [2,50]m[2,50]\,m. Furthermore, the uplink system bandwidth B=20MHzB=20\,MHz is shared amongst UEs, the Noise power spectral is 1010W10^{-10}\,W, and the transmit power of UEs and BS are 10W10\,W and 40W40\,W, respectively. In this work, we assume that the size of the uploaded local model and downloaded global model is the same and it is set to vs{100,200,300}KBv_{s}\in\{100,200,300\}\,KB for each service.

For the local training model at UEs, we first set the training data size of UEs in each learning service following a uniform distribution in 1020MB10-20\,MB. The maximum computation capacity (i.e., CPU frequency) at each UE is uniformly distributed between [1,2]GHz[1,2]\,GHz. The required CPU cycles csc_{s} to train one bit of data at the UE for each learning service is {50,70,90}cycles\{50,70,90\}\,cycles. Furthermore, the minimum required CPU frequency for each service is fsmin=0.1GHzf^{min}_{s}=0.1\,GHz. We consider that the effective capacitance coefficient is the same for all UEs as βn=21028\beta_{n}=2\cdot 10^{-28} and the trade-off parameter κs\kappa_{s} is set to 0.20.2. For the federated learning parameters, we set L=1,β=0.5,γ=1,L=1,\,\beta=0.5,\gamma=1, and c=1c=1. Then, the relative accuracy of the local problem at UEs for each service is θ{0.07,0.06,0.05}\theta\in\{0.07,0.06,0.05\}. This setting reflects the CPU frequency requirement and model size as above and defines the required number of local iterations for each learning service correspondingly. Finally, for the algorithm setting, we set the convergence thresholds in the algorithms as ϵ1\epsilon_{1} and ϵ2\epsilon_{2} are 10410^{-4} and 10510^{-5}, respectively. Then, the values of the parameters ρ1\rho_{1}, ρ2\rho_{2}, and ν\nu are 10001000, 1010, and 15001500 respectively.

Refer to caption
(a) Time Comparison.
Refer to caption
(b) Energy Comparison.
Refer to caption
(c) Service Cost Comparison.
Figure 7: Cost Comparison.
Refer to caption
(a) Energy consumption of services by increasing κ3\kappa_{3}.
Refer to caption
(b) Total time of services by increasing κ3\kappa_{3}.
Refer to caption
(c) Trade-offs in objectives by reducing κs\kappa_{s}.
Figure 8: The trade-offs in energy consumption and total running time by varying κ\kappa.

V-B Numerical Results

We first illustrate a realization for the random location and local dataset size of UEs as shown in Fig. 5. For this realization, we demonstrate the convergence of the total cost, primal residual, CPU allocation, bandwidth from two solution approaches in Fig. 6. Accordingly, the centralized approach solely requires one iteration to get the optimal solution and the decentralized solution approach performs an iterative update process with many iterations to achieve that convergence condition such as the changes of solutions below small thresholds. Starting from the same initial points, the centralized approach is quickly converged within one iteration while the decentralized approach requires 9595 iterations to achieve the same solution in Fig. 6(a). Even though, the decentralized algorithm needs only 35 iterations to get almost similar cost compared to the optimal one, however, the allocated CPU frequency in Fig. 6(d) needs more iterations to obtain the same optimal solution from the solver or centralized algorithm. Different from a slow convergence of CPU frequency, the bandwidth solutions are quickly converged after 3 iterations and so the primal residual r2r_{2} in the equation (36). In practical usage, we can stop when the primal residual starts converging to zero after 5555 iterations as illustrated in Fig. 6(b), and 6(c). As a result, these solutions still guarantees to be a feasible solution and obtain such a very similar to optimal cost. In light of this observation, we later test the convergence performance of the integrated early stopping strategy in the decentralized algorithm.

Now, we will discuss the characteristic of the optimal solution in Fig. 6 as follows. As an example, by the reason of Service 11 having the smallest local model parameter size to update, it has the highest hyper-learning rate and correspondingly takes more global rounds, less number of local iterations. Thereby, in order to proceed with the local learning and perform more global rounds quickly, Service 1 occupies most of the CPU frequency of UEs. Unlike Service 1, Service 3 has the lowest hyper-learning rate, takes the least CPU frequency, and performs less global rounds. As the learning scheme in this work follows a synchronous federated learning, all UEs have to complete the local update and send the local weight parameters to the server before updating model at the MEC server. Thus, the users who are far from the base station and have more training data to process will require longer time to train, then upload the local model. Accordingly, these devices receive the larger fraction of the uplink system bandwidth to upload their local parameters to the server. As shown in Fig. 5, UE48 is one of the furthest UEs from BS and has a larger amount of local data. Therefore, the largest fraction of the uplink system bandwidth is allocated to UE48 as shown in Fig. 6(f).

Moreover, we compare the optimal learning time, energy consumption, and total cost with two heuristic approaches as shown in Fig. 7. In the first heuristic approach, the CPU frequency of the UEs is equally allocated to the learning services. Moreover, the uplink system bandwidth is equally allocated to the UEs as well to upload the local learning weight parameter to the server. Then, FLO decides solely the optimal hyper-learning rate for each learning service by solving the SUB1-c problem. The second heuristic approach is adopted to proportionally allocate the local CPU based on the local data size (i.e., Ds,nD_{s,n}) of each service at UEs and allocate bandwidth based on the transmission capacity of each UE. From the figures, we observe that the cost including the learning time and energy consumption of UEs for all services is reduced more than 18%18\% and 16%16\% than that of the Heuristic 11 and Heuristic 22 strategies. Among three learning services, Service 1 has the lowest CPU requirement and the smallest size of local model parameters. Therefore, it needs the lowest learning time and energy consumption as well. However, the minimum local performance at UEs of Service 1 compels more global rounds and the lowest values of the hyper-learning rate η\eta than that of Service 2, and 3.

Furthermore, we vary the trade-off parameter to study the effects of the conflicting goals of minimizing the time cost and energy cost of each FL service. As increasing the trade-off parameter (i.e., κs\kappa_{s}), the MS-FEDL gives a higher priority to minimize the running time than that of the energy consumption of UEs. Consequently, Fig. 8(c) shows the decreasing curve of running time. Meanwhile, the services require more energy consumption from UEs. Moreover, the priority of services in terms of running time can be controlled by varying the trade-off parameter κs\kappa_{s} of the service ss. Accordingly, the service which has a higher priority can be set with a higher value of κs\kappa_{s} to reduce the learning time in solving the MS-FEDL problem. In Fig. 8, we only increase the trade-off parameter κ3\kappa_{3} of Service 3 to demonstrate the higher priority of Service 3 than that of the other services. As a result, the total running time of Service 3 can be decreased while the total time of Service 1 and 2 are increased. On the other hand, the energy consumption of UEs to serve Service 3 is increased while the energy consumption of UEs to serve other services is slightly decreased.

To boost the convergence speed, we can apply the early stopping strategy for the decentralized algorithm, namely the Decentralized-ES algorithm, by using the convergence condition on the primal residuals instead of the primal variables as we discuss above. In Fig. 9, we run 100100 realizations for the random location and local dataset size of UEs while keeping the other settings to validate the convergence speed of the decentralized approach using the Jacobi-Proximal scheme for multi-convex ADMM, the early stopping version of the decentralized algorithm and the original multi-convex ADMM algorithm using Gauss-Seidel scheme. Accordingly, we observe that the median values of the required iterations of the decentralized, decentralized-ES and miADMM algorithms for convergence are 9696 iterations, 6767 iterations, and 7070 iterations, respectively. The decentralized algorithm requires higher number of iterations on average than the original miAdMM algorithm. However, the early stopping strategy helps to speed up the convergence rate and obtain nearly the optimal cost. Statistically, there is only a 0.05%0.05\% difference with the optimal cost. Note that, the original miADMM does not fully support the parallel operation in the primal problem and requires cyclic learning service operating based on Gauss-Seidel update scheme. Besides, even though the proposed decentralized algorithms require a higher number of iterations to convergence, they provide privacy preserved and flexible approaches to independently control the learning process and resource allocation for each learning service. Note that, if the stringent time-constrained is required, the centralized algorithm is applicable because it can provide the solution within two iterations.

Refer to caption
Figure 9: Convergence performance of the decentralized algorithms in 100 realizations.

VI Conclusion

In this paper, we analyzed a multi-service federated learning scheme that is managed by a federated learning orchestrator to provide the optimal computation, communication resources and control the learning process. We first formulate the optimization model for computation, communication resource allocation, and the hyper-learning rate decision among learning services regarding the learning time and energy consumption of UEs. We then decompose the proposed multi-convex problem into three convex sub-problems and solve them alternatively by using the block coordinate descent algorithm in the centralized manner. Besides, we develop a decentralized algorithm to preserve the privacy of each learning service without revealing the learning service information (i.e., dataset information, exchange local updates information between UEs and the MEC server, the number of CPU cycles for each UE to execute one sample of data) to FLO. The simulation results demonstrate the superior convergence performance of the centralized algorithm and the efficiency of the proposed approach compared to the heuristic strategy. Furthermore, by experiment, the early stopping strategy can boost the convergence speed of the decentralized algorithm with an infinitesimal higher value than the optimal solution.

To scale up our design, different MEC servers in different cells can independently allocate their resources. The recent works have analyzed the hierarchical federated learning framework in [33, 34] and game-strategic formulation that provide promising directions to extend this work regarding the scalability issue. We advocate the wireless dynamic and packet losses impacts as in [35] on the bounds of global iterations in the FEDL algorithm and MS-FEDL problem are also important to employ FL scheme to the realistic communication scenarios. We leave the possibly extensive analysis of the proposed approaches for our future works.

References

  • [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless Network Intelligence at the Edge,” Proceedings of the IEEE, vol. 107, no. 11, pp. 2204–2239, Nov 2019.
  • [2] “CrowdTracker: Optimized Urban Moving Object Tracking Using Mobile Crowd Sensing,” IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3452–3463, Oct 2018.
  • [3] “Free Community-based GPS, Maps & Traffic Navigation App — Waze,” https://www.waze.com/.
  • [4] A. Mathurz, N. D. Lanezy, S. Bhattacharyaz, A. Boranz, C. Forlivesiz, and F. Kawsarz, “DeepEye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware,” in MobiSys 2017 - Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services.   Association for Computing Machinery, Inc, Jun 2017, pp. 68–81.
  • [5] B. Fang, X. Zeng, and M. Zhang, “NestDNN: Resource-aware multi-tenant on-device deep learning for continuous mobile vision,” in Proceedings of the Annual International Conference on Mobile Computing and Networking, MOBICOM.   Association for Computing Machinery, Oct 2018, pp. 115–127.
  • [6] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
  • [7] “Federated Learning: Collaborative Machine Learning without Centralized Training Data,” http://ai.googleblog.com/2017/04/federated-learning-collaborative.html.
  • [8] “We are making on-device AI ubiquitous,” https://www.qualcomm.com/news/onq/2017/08/16/we-are-making-device-ai-ubiquitous, Aug. 2017.
  • [9] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surveys & Tutorials, 2020.
  • [10] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Federated Learning over Wireless Networks: Optimization Model Design and Analysis,” in IEEE INFOCOM 2019, Paris, France, Apr. 2019.
  • [11] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, K. Chan, and I. T. J. Watson, “When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning,” in IEEE INFOCOM 2018.
  • [12] C. Zhang, P. Patras, and H. Haddadi, “Deep Learning in Mobile and Wireless Networking: A Survey,” IEEE Communications Surveys and Tutorials, vol. 21, no. 3, pp. 2224–2287, 2019.
  • [13] C. Ma, J. Konečný, M. Jaggi, V. Smith, M. I. Jordan, P. Richtárik, and M. Takáč, “Distributed optimization with arbitrary local solvers,” Optimization Methods and Software, vol. 32, no. 4, pp. 813–848, Jul. 2017.
  • [14] S. J. Reddi, J. Konečnỳ, P. Richtárik, B. Póczós, and A. Smola, “Aide: Fast and communication efficient distributed optimization,” arXiv:1608.06879, 2016.
  • [15] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv:1610.02527, 2016.
  • [16] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” in NIPS Workshop on Private Multi-Party Machine Learning, 2016. [Online]. Available: https://arxiv.org/abs/1610.05492
  • [17] J. Konečný, Z. Qu, and P. Richtárik, “Semi-stochastic coordinate descent,” Optimization Methods and Software, vol. 32, no. 5, pp. 993–1005, Sep. 2017.
  • [18] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems 2020, 2020, pp. 429–450.
  • [19] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  • [20] C. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated Learning over Wireless Networks: Convergence Analysis and Resource Allocation,” arXiv:1910.13067, 2019.
  • [21] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman Divergences,” Journal of Machine Learning Research, vol. 6, pp. 1705–1749, Dec. 2005.
  • [22] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010.
  • [23] W. Deng, M.-J. Lai, Z. Peng, and W. Yin, “Parallel Multi-Block ADMM with o(1/k) Convergence,” Journal of Scientific Computing, vol. 71, no. 2, pp. 712–736, May 2017.
  • [24] J. Wang, L. Zhao, and L. Wu, “Multi-convex inequality-constrained alternating direction method of multipliers,” arXiv:1902.10882, 2019.
  • [25] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive Federated Learning in Resource Constrained Edge Computing Systems,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1205–1221, Jun. 2019.
  • [26] V. Smith, S. Forte, C. Ma, M. Takáč, M. I. Jordan, and M. Jaggi, “CoCoA: A General Framework for Communication-Efficient Distributed Optimization,” Journal of Machine Learning Research, vol. 18, no. 230, pp. 1–49, 2018.
  • [27] Y. Nesterov, Lectures on Convex Optimization.   Springer International Publishing, 2018, vol. 137.
  • [28] A. P. Miettinen and J. K. Nurminen, “Energy Efficiency of Mobile Clients in Cloud Computing,” in USENIX HotCloud’10, Berkeley, CA, USA, 2010, pp. 4–4.
  • [29] T. D. Burd and R. W. Brodersen, “Processor Design for Portable Systems,” Journal of VLSI Signal Processing System, vol. 13, no. 2-3, pp. 203–221, Aug. 1996.
  • [30] Y. Xu and W. Yin, “A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion,” SIAM Journal on Imaging Sciences, vol. 6, no. 3, pp. 1758–1789, Jan 2013.
  • [31] A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, Mar 2006.
  • [32] T. Lin, S. Ma, and S. Zhang, “On the Global Linear Convergence of the ADMM with MultiBlock Variables,” SIAM Journal on Optimization, vol. 25, no. 3, pp. 1478–1497, Jan 2015.
  • [33] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 8866–8870.
  • [34] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloud hierarchical federated learning,” in ICC 2020-2020 IEEE International Conference on Communications (ICC).   IEEE, 2020, pp. 1–6.
  • [35] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” arXiv preprint arXiv:1909.07972, 2019.

-A Proof for Lemma 2

In this proof, we skip the subscription ss for each service.

According to the equation (15), we have Kg(Θ)=A1ΘK_{g}(\Theta)=\frac{A_{1}}{\Theta} and Θ=2ρ(Bη2+1)CηDη2\Theta=\frac{2\rho\bigl{(}B\eta^{2}+1\bigr{)}}{C\eta-D\eta^{2}} where

A:=\displaystyle A\mathrel{\mathop{:}}= logF(w0)F(w)ϵ>0,\displaystyle\log\frac{F(w^{0})-F(w^{*})}{\epsilon}>0,
B:=\displaystyle B\mathrel{\mathop{:}}= (1+θ)2ρ2,\displaystyle(1+\theta)^{2}\rho^{2},
C:=\displaystyle C\mathrel{\mathop{:}}= 2(θ1)22(θ+1)θρ2,\displaystyle 2(\theta-1)^{2}-2(\theta+1)\theta\rho^{2},
D:=\displaystyle D\mathrel{\mathop{:}}= ρ2(θ+1)(3θ+1),\displaystyle\rho^{2}(\theta+1)(3\theta+1),

and θ(0,1)\theta\in(0,1) is defined in (5), ρ=Lβ>1,L,β\rho=\frac{L}{\beta}>1,\,L,\,\beta are defined in Assumption 1.

Given 0<Θ<10<\Theta<1 and 0<η0<\eta,

We have CηDη2>0C>0,η<CD.C\eta-D\eta^{2}>0\Rightarrow C>0,\eta<\frac{C}{D}.

Thus, B,C,D>0B,C,D>0

We first verify the convexity of the SUB1-d problem then derive a closed-form solution for this problem. The problem is strictly convex by validating the first and second derivatives of the objective function as follows

g(η)\displaystyle g(\eta) :=𝒞(η)2ρA1𝒞^s=Bη2+1CηDη2\displaystyle\mathrel{\mathop{:}}=\frac{\mathcal{C}(\eta)}{2\rho A_{1}\hat{\cal{C}}_{s}}=\frac{B\eta^{2}+1}{C\eta-D\eta^{2}}
g(η)\displaystyle\Rightarrow\nabla g(\eta) =2Bη(CηDη2)(Bη2+1)(C2Dη)(CηDη2)2,\displaystyle=\frac{2B\eta(C\eta-D\eta^{2})-(B\eta^{2}+1)(C-2D\eta)}{(C\eta-D\eta^{2})^{2}},
=BCη2+2DηC(CηDη2)2.\displaystyle=\frac{BC\eta^{2}+2D\eta-C}{(C\eta-D\eta^{2})^{2}}. (39)
\displaystyle\Rightarrow 2g(η)\displaystyle\nabla^{2}g(\eta)
=2BCη+2D(CηDη2)22(BCη2+2DηC)(C2Dη)(CηDη2)3;\displaystyle=\frac{2BC\eta+2D}{(C\eta-D\eta^{2})^{2}}-2\frac{(BC\eta^{2}+2D\eta-C)(C-2D\eta)}{(C\eta-D\eta^{2})^{3}};
=2BCη+2D(CηDη2)2+2BCη2(2DηC)+(2DηC)2(CηDη2)3;\displaystyle=\frac{2BC\eta+2D}{(C\eta-D\eta^{2})^{2}}+2\frac{BC\eta^{2}(2D\eta-C)+(2D\eta-C)^{2}}{(C\eta-D\eta^{2})^{3}};
=2D(CηDη2)2+2BCDη3+(2DηC)2(CηDη2)3>0.\displaystyle=\frac{2D}{(C\eta-D\eta^{2})^{2}}+2\frac{BCD\eta^{3}+(2D\eta-C)^{2}}{(C\eta-D\eta^{2})^{3}}>0. (40)

Thus, the problem is strictly convex and has a unique solution.

𝒞(η)=0\displaystyle\nabla\mathcal{C}(\eta)=0
BCη2+2DηC=0\displaystyle\Rightarrow BC\eta^{2}+2D\eta-C=0
(Since η>0),η=D+D2+BC2BC.\displaystyle\Rightarrow(\text{Since }\eta>0),\eta^{*}=\frac{-D+\sqrt{D^{2}+BC^{2}}}{BC}. (41)

Note that, in our prior work, the sufficient small value of θ\theta can guarantee the feasibility of the problem and the closed-form solution (41). Also, This happens when many local iterations are required to fit the local models and causes the small values of θ\theta.

-B Convexity Proof for SUB2-c problem

Firstly let us introduce the Lagrangian function of the SUB2-c problem as follows

(Tcmp,f,λ,μ,β)=s𝒮n𝒩λs,n(csDs,nfs,nTscmp)\displaystyle\mathcal{L}(T^{\textrm{cmp}},f,\lambda,\mu,\beta)=\sum_{s\in\mathcal{S}}\sum_{n\in\mathcal{N}}\lambda_{s,n}\left(\frac{c_{s}D_{s,n}}{f_{s,n}}-T_{s}^{\textrm{cmp}}\right)
+s𝒮Kl,sKg(Θs)(n𝒩αn2csDs,nfs,n2+κsTscmp)\displaystyle+\sum_{s\in\mathcal{S}}K_{l,s}K_{g}(\Theta^{*}_{s})\left(\sum_{n\in\mathcal{N}}\frac{\alpha_{n}}{2}c_{s}D_{s,n}f^{2}_{s,n}+\kappa_{s}T_{s}^{\textrm{cmp}}\right)
+n𝒩μn(n𝒩fs,nfntot)+s𝒮n𝒩λs,n(fs,minfs,n),\displaystyle+\sum_{n\in\mathcal{N}}\mu_{n}\big{(}\sum_{n\in\mathcal{N}}f_{s,n}-f_{n}^{\textrm{tot}}\big{)}+\sum_{s\in\mathcal{S}}\sum_{n\in\mathcal{N}}\lambda_{s,n}(f_{s,min}-f_{s,n}), (42)

where λ,β0,\lambda,\beta\geq 0, and μ\mu are Lagrangian multipliers. The first-order partial derivative of the Lagrangian function in (42) is as follows

fs,n=Kl,sKg(Θs)αncsDs,nfs,nλs,ncsDs,nfs,n2+μnβs,n;\displaystyle\frac{\partial\mathcal{L}}{\partial f_{s,n}}=K_{l,s}K_{g}(\Theta^{*}_{s})\alpha_{n}c_{s}D_{s,n}f_{s,n}-\frac{\lambda_{s,n}c_{s}D_{s,n}}{f^{2}_{s,n}}+\mu_{n}-\beta_{s,n};
Tscmp=Kl,sKg(Θs)κs+n𝒩λs,n.\displaystyle\frac{\partial\mathcal{L}}{\partial T_{s}^{\textrm{cmp}}}=K_{l,s}K_{g}(\Theta^{*}_{s})\kappa_{s}+\sum_{n\in\mathcal{N}}\lambda_{s,n}.

Then, the second-order derivative as follows

2fs,n2=Kl,sKg(Θs)αncsDs,n+2λs,ncsDs,nfs,n3>0;\displaystyle\frac{\partial^{2}\mathcal{L}}{\partial f^{2}_{s,n}}=K_{l,s}K_{g}(\Theta^{*}_{s})\alpha_{n}c_{s}D_{s,n}+\frac{2\lambda_{s,n}c_{s}D_{s,n}}{f^{3}_{s,n}}>0; (43)
2(Tscmp)2=0.\displaystyle\frac{\partial^{2}\mathcal{L}}{\partial(T_{s}^{\textrm{cmp}})^{2}}=0. (44)

The Hessian matrix of the Lagrangian function is positive semi-definite. Therefore, we can conclude that the SUB2-c problem is a convex problem.

-C Convexity Proof for SUB3-c problem

The Lagrangian function of the SUB3-c problem of BW allocation is as follows

(Tcom,w,ν,ζ,δ)=s𝒮Kg(Θs)(n𝒩pnτs,nul(wn)+κsTscom)\displaystyle\mathcal{L}(T^{\textrm{com}},w,\nu,\zeta,\delta)=\sum_{s\in\mathcal{S}}K_{g}(\Theta^{*}_{s})\left(\sum_{n\in\mathcal{N}}p_{n}\tau_{s,n}^{ul}(w_{n})+\kappa_{s}T_{s}^{com}\right)
+s𝒮n𝒩νs,n(τs,nul(wn)+τs,ndlTscom)\displaystyle+\sum_{s\in\mathcal{S}}\sum_{n\in\mathcal{N}}\nu_{s,n}(\tau_{s,n}^{ul}(w_{n})+\tau_{s,n}^{dl}-T_{s}^{com})
+n𝒩ζn(wminwn)+δn𝒩(1wn),\displaystyle+\sum_{n\in\mathcal{N}}\zeta_{n}(w_{min}-w_{n})+\delta\sum_{n\in\mathcal{N}}(1-w_{n}), (45)

where ζ,ν0,\zeta,\nu\geq 0, and δ\delta are Lagrangian multipliers. Then, the first-order derivative of the Lagrangian function in (45) as follows

wn=Kg(Θs)vswn2Bullog2(1+hnpnN0)(pn+νs,n)ζn;\displaystyle\frac{\partial\mathcal{L}}{\partial w_{n}}=\frac{-K_{g}(\Theta^{*}_{s})v_{s}}{w_{n}^{2}B^{ul}\log_{2}(1+\frac{h_{n}p_{n}}{N_{0}})}\left(p_{n}+\nu_{s,n}\right)-\zeta_{n};
Tscom=Kg(Θs)κsn𝒩νs,n\displaystyle\frac{\partial\mathcal{L}}{\partial T_{s}^{\textrm{com}}}=K_{g}(\Theta^{*}_{s})\kappa_{s}-\sum_{n\in\mathcal{N}}\nu_{s,n} (46)

The second-order derivative of is as follows:

2wn2=2Kg(Θs)vswn3Bullog2(1+hnpnN0)(pn+νs,n)>0;\displaystyle\frac{\partial^{2}\mathcal{L}}{\partial w_{n}^{2}}=\frac{2K_{g}(\Theta^{*}_{s})v_{s}}{w_{n}^{3}B^{ul}\log_{2}(1+\frac{h_{n}p_{n}}{N_{0}})}\left(p_{n}+\nu_{s,n}\right)>0; (47)
2(Tscom)2=0.\displaystyle\frac{\partial^{2}\mathcal{L}}{\partial(T_{s}^{\textrm{com}})^{2}}=0. (48)

Therefore, we can conclude that the SUB3-c problem is a convex problem.