Efficient Federated Meta-Learning over Multi-Access Wireless Networks
Abstract
Federated meta-learning (FML) has emerged as a promising paradigm to cope with the data limitation and heterogeneity challenges in today’s edge learning arena. However, its performance is often limited by slow convergence and corresponding low communication efficiency. In addition, since the available radio spectrum and IoT devices’ energy capacity are usually insufficient, it is crucial to control the resource allocation and energy consumption when deploying FML in practical wireless networks. To overcome the challenges, in this paper, we rigorously analyze the contribution of each device to the global loss reduction in each round and develop an FML algorithm (called NUFM) with a non-uniform device selection scheme to accelerate the convergence. After that, we formulate a resource allocation problem integrating NUFM in multi-access wireless systems to jointly improve the convergence rate and minimize the wall-clock time along with energy cost. By deconstructing the original problem step by step, we devise a joint device selection and resource allocation strategy to solve the problem with theoretical guarantees. Further, we show that the computational complexity of NUFM can be reduced from to (with the model dimension ) via combining two first-order approximation techniques. Extensive simulation results demonstrate the effectiveness and superiority of the proposed methods in comparison with existing baselines.
Index Terms:
Federated meta-learning, multi-access systems, device selection, resource allocation, efficiency.I Introduction
The integration of Artificial Intelligence (AI) and Internet-of-Things (IoT) has led to a proliferation of studies on edge intelligence, aiming at pushing AI frontiers to the wireless network edge proximal to IoT devices and data sources [1]. It is expected that edge intelligence will reduce time-to-action latency down to milliseconds for IoT applications while minimizing network bandwidth and offering security guarantees [2]. However, a general consensus is that a single IoT device can hardly realize edge intelligence due to its limited computational and storage capabilities. Accordingly, it is natural to rely on collaboration in edge learning, whereby IoT devices work together to accomplish computation-intensive tasks [3].
Building on a synergy of federated learning [4] and meta-learning [5], federated meta-learning (FML) has been proposed under a common theme of fostering edge-edge collaboration [6, 7, 8]. In FML, IoT devices join forces to learn an initial shared model under the orchestration of a central server such that current or new devices can quickly adapt the learned model to their local datasets via one or a few gradient descent steps. Notably, FML can keep all the benefits of the federated learning paradigm (such as simplicity, data security, and flexibility), while giving a more personalized model for each device to capture the differences among tasks [9]. Therefore, FML has emerged as a promising approach to tackle the heterogeneity challenges in federated learning and to facilitate efficient edge learning [10].
Despite its promising benefits, FML comes with new challenges. On one hand, the number of participated devices can be enormous. The uniform device selection at random, often done in the existing methods, leads to a low convergence speed [11, 9]. Although recent studies [12, 13, 14] have characterized the convergence of federated learning and proposed non-uniform device selection mechanisms, they cannot be directly applied to FML problems due to the bias and high-order information in the stochastic gradients. On the other hand, the performance of FML in a wireless environment is highly related to its wall-clock time, including the computation time (determined by local data sizes and devices’ CPU types) and communication time (depending on channel gains, interference, and transmission power) [15, 16]. If not properly controlled, a large wall-clock time can cause unexpected training delay and communication inefficiency. In addition, the DNN model training, which involves a large number of data samples and epochs, usually induces high computational cost, especially for sophisticated model structures consisting of millions of parameters [15]. Due to the limited power capacity of IoT devices, energy consumption should also be properly managed to ensure system sustainability and stability [14]. In a nutshell, for the purpose of efficiently deploying FML in today’s wireless systems, the strategy of device selection and resource allocation must be carefully crafted not only to accelerate the learning process, but also to control the wall-clock time of training and energy cost in edge devices. Unfortunately, despite their importance, there are limited studies on these aspects in the current literature.
In this paper, we tackle the above-mentioned challenges in two steps: 1) We develop an algorithm (called NUFM) with a non-uniform device selection scheme to improve the convergence rate of vanilla FML algorithm; 2) based on NUFM, we propose a resource allocation strategy (called URAL) that jointly optimizes the convergence speed, wall-clock time, and energy consumption in the context of multi-access wireless systems. More specifically, first we rigorously quantify the contribution of each device to the convergence of FML via deriving a tight lower bound on the reduction of one-round global loss. Based on the quantitative results, we present the non-uniform device selection scheme that maximizes the loss reduction per round, followed by the NUFM algorithm. Then, we formulate a resource allocation problem for NUFM over wireless networks, capturing the trade-offs among convergence, wall-clock time, and energy cost. To solve this problem, we exploit its special structure and decompose it into two sub-problems. The first one is to minimize the computation time via controlling devices’ CPU-cycle frequencies, which is solved optimally based on the analysis of the effect of device heterogeneity on the objective. The second sub-problem aims at optimizing the resource block allocation and transmission power management. It is a non-convex mixed-integer non-linear programming (MINLP) problem, and to derive a closed-form solution is a non-trivial task. Thus, after deconstructing the problem step by step, we devise an iterative method to solve it and provide a convergence guarantee.
In summary, our main contributions are three-fold.
-
•
We provide a theoretical characterization of contribution of an individual device to the convergence of FML in each round, via establishing a tight lower bound on the one-round reduction of expected global loss. Using this quantitative result, we develop NUFM, a fast-convergent FML algorithm with non-uniform device selection;
-
•
To embed NUFM in the context of multi-access wireless systems, we formulate a resource allocation problem, capturing the trade-offs among the convergence, wall-clock time, and energy consumption. By decomposing the original problem into two sub-problems and deconstructing sub-problems step by step, we propose a joint device selection and resource allocation algorithm (namely URAL) to solve the problem effectively with theoretical performance guarantees;
-
•
To reduce the computational complexity, we further integrate our proposed algorithms with two first-order approximation techniques in [9], by which the complexity of a one-step update in NUFM can be reduced from to . We also show that our theoretical results hold in these cases;
-
•
We provide extensive simulation results on challenging real-world benchmarks (i.e., Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet) to demonstrate the efficacy of our methods.
The remainder of this paper is organized as follows. Section II briefly reviews the related work. Section III introduces the FML problem and standard algorithm. We present the non-uniform device selection scheme in Section IV and adapt the scheme for wireless networks in Section V. Finally, Section VI presents the extension to first-order approximation techniques, followed by the simulation results in Section VII and a conclusion drawn in Section VIII.
II Related Work
Federated learning (FL) [4] has been proposed as a promising technique to facilitate edge-edge collaborative learning [17]. However, due to the heterogeneity in devices, models, and data distributions, a shared global model often fails to capture the individual information of each device, leading to performance degradation in inference or classification [18, 10, 19].
Very recently, based on the advances in meta-learning [5], federated meta-learning (FML) has garnered much attention, which aims to learn a personalized model for each device to cope with the heterogeneity challenges [6, 7, 8, 9, 11]. Chen et al.[6] first introduce an FML method called FedMeta, integrating the model-agnostic meta-learning (MAML) algorithm [5] into the federated learning framework. They show that FML can significantly improve the performance of FedAvg [4]. Jiang et al.[7] analyze the connection between FedAvg and MAML, and empirically demonstrate that FML enables better and more stable personalized performance. From a theoretical perspective, Lin et al.[8] analyze the convergence properties and computational complexity of FML with strongly convex loss functions and exact gradients. Fallah et al.[9] further provide the convergence guarantees in non-convex cases with stochastic gradients. Different from the above gradient descent–based approaches, another recent work [11] develops an ADMM-based FML method and gives its convergence guarantee under non-convex cases. However, due to selecting devices uniformly at random, the existing FML algorithms often suffer from slow convergence and low communication efficiency [9]. Further, deploying FML in practical wireless systems calls for effective resource allocation strategies [15, 20], which is beyond the scope of existing FML literature.
There exists a significant body of works placing interests in convergence improvement and resource allocation for FL [21, 22, 23, 24, 13, 25, 26, 27, 15, 28, 29, 14, 30, 31, 32, 33, 34, 35, 36]. Regarding the convergence improvement, Nguyen et al.[13] propose a fast-convergent FL algorithm, called FOLB, which achieves a near-optimal lower bound for the overall loss decrease in each round. Note that, while the idea of NUFM is similar to [13], the lower bound for FML is derived from a completely different technical path due to the inherent complexity in the local update. To minimize the convergence time, a probabilistic device selection scheme for FL is designed in [25], which assigns high probabilities to the devices with large effects on the global model. Ren et al.[26] investigate a batchsize selection strategy for accelerating the FL training process. Karimireddy et al.[30] employ the variance reduction technique to develop a new FL algorithm. Based on momentum methods, Yu et al.[27] give an FL algorithm with linear speedup property. Regarding the resource allocation in FL, Dinh et al.[15] embed FL in wireless networks, considering the trade-offs between training time and energy consumption, under the assumption that all devices participate in the whole training process. From a long-term perspective, Xu et al.[28] empirically investigate the device selection scheme jointly with bandwidth allocation for FL, using the Lyapunov optimization method. Chen et al.[14] investigate a device selection problem with “hard” resource constraints to enable the implementation of FL over wireless networks. Wang et al.[29] propose a control algorithm to determine the best trade-off between the local update and global aggregation under a resource budget.
Although extensive research has been carried out on FL, researchers have not treated FML in much detail. In particular, the existing FL acceleration techniques cannot be directly applied to FML due to the high-order information and biased stochastic gradients in the local update phases111Different from FML, FL is a first-order method with unbiased stochastic gradients. (see Lemma 2 in Section IV). At the same time, device selection and resource allocation require crafting jointly [6], rather than simply plugging the existing strategies in.
III Preliminaries and Assumptions
In this section, we introduce federated meta-learning, including the learning problem, standard algorithm, and assumptions for theoretical analysis.
Notation | Definition | Notation | Definition |
---|---|---|---|
Model parameter | Meta–loss function | ||
, | Stepsize and meta–learning rate | , | Lipschitz continuous parameters |
, | Indexes of user devices and data samples | , | Upper bounds of variances |
Index of training rounds | , | Similarity parameters | |
Index of local update steps | Contribution to global loss reduction of device | ||
Number of local update steps | CPU-cycle frequency of device | ||
Underlying data distribution of device | Transmission power of device | ||
, | Dataset and its size of device | Binary variable indicating if device accesses RB | |
Set of devices | , | Weight parameters | |
Number of devices | CPU cycles for computing one sample by device | ||
Set of participating devices in round | , | Channel gain of device and inference in RB | |
Number of participating devices in round | , | Bandwidth of each RB and noise power spectral density | |
, | Input and corresponding label | , | Set and number of RBs |
Loss of model on sample | Total energy consumption | ||
Expected loss function | Total latency |
III-A Federated Meta-Learning Problem
We consider a set of user devices that are all connected to a server. Each device has a labeled dataset that can be accessed only by itself. Here, the tuple is a data sample with input and label , and follows an unknown underlying distribution . Define as the model parameter, such as the weights of a Deep Neural Network (DNN) model. For device , the loss function of a model parameter is defined as , which measures the error of model in predicting the true label given input .
Federated meta-learning (FML) looks for a good model initialization (also called meta-model) such that the well-performed models of different devices can be quickly obtained via one or a few gradient descent steps. More specially, FML aims to solve the following problem
(1) |
where represents the expected loss function over the data distribution of device , i.e., , is the number of devices, and is the stepsize. The advantages of this formulation are two-fold: 1) It gives a personalized solution that can capture any heterogeneity between the devices; 2) the meta-model can quickly adapt to new devices via slightly updating it with respect to their own data. Clearly, FML well fits edge learning cases, where edge devices have insufficient computing power and limited data samples.
Next, we review the standard FML algorithm in the literature.
III-B Standard Algorithm
Similar to federated learning, vanilla FML algorithm solves (1) in two repeating steps: local update and global aggregation [9], as detailed below.
-
•
Local update: At the beginning of each round , the server first sends the current global model to a fraction of devices chosen uniformly at random with pre-set size . Then, each device updates the received model based on its meta-function by running () steps of stochastic gradient descent locally (also called mini-batch gradient descent), i.e.,
(2) where denotes the local model of device in the -th step of the local update in round with , and is the meta–learning rate. In (2), the stochastic gradient is given by
(3) where , , and are independent batches222We slightly abuse the notation as a batch of the local dataset of the -th device., and for any batch , and are the unbiased esimates of and respectively, i.e.,
(4) (5) -
•
Global aggregation: After updating the local model parameter, each selected device sends its local model to the server. The server updates the global model by averaging over the received models, i.e.,
(6)
It is easy to see that the main difference between federated learning and FML lies in the local update phase: In federated learning, local update is done using the unbiased gradient estimates while FML uses the biased one consisting of high-order information. Besides, federated learning can be considered as a special case of FML, i.e., FML under .
III-C Assumptions
Assumption 1 (Smoothness).
The expected loss function corresponding to device is twice continuously differentiable and -smooth, i.e.,
(7) |
Besides, its gradient is bounded by a positive constant , i.e., .
Assumption 2 (Lipschitz Hessian).
The Hessian of function is -Lipschitz continuous for each , i.e.,
(8) |
Assumption 3 (Bounded Variance).
Given any , the following facts hold for stochastic gradient and Hessians with
(9) | ||||
(10) |
Assumption 4 (Similarity).
For any and , there exist nonnegtive constants and such that the gradients and Hessians of the expected loss funtions and satisfy the following conditions
(11) | ||||
(12) |
Assumption 2 implies the high-order smoothness of dealing with the second-order information in the local update step (2). Assumption 4 indicates that the variations of gradients between different devices are bounded by some constants, which captures the similarities between devices’ tasks corresponding to non-IID data. It holds for many practical loss functions [37], such as logistic regression and hyperbolic tangent functions. In particular, and can be roughly seen as a distance between data distributions and [38].
IV Non-Uniform Federated Meta-Learning
Due to the uniform selection of devices in each round, the convergence rate of the standard FML algorithm is naturally slow. In this section, we present a non-uniform device selection scheme to tackle this challenge.
IV-A Device Contribution Quantification
We begin with quantifying the contribution of each device to the reduction of one-round global loss using its dataset size and gradient norms. For convenience, define , , and . We first provide necessary lemmas before giving the main result.
Sketch of Proof.
Lemma 1 gives the smoothness of the local meta-function and global loss function .
Lemma 2.
Sketch of Proof.
Lemma 2 shows that the stochastic gradient is a biased estimate of , revealing the challenges in analyzing FML algorithms.
Sketch of Proof.
Lemma 3 characterizes the similarities between the local meta-functions, which is critical for analyzing the one-step global loss reduction because it relates the local meta-function and global objective. Based on Lemmas 1–3, we are now ready to give our main result.
Theorem 1.
Suppose that Assumptions 1-4 are satisfied, and , , and are independent batches. If the local update and global aggregation follow (2) and (6) respectively, then the following fact holds true for
(17) |
where the outer expectation of RHS is taken with respect to the selected user set and data sample sizes, and the inner expectation is only regarding data sample sizes.
Sketch of Proof.
Theorem 1 provides a lower bound on the one-round reduction of the global objective function based on the device selection. It implies that different user selection has varying impacts on the objective improvement and quantifies the contribution of each device to the objective improvement, depending on the variance of local meta-function, task similarities, smoothness, and learning rates. It therefore provides a criterion for selecting users to accelerate the convergence.
From Theorem 1, we have the following Corollary, which simplifies the above result and extends it to multi-step cases.
Corollary 1.
Sketch of Proof.
Corollary 1 implies that the device with a large gradient naturally accelerates the global loss decrease, but a small dataset size degrades the process due to corresponding high variance. Besides, as the device dissimilarities become large, the lower bound (1) weakens.
Motivated by Corollary 1, we study the device selection in the following subsection.
IV-B Device Selection
To improve the convergence speed, we aim to maximize the lower bound (1) on the one-round objective reduction. Based on Corollary 1, we define the contribution device to the convergence in round as
(21) |
where we replace the second moment of by its sample value . Then, the device selection problem in round can be formulated as
(22) |
In (22), is a binary variable: for selecting device in this round; , otherwise. The solution of (22) can be found by the Select algorithm introduced in [40, Chapter 9] with the worst-case complexity (the problem (22) is indeed a selection problem). Accordingly, our device selection scheme is presented as follows.
Device selection: Following the local update phase, instead of selecting devices uniformly at random, each device first computes its contribution scalar locally and sends it to the server. After receiving from all devices, the server runs the Select algorithm and finds the optimal device set denoted by . Notably, although constants and in (22) consist of unknown parameters such as , , and , they can be either estimated during training as in [29] or directly tuned as in our simulations.
Based on the device selection scheme, we propose the Non-Uniform Federated Meta-Learning (NUFM) algorithm (depicted in Algorithm 1). In particular, although NUFM requires an additional communication phase to upload to the server, the communication overhead can be negligible because is just a scalar.
V Federated Meta-Learning over Wireless Networks
In this section, we extend NUFM to the context of multi-access wireless systems, where the bandwidth for uplink transmission and the power of IoT devices are limited. First, we present the system model followed by the problem formulation. Then, we decompose the original problem into two sub-problems and devise solutions for each of them with theoretical performance guarantees.
V-A System Model
As illustrated in Fig. 1, we consider a wireless multi-user system, where a set of end devices joint forces to carry out federated meta-learning aided by an edge server. Each round consists of two stages: the computation phase and the communication phase. In the computation phase, each device downloads the current global model and computes the local model based on its local dataset; in the communication phase, the selected devices transmit the local models to the edge server via a limited number of wireless channels. After that, the edge server runs the global aggregation and starts the next round. Here we do not consider the downlink communication due to the asymmetric uplink-downlink settings in wireless networks. That is, the transmission power at the server (e.g., a base station) and the downlink communication bandwidth are generally sufficient for global meta-model transmission. Thus, the downlink time is usually neglected, compared to the uplink data transmission time [15]. Since we focus on the device selection and resource allocation problem in each round, we omit subscript for brevity throughout this section.

V-A1 Computation Model
We denote as the CPU cycles for device to update the model with one sample, which can be measured offline as a priori knowledge [15]. Assume that the batch size of device used in local update phase (2) is . Then, the number of CPU cycles required for device to run a one-step local update is . We denote the CPU-cycle frequency of device as . Thus, the CPU energy consumption of device in the computation during the local update phase can be expressed by
(23) |
where is the effective capacitance coefficient of the computing chipset of device [41]. The computational time of device in a round can be denoted as
(24) |
For simplicity, we set in the following.
V-A2 Communication Model
We consider a multi-access protocol for devices, i.e., the orthogonal frequency division multiple access (OFDMA) technique whereby each device can occupy one uplink resource block (RB) in a communication round to upload its local model. There are RBs in the system, denoted by . The achievable transmission rate of device is [25]
(25) |
with being the bandwidth of each RB, the channel gain, the noise power spectral density, the transmission power of device , and the interference caused by the devices that are located in other service areas and use the same RB. In (25), is a binary variable associated with the -th RB allocation for device : indicates that RB is allocated to device , and otherwise. Each device can only occupy one RB at maximum while each RB can be accessed by at most one device, thereby we have
(26) | ||||
(27) |
Due to the fixed dimension of model parameters, we assume that the model sizes of devices are constant throughout the learning process, denoted by . If device is selected, the time duration of transmitting the model is given by
(28) |
where . Besides, the energy consumption of the transmission is
(29) |
If no RB is allocated to device in current round, its transmission power and energy consumption is zero.
V-B Problem Formulation
For ease of exposition, we define , , and . Recall the procedure of NUFM. The total energy consumption and wall-clock time in a round can be expressed by
(30) | ||||
(31) |
where we neglect the communication time for transmitting the scalar . The total contribution to the convergence is
(32) |
where is given in (21)333One can regularize via adding a large enough constant in (21) to keep it positive..
We consider the following non-convex mixed-integer non-linear programming (MINLP) problem
(33) | ||||
(34) | ||||
(35) | ||||
(26) | ||||
(27) |
where and are weight parameters to capture the Pareto-optimal trade-offs among convergence, latency, and energy consumption, the values of which depend on specific scenarios. Constraints (33) and (34) give the feasible regions of devices’ transmission power levels and CPU-cycle frequencies, respectively. Constraints (26) and (27) restrict that each device can only access one uplink RB while each RB can be allocated to one device at most.
In this formulation, we aim to maximize the convergence speed of FML, while minimizing the energy consumption and wall-clock time in each round. Notably, our solution can adapt to the problem with hard constraints on energy consumption and wall-clock time as in [14] via setting “virtual devices” (see Lemma 4 and Lemma 6).
Next, we provide a joint device selection and resource allocation algorithm to solve this problem.
V-C A Joint Device Selection and Resource Allocation Algorithm
Substituting (30), (31), and (32) into problem (P), we can easily decompose the original problem into the following two sub-problems (SP1) and (SP2).
(SP1) aims at controlling the CPU-cycle frequencies for devices to minimize the energy consumption and latency in the computational phase. (SP2) controls the transmission power and RB allocation to maximize the convergence speed while minimizing the transmission cost and communication delay. We provide the solutions to these two sub-problems separately.
V-C1 Solution to (SP1)
Denote the optimal CPU-cycle frequencies of (SP1) as . We first give the following lemma to offer insights into this sub-problem.
Lemma 4.
If device is the straggler of all devices with optimal CPU frequencies , i.e., , the following holds true for any
(36) |
The positive constants and in (36) are defined as
(37) | ||||
(38) |
Sketch of Proof.
Lemma 4 implies that if the straggler (a device with the lowest computational time) can be determined, then the optimal CPU-cycle frequencies of all devices can be derived as closed-form solutions. Intuitively, due to the contradiction in minimizing the energy consumption and computational time, if the straggler is fixed, then the other devices can use the smallest CPU-cycle frequencies as long as the computational time is shorter than that of the straggler. It leads to the following Theorem.
Theorem 2.
Denote as the optimal solution (i.e., (36)) under the assumption that is the straggler. Then, the global optimal solution of (SP1) can be obtained by
(39) |
where , and is the objective function in (SP1).
Proof.
The result can be directly obtained from Lemma 4. We omit it for brevity. ∎
V-C2 Solution to (SP2)
Similar to Section V-C1, we denote the optimal solutions of (SP2) as and respectively, as the optimal set of selected devices and, for each , RB block allocated to as , i.e., .
It is challenging to derive a closed-form solution for (SP2) because it is a non-convex MINLP problem with non-differentiable “max” operator in the objective. Thus, in the following, we develop an iterative algorithm to solve this problem and show that the algorithm will converge to a local minimum.
We begin with analyzing the properties of the optimal solution in the next lemma.
Lemma 5.
Denote the transmission delay regarding and as , i.e.,
(40) |
The following relation holds
(41) |
and can be expressed by
(42) |
Sketch of Proof.
Lemma 5 indicates that the optimal transmission power can be derived as a closed-form via (42), given the RB allocation and transmission delay. Lemma 5 also implies that for any RB allocation strategy and transmission delay (not necessarily optimal), equation (42) provides the “optimal” transmission power under and as long as (41) is satisfied. Based on that, we have the following result.
Theorem 3.
Denote . Given transmission delay , the optimal RB allocation strategy can be obtained by
(43) |
where
(44) |
Proof.
Theorem 3 shows the optimal RB allocation strategy can be obtained by solving (43), given transmission delay . Naturally, problem (43) can be equivalently transformed to a bipartite matching problem. Consider a Bipartite Graph with source set and destination set . For each and , denote the weight of the edge from node to node as : If , ; otherwise, . Therefore, maximizing (43) is equivalent to finding a matching in with the minimum sum of weights. It means that we can obtain the optimal RB allocation strategy under fixed transmission delay via Kuhn-Munkres algorithm with the worst complexity of [42].
We proceed to show how to iteratively approximate the optimal , , and .
Lemma 6.
Let denote the communication straggler among all selected devices with respect to RB allocation and transmission power , i.e., for any ,
(45) |
Then, the following holds true:
-
1)
Define function . Then, is monotonically increasing with respect to , and has unique zero point , where and are denoted as follows
(46) (47) -
2)
Denote . For , we have
(48)
Sketch of Proof.
Lemma 6 indicates that, given optimal RB allocation strategy and straggler, the optimal transmission power can be derived by (48), different from Lemma 5 that requires the corresponding transmission delay . Notably, in (48), we can obtain zero point of with any required tolerance by Bisection method in iterations at most.
Similar to Theorem 2, we can find the optimal transmission power by the following theorem, given the RB allocation.
Theorem 4.
Denote as the optimal solution under the assumption that is the communication straggler, given fixed RB allocation . The corresponding optimal transmission power is given by
(49) |
where and is the objective function defined in (SP2).
Proof.
We can easily obtain the result from Lemma 6, thereby omitting it for brevity. ∎
Define the communication time corresponding to and as . Based on Theorem 3 and 4, we have the following Iterative Solution (IVES) algorithm to solve (SP2).
IVES: We initialize transmission delay (based on (42)) as follows
(50) |
In each iteration , we first compute an RB allocation strategy via solving (43) by the Kuhn-Munkres algorithm. Then, based on , we find the corresponding transmission power by (49) and update the transmission delay by before the next iteration. The details of IVES are depicted in Algorithm 2.
Using IVES, we can solve (SP2) in an iterative manner. In the following theorem, we provide the convergence guarantee for IVES.
Theorem 5.
If we solve (SP2) by IVES, then monotonically increases and converges to a unique point.
Sketch of Proof.
Although IVES solves (SP2) iteratively, we observe that it can converge extremely fast (often with only two iterations) in the simulation, achieving a low computation complexity.
Combining the solutions of (SP1) and (SP2), we provide the User selection and Resource Allocation (URAL) in Algorithm 3 to solve the original problem (P). The URAL can simultaneously optimize the convergence speed, training time, and energy consumption via jointly selecting devices and allocating resources. Further, URAL can be directly integrated into the NUFM paradigm in the device selection phase to facilitate the deployment of FML in wireless networks.
VI Extension to First-Order Approximations
Due to the computation of Hessian in local update (2), it may cause high computational cost for resource-limited IoT devices. In this section, we address this challenge.
There are two common methods used in the literature to reduce the complexity in computing Hessian [9]:
-
1.
replacing the stochastic gradient by
(51) -
2.
replacing the Hessian-gradient product by
(52) where .
By doing so, the computational complexity of a one-step local update can be reduced from to , while not sacrificing too much learning performance. Next, we show that our results in Theorem 1 hold in the above two cases.
Corollary 2.
VII Simulation
This section evaluates the performance of our proposed algorithms by comparing with existing baselines in real-world datasets. We first present the experimental setup, including the datasets, models, parameters, baselines, and environment. Then we provide our results from various aspects.
VII-A Experimental Setup

VII-A1 Datasets and Models
We evaluate our algorithms on four widely-used benchmarks, namely Fashion-MNIST [43], CIFAR-10 [44], CIFAR-100 [44], and ImageNet [45]. Specifically, the data is distributed among devices as follows: a) Each device has samples from two random classes; b) the number of samples per class follows a truncated Gaussian distribution with and . We select 50% devices at random for training with the rest for testing. For each device, we divide the local dataset into a support set and a query set. We consider 1-shot 2-class classification tasks, i.e., the support set contains only 1 labeled example for each class. We set the stepsizes as . We use a convolutional neural network (CNN) with max-pooling operations and the Leaky Rectified Linear Unit (Leaky ReLU) activation function, containing three convolutional layers with sizes 32, 64, and 128 respectively, followed by a fully connected layer and a softmax layer. The strides are set as 1 for convolution operation and 2 for the pooling operation.
VII-A2 Baselines
To compare the performance of NUFM, we first consider two existing algorithms, i.e., FedAvg [4] and Per-FedAvg [9]. Further, to validate the effectiveness of URAL in multi-access wireless networks, we use two baselines, called Greedy and Random. In each round, the Greedy strategy determines the CPU-cycle frequency and transmission power for device by greedily minimizing its individual objective, i.e.,
(55) | ||||
(56) |
where is selected at random (i.e., randomly allocating RBs to selected devices). The Random strategy decides the CPU-cycle frequencies, transmission powers, and RB allocation for the selected devices uniformly at random from the feasible regions.
VII-A3 Implementation
We implement the code in TensorFlow Version 1.14 on a server with two Intel® Xeon® Golden 5120 CPUs and one Nvidia® Tesla-V100 32G GPU. The parameters used in the simulation can be found in Table III.
VII-B Experimental Results
VII-B1 Convergence Speed
To demonstrate the improvement of NUFM on the convergence speed, we compare the algorithms on different benchmarks with the same initial model and learning rate. We vary the number of participated devices from 20 to 40, and set the numbers of local update steps and total communication rounds as and , respectively. We let . As illustrated in Fig. 2 and Table II, NUFM significantly improve the convergence speed and corresponding test accuracy of the existing FML approach on all datasets444To make the graphs more legible, we draw symbols every two points in Fig. 2.. Clearly, it validates the effectiveness of our proposed device selection scheme that maximizes the lower bound of one-round global loss reduction. Interestingly, Fig. 2 also indicates that NUFM converges more quickly with relatively fewer participated devices. For example, in round 19, the loss achieved by NUFM with decreases by more than 9% and 20% over those with and on Fashion-MNIST, respectively. The underlying rationale is that relatively fewer “good” devices can provide a larger lower bound on the one-round global loss decrease (note that in (1) the lower bound takes the average of the selected devices). More selected devices in each round generally require more communication resources. Thus, the results reveal the potential of NUFM in applications to resource-limited wireless systems.
Algorithm | Fashion-MNIST | CIFAR-10 | CIFAR-100 | ImageNet | |
---|---|---|---|---|---|
NUFM | 68.04% | 58.80% | 23.95% | 34.04% | |
Per-FedAvg | 62.75% | 58.22% | 21.49% | 30.98% | |
FedAvg | 61.04% | 54.31% | 10.13% | 12.14% |
VII-B2 Effect of Local Update Steps
To show the effect of local update steps on the convergence rate of NUFM, we present results with varying numbers of local update steps in each round. For clarity of illustration, we compare the loss under different numbers of local steps in the round 19 on Fashion-MNIST and CIFAR-100. Fig. 3 shows that fewer local update steps lead to a larger gap between the baselines and NUFM, which verifies the theoretical result that a small number of local steps can slow the convergence of FedAvg and Per-FedAvg [9, Theorem 4.5]. It also implies that NUFM can improve the computational efficiency of local devices.

VII-B3 Performance of URAL in Wireless Networks
We evaluate the performance of URAL by comparing with four baselines, namely NUFM-Greedy, NUFM-Random, RU-Greedy, and RU-Random, as detailed below.
-
a)
NUFM-Greedy: select devices by NUFM, decide CPU-cycle frequencies, RB allocation, and transmission power by Greedy strategy;
-
b)
NUFM-Random: select devices by NUFM, decide CPU-cycle frequencies, RB allocation, and transmission power by Random strategy;
-
c)
RU-Greedy: select devices uniformly at random, decide CPU-cycle frequencies, RB allocation, and transmission power by Greedy strategy;
-
d)
RU-Random: select devices uniformly at random, decide CPU-cycle frequencies, RB allocation, and transmission power by Random strategy.
Parameter | Value | ||
---|---|---|---|
Step size () and meta–learning rate () | 0.001 | ||
# edge devices () | 100 | ||
# participating devices () in Experiments 1 and 2 | An integer varying among | ||
# local updates () | An integer varying among | ||
Hyper-parameters ( and ) | 1 | ||
Weight parameters ( and ) | Real numbers varying among | ||
# RBs | An integer varying among | ||
CPU cycles of devices() | Real numbers following | ||
Maximum CPU-cycle frequencies of devices () | Real numbers following | ||
Maximum transmission powers of devices () | Real numbers following | ||
Inference in RBs () | Real numbers following | ||
Model size (), Bandwidth (), | 1 | ||
Noise power spectral density () | 1 | ||
effective capacitance coefficients of devices () | Real numbers following | ||
Channel gains of devices |
|
We simulate a wireless system consisting of RBs and let the channel gain of device follow a uniform distribution with and . We set , , and . The interference of RB is drawn from . We set , , , and for each to simulate the device heterogeneity. In the following experiments, we run the algorithms on FMNIST with local update steps and .
As shown in Fig. 4, URAL can significantly reduce energy consumption and wall-clock time, as compared with the baselines. However, it is counter-intuitive that the Greedy strategy is not always better than Random. There are two reasons. On one hand, the energy cost and wall-clock time depend on the selection of weight parameters and . The results in Fig. 4 imply that, when , Greedy pays more attention to the wall-clock time than to energy consumption. Accordingly, Greedy achieves much lower average delay than that of Random, but sacrificing parts of the energy. On the other hand, the wall-clock time and energy cost require joint control with RB allocation. Although Greedy minimizes the individual objectives (55)-(56), improper RB allocation can cause arbitrary performance degradation. Different from Greedy and Random–based baselines, since URAL aims to maximize the joint objective (P) via co-optimizing the CPU-cycle frequencies, transmission power, and RB allocation strategy, it can alleviate the above-mentioned issues, and achieve a better delay and energy control. At the same time, Fig. 4 indicates that URAL converges as fast as NUFM-Greedy and NUFM-Random (the corresponding lines almost overlap), which select devices greedily to accelerate convergence (as in NUFM). Thus, URAL can have an excellent convergence rate.

VII-B4 Effect of Resource Blocks
In Fig. 5, we test the performance of URAL under different numbers of RBs. We vary the number of RBs from 1 to 50. More RBs enable more devices to be selected in each round, leading to larger energy consumption. As shown in Fig. 5, URAL keeps stable wall-clock time with the increase of RBs. Meanwhile, URAL can control the power in devices to avoid serious waste of energy. It is counter-intuitive that the convergence speed does not always decrease with the increase of RBs, especially for URAL. The reason is indeed the same as that in Section VII-B1. That is, too few selected devices can slow the convergence due to insufficient information provided in each round while a large number of participated devices may weaken the global loss reduction as shown in (1). Therefore, URAL can adapt to the practical systems with constrained wireless resources via achieving fast convergence with only a small set of devices.

VII-B5 Effect of Channel Quality
To investigate the effect of channel conditions on performance, we set the number of RBs and vary the maximum channel gain from 0.25 to 2, and show the corresponding energy consumption and wall-clock training time in Fig. 6. The results indicate that the energy consumption and latency decrease as channel quality improves, because devices can use less power to achieve a relatively large transmission rate.

VII-B6 Effect of Weight Parameters
We study how weight parameters and affect the average energy consumption and wall-clock time of URAL in Fig. 7. We first fix and vary from 0.5 to 2.5. As expected, the total energy consumption decreases with the increase of , with the opposite trend for wall-clock time. Then we vary with . Similarly, a larger leads to less latency and more energy cost. It implies that we can control the levels of wall-clock training time and energy consumption by tuning the weight parameters. In particular, even with a large or , the wall-clock time and energy cost can be controlled at low levels. Meanwhile, the convergence rate achieved by URAL is robust to and . Thus, URAL can make full use of the resources (including datasets, bandwidth, and power) and achieve great trade-offs among the convergence rate, latency, and energy consumption.

VIII Conclusion
In this paper, we have proposed an FML algorithm, called NUFM, that maximizes the theoretical lower bound of global loss reduction in each round to accelerate the convergence. Aiming at effectively deploying NUFM in wireless networks, we present a device selection and resource allocation strategy (URAL), which jointly controls the CPU-cycle frequencies and RB allocation to optimize the trade-off between energy consumption and wall-clock training time. Moreover, we integrate the proposed algorithms with two first-order approximation techniques to further reduce the computational complexity in IoT devices. Extensive simulation results demonstrate that the proposed methods outperform the baseline algorithms.
Future work will investigate the trade-off between the local update and global aggregation in FML to minimize the convergence time and energy cost from a long-term perspective. In addition, how to characterize the convergence properties and communication complexity of the NUFM algorithm requires further research.
References
- [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
- [2] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intelligence: Challenges and opportunities of near-sensor machine learning applications,” in Proc. IEEE ASAP, 2018, pp. 1–7.
- [3] X. Zhang, Y. Wang, S. Lu, L. Liu, W. Shi et al., “Openei: An open framework for edge intelligence,” in Proc. IEEE ICDCS, 2019, pp. 1840–1851.
- [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS. PMLR, 2017, pp. 1273–1282.
- [5] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
- [6] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning with fast convergence and efficient communication,” ArXiv reprints arXiv: 1802.07876, 2019.
- [7] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” ArXiv reprints arXiv: 1909.12488, 2019.
- [8] S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via federated meta-learning,” in Proc. IEEE ICDCS, 2020, pp. 289–299.
- [9] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” in Proc. NIPS, 2020, pp. 1–12.
- [10] P. Kairouz et al., “Advances and open problems in federated learning,” ArXiv reprints arXiv: 1912.04977, 2021.
- [11] S. Yue, J. Ren, J. Xin, S. Lin, and J. Zhang, “Inexact-admm based federated meta-learning for fast and continual edge learning,” in Proc. ACM MobiHoc, 2021, p. 91–100.
- [12] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in Proc. IEEE ICC, 2019, pp. 1–7.
- [13] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang, and H. V. Poor, “Fast-convergent federated learning,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 201–218, 2020.
- [14] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, 2020.
- [15] C. T. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless networks: Convergence analysis and resource allocation,” IEEE/ACM Trans. Networking, vol. 29, no. 1, pp. 398–409, 2020.
- [16] J. Liu, J. Ren, Y. Zhang, X. Peng, Y. Zhang, and Y. Yang, “Efficient dependent task offloading for multiple applications in mec-cloud system,” IEEE Trans. Mob. Comput., 2021.
- [17] Z. M. Fadlullah and N. Kato, “Hcp: Heterogeneous computing platform for federated learning based collaborative content caching towards 6g networks,” IEEE Trans. Emerging Top. Comput., 2020.
- [18] Q. Wu, K. He, and X. Chen, “Personalized federated learning for intelligent iot applications: A cloud-edge based framework,” IEEE Open J. Comput. Soc., vol. 1, pp. 35–44, 2020.
- [19] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, pp. 1–19, 2019.
- [20] S. Yue, J. Ren, N. Qiao, Y. Zhang, H. Jiang, Y. Zhang, and Y. Yang, “Todg: Distributed task offloading with delay guarantees for edge computing,” IEEE Trans. Parallel Distrib. Syst., 2021.
- [21] J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling for cellular federated edge learning with importance and channel awareness,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7690–7703, 2020.
- [22] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, 2019.
- [23] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in Proc. IEEE ICC Workshops, 2020, pp. 1–6.
- [24] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, 2020.
- [25] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time optimization for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2457–2471, 2020.
- [26] J. Ren, G. Yu, and G. Ding, “Accelerating dnn training in wireless federated edge learning systems,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 219–232, 2020.
- [27] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization,” in Proc. ICML, 2019, pp. 7184–7193.
- [28] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, 2020.
- [29] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas in Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
- [30] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in Proc. ICML, 2020, pp. 5132–5143.
- [31] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective federated learning design,” in Proc. IEEE INFOCOM, 2021.
- [32] W. Luping, W. Wei, and L. Bo, “Cmfl: Mitigating communication overhead for federated learning,” in Proc. IEEE ICDCS, 2019, pp. 954–964.
- [33] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially private meta-learning,” in Proc. ICLR, 2019.
- [34] K. C. Sim et al., “Personalization of end-to-end speech recognition on mobile devices for named entities,” in Proc. IEEE ASRU, 2019, pp. 23–30.
- [35] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergence for wireless federated learning,” in Proc. IEEE ICC, 2020, pp. 1–6.
- [36] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
- [37] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data,” ArXiv reprints arXiv: 2005.11418, 2020.
- [38] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory of gradient-based model-agnostic meta-learning algorithms,” in Proc. AISTATS, 2020, pp. 1082–1092.
- [39] S. Yue, J. Ren, J. Xin, D. Zhang, Y. Zhang, and W. Zhuang, “Efficient federated meta-learning over multi-access wireless networks,” arXiv preprint arXiv:2108.06453, 2021.
- [40] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2009.
- [41] T. D. Burd and R. W. Brodersen, “Processor design for portable systems,” J. VLSI Sig. Proc. Syst. Sig. Image Video Technol., vol. 13, no. 2, pp. 203–221, 1996.
- [42] E. W. Weisstein, “Hungarian maximum matching algorithm,” https://mathworld.wolfram.com/, 2011.
- [43] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” ArXiv reprints arXiv: 1708.07747, 2017.
- [44] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical Report TR-2009, University of Toronto, 2009.
- [45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. of IEEE CVPR, 2009, pp. 248–255.
Appendix A Proof of Lemma 1
The proof is standard [11, 9], and we provide it for completeness. Recalling the definition of as , we have
Based on that, for any we have
(adding and subtracting the term ) | ||||
(from triangle inequality) | ||||
(57) |
Then, for , the following holds true
(from Assumption 1) | ||||
(from triangle inequality) | ||||
(58) |
Regarding , it can be shown that
(from Assumptions 1 and 2) |
Substituting and into (57), we have the result.
Appendix B Proof of Lemma 2
We rewrite the stochastic gradient as follows
(59) |
where and are given by
(60) | ||||
(61) |
Note that and are independent. Due to Assumption 3, we have
(62) | ||||
(63) |
Next, we proceed to bound the first and second moments of . Regarding the first moment, we have
(from the smoothness of ) | ||||
(from Assumption 3) |
Regarding the second moment, we have
(from the tower rule, smoothness of along with Assumption 3) | ||||
(64) |
Based on (59), we have
(65) |
which gives us the first result in Lemma 2.
By the submultiplicative property of matrix norm,
(66) |
Thus, we have
(67) |
Taking expectation on both sizes, we obtain
(using the fact that ) | ||||
(68) |
which completes the proof.
Appendix C Proof of Lemma 3
Recall the definition of . We have
(69) |
Regrading , due to the submultiplicative property of matrix norm, the following holds
(70) |
where the last inequality follows from Assumption 4.
Appendix D Proof of Lemma 4
Since is the straggler among devices, (SP1) can be equivalently transformed to
(73) |
Fixing , for each 555We implicitly assume that is not empty., we can show that the optimal CPU frequency of the problem (73) can be obtained via solving the following decomposed convex optimization problem
(74) |
If , the optimal solution of problem (74) is given by
(75) |
Otherwise, the problem is infeasible because the constraints are mutually contradictory. Substituting in problem (73), we have the following problem with respect to
(76) |
We simplify the expression of as where positive constants and are defined in (76). Then, the derivative of can be written as
(77) |
Based on (77), the minimum value of is obtained at its stationary point . Thus, the optimal solution of problem (76) is
(78) |
Appendix E Proof of Lemma 5
Given and , can be obtained via solving the following problem
(79) |
where we eliminate , , , and in the objective. Clearly, if , then . If the constraints in (79) are mutually contradictory, i.e.
(80) |
then must be 0, which gives (41).
When there exists such that , by denoting and rearranging the terms, we can transform the problem (79) to
(81) |
We have
(82) |
Denoting the numerator of (82) as , we have and
(83) |
which implies that is monotonically increasing with . Thus, recalling the definition of , due to (41), we obtain
(84) |
which completes the proof of (42).
Appendix F Proof of Lemma 6
If denotes the straggler among all devices and , we can rewrite (6) for each as
(85) |
Based on (F), by eliminating constants and , we transform (SP2) under fixed to
(86) |
Due to (F), for each , we can obtain the optimal solution of (86) (abusing notations) via solving the decomposed problem as follows
(87) |
where . Note that in (87) we consider the nontrivial case where is not empty. It is easy to see that
(88) |
Denoting the numerator of (88) as , we have and
(89) |
which implies that is monotonically increasing with . Thus, if , recalling the definition of , we have
(90) |
Otherwise, problem (87) is infeasible.
Similar to (87), denote and institute in (86) (noting that ), whereby we have the following problem regarding
(91) |
Denoting positive constant as in (91), we can write and obtain
(92) |
Next, we show that has unique minimum point. Denoting the numerator of (F) as , we have and
(93) |
which implies is monotonically increasing. Besides, it can be shown that , where . Thus, there must exist a unique point such that holds, and it is also the minimum value of .
Accordingly, the optimal solution of the problem (91) can be expressed as
(94) |
Therefore, we complete the proof.
Appendix G Proof of Theorem 1
From Lemma 1 (smoothness of ), for any , we have
(95) |
Combining (95) with (2), we have
(96) |
Denoting
(97) |
we have . Next, we bound from below.
First, for any , we rewrite as
(98) |
and bound by
(99) |
where (b) is derived from Lemma 3. Inequality (a) follows from the fact that, for any and ,
(100) |
where denotes the -th coordinate of , and the first inequality is derived by using Cauchy-Schwarz inequality.
Appendix H Proof of Theorem 5
First, it is easy to see that is upper bounded by . Besides, due to (43), we have
(107) |
Based on (48), the optimal transmission power corresponding to is given by
(108) |
where we slightly abuse the notation and denote the RB block allocated to as , i.e., . Thus, we have . From (49), . Using these two inequalities, we obtain
(109) |
thereby completing the proof.
Appendix I Proof of Corollaries 1 and 2
I-A Proof of Corollary 1
Denote auxiliary variable as a “virtual” global model in local step ( is defined in (2)). Similar to (G), the following holds
(from Lemma 1) | ||||
(from the definition of ) | ||||
(from (2)) |
Denote
(110) |
We have . Rewrite for each as follows
(111) |
Note that in (111), and are similar to those in (98) with being replaced by . Thus, from (G) and (101), and are bounded by
(112) | ||||
(113) |
Regarding , we can write
(From Lemma 1) |
To bound , denote
(114) |
with . Then we have
(from [9, Equation (68)]) | ||||
(115) |
for any . For , we first write
(116) |
Next, we bound and separately.
- •
-
•
Upper Bound of : Due to Cauchy-Schwarz inequality, the following holds
(118) where and for ; and . Thus, we have
(using Lemma 2) where .
Based on the above results, substituting (116) into (115), we obtain
(119) |
Note that (I-A) is essentially the same as [9, Equation (78)]. Therefore, due to and [9, Corollay F.2.], we have
(120) |