This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Resource Consumption for Supporting Federated Learning in Wireless Networks

Yi-Jing Liu, Shuang Qin  , Yao Sun, ,
Gang Feng, 
Y. Liu, S. Qin, and G. Feng are with the National Key Lab on Communications, University of Electronic Science and Technology of China, Chengdu 611731, P. R. China, and also with the Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou 313001, P. R. China. Y. Sun is with the James Watt School of Engineering, University of Glasgow, Glasgow G12 8QQ, U.K. S. Qin is the corresponding author (e-mail:[email protected]).
Abstract

Federated learning (FL) has recently become one of the hottest focuses in wireless edge networks with the ever-increasing computing capability of user equipment (UE). In FL, UEs train local machine learning models and transmit them to an aggregator, where a global model is formed and then sent back to UEs. In wireless networks, local training and model transmission can be unsuccessful due to constrained computing resources, wireless channel impairments, bandwidth limitations, etc., which degrades FL performance in model accuracy and/or training time. Moreover, we need to quantify the benefits and cost of deploying edge intelligence, as model training and transmission consume certain amount of resources. Therefore, it is imperative to deeply understand the relationship between FL performance and multiple-dimensional resources. In this paper, we construct an analytical model to investigate the relationship between the FL model accuracy and consumed resources in FL empowered wireless edge networks. Based on the analytical model, we explicitly quantify the model accuracy, available computing resources and communication resources. Numerical results validate the effectiveness of our theoretical modeling and analysis, and demonstrate the trade-off between the communication and computing resources for achieving a certain model accuracy.

Index Terms:
Federated learning, edge intelligence, resource consumption, FL performance.

I Introduction

Edge intelligence is boosted by the unprecedented computing capability of smart devices. Nowadays, more than 10 billion Internet-of-Things (IoT) equipment and 5 billion smartphones have emerged that are equipped with artificial intelligence (AI)-empowered computing modules, such as AI chips and graphic processing units (GPUs) [1]. On the one hand, the user equipment (UE) can be potentially deployed as computing nodes to support emerging services, such as collaborative tasks, which paves the way for applying AI in wireless edge networks. On the other hand, in the paradigm of machine learning (ML), the powerful computing capability on these UEs can decouple conventional ML from acquiring, storing, and training data in data centers as conventional methods.

Federated learning (FL) has recently been widely acknowledged as one of the most essential enablers to bring edge intelligence into reality, as it facilitates collaborative training of ML models, while enhancing individual user privacy and data security [2, 3]. In FL, ML models are trained locally, therefore raw data remains in the device. Specifically, FL uses an iterative approach that requires a number of global iterations to achieve a certain global model accuracy. In each global iteration, UEs perform several local iterations to reach a local model accuracy [2, 3]. As a result, the implementation of FL in wireless networks can also reduce the costs of transmitting raw data, relieve the burden on backbone networks, and reduce latency for real-time decisions, etc.

While FL offers these attractive and valuable benefits, it also faces many challenges, especially when being deployed in wireless edge networks. For example, both local training and model transmission can be unsuccessful due to constrained resources and unstable transmission. Moreover, different from the conventional ML approaches, where raw datasets are sent to a central server, only the lightweight model parameters (i.e., weights, gradients, etc.) are exchanged in FL. Nevertheless, the communication cost of FL could be still fairly large and cannot be ignored. The experimental results in [4] show that the model size of a 5-layer convolutional neural network used for MNIST (classification) is about 4.567MB per global iteration for 28×2828\times 28 images, while the model size of ResNet-110 used for CIFAR-10 (classification) is around 4.6MB per global iteration for 32×3232\times 32 images [5]. Therefore, before deploying FL empowered wireless edge networks, we need to answer two fundamental questions: (1) How accurate of an ML model can be achieved by using FL, and (2) How much cost is incurred to guarantee certain required FL performance? Obviously, answering these two questions is of paramount importance for facilitating edge network intelligence. Therefore, we need to deeply understand the relationship between FL performance and consumed multi-dimensional resources.

In this paper, we theoretically analyze how many resources are needed to support an FL-empowered wireless edge network by assuming spatial-temporal domain Poisson distribution. We first derive the distribution of signal-to-interference-plus-noise ratio (SINR), signal-noise ratio (SNR), model transmission success probability, and resource consumption. Then, we evaluate the impact of the amount of resources on FL performance. Numerical results validate the accuracy of our theoretical modeling and analysis. The main contributions of this paper can be summarized as follows,

  1. (1)

    We develop an analytical model for FL empowered wireless edge networks, where UE geographical distribution and arrival rate of the interfering UEs are modeled as Poisson Point Process (PPP).

  2. (2)

    We theoretically analyze SINR, SNR, and the local/global model transmission success probability. Specifically, we derive the probability density function (PDF) of SINR and SNR, where we obtain the transmission success probability of the local/global model.

  3. (3)

    Based on the analytical model, we derive the explicit expression of the model accuracy, as a function of the amount of resources (including communication resources and computing resources) under FL framework.

  4. (4)

    We investigate three specific cases according to the sufficiency of respective communication and computing resources. We use simulation experiments to validate the effectiveness of our theoretical modeling and analysis, and demonstrate the trade-off relationship between the communication resources and computing resources for achieving certain machine learning model accuracy.

In the rest of this paper, we review related work in Section II. Then we present the FL empowered edge network model in Section III and the analysis for the communication and computing resource consumption in Section IV. In Section V, the relationship between FL performance and consumed resources is derived. In Section VI, different cases based on the sufficiency of respective communication and computing resources are discussed. Finally, we present the numerical results in Section VII and conclude the paper in Section VIII.

II Related Work

Currently, there has been a large body of work on developing various FL algorithms for FL empowered wireless edge networks. The authors of [6] designed an appropriate user selection scheme to minimize FL convergence time and training loss by jointly optimizing user selection and resource block allocation. The authors of [7] proposed a collaborative FL framework to enable UEs to implement FL with less reliance on a central server by aggregating the local FL models received from the associated UEs. In [8], the authors presented a Stackelberg-game-based approach to develop an FL incentive scheme by modeling the incentive-based interaction between a global server and participating UEs. In [9], we proposed a hybrid FL scheme to make a global UE association decision for heterogeneous models by exploiting two levels of model aggregation. All aforementioned investigations aimed to facilitate edge intelligence by developing suitable FL algorithms in wireless edge networks. However, the authors have not explicitly addressed the resource cost under FL framework. In fact, the communication cost under FL framework could be still fairly large and cannot be ignored, although only the lightweight model parameters are exchanged.

To support these improved FL algorithms even the legacy FL algorithms, resource-efficient and FL performance guarantee are indispensable basis for achieving the FL empowered wireless edge intelligence. The authors of [10] developed a low-cost sampling-based algorithm by adapting various control variables to minimize cost components (e.g., learning time, energy consumption). In [10], the authors considered a multivariate control problem for energy-efficient FL to guarantee convergence by designing principles for different optimization goals. The authors of [11] proposed an over-the-air computation based approach to improve communication efficiency by modeling joint device selection and beamforming design as a sparse and low-rank optimization problem. In [12], the authors introduced update-importance-based client scheduling schemes to reduce the required number of model training rounds by selecting a subset of clients for local updates in each round of training. The authors of [13] proposed a convergent over-the-air FL scheme to reduce bandwidth and energy consumption by inducing precoding and scaling upon transmissions to gradually mitigate the effect of the noisy channel. In [14], the authors proposed a federated dropout scheme to enable FL on resource-constrained devices by tackling both the communication and computation resource bottlenecks. All these investigations aimed to achieve resource-efficient FL algorithms in wireless edge networks. However, the vulnerability of wireless links is largely ignored, which directly degrades FL performance by affecting local training and model transmission. Therefore, deeply understanding the relationship among FL performance, wireless factors, and multi-dimensional resources is essential for enabling wireless edge intelligence.

So far, there has been little attention on quantifying the relationship between FL performance and consumed resources, while considering the vulnerability of wireless links. The authors of [15] investigated the trade-off between the number of local iterations and the number of global iterations to capture the relationship between FL training time and energy consumption. However, they have not considered the unsuccessful model transmission in a real network. In [16], the authors studied a joint learning and communication optimization scheme to minimize an FL loss function, where the limited resources and unstable wireless links were considered. However, they only focused on optimizing FL performance without comprehensively quantifying the relationship between FL performance and consumed resources.

III FL empowered Wireless Network Model

We consider an FL empowered wireless edge network consisting of a central base station (BS) and multiple UEs, as shown in Fig.1. The UEs can be regarded as local computing nodes for local model training, while the server (e.g., edge servers) serves as the model aggregator co-located with the BS [2, 9, 3]. To quantitatively present the FL empowered wireless network model, we need to model the distribution of the UEs and arrival rate of the interfering UEs. As one of the most commonly used point processes, PPP model has been widely used to model UEs distribution and/or arrival rate of the interfering UEs in wireless networks, where a huge amount of data has validated the accuracy of the model [17, 18, 19, 20]. Nevertheless, other point processes, such as Poisson cluster process (PCP) [21] and cox process [22] for some specific scenarios, can also be used in our analytical model.

Refer to caption
Figure 1: The FL empowered wireless edge network.

III-A FL Model

III-A1 Loss Function

Let random variable NN denote the number of UEs that are geographically distributed as homogeneous PPP with intensity λi\lambda_{i}, where nn denotes the value of NN. Similarly, we use capital letter to denote a random variable and the corresponding lower case to its value in the following. For convenience, the frequently used notations are summarized in TABLE I.

TABLE I: List of notations
Notation Definition Notation Definition
NN Number of UEs (random variable) nn The value of NN
NIN_{I} Number of interfering UEs nIn_{I} The value of NIN_{I}
N𝒜N_{\cal A} Number of UEs in interfering area 𝒜\cal A n𝒜n_{\cal A} The value of N𝒜N_{\cal A}
λi\lambda_{i} UE density λa\lambda_{a} Interfering UE arrival rate
𝒮i{\cal S}_{i} Datasets of UE ii Si{S}_{i} The amount of 𝒮i{\cal S}_{i}
tt The index of local iterations, 0tτ0\leq t\leq\tau sis_{i} The value of SiS_{i}
τ\tau Number of local iterations KK Number of communication rounds
wrw_{r} Global model at rr-th round wir(t)w_{i}^{r}(t) Local model of UE ii
ZiZ_{i} Computing capacity of UE ii IiI_{i} Interference of UE ii
PupP_{\text{up}} Transmit power of the UE PdownP_{\text{down}} Transmit power of the BS
D1D_{1} Distance between the UE and the BS r0r_{0} Radius of the BS coverage
𝐃2{\bf D}_{2} Distance vector for interfering UEs d0d_{0} Radius of interfering area
𝑆𝐼𝑁𝑅up\mathit{SINR}_{\text{up}} SINR of uplink 𝑆𝑁𝑅down\mathit{SNR}_{\text{down}} SNR of downlink
βdown\beta_{\text{down}} SNR threshold βup\beta_{\text{up}} SINR threshold
Rdowni,rR_{\text{down}}^{i,r} Transmission rate (downlink) Rupi,rR_{\text{up}}^{i,r} Transmission rate (uplink)
bdowni,rb_{\text{down}}^{i,r} Bandwidth consumption (downlink) bupi,rb_{\text{up}}^{i,r} Bandwidth consumption (uplink)
TiT_{i} Training time of UE ii for cic_{i} Number of CPU cycles for
one local iteration computing a sample

For a specific UE ii, it has a local dataset 𝒮i{\cal S}_{i} with SiS_{i} data samples, where 𝒮i={xikd,yik}k=1Si{\cal S}_{i}=\{x_{ik}\in\mathbb{R}^{d},y_{ik}\in\mathbb{R}\}_{k=1}^{S_{i}}. Moreover, we define fk(wir(t);xik,yik)f_{k}(w_{i}^{r}(t);x_{ik},y_{ik}) as a loss function for data sample kk of UE ii, where wir(t)w_{i}^{r}(t) represents the model parameter of UE ii at the tt-th local iteration during the rr-th global iteration. The loss function fk(wir(t);xik,yik)f_{k}(w_{i}^{r}(t);x_{ik},y_{ik}) is different for various FL learning tasks [23]. For example, for a linear regression, the loss function is fk(wir(t);xik,yik)=12(xikTwir(t)yik)2f_{k}(w_{i}^{r}(t);x_{ik},y_{ik})=\frac{1}{2}(x_{ik}^{\text{T}}w_{i}^{r}(t)-y_{ik})^{2}. For neural network, the loss function could be mean squared error (i.e., 1ni=1n(yiky^ik)\frac{1}{n}\sum_{i=1}^{n}(y_{ik}-\widehat{y}_{ik})), where y^ik\widehat{y}_{ik} represents the predicted value of yiky_{ik}. Based on fk(wir(t);xik,yik)f_{k}(w_{i}^{r}(t);x_{ik},y_{ik}), we define Fi(wir(t)):mF_{i}(w_{i}^{r}(t)):\mathbb{R}^{m}\rightarrow\mathbb{R} as a local loss function to capture the local training performance, which is as follows,

Fi(wir(t))1SikSifk(wir(t);xik,yik).F_{i}(w_{i}^{r}(t))\triangleq\frac{1}{S_{i}}\sum_{k\in S_{i}}f_{k}(w_{i}^{r}(t);x_{ik},y_{ik}). (1)

In addition, we define F(wr)F(w_{r}) as the global loss function on all distributed datasets to measure the global training performance, which is expressed by

F(wr)i=1nSiFi(wir(t))S,F(w_{r})\triangleq\frac{\sum_{i=1}^{n}S_{i}F_{i}(w_{i}^{r}(t))}{S}, (2)

where Si=1nSi{S}\triangleq\sum_{i=1}^{n}{S}_{i}. The goal of the BS is to derive a vector wrw_{r}^{*} satisfying wrargwrminF(wr)w_{r}^{*}\triangleq\arg_{w_{r}}\min F(w_{r}).

III-A2 Updating Model

In FL, each global iteration is called a communication round [9], as shown in Fig.2. A communication round consists of 55 phases including local model updating, local iterations (also called local training), local model transmission, global model updating, and global model transmission. In the following, we present the details of the local and global model updating respectively.

Refer to caption
Figure 2: The process of a communication round.
  1. (1)

    Local Model Updating: The local model updating can be performed based on a local learning algorithm, such as gradient descent (GD), actor-critic (AC), etc. Specifically, if 0<tτ0<t\leq\tau, the local model wir(t)w_{i}^{r}(t) is updated by wir(t)=wrηFi(wir(t1))w_{i}^{r}(t)=w_{r}-\eta\nabla F_{i}(w_{i}^{r}(t-1)), while wir(t)=wrw_{i}^{r}(t)=w_{r} if t=0t=0, where tt represents the index of local iterations, τ\tau represents the total number of local iterations during a communication round. Moreover, η0\eta\geq 0 is the step size, and wrw_{r} represents the global model at the rr-th communication round.

  2. (2)

    Global Model Updating: After τ\tau local iterations, i.e., t=τt=\tau, UEs will achieve a certain local accuracy and send the local models wir(t)w_{i}^{r}(t) to the aggregator. Then the global aggregation is performed at the aggregator according to

    wr=i=1nSiwir(t)S,t=τ.w_{r}=\frac{\sum_{i=1}^{n}{S}_{i}w_{i}^{r}(t)}{S},t=\tau. (3)

III-B Computing Resource Consumption Model

For a specific UE ii, let Zi{Z}_{i} denote its computing capacity in cycles/s. cic_{i} (cycles/sample) denotes the number of CPU cycles required for computing one data sample at UE ii. TiT_{i} represents the local computing time (training time) needed for one local iteration. Therefore, similar to that in [24], the consumed computing resources during one local iteration for UE ii is given by Zi=ciSiTi{Z}_{i}=\frac{c_{i}{S}_{i}}{T_{i}}.

III-C Communication Resource Consumption Model

III-C1 Uplink

The transmission time for UE ii to transmit the local model wir(t)w_{i}^{r}(t) (uplink direction) is denoted by Tupi,r{T}_{\text{up}}^{i,r}. Since the dimensions of local models are fixed for all UEs that participate in local training, the data size of the local model on each UE is constant and is denoted by ss [24]. The transmission rate of UE ii on the wireless channel to the BS during the rr-th communication round is represented by Rupi,rR_{\text{up}}^{i,r}. Therefore, we have sRupi,r=Tupi,r\frac{s}{R_{\text{up}}^{i,r}}={T}_{\text{up}}^{i,r}, where

Rupi,r=bupi,rlog2(1+𝑆𝐼𝑁𝑅up(D1,NI,𝐃2)),R_{\text{up}}^{i,r}=b_{\text{up}}^{i,r}\log_{2}(1+\mathit{SINR}_{\text{up}}(D_{1},N_{I},\mathbf{D}_{2})), (4)

where bupi,rb_{\text{up}}^{i,r} represents the amount of consumed bandwidth for transmitting the local model wir(t)w_{i}^{r}(t). In addition, 𝑆𝐼𝑁𝑅up(D1,NI,𝐃2)=PupG(D1)j=1NIPG(D2(j))+δ2\mathit{SINR}_{\text{up}}(D_{1},N_{I},\mathbf{D}_{2})=\frac{P_{\text{up}}G(D_{1})}{\sum_{j=1}^{N_{I}}PG({D}_{2}^{(j)})+\delta^{2}} is the SINR. D1D_{1} represents the distance between the UE and the BS, 𝐃2=[D2(1),D2(2),D2(j),,D2(NI)]\mathbf{D}_{2}=[{D}_{2}^{(1)},{D}_{2}^{(2)},...{D}_{2}^{(j)},...,{D}_{2}^{(N_{I})}] denotes the distance vector for all interfering UEs of UE ii, NIN_{I} represents the number of interfering UEs with NINN_{I}\leq N, δ2\delta^{2} denotes the noise power, PupP_{\text{up}} represents the transmit power of the UE. G()G(\cdot) represents the large-scale channel gain between the BS and UEs. Indeed, the channel gain model could be either large-scale (e.g., path loss) or small-scale (e.g., Rayleigh fading, Rician fading), which only affects SINR/SNR. Furthermore, let βup\beta_{\text{up}} denote the SINR threshold that the BS can successfully decode the received updates from UE ii. Therefore, local model transmission is successful only if 𝑆𝐼𝑁𝑅up(D1,NI,𝐃2)>βup\mathit{SINR}_{\text{up}}(D_{1},N_{I},\mathbf{D}_{2})>\beta_{\text{up}}.

III-C2 DownLink

Let us analyze SNR for the BS to transmit the global model (downlink direction). The transmission time for transmitting the global model wrw_{r} in the downlink is denoted by Tdowni,r{T}_{\text{down}}^{i,r}. From equation (3), we see that the dimensions of the global model wrw_{r} is similar to that of local models. Therefore, the data size of the global model that the BS sends to each UE is also equal to ss [16]. We assume that the transmission rate of the BS during the rr-th communication round is represented by Rdowni,rR_{\text{down}}^{i,r}. Therefore, we have sRdowni,r=Tdowni,r\frac{s}{R_{\text{down}}^{i,r}}={T}_{\text{down}}^{i,r}, where

Rdowni,r=bdowni,rlog2(1+𝑆𝑁𝑅down(D1)),R_{\text{down}}^{i,r}=b_{\text{down}}^{i,r}\log_{2}(1+\mathit{SNR}_{\text{down}}(D_{1})), (5)

where bdowni,rb_{\text{down}}^{i,r} represents the consumed bandwidth for transmitting the global model. In addition, 𝑆𝑁𝑅down(D1)=PdownG(D1)δ2\mathit{SNR}_{\text{down}}(D_{1})=\frac{P_{\text{down}}G(D_{1})}{\delta^{2}}, where PdownP_{\text{down}} represents the transmit power of the BS allocated to all UEs. Furthermore, let βdown\beta_{\text{down}} denote the SNR threshold that UEs can successfully decode the received updates from the BS. Therefore, the global model transmission is successful only if 𝑆𝑁𝑅down(D1)>βdown\mathit{SNR}_{\text{down}}(D_{1})>\beta_{\text{down}}.

IV Wireless Bandwidth and Computing Resources Consumed for Supporting FL empowered Edge Intelligence

A certain amount of wireless bandwidth is consumed in the uplink/downlink, when models are exchanged between the BS and the UEs. Specifically, in the uplink, the UEs send their local models to the BS via some channel partitioning schemes, such as orthogonal frequency division multiplexing (OFDM). in the downlink, the BS sends the global model to individual UEs. Indeed, the wireless transmission environment of the uplink/downlink will affect the transmission process of the local/global model, and thus affect the global aggregation and local training. In this section, we theoretically analyze SINR, SNR, as well as wireless bandwidth consumed in the uplink/downlink to support FL empowered wireless edge networks.

IV-A SINR Analysis for Uplink

IV-A1 Probability Density Function (PDF) of SINR

To derive the PDF of SINR, we separately investigate the signal power and interference. We have assumed that the UEs are geographically distributed as homogeneous PPP with intensity λi\lambda_{i}, and thus the number of UEs NN is a variable of Poisson distribution with density parameter π(r0)2λi\pi(r_{0})^{2}\lambda_{i}, where r0r_{0} represents the radius of the BS coverage. For a specific UE ii, the signal power is also a random variable, i.e., Sup=PupG(D1)S_{\text{up}}=P_{\text{up}}G(D_{1}), as it only relates to the distance D1D_{1} and PupP_{\text{up}} is always fixed for each UE.

Proposition 1.

The PDF of the distance D1D_{1} between a specific UE and the serving BS is fD1(d1)=2d1(r0)2f_{D_{1}}(d_{1})=\frac{2d_{1}}{(r_{0})^{2}}.

Proof:

As the PDF of location (X,Y)(X,Y) for UE ii is f(X,Y)=1π(r0)2f(X,Y)=\frac{1}{\pi(r_{0})^{2}}, the CDF of distance D1=d1D_{1}=d_{1} is FD1(d1)=X2+Y2(d12)1π(r0)2𝑑X𝑑Y=(d1)2(r0)2F_{D_{1}}(d_{1})=\iint\limits_{X^{2}+Y^{2}\leq(d_{1}^{2})}\frac{1}{\pi(r_{0})^{2}}dXdY=\frac{(d_{1})^{2}}{\left(r_{0}\right)^{2}}. Therefore, the PDF of D1=d1D_{1}=d_{1} is fD1(d1)=FD1(d1)=2d1(r0)2f_{D_{1}}(d_{1})=F^{{}^{\prime}}_{D_{1}}(d_{1})=\frac{2d_{1}}{(r_{0})^{2}}. ∎

Therefore, we can obtain the PDF of signal power (i.e., fS(S=PupG(D1))=fD1(d1)f_{S}(S=P_{\text{up}}G(D_{1}))=f_{D_{1}}(d_{1})) for deriving the closed-form expression of Pr(𝑆𝐼𝑁𝑅up>βup){\rm Pr}(\mathit{SINR}_{\text{up}}>\beta_{\text{up}}). Next, we investigate the distribution of the received interference in the uplink. Note that only the transmitting UEs located in the interfering area with radius d0d_{0}, can contribute to the interference. We assume that the number of UEs within the interfering area is represented by N𝒜(N𝒜NI)N_{\cal A}(N_{\cal A}\geq N_{I}), which is also a variable of Poisson distribution with density parameter π(d0)2λi\pi(d_{0})^{2}\lambda_{i}. Moreover, the transmission time for the UEs is represented by tup{t}_{\text{up}}, where the transmitting UEs during time [tup,tup][-t_{\text{up}},t_{\text{up}}] can contribute to interference. Therefore, for a specific UE, the number of interfering UEs is distributed as PPP with parameter 2tupλa2t_{\text{up}}\lambda_{a}, where λa\lambda_{a} denotes the arrival rate of interfering UEs. Therefore, the interference probability of a transmitting UE during time [tup,tup][-t_{\text{up}},t_{\text{up}}] is Pr(active)=1exp{2tupλa}{\rm Pr}(active)=1-\text{exp}\{-2t_{\text{up}}\lambda_{a}\}.

Therefore, the probability of the number of interfering UEs NI=nIN_{I}=n_{I} given N𝒜=n𝒜N_{\cal A}=n_{\cal A} is

Pr(NI=nI|N𝒜=n𝒜)=Cn𝒜nI(1exp{12tupλa})nI(exp{12tupλa})n𝒜nI,{\rm Pr}(N_{I}=n_{I}|N_{\cal A}=n_{\cal A})=C_{n_{\cal A}}^{n_{I}}(1-\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{I}}\cdot(\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{\cal A}-n_{I}}, (6)

where Cn𝒜nIC_{n_{\cal A}}^{n_{I}} is the combination number. Therefore, the PDF of NIN_{I} is

fNI(nI)=Pr(NI=nI)=n𝒜=nINPr(NI=nI|N𝒜=n𝒜)Pr(N𝒜=n𝒜),f_{N_{I}}(n_{I})={\rm Pr}(N_{I}=n_{I})=\sum_{n_{\cal A}=n_{I}}^{N}{\rm Pr}(N_{I}=n_{I}|N_{\cal A}=n_{\cal A}){\rm Pr}(N_{\cal A}=n_{\cal A}), (7)

where Pr(N𝒜=n𝒜)=(π(d0)2λi)n𝒜n𝒜!exp{π(d0)2λi}{\rm Pr}(N_{\cal A}=n_{\cal A})=\frac{(\pi(d_{0})^{2}\lambda_{i})^{n_{\cal A}}}{n_{\cal A}!}\text{exp}\{-\pi(d_{0})^{2}\lambda_{i}\}. Based on Proposition 1, we can derive the PDF of interference IiI_{i} generated by UE ii, i.e., fIi(Ii=PupG(d2(i)))=fD2(i)(d2(i))=2d2(i)d02f_{I_{i}}(I_{i}=P_{\text{up}}G(d_{2}^{(i)}))=f_{D_{2}^{(i)}}(d_{2}^{(i)})=\frac{2d_{2}^{(i)}}{d_{0}^{2}}. As the total interference I(NI,𝐃2)I(N_{I},\mathbf{D}_{2}) is affected by the number of interfering UEs NIN_{I} as well as the distance of these interfering UEs 𝐃2\mathbf{D}_{2}, we have the PDF of I(NI,𝐃2)I(N_{I},\mathbf{D}_{2}), as follows,

fI(NI=nI,𝐃2=𝐝2)=fNI(nI)Pr(𝐃2=𝐝2|NI=nI)=fNI(nI)(2(d0)2)nIn=1nId2(n).f_{I}(N_{I}=n_{I},\mathbf{D}_{2}={\mathbf{d}}_{2})=f_{N_{I}}(n_{I})Pr(\mathbf{D}_{2}=\mathbf{d}_{2}|N_{I}=n_{I})=f_{N_{I}}(n_{I})\left(\frac{2}{(d_{0})^{2}}\right)^{n_{I}}\prod^{n_{I}}_{n=1}d_{2}^{(n)}.~{}~{} (8)

Therefore, the PDF of SINR can be given by

f𝑆𝐼𝑁𝑅up(D1=d1,NI=nI,𝐃2=𝐝2)=fD1(d1)fI(NI=nI,𝐃2=𝐝2),f_{\mathit{SINR}_{\text{up}}}(D_{1}=d_{1},N_{I}=n_{I},\mathbf{D}_{2}={\mathbf{d}}_{2})=f_{D_{1}}(d_{1})f_{I}(N_{I}=n_{I},\mathbf{D}_{2}={\mathbf{d}}_{2}), (9)

IV-A2 transmission success probability of Local Models

Local model transmission is successful if 𝑆𝐼𝑁𝑅up>βup\mathit{SINR}_{\text{up}}>\beta_{\text{up}}. Therefore, the transmission success probability of local models is given by

Pr(𝑆𝐼𝑁𝑅up>βup)=𝒜f𝑆𝐼𝑁𝑅up𝑑𝒜,{\rm Pr}(\mathit{SINR}_{\text{up}}>\beta_{\text{up}})=\iiint\limits_{\cal A}f_{\mathit{SINR}_{\text{up}}}d{\cal A}, (10)

where 𝒜{\cal A} represents the area of (D1,NI,𝐃2)(D_{1},N_{I},\mathbf{D}_{2}) that satisfies 𝑆𝐼𝑁𝑅up(D1,NI,𝐃2)>βup\mathit{SINR}_{\text{up}}(D_{1},N_{I},\mathbf{D}_{2})>\beta_{\text{up}}. As we can obtain f𝑆𝐼𝑁𝑅upf_{\mathit{SINR}_{\text{up}}} in equation (10), we only need to find the interfering area 𝒜{\cal A}.

For the distance between the UE and the serving BS, intuitively f𝑆𝐼𝑁𝑅up<βupf_{\mathit{SINR}_{\text{up}}}<\beta_{\text{up}}, when D1>d0D_{1}>d_{0} [20]. Therefore, the satisfying range of D1D_{1} is (0,d0](0,d_{0}]. Therefore, when given D1=d1D_{1}=d_{1}, we can obtain the number of interfering UEs NIN_{I} and the location of these interfering UEs. Let n¯I\overline{n}_{I} represent the mean of random variable NIN_{I}. Based on the UE distribution and interfering UE arrival models, we can derive n¯I\overline{n}_{I} as follows,

n¯IE(N𝒜)Pr(active)=π(d0)2λi(1exp{12tupλI}).\overline{n}_{I}\triangleq E(N_{\cal A}){\rm Pr}(active)=\pi(d_{0})^{2}\lambda_{i}\left(1-\text{exp}\{1-2t_{\text{up}}\lambda_{I}\}\right). (11)

Therefore, SINR is only related to D1D_{1} and 𝐃2\mathbf{D}_{2}, expressed as 𝑆𝐼𝑁𝑅up(D1,𝐃2)=PupG(D1)i=1n¯IIi+δ2\mathit{SINR}_{\text{up}}(D_{1},\mathbf{D}_{2})=\frac{P_{\text{up}}G(D_{1})}{\sum_{i=1}^{\overline{n}_{I}}I_{i}+\delta^{2}}, where Ii=PupG(D2(i))I_{i}=P_{\text{up}}G(D_{2}^{(i)}) represents the interference generated by UE ii. Therefore, we have

Pr(𝑆𝐼𝑁𝑅up>βup)=Pr(i=1n¯IIi<PupG(D1)βupδ2).{\rm Pr}(\mathit{SINR}_{\text{up}}>\beta_{\text{up}})={\rm Pr}\left(\sum_{i=1}^{\overline{n}_{I}}I_{i}<\frac{P_{\text{up}}G(D_{1})}{\beta_{\text{up}}}-\delta^{2}\right). (12)

In a typical FL framework, the number of UEs involved in local model training is fairly large, say at least hundreds of UEs [25]. Therefore, based on the central limit theorem, i=1n¯IIi\sum_{i=1}^{\overline{n}_{I}}I_{i} follows a normal distribution N(μI,σI2)N(\mu_{I},\sigma_{I}^{2}) [20, 26]. Furthermore, we have μI=n¯IE(Ii)\mu_{I}=\overline{n}_{I}E(I_{i}) and σI=n¯ID(Ii)\sigma_{I}=\sqrt{\overline{n}_{I}}D(I_{i}), which are the mean and variance of IiI_{i} respectively [26]. More details about μI\mu_{I} and σI\sigma_{I} can be found in Appendix A.

Let Y=IμIσIY=\frac{I-\mu_{I}}{\sigma_{I}}, where I=i=1n¯IIiI=\sum_{i=1}^{\overline{n}_{I}}I_{i} and IN(μI,σI)I\sim N(\mu_{I},\sigma_{I}). Therefore, we have YN(0,1)Y\sim N(0,1), where

Pr(i=1n¯IIi<PupG(d1)βupδ2)=Pr(Y<PupG(d1)βupδ2μIσI)=Φ(ξ(d1)),{\rm Pr}\left(\sum_{i=1}^{\overline{n}_{I}}I_{i}<\frac{P_{\text{up}}G(d_{1})}{\beta_{\text{up}}}-\delta^{2}\right)={\rm Pr}\left(Y<\frac{\frac{P_{\text{up}}G(d_{1})}{\beta_{\text{up}}}-\delta^{2}-\mu_{I}}{\sigma_{I}}\right)=\Phi(\xi(d_{1})), (13)

where ξ(d1)=PupG(d1)βupδ2μIσI\xi(d_{1})=\frac{\frac{P_{\text{up}}G(d_{1})}{\beta_{\text{up}}}-\delta^{2}-\mu_{I}}{\sigma_{I}} and Φ\Phi represents the cumulative distribution function (CDF) of standard normal distribution. Therefore, we have

Pr(𝑆𝐼𝑁𝑅up>βup)=𝒜f𝑆𝐼𝑁𝑅up𝑑𝒜=d1=dmind0fD1(d1)Φ(ξ(d1))d(d1).{\rm Pr}\left(\mathit{SINR}_{\text{up}}>\beta_{\text{up}}\right)=\iiint\limits_{\cal A}f_{\mathit{SINR}_{\text{up}}}d{\cal A}=\int_{d_{1}=d_{\text{min}}}^{d_{0}}f_{D_{1}}(d_{1})\Phi(\xi(d_{1}))d(d_{1}). (14)

IV-B SNR Analysis for Downlink

As SNRdown(D1)=PdownG(D1)/δ2\text{SNR}_{\text{down}}(D_{1})=P_{\text{down}}G(D_{1})/\delta^{2}, we can obtain the PDF of the signal power, as fSdown(Sdown=PdownG(D1))=fD1(d1)f_{S_{\text{down}}}(S_{\text{down}}=P_{\text{down}}G(D_{1}))=f_{D_{1}}(d_{1}), when given transmit power PdownP_{\text{down}} and noise level δ2\delta^{2}. Therefore, given G(dmin)=δ2βdown/PdownG(d_{\text{min}}^{{}^{\prime}})=\delta^{2}\beta_{\text{down}}/P_{\text{down}}, where we assume G(d1)G(d_{1}) monotonically increases, we have

Pr(𝑆𝑁𝑅down>βdown)=Pr(PdownG(D1)δ2>βdown)=d1=dminr0fD1(d1)d(d1).{\rm Pr}(\mathit{SNR}_{down}>\beta_{\text{down}})={\rm Pr}(\frac{P_{\text{down}}G(D_{1})}{\delta^{2}}>\beta_{\text{down}})=\int_{d_{1}=d_{\text{min}}^{{}^{\prime}}}^{r_{0}}f_{D_{1}}(d_{1})d(d_{1}). (15)

IV-C Wireless Bandwidth Consumed for Transmitting Local/Global Models

Based on equation (4), the bandwidth consumed for transmitting the local model wir(t)w_{i}^{r}(t) during the rr-th communication round is given by bupi,r=sTupi,rlog2(1+𝑆𝐼𝑁𝑅up(D1,NI,D2)b_{\text{up}}^{i,r}=\frac{s}{T_{\text{up}}^{i,r}\log_{2}(1+\mathit{SINR}_{\text{up}}(D_{1},N_{I},\textbf{D}_{2})}. As ss and Tupi,rT_{\text{up}}^{i,r} are constant, the PDF of bupi,rb_{\text{up}}^{i,r} for UE ii in the uplink is equal to f𝑆𝐼𝑁𝑅upf_{\mathit{SINR}_{\text{up}}}. Therefore, the mean of bandwidth for all UEs transmitting local models during KK communication rounds is as follows,

B¯up=Ki=1nbupi,rf𝑆𝐼𝑁𝑅up.\overline{B}_{\text{up}}=K\cdot\sum_{i=1}^{n}b_{\text{up}}^{i,r}f_{\mathit{SINR}_{\text{up}}}. (16)

Similarly, the mean of bandwidth for transmitting the global models during KK communication rounds is given by

B¯down=Ki=1nbdowni,rf𝑆𝑁𝑅down.\overline{B}_{\text{down}}=K\cdot\sum_{i=1}^{n}b_{\text{down}}^{i,r}f_{\mathit{SNR}_{\text{down}}}. (17)

IV-D Consumed Computing Resources in FL

We assume the processing capacity of all UEs are constant in cycles/sample. Moreover, as different ML models pose different degrees of complexity, we assume all UEs train the same FL task in our analytical model, where the local ML models have the same size and structure. Therefore, the total amount of computing resources needed to support local model training for all UEs is affected by the number of training UEs as well as the amount of datasets on the UEs. On the one hand, the number of training UEs is affected by the wireless transmission of the global model. On the other hand, many of the existing studies explicitly indicate that the amount of different datasets distributed on the UEs is imbalanced, as the data is collected directly and stored persistently [27, 28]. Note here data imbalance means the different amount of local datasets, instead of different dataset contents, so our analytical model is based on i.i.d. data, where non-i.i.d data case is left for future work. Therefore, we theoretically analyze computing resources consumption for supporting local training from the perspective of SNR and imbalanced datasets in this section.

We assume that the amount of datasets on the UEs follows the normal distribution [29, 30], i.e., 𝒮iN(μi,σi2){\cal S}_{i}\sim N(\mu_{i},\sigma_{i}^{2}), where μi\mu_{i} or/and σi2\sigma_{i}^{2} could be different for specific UEs. Indeed, other distributions, such as Beta distribution and Gamma distribution, can also be used in our analytical model. Moreover, as the computing resources consumption of UE ii for one local iteration is Zi=ciSiTi{Z}_{i}=\frac{c_{i}S_{i}}{T_{i}}, the PDF of Zi{Z}_{i} is equal to fSi(si)f_{{S}_{i}}(s_{i}), i.e., fZi(zi)=12πσiexp((siμi)22σi2)f_{{Z}_{i}}(z_{i})=\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right).

For a specific UE ii, if 𝑆𝑁𝑅down>βdown\mathit{SNR}_{down}>\beta_{\text{down}}, we say UE ii can successfully receive the global model. In other words, UE ii will continue to perform local training in the next communication round and consume certain computing resources. Let 𝒁^={Z^1,Z^2,,Z^i,,Z^n}\boldsymbol{\hat{Z}}=\{\hat{Z}_{1},\hat{Z}_{2},...,\hat{Z}_{i},...,\hat{Z}_{n}\} indicate the certain computing resources consumed by all UEs, where the value of Z^i\hat{Z}_{i} is set to zi{z}_{i} if UE ii successfully receives the global model and 0 otherwise. Therefore, we can obtain the PDF of Z^i\hat{Z}_{i} as follows,

fZ^i(Z^i=zi)=Pr(Z^i=zi|𝑆𝑁𝑅down>βdown)=fZi(zi)Pr(𝑆𝑁𝑅down>βdown).f_{\hat{Z}_{i}}(\hat{Z}_{i}=z_{i})={\rm Pr}(\hat{Z}_{i}=z_{i}|\mathit{SNR}_{down}>\beta_{\text{down}})=f_{{Z}_{i}}(z_{i}){\rm Pr}(\mathit{SNR}_{down}>\beta_{\text{down}}). (18)

Therefore, based on equation (18), we can derive the mean of computing resources consumed by all UEs for one local iteration, which is given by C¯𝑈𝐸=i=1nzifZ^i(Z^i=zi)\overline{C}_{\mathit{UE}}=\sum_{i=1}^{n}z_{i}f_{\hat{Z}_{i}}(\hat{Z}_{i}=z_{i}). Therefore, the total computing resources consumed for local model training is given by

Ctotal=τKC¯𝑈𝐸,C_{\text{total}}=\tau K\overline{C}_{\mathit{UE}}, (19)

where τ\tau and KK represent the total number of local trainings and communication rounds respectively. Armed with the above preparation, we are now starting to analyze how the resources affect the FL performance by evaluating local and global model accuracy.

V The Relationship between FL Performance and Consumed Resources

Indeed, the unsuccessful transmission of local models in the uplink affects the aggregation of the global model, while the unsuccessful transmission in the downlink affects the updating and training of local models. Therefore, it is necessary to analyze how the computing resources and communication resources affect the FL performance by evaluating both the local and global model accuracy.

V-A Local Model Accuracy

In an FL framework, no matter what local machine learning algorithm is used, each UE solves the local optimization problem for local training [24, 15, 31], i.e.,

minhidGi(wr,hi)Fi(wr+hi)(Fi(wr)ζF(wr))Thi,\min_{h_{i}\in\mathbb{R}^{d}}G_{i}(w_{r},h_{i})\triangleq F_{i}(w_{r}+h_{i})-(\nabla F_{i}(w_{r})-\zeta\nabla F(w_{r}))^{\text{T}}h_{i}, (20)

where ζ\zeta is constant and hih_{i} represents the difference between the global model and the local model for UE ii. Without loss of generality, we use the GD algorithm to update local models, as it can achieve the required high accuracy and facilitate the convergence analysis [24], as follows,

hi(r)(t+1)=hi(r)(t)ξGi(wr,hi(r)(t)),h_{i}^{(r)(t+1)}=h_{i}^{(r)(t)}-\xi\nabla G_{i}(w_{r},h_{i}^{(r)(t)}), (21)

where ξ\xi represents the step size and hi(r)(t)h_{i}^{(r)(t)} denotes the value of hih_{i} at the tt-th local iteration with given global model vector wrw_{r}. Moreover, Gi(wr,hi(r)(t))\nabla G_{i}(w_{r},h_{i}^{(r)(t)}) is the gradient of Gi(wr,hi)G_{i}(w_{r},h_{i}) at point hi=hi(r)(t)h_{i}=h_{i}^{(r)(t)}. In addition, wir(t)=wr+hi(r)(t)w_{i}^{r}(t)=w_{r}+h_{i}^{(r)(t)} represents the local model of UE ii at the tt-th local iteration. For a small step ξ\xi, we can derive a set of solutions hi(r)(0),,hi(r)(t),,hi(r)(τ)h_{i}^{(r)(0)},\cdots,h_{i}^{(r)(t)},\cdots,h_{i}^{(r)(\tau)}, which satisfies

Gi(wr,hi(r)(0))Gi(wr,hi(r)(t))Gi(wr,hi(r)(τ)).G_{i}(w_{r},h_{i}^{(r)(0)})\geq\cdots\geq G_{i}(w_{r},h_{i}^{(r)(t)})\geq\cdots\geq G_{i}(w_{r},h_{i}^{(r)(\tau)}).

To provide the convergence condition for the GD method, we introduce local model accuracy loss ϵl\epsilon_{l} [24, 15], which resembles the approximate factors in [15] [24], as follows,

Gi(wr,hi(r)(t))Gi(wr,hi(r))ϵl(Gi(wr,hi(r)(0))Gi(wr,hi(r))),G_{i}(w_{r},h_{i}^{(r)(t)})-G_{i}(w_{r},h_{i}^{(r)*})\leq\epsilon_{l}(G_{i}(w_{r},h_{i}^{(r)(0)})-G_{i}(w_{r},h_{i}^{(r)*})), (22)

where hi(r)h_{i}^{(r)*} represents the optimal solution of problem (20). Note that each UE aims to solve the local optimization problem with a target local model accuracy 1ϵl1-\epsilon_{l}. To achieve the local model accuracy 1ϵl1-\epsilon_{l} and the global model accuracy loss 1ϵg1-\epsilon_{g} in the following, we first make the following three assumptions on the local loss function Fi(w)F_{i}(w), as that in [24, 16, 31].

\bullet  Assumption 1: Function Fi(w)F_{i}(w) is LL-Lipschitz, i.e., w,wd,Fi(w)Fi(w)Lww.\forall w,w^{\prime}\in\mathbb{R}^{d},\parallel\nabla F_{i}(w)-\nabla F_{i}(w^{\prime})\parallel\leq L\parallel w-w^{\prime}\parallel.

\bullet  Assumption 2: Function Fi(w)F_{i}(w) is γ\gamma-strongly convex, i.e., w,wd,Fi(w)Fi(w)+Fi(w),(ww)+γ2ww2\forall w,w^{\prime}\in\mathbb{R}^{d},F_{i}(w)\geq F_{i}(w^{\prime})+\langle\nabla F_{i}(w^{\prime}),(w-w^{\prime})\rangle+\frac{\gamma}{2}\parallel w-w^{\prime}\parallel^{2}.

\bullet  Assumption 3: Fi(w)F_{i}(w) is twice-continuously differentiable. And γI2Fi(w)LI\gamma I\leq\nabla^{2}F_{i}(w)\leq LI.

Based on the three assumptions, we can obtain the lower bound on the number of local iterations during each communication round, which is shown as Proposition 2.

Proposition 2.

Local model accuracy loss ϵl\epsilon_{l} is achieved if ξ<2L\xi<\frac{2}{L} and run the GD method

τ2(2Lξ)ξγln1ϵl\tau\geq\lceil\frac{2}{(2-L\xi)\xi\gamma}\ln\frac{1}{\epsilon_{l}}\rceil

iterations during each communication round at each UE that participates in local training.

Proof:

See Appendix B. ∎

The lower bound in Proposition 2 reflects the growing trend of the number of local iterations with respect to the local model accuracy, which can approximate the consumption of computing resources for training local models.

V-B Global Model Accuracy

In FL algorithms, a global model accuracy is also needed. For a specific FL task, we define ϵg\epsilon_{g} as its global model accuracy loss (the global model accuracy is 1ϵg1-\epsilon_{g}), as follows,

F(wr(𝒮^,𝑆𝐼𝑁𝑅up))F(wr)ϵg(F(w1)F(wr)),F(w_{r}(\hat{\cal S},\mathit{SINR}_{\text{up}}))-F(w_{r}^{*})\leq\epsilon_{g}(F(w_{1})-F(w_{r}^{*})), (23)

where wrw_{r}^{*} represents the actual optimal solution. Moreover, we provide the following Proposition 3 about the number of communication rounds for achieving the global model accuracy 1ϵg1-\epsilon_{g}.

Proposition 3.

Global model accuracy 1ϵg1-\epsilon_{g} is achieved if the number of communication rounds KK meets

K2L2ln1ϵg(1ϵl)γ2ζ,K\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil,

when running FL algorithm shown as Algorithm 1 with 0<ζ<γL0<\zeta<\frac{\gamma}{L}

Proof:

See Appendix C. ∎

Note that it is very hard to derive a closed-form expression of the global model during each communication round due to the dynamic nature of the wireless channel and the uncertain nature of multiple random variables. Therefore, we assume the amount of datasets on each UE is fixed to facilitate the proof of Proposition 3. In addition, from Proposition 2 and Proposition 3, we can see that there is a trade-off between the number of communication rounds and the number of local iterations characterized by ϵl\epsilon_{l}: small ϵl\epsilon_{l} leads to large τ\tau, yet small KK, from which we can jointly approximate the communication and computing resources consumed by training FL tasks. The details can be found in the next section.

Algorithm 1 : FL Algorithm.

Input: The required local model accuracy loss ϵl\epsilon_{l}, the required global model accuracy loss ϵg\epsilon_{g}.
output: The global model wrw_{r}, the number of local iterations τ\tau, and the number of communication rounds KK.

1:  Initialization: local models wi1(0)=0w_{i}^{1}(0)=0, the global model g1=0g_{1}=0.
2:  for r=1,2,r=1,2,... do
3:     Each UE calculates Fi(wr)\nabla F_{i}(w_{r}).
4:     Each UE sends Fi(wr)\nabla F_{i}(w_{r}) to the BS.
5:     The BS calculates F(wr)=i=1n𝒮i^Fi(wr)i=1n𝒮i^\nabla F(w_{r})=\frac{\sum_{i=1}^{n}\hat{{\cal S}_{i}}\nabla F_{i}(w_{r})}{\sum_{i=1}^{n}\hat{{\cal S}_{i}}}.
6:     The BS broadcasts F(wr)\nabla F(w_{r}) to each UE.
7:     Parallel Each UE i=1,2,,ni=1,2,...,n
8:     Initialization: the local iteration number t=0t=0, and set hi(r)(0)=0h_{i}^{(r)(0)}=0.
9:     Repeat
10:     Every VV steps set hi(r)=hi(r)(t)h_{i}^{(r)*}=h_{i}^{(r)(t)}.
11:     Update hi(r)(t+1)=hi(r)(t)ξGi(wr,hi(r)(t))h_{i}^{(r)(t+1)}=h_{i}^{(r)(t)}-\xi\nabla G_{i}(w_{r},h_{i}^{(r)(t)}).
12:     Set wir(t)=wr+hi(r)(t)w_{i}^{r}(t)=w_{r}+h_{i}^{(r)(t)}.
13:     if Gi(wr,hi(r)(t))Gi(wr,hi(r))(Gi(wr,hi(r)(0))Gi(wr,hi(r)))>ϵl\frac{G_{i}(w_{r},h_{i}^{(r)(t)})-G_{i}(w_{r},h_{i}^{(r)*})}{(G_{i}(w_{r},h_{i}^{(r)(0)})-G_{i}(w_{r},h_{i}^{(r)*}))}>\epsilon_{l} then
14:        Set t=t+1t=t+1
15:     else
16:        Each UE ii sends wir(t)w_{i}^{r}(t) to the BS.
17:     end if
18:     The BS calculates wr=i=1n𝒮i^wir(t)i=1n𝒮i^w_{r}=\frac{\sum_{i=1}^{n}\hat{{\cal S}_{i}}w_{i}^{r}(t)}{\sum_{i=1}^{n}\hat{{\cal S}_{i}}}
19:     The BS sends wrw_{r} to all UEs.
20:     if F(wr(𝒮^,𝑆𝐼𝑁𝑅up))F(g)F(g0)F(g)<ϵg\frac{F(w_{r}(\hat{\cal S},\mathit{SINR}_{\text{up}}))-F(g^{*})}{F(g_{0})-F(g^{*})}<\epsilon_{g} then
21:        Break;
22:     end if
23:  end for
24:  Set τ=t\tau=t, K=rK=r.
25:  output wrw_{r}, τ\tau, KK.

VI Discussions of Three cases

In general, the resources used for training FL tasks at the wireless edge network should be limited, as 1) Communication and computing resources at the wireless edge network are limited and precious. 2) Resource consumption quickly increases with the widespread use of smart terminals. In this section, we discuss three specific cases for different sufficiency of communication and computing resources. Furthermore, we derive the explicit expression of the model accuracy under FL framework, as a function of the amount of the consumed resources based on the sufficiency of respective communication and computing resources.

VI-A Case 1: Sufficient Communication Resources and Computing Resources

When both communication resources and computing resources are sufficient, we can approximate the amount of communication/computing resources needed for the FL algorithm based on Proposition 2 and Proposition 3. Specifically, the bandwidth needed for transmitting local models in the uplink should meet

B¯up2L2ln1ϵg(1ϵl)γ2ζi=1nbupi,rf𝑆𝐼𝑁𝑅up=2L2ln1ϵg(1ϵl)γ2ζi=1nsTupi,rlog2(1+PupG(d1)j=1NIPG(d2(j))+δ2)(2d1(d0)2)n𝒜=nINCn𝒜nI(1exp{12tupλa})nI(exp{12tupλa})n𝒜nI(π(d0)2λi)n𝒜n𝒜!exp{π(d0)2λi}(2(d0)2)nIn=1nId2(n),\begin{split}\overline{B}_{\text{up}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}b_{\text{up}}^{i,r}f_{\mathit{SINR}_{\text{up}}}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ =\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{up}}^{i,r}\log_{2}(1+\frac{P_{\text{up}}G(d_{1})}{\sum_{j=1}^{N_{I}}PG({d_{2}}^{(j)})+\delta^{2}})}\cdot(\frac{2d_{1}}{(d_{0})^{2}})\cdot\sum_{n_{\cal A}=n_{I}}^{N}C_{n_{\cal A}}^{n_{I}}(1-\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{I}}\cdot\\ (\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{\cal A}-n_{I}}\cdot\frac{(\pi(d_{0})^{2}\lambda_{i})^{n_{\cal A}}}{n_{\cal A}!}\text{exp}\{-\pi(d_{0})^{2}\lambda_{i}\}\cdot\left(\frac{2}{(d_{0})^{2}}\right)^{n_{I}}\prod^{n_{I}}_{n=1}d_{2}^{(n)},\end{split} (24)

Similarly, based on equation (17), we can obtain the bandwidth needed for transmitting the global model in the downlink, as follows,

B¯down2L2ln1ϵg(1ϵl)γ2ζi=1nbdowni,rf𝑆𝑁𝑅down=2L2ln1ϵg(1ϵl)γ2ζi=1nsTdowni,rlog2(1+PdownG(d1)δ2)2d1r02.\begin{split}\overline{B}_{\text{down}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}b_{\text{down}}^{i,r}f_{\mathit{SNR}_{\text{down}}}=\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{down}}^{i,r}\log_{2}(1+\frac{P_{\text{down}}G(d_{1})}{\delta^{2}})}\cdot\frac{2d_{1}}{r_{0}^{2}}.\end{split} (25)

Furthermore, based on equation (19), given local accuracy ϵl\epsilon_{l} with ξ<2L\xi<\frac{2}{L}, the total amount of computing resources should meet the following constraint,

Ctotal2(2Lξ)ξγln1ϵl2L2ln1ϵg(1ϵl)γ2ζi=1nciSiTi12πσiexp((siμi)22σi2)(d0)2(dmin)2(r0)2.\begin{split}C_{\text{total}}\geq\lceil\frac{2}{(2-L\xi)\xi\gamma}\ln\frac{1}{\epsilon_{l}}\rceil\cdot\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)\cdot\frac{(d_{0})^{2}-(d_{\text{min}})^{2}}{(r_{0})^{2}}.\end{split} (26)

VI-B Case 2: Sufficient Computing Resources and Insufficient Communication Resources

When computing resources are sufficient, while communication resources are insufficient, we aim to reduce bandwidth consumption by reducing the number of communication rounds. In this case, the number of local iterations still follows Proposition 2, as computing resources are sufficient. However, Proposition 3 may not be met due to the lack of communication resources, which decreases the number of communication rounds KK. As a result, the required global model accuracy cannot be achieved. Specifically, the maximal number of communication rounds KmaxK_{\text{max}} is limited by the communication resources, i.e., Kmax=min{BdownmaxB¯downr,BupmaxB¯upr}K_{\text{max}}=\lfloor\min\{\frac{B_{\text{down}}^{\text{max}}}{\overline{B}_{\text{down}}^{r}},\frac{B_{\text{up}}^{\text{max}}}{\overline{B}_{\text{up}}^{r}}\}\rfloor, where BupmaxB_{\text{up}}^{\text{max}} and BdownmaxB_{\text{down}}^{\text{max}} are the maximal available bandwidth that can be used for FL in the uplink and the downlink respectively, B¯upr=i=1nbupi,rf𝑆𝐼𝑁𝑅up\overline{B}_{\text{up}}^{r}=\sum_{i=1}^{n}b_{\text{up}}^{i,r}f_{\mathit{SINR}_{\text{up}}} and B¯downr=i=1nbdowni,rf𝑆𝑁𝑅down\overline{B}_{\text{down}}^{r}=\sum_{i=1}^{n}b_{\text{down}}^{i,r}f_{\mathit{SNR}_{\text{down}}} are the mean bandwidth consumption on the uplink and downlink at one global iteration, respectively. To achieve the required global accuracy 1ϵg1-\epsilon_{g}, when the number of communication round is limited, based on Appendix C, we first give the following relationship,

F(gr(𝒮^,𝑆𝐼𝑁𝑅up))F(g)exp(K((1ϵl)γ2ζ2L2))(F(g0)F(g)),F\left(g_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-F\left(g^{*}\right)\leq\text{exp}\left(-K\left(\frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\right)\right)\left(F(g_{0})-F\left(g^{*}\right)\right), (27)

where we reasonably expect the real achieved global accuracy loss ϵ~g\widetilde{\epsilon}_{g} can be expressed by

ϵ~g=exp(K((1ϵ~l)γ2ζ2L2)).\widetilde{\epsilon}_{g}=\text{exp}\left(-K\left(\frac{(1-\widetilde{\epsilon}_{l})\gamma^{2}\zeta}{2L^{2}}\right)\right). (28)

Therefore, we have K=2L2ln1ϵ~g(1ϵ~l)γ2ζK=\lceil\frac{2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}{(1-\widetilde{\epsilon}_{l})\gamma^{2}\zeta}\rceil, where ϵ~l\widetilde{\epsilon}_{l} is the realistic local model accuracy loss. Moreover, we can see that when ϵ~g\widetilde{\epsilon}_{g} is fixed, KK will decrease if ϵ~l\widetilde{\epsilon}_{l} decreases. If we want to reduce KK thus to reduce bandwidth consumption, while keeping ϵ~g\widetilde{\epsilon}_{g} unchanged, we should decrease ϵl~\widetilde{\epsilon_{l}} by increasing the number of local iterations. As a result, the computing resource consumption will increase. In other words, there exists a trade-off to some extent between the communication resources and computing resources for achieving a certain ML model accuracy. In addition, from the perspective of communication resources, the number of communication rounds KK should meet KKmaxK\leq K_{\text{max}}. Therefore, we have ϵ~l12L2ln1ϵ~gKmaxγ2ζ\widetilde{\epsilon}_{l}\leq\lfloor 1-\frac{2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}{K_{\text{max}}\gamma^{2}\zeta}\rfloor.

Therefore, based on Proposition 2, the number of local iterations should meet

τ2(2Lξ)ξγlnKmaxγ2ζKmaxγ2ζ2L2ln1ϵ~g.\tau\geq\lceil\frac{2}{(2-L\xi)\xi\gamma}\ln\frac{K_{\text{max}}\gamma^{2}\zeta}{K_{\text{max}}\gamma^{2}\zeta-2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}\rceil. (29)

As a result, the total computing resources consumed for the FL task is given by

Ctotali=1nciSiTi12πσiexp((siμi)22σi2)(d0)2(dmin)2(r0)22(2Lξ)ξγlnKmaxγ2ζKmaxγ2ζ2L2ln1ϵ~g2L2ln1ϵ~g(1ϵ~l)γ2ζ.\begin{split}C_{\text{total}}\geq\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)\cdot\frac{(d_{0})^{2}-(d_{\text{min}})^{2}}{(r_{0})^{2}}\cdot\\ \lceil\frac{2}{(2-L\xi)\xi\gamma}\ln\frac{K_{\text{max}}\gamma^{2}\zeta}{K_{\text{max}}\gamma^{2}\zeta-2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}\rceil\cdot\lceil\frac{2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}{(1-\widetilde{\epsilon}_{l})\gamma^{2}\zeta}\rceil.\end{split} (30)

VI-C Case 3: Sufficient Communication Resources and Insufficient Computing Resources

When communication resources are sufficient, while computing resources are insufficient, we aim to reduce the computing resource consumption by reducing the number of local iterations. As communication resources are sufficient, the number of communication rounds still follows Proposition 3, while Proposition 2 may not be met due to the lack of computing resources, which decreases the number of local iterations. As a result, the required local model accuracy cannot be achieved. In addition, from the perspective of computing resources, the number of local iterations should meet τC¯UEi=1nCi\tau\cdot\overline{C}_{\text{UE}}\leq\sum_{i=1}^{n}C_{i}, where CiC_{i} represents the maximal computing resources used for local training on UE ii. To achieve the required local accuracy although the number of local iterations are limited, we give the following relationship based on Appendix B,

Gi(wr,hi(r)(t))Gi(wr,hi(r))exp(τ(2Lξ)ξγ2)(Gi(wr,hi(r)(0))Gi(wr,hi(r))),G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\leq\text{exp}\left(-\tau\frac{\left(2-L\xi\right)\xi\gamma}{2}\right)\cdot\left(G_{i}\left(w_{r},h_{i}^{(r)(0)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\right), (31)

from which we can reasonably expect that the real local model accuracy loss ϵ~l\widetilde{\epsilon}_{l} is expressed by ϵ~l=exp(τ(2Lξ)ξγ2)\widetilde{\epsilon}_{l}=\text{exp}\left(-\tau\frac{\left(2-L\xi\right)\xi\gamma}{2}\right), where we can obtain the number of local iterations, i.e., τ=2ln1ϵ~l(2Lξ)ξγ\tau=\lceil\frac{2\ln\frac{1}{\widetilde{\epsilon}_{l}}}{(2-L\xi)\xi\gamma}\rceil.

Therefore, when ξ<2L\xi<\frac{2}{L}, we have ϵ~lexp((Lξ2)ξγi=1nCi2C¯UE)\widetilde{\epsilon}_{l}\geq\text{exp}\left(\frac{(L\xi-2)\xi\gamma\sum_{i=1}^{n}C_{i}}{2\overline{C}_{\text{UE}}}\right). In other words, when the total amount of available computing resource decreases, the lower bound of ϵ~l\widetilde{\epsilon}_{l} will increase. Moreover, based on CUE\text{C}_{\text{UE}} in Section IV. D, we can derive the lower bound of the number of communication rounds, as follows,

K2L2ln1ϵg(1exp((Lξ2)ξγi=1nCi2i=1nciSiTi12πσiexp((siμi)22σi2)))γ2ζ.K\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{\left(1-\text{exp}\left(\frac{(L\xi-2)\xi\gamma\sum_{i=1}^{n}C_{i}}{2\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)}\right)\right)\gamma^{2}\zeta}\rceil. (32)

Therefore, the bandwidth for transmitting local models and the global model in the uplink and the downlink are respectively given by

B¯up2L2ln1ϵg(1exp((Lξ2)ξγi=1nCi2i=1nciSiTi12πσiexp((siμi)22σi2)))γ2ζi=1nsTupi,rlog2(1+PupG(d1)j=1NIPG(d2(j))+δ2)(2d1(d0)2)n𝒜=nINCn𝒜nI(1exp{12tupλa})nI(exp{12tupλa})n𝒜nI(π(d0)2λi)n𝒜n𝒜!exp{π(d0)2λi}(2(d0)2)nIn=1nId2(n),\begin{split}\overline{B}_{\text{up}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{\left(1-\text{exp}\left(\frac{(L\xi-2)\xi\gamma\sum_{i=1}^{n}C_{i}}{2\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)}\right)\right)\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{up}}^{i,r}\log_{2}(1+\frac{P_{\text{up}}G(d_{1})}{\sum_{j=1}^{N_{I}}PG({d_{2}}^{(j)})+\delta^{2}})}\cdot\\ (\frac{2d_{1}}{(d_{0})^{2}})\cdot\sum_{n_{\cal A}=n_{I}}^{N}C_{n_{\cal A}}^{n_{I}}(1-\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{I}}\cdot(\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{\cal A}-n_{I}}\cdot\\ \frac{(\pi(d_{0})^{2}\lambda_{i})^{n_{\cal A}}}{n_{\cal A}!}\text{exp}\{-\pi(d_{0})^{2}\lambda_{i}\}\cdot\left(\frac{2}{(d_{0})^{2}}\right)^{n_{I}}\prod^{n_{I}}_{n=1}d_{2}^{(n)},\end{split} (33)
B¯down2L2ln1ϵg(1exp((Lξ2)ξγi=1nCi2i=1nciSiTi12πσiexp((siμi)22σi2)))γ2ζi=1nsTdowni,rlog2(1+PdownG(d1)δ2)2d1(r0)2.\overline{B}_{\text{down}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{\left(1-\text{exp}\left(\frac{(L\xi-2)\xi\gamma\sum_{i=1}^{n}C_{i}}{2\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)}\right)\right)\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{down}}^{i,r}\log_{2}(1+\frac{P_{\text{down}}G(d_{1})}{\delta^{2}})}\cdot\frac{2d_{1}}{(r_{0})^{2}}. (34)

Therefore, based on the analysis aforementioned, we provide Proposition 4 about the resource consumption for the three cases discussed above.

Proposition 4.
  1. (1)

    Case 1-Sufficient Communication and Computing Resources: To achieve the required model accuracy ϵg\epsilon_{g} and ϵl\epsilon_{l}, the consumption of bandwidth in the uplink is B¯up2L2ln1ϵg(1ϵl)γ2ζ(2(d0)2)nIn=1nId2(n)i=1nsTupi,rlog2(1+PupG(d1)j=1NIPG(d2(j))+δ2)(2d1(d0)2)n𝒜=nINCn𝒜nI(1exp{12tupλa})nI(exp{12tupλa})n𝒜nI(π(d0)2λi)n𝒜n𝒜!exp{π(d0)2λi}\overline{B}_{\text{up}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\left(\frac{2}{(d_{0})^{2}}\right)^{n_{I}}\prod^{n_{I}}_{n=1}d_{2}^{(n)}\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{up}}^{i,r}\log_{2}(1+\frac{P_{\text{up}}G(d_{1})}{\sum_{j=1}^{N_{I}}PG({d_{2}}^{(j)})+\delta^{2}})}\cdot(\frac{2d_{1}}{(d_{0})^{2}})\cdot\sum_{n_{\cal A}=n_{I}}^{N}C_{n_{\cal A}}^{n_{I}}(1-\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{I}}\cdot(\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{\cal A}-n_{I}}\cdot\frac{(\pi(d_{0})^{2}\lambda_{i})^{n_{\cal A}}}{n_{\cal A}!}\text{exp}\{-\pi(d_{0})^{2}\lambda_{i}\}, the consumption of bandwidth in the downlink is B¯down2L2ln1ϵg(1ϵl)γ2ζ2d1r02i=1nsTdowni,rlog2(1+PdownG(d1)δ2)\overline{B}_{\text{down}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\frac{2d_{1}}{r_{0}^{2}}\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{down}}^{i,r}\log_{2}(1+\frac{P_{\text{down}}G(d_{1})}{\delta^{2}})}, and the consumption of computing resources used for local training is Ctotal2(2Lξ)ξγln1ϵl2L2ln1ϵg(1ϵl)γ2ζi=1nciSiTi12πσiexp((siμi)22σi2)(d0)2(dmin)2(r0)2.C_{\text{total}}\geq\lceil\frac{2}{(2-L\xi)\xi\gamma}\ln\frac{1}{\epsilon_{l}}\rceil\cdot\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)\cdot\frac{(d_{0})^{2}-(d_{\text{min}})^{2}}{(r_{0})^{2}}.

  2. (2)

    Case 2-Sufficient Computing Resources and Insufficient Communication Resources: To achieve the required global model accuracy ϵg\epsilon_{g}, the consumption of computing resources is Ctotali=1nciSiTi12πσiexp((siμi)22σi2)(d0)2(dmin)2(r0)22(2Lξ)ξγlnKmaxγ2ζKmaxγ2ζ2L2ln1ϵ~g2L2ln1ϵ~g(1ϵ~l)γ2ζ.C_{\text{total}}\geq\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)\cdot\frac{(d_{0})^{2}-(d_{\text{min}})^{2}}{(r_{0})^{2}}\cdot\lceil\frac{2}{(2-L\xi)\xi\gamma}\ln\frac{K_{\text{max}}\gamma^{2}\zeta}{K_{\text{max}}\gamma^{2}\zeta-2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}\rceil\cdot\lceil\frac{2L^{2}\ln\frac{1}{\widetilde{\epsilon}_{g}}}{(1-\widetilde{\epsilon}_{l})\gamma^{2}\zeta}\rceil.

  3. (3)

    Case 3-Sufficient Communication Resources and Insufficient Computing Resources: To achieve the required global model accuracy ϵg\epsilon_{g}, the consumption of bandwidth in the uplink is
    B¯up2L2ln1ϵg(1exp((Lξ2)ξγi=1nCi2i=1nciSiTi12πσiexp((siμi)22σi2)))γ2ζi=1nsTupi,rlog2(1+PupG(d1)j=1NIPG(d2(j))+δ2)(2d1(d0)2)n𝒜=nINCn𝒜nI(1exp{12tupλa})nI(exp{12tupλa})n𝒜nI(π(d0)2λi)n𝒜n𝒜!exp{π(d0)2λi}(2(d0)2)nIn=1nId2(n)\overline{B}_{\text{up}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{\left(1-\text{exp}\left(\frac{(L\xi-2)\xi\gamma\sum_{i=1}^{n}C_{i}}{2\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)}\right)\right)\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{up}}^{i,r}\log_{2}(1+\frac{P_{\text{up}}G(d_{1})}{\sum_{j=1}^{N_{I}}PG({d_{2}}^{(j)})+\delta^{2}})}\cdot\\ (\frac{2d_{1}}{(d_{0})^{2}})\cdot\sum_{n_{\cal A}=n_{I}}^{N}C_{n_{\cal A}}^{n_{I}}(1-\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{I}}\cdot(\text{exp}\{1-2t_{\text{up}}\lambda_{a}\})^{n_{\cal A}-n_{I}}\cdot\frac{(\pi(d_{0})^{2}\lambda_{i})^{n_{\cal A}}}{n_{\cal A}!}\text{exp}\{-\pi(d_{0})^{2}\lambda_{i}\}\cdot\left(\frac{2}{(d_{0})^{2}}\right)^{n_{I}}\prod^{n_{I}}_{n=1}d_{2}^{(n)}, while the consumption of bandwidth in the downlink is
    B¯down2L2ln1ϵg(1exp((Lξ2)ξγi=1nCi2i=1nciSiTi12πσiexp((siμi)22σi2)))γ2ζi=1nsTdowni,rlog2(1+PdownG(d1)δ2)2d1(r0)2\overline{B}_{\text{down}}\geq\lceil\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{\left(1-\text{exp}\left(\frac{(L\xi-2)\xi\gamma\sum_{i=1}^{n}C_{i}}{2\sum_{i=1}^{n}\frac{c_{i}S_{i}}{T_{i}}\cdot\frac{1}{\sqrt{2\pi}\sigma_{i}}\text{exp}\left(-\frac{(s_{i}-\mu_{i})^{2}}{2\sigma_{i}^{2}}\right)}\right)\right)\gamma^{2}\zeta}\rceil\cdot\sum_{i=1}^{n}\frac{s}{T_{\text{down}}^{i,r}\log_{2}(1+\frac{P_{\text{down}}G(d_{1})}{\delta^{2}})}\cdot\frac{2d_{1}}{(r_{0})^{2}}.

VII Numerical Results and Discussion

In this section, we verify our analytical modeling using numerical simulations by (1) verifying the analytical results of transmission success probability (uplink and downlink) and resource (bandwidth resource and computing resource) consumption; (2) Measuring the performance of FL settings; (3) Examining the trade-off between the computing resources and communication resources under FL framework.

VII-A Simulation Setting

We consider an FL empowered wireless network composed of multiple UEs that are randomly generated and one central BS with a cloud server that serves as the FL model aggregator. The coverage of the BS is a circular area with a radius of 1KM1\text{KM}. The radius of the interfering area is set to 100100m. The transmit power of UEs and the serving BS is set to 2020dBm and 4343dBm respectively [20]. Moreover, the noise power is set to 173-173dBm [20]. The density of interfering UEs λa\lambda_{a} is set to 1UE/m21~{}\text{UE}/\text{m}^{2} and tupt_{up} is randomly chosen within [1,3][1,3]ms. The path loss is modeled as g(D1)=34+40log(D1)g(D_{1})=34+40\log(D_{1}) [9]. The number of CPU cycles required for computing one sample data is randomly chosen within [1,4]×104[1,4]\times 10^{4} cycles/sample [24]. μi\mu_{i} and σi\sigma_{i} are randomly chosen within [1000,10000][1000,10000] and [0.2,0.5][0.2,0.5].

We consider using FL to solve the multi-class classification problem over MNIST datasets [32] for model training, where all datasets of UEs are randomly splitted with 75-25 ratio, for training and testing respectively [28]. Moreover, we use a fully-connected two-layer network built over PyTorch, where the size of input layer, hidden layer and output layer is set to 784 (28×2828\times 28), 40 and 10 respectively. The activation function is ReLU, as it can greatly accelerate the convergence of gradient descent and increase the number of the activated neurons [33, 34]. In addition, a constant learning rate has always been the conventional choice [35, 36, 37]. Inspired by the hyper-parameter analysis and the corresponding experimental results in [35, 36], we set learning rate ξ=0.03\xi=0.03. In addition, according to our neural network settings, the transmitted model size ss is around 1.156 MB, when using 32-bit floating-point computing. In addition, we set L=1/10L=1/10, γ=1/10\gamma=1/10, ξ=1/10\xi=1/10, ζ=1/10\zeta=1/10 [24].

VII-B Simulation Results

VII-B1 Verifying Analytical Results

First, we examine the local and the global model transmission success probability with varying UE density respectively. In the two simulations, based on PPP model, we randomly generated 30 specific point distributions for each UE density (0.1 intervals), where both the simulation results of Pr(SINRup>βup){\rm Pr}(\text{SINR}_{\text{up}}>\beta_{\text{up}}) and Pr(SNRdown>βdown){\rm Pr}(\text{SNR}_{\text{down}}>\beta_{\text{down}}) are averaged over these 30 different channel instances for each UE density. Fig. 3 and Fig. 4 show the probability Pr(SINRup>βup){\rm Pr}(\text{SINR}_{\text{up}}>\beta_{\text{up}}) in the uplink and Pr(SNRdown>βdown){\rm Pr}(\text{SNR}_{\text{down}}>\beta_{\text{down}}) in the downlink respectively for both analytical and simulation results with varying UE density under different threshold parameters. The analytical results of Pr(SINRup>βup){\rm Pr}(\text{SINR}_{\text{up}}>\beta_{\text{up}}) and Pr(SNRdown>βdown){\rm Pr}(\text{SNR}_{\text{down}}>\beta_{\text{down}}) are computed based on equation (14) and equation (15) respectively. From Fig.3 and Fig. 4, we can see that the curves of analytical results match closely to simulations for both the uplink and the downlink. Moreover, we can see that the smaller threshold (βup\beta_{\text{up}} in Fig. 3, βdown\beta_{\text{down}} in Fig. 4), the larger transmission success probability.

Refer to caption
Figure 3: Comparison of the probability of successful transmission in the uplink.
Refer to caption
Figure 4: Comparison of the probability of successful transmission in the downlink.
Refer to caption
Figure 5: Comparison of bandwidth consumption in the uplink.

Next, we examine the bandwidth consumption in the uplink and the downlink respectively for both analytical and simulation results with respect to the global accuracy loss ϵg\epsilon_{g}. We first randomly select UE density within [1,2][1,2] and randomly select 10 specific point distributions under the corresponding UE density. Then, we train the same FL task (i.e., classification on MNIST datasets) for each point distribution, where the simulation results are averaged over 10 point distributions. Fig. 5 and Fig. 6 show the bandwidth consumption in the uplink and the downlink changes with the global accuracy loss respectively. From Fig. 5 and Fig. 6, we can see that the curves of analytical results match closely to simulations for both the uplink and the downlink. Moreover, both the bandwidth consumption in the uplink and the downlink decrease with respect to the global accuracy loss. In addition, we also find that the lower local accuracy leads to more bandwidth consumption to guarantee a specific global accuracy when training i.i.d data. Specifically, as shown in Fig. 5, in the uplink, the amount of bandwidth consumed to guarantee ϵl=0.3\epsilon_{l}=0.3 is 0.144×1030.144\times 10^{3} Mbps more than that to guarantee ϵl=0.2\epsilon_{l}=0.2 on average, while the amount of bandwidth consumed to guarantee ϵl=0.4\epsilon_{l}=0.4 is 0.089×1030.089\times 10^{3} Mbps more than that to guarantee ϵl=0.3\epsilon_{l}=0.3 on average. As shown in Fig. 6, in the downlink, the amount of bandwidth consumed to guarantee ϵl=0.3\epsilon_{l}=0.3 is 0.54×1040.54\times 10^{4} Mbps more than that to guarantee ϵl=0.2\epsilon_{l}=0.2 on average, while the amount of bandwidth to guarantee ϵl=0.4\epsilon_{l}=0.4 is 0.28×1040.28\times 10^{4} Mbps more than that to guarantee ϵl=0.3\epsilon_{l}=0.3 on average. The reason is that the lower local accuracy needs more communication rounds to aggregate the local models to achieve a certain global accuracy, and thus consumes more bandwidth.

Refer to caption
Figure 6: Comparison of bandwidth consumption in the downlink.
Refer to caption
Figure 7: Comparison of the computing resource consumption.
Refer to caption
Figure 8: Local training during each communication round.

In the following, we examine the computing resource consumption for both analytical and simulation results with respect to the density of UEs. Specifically, the analytical results of computing resource consumption is computed based on equation (19), while the simulation results are averaged over 10 randomly generated point distributions for each UE density. Fig. 7 shows the computing resource consumption changes with the density of UEs. From Fig. 7, we can see that the amount of computing resource consumption increases in the beginning and then decreases with the density of UEs. Specifically, the amount of computing resource consumption increases with UE density, when it is approximately below 2 (i.e., λi2\lambda_{i}\leq 2), as the number of UEs that participate in local training increases with UE density. When approximately λi2\lambda_{i}\geq 2, the amount of computing resource consumption decreases with UE density, as poor SNR causes that some UEs fail in successfully receiving the global model. As a result, the number of UEs that participate in local training decreases in the next communication round, and thus the amount of computing resource consumption decreases. Moreover, we also find that achieving higher local accuracy needs more computing resources to train local models. Specifically, the amount of computing resources consumed to guarantee ϵl=0.2\epsilon_{l}=0.2 is 0.27×10120.27\times 10^{12} cycles/s more than that to guarantee ϵl=0.3\epsilon_{l}=0.3 on average, while the amount of computing resources consumed to guarantee ϵl=0.3\epsilon_{l}=0.3 is 15×101215\times 10^{12} cycles/s more than that to guarantee ϵl=0.4\epsilon_{l}=0.4 on average.

VII-B2 Measuring the performance of FL settings

First, we examine the convergence property by using simulation experiments. In this simulation experiment, the UE density is randomly chosen within [1,2][1,2] and data points are randomly generated (the same settings for the following simulations). Moreover, we set ϵg=0.2\epsilon_{g}=0.2, βup=15\beta_{\text{up}}=-15dB, and βdown=15\beta_{\text{down}}=15dB. As shown in Fig. 8, we randomly choose 33 UEs to observe the changes of the local optimization function, where the local optimization function converges in about 40 epochs. In addition, as shown in Fig. 9, we can observe that the global loss function convergences in around 12 communication rounds.

Refer to caption
Figure 9: Convergence.
Refer to caption
Figure 10: Comparison of the global accuracy loss.
Refer to caption
Figure 11: The relationship between training accuracy and testing accuracy.

Next, we examine the global accuracy loss with the number of communication rounds when fixed local accuracy loss ϵl=0.2\epsilon_{l}=0.2. In this simulation, we still set βup=15\beta_{\text{up}}=-15dB and βdown=15\beta_{\text{down}}=15dB. Fig. 10 shows the global model accuracy loss changes with the number of communication rounds. From Fig. 10, we can see that the global model accuracy loss decreases with the number of communication rounds. Moreover, the difference between the actual global accuracy loss and ϵl=0.2\epsilon_{l}=0.2 is always within 0.10.1 when the learning convergences. Please note that SINR and SNR practically affect the global aggregation and local training respectively.

Next, we examine whether the well trained model is effective for the test datasets. In this simulation experiment, the test datasets are drawn from the same distribution as the training data. We randomly select 3 UEs and calculate the testing accuracy every 2 communication rounds. As shown in Fig. 11, we can see that the testing accuracy increases with training accuracy, where ϵl\epsilon_{l} is the local training accuracy and ϵ^l\hat{\epsilon}_{l} is the testing training accuracy. We can also see, in general, the difference between the training accuracy and the testing accuracy is within [0,0.1][0,0.1].

VII-B3 Examining the trade-off between the computing resources and communication resources under FL framework

First, we examine the relationship between the global model accuracy and available bandwidth in the uplink. In this simulation experiment, we first assume that bandwidth in the downlink and computing resources are sufficient, and then we fix the required local model accuracy (0.1,0.2,0.30.1,0.2,0.3) to verify the relationship between the global model accuracy and the amount of available bandwidth in the uplink. From Fig. 12, we can see that the global model accuracy sharply increases in the beginning and then increases slowly with the amount of available bandwidth in the uplink, as the number of local UEs that can participate in global aggregation quickly increases with the amount of bandwidth in the beginning. When the amount of bandwidth increases to be sufficient, it has little effect on the transmission success probability of local models, and thus the global model accuracy keeps fairly steady. Moreover, in the beginning, higher local model accuracy (lower ϵl\epsilon_{l}) leads to a higher global model accuracy. Specifically, the global model accuracy when ϵl=0.1\epsilon_{l}=0.1 is 8.5%8.5\% higher than that when ϵl=0.2\epsilon_{l}=0.2, while the global model accuracy when ϵl=0.2\epsilon_{l}=0.2 is 17.2%17.2\% higher than that when ϵl=0.3\epsilon_{l}=0.3.

Refer to caption
Figure 12: The relationship between available bandwidth in the uplink and global model accuracy.
Refer to caption
Figure 13: The relationship between available computing resources and local model accuracy.
Refer to caption
Figure 14: The trade-off between the computing resources and bandwidth.

After that, we examine the relationship between the local model accuracy and computing resources, as shown in Fig. 13, where we randomly select 3 different UEs and assume that the bandwidth in the uplink/downlink is sufficient. From Fig. 13, we can see that the local model accuracy quickly increases in the beginning and then keeps fairly steady with the amount of computing resources. The reason is similar to that in Fig. 12, i.e., more computing resources leads to more local iterations in the beginning, while sufficient computing resources have little effect on local iterations. Please note the fluctuations of the curves in the Fig. 13 are due to that the local model accuracy in the Fig. 13 is recorded per local iteration, while the global accuracy in the Fig. 12 is recorded per communication round composed of 30 local iterations.

Finally, we examine the trend of the global model accuracy with respect to the amount of computing resources and the amount of bandwidth. Fig. 14 shows the relationship among the global model accuracy, computing resources, and bandwidth, where the trade-off between the amount of computing resources and the amount of bandwidth is verified. As shown in Fig. 14, both the amount of bandwidth and the amount of computing resources affect the global model accuracy, where we can flexibly adjust the amount of computing resources and the amount of bandwidth to guarantee a specific global model accuracy. Specifically, when we fix the amount of bandwidth used for transmitting the local models, we can increase the global model accuracy by increasing the amount of computing resources. When we fix the amount of computing resources used for local training, we can increase the global model accuracy by increasing the amount of bandwidth.

VIII Conclusion

Wireless edge network intelligence enabled by FL has been widely acknowledged as a very promising means to address a wide range of challenging network issues. In this paper, we have theoretically analyzed how accurate of an ML model can be achieved by using FL and how many resources are consumed to guarantee a certain local/global accuracy. Specifically, we have derived the explicit expression of the model accuracy under FL framework, as a function of the amount of computing resources/communication resources for FL empowered wireless edge networks. Numerical results validate the effectiveness of our theoretical modeling. The modeling and results can provide some fundamental understanding for the trade-off between the learning performance and consumed resources, which is useful for promoting FL empowered wireless network edge intelligence.

Appendix A Calculate μI\mu_{I} and σI\sigma_{I}

μI=n¯IE(Ii)=n¯Id2(i)=dmind0Pupg(d2(i))fD2(i)(d2(i))d(d2(i))=n¯Id2(i)=dmind0Pupg(d2(i))2d2(i)(d0)2d(d2(i))=2Pupn¯I(d0)2d2(i)=dmind0g(d2(i))(d2(i))d(d2(i))=(b)2PupnI(d0)2d2(i)=dmind0(d2(i))𝑑G(d2(i))\begin{split}\mu_{I}=\overline{n}_{I}E(I_{i})=\overline{n}_{I}\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}P_{\text{up}}g(d_{2}^{(i)})f_{D_{2}^{(i)}}(d_{2}^{(i)})d(d_{2}^{(i)})=\overline{n}_{I}\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}P_{\text{up}}g(d_{2}^{(i)})\cdot\frac{2d_{2}^{(i)}}{(d_{0})^{2}}d(d_{2}^{(i)})\\ =\frac{2P_{\text{up}}\overline{n}_{I}}{(d_{0})^{2}}\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}g(d_{2}^{(i)})(d_{2}^{(i)})d(d_{2}^{(i)})\overset{(b)}{=}\frac{2P_{\text{up}}n_{I}}{(d_{0})^{2}}\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}(d_{2}^{(i)})dG(d_{2}^{(i)})~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \end{split}
=2Pupn¯I(d0)2[(d2(i))G(d2(i))|d2(i)=dmind0d2(i)=dmind0G(d2(i))d(d2(i))],\begin{split}=\frac{2P_{\text{up}}\overline{n}_{I}}{(d_{0})^{2}}[(d_{2}^{(i)})G(d_{2}^{(i)})|_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}-\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}G(d_{2}^{(i)})d(d_{2}^{(i)})],~{}~{}~{}~{}~{}~{}~{}\end{split}

where d2(i)=dmind0G(d2(i))d(d2(i))=G2(d0)G2(dmin)\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}G(d_{2}^{(i)})d(d_{2}^{(i)})={G}_{2}(d_{0})-{G}_{2}(d_{\text{min}}), Therefore, μI\mu_{I} can be given by

μI=2Pupn¯I(d0)2{[d0G(d0)dminG(dmin)][G2(d0)G2(dmin)]},\mu_{I}=\frac{2P_{\text{up}}\overline{n}_{I}}{(d_{0})^{2}}\{[d_{0}G(d_{0})-d_{\text{min}}G(d_{\text{min}})]-[{G}_{2}(d_{0})-{G}_{2}(d_{\text{min}})]\},

where dmind_{\text{min}} is the minimum distance between UEs and BS, and we define g()=Gg(\cdot)=G^{\prime} and G()=G2()G(\cdot)={G}_{2}^{\prime}(\cdot).

σI=n¯ID(Ii)=n¯I[E(Ii2)E2(Ii)]=n¯I[d2(i)=dmind02Pup2g2(d2(i))d2(i)(d0)2d(d2(i))(μInI)2].\sigma_{I}=\sqrt{\overline{n}_{I}}D(I_{i})=\sqrt{\overline{n}_{I}}[E(I_{i}^{2})-E^{2}(I_{i})]=\sqrt{\overline{n}_{I}}\left[\int_{d_{2}^{(i)}=d_{\text{min}}}^{d_{0}}2P_{\text{up}}^{2}g^{2}(d_{2}^{(i)})\frac{d_{2}^{(i)}}{(d_{0})^{2}}d(d_{2}^{(i)})-(\frac{\mu_{I}}{n_{I}})^{2}\right].

Appendix B Proof of Proposition 2

First, based on the definition of Gi()G_{i}(\cdot) in problem (20), we have 2Gi()=2Fi()\nabla^{2}G_{i}(\cdot)=\nabla^{2}F_{i}(\cdot). Therefore, 2Gi()\nabla^{2}G_{i}(\cdot) also meets the LL-smooth, γ\gamma-convex, and twice-differentiable assumptions, i.e.,

Gi(w)Gi(w)Lww,\displaystyle\parallel\nabla G_{i}(w)-\nabla G_{i}(w^{\prime})\parallel\leq L\parallel w-w^{\prime}\parallel, (B.1)
Gi(w)Gi(w)γww,\displaystyle\parallel\nabla G_{i}(w)-\nabla G_{i}(w^{\prime})\parallel\geq\gamma\parallel w-w^{\prime}\parallel, (B.2)
γI2Gi(w)LI,\displaystyle\gamma I\leq\nabla^{2}G_{i}(w)\leq LI, (B.3)
Gi(w)Gi(w)+(Gi(w))T(ww)+γ2ww2.\displaystyle G_{i}(w)\geq G_{i}(w^{\prime})+\nabla\left(G_{i}(w^{\prime})\right)^{T}(w-w^{\prime})+\frac{\gamma}{2}\parallel w-w^{\prime}\parallel^{2}. (B.4)

Then, when given wrw_{r}, we rewrite Gi()G_{i}(\cdot) using the second-order Taylor expansion as

Gi(wr,hi(r)(t+1))=Gi(wr,hi(r)(t))+(hi(r)(t+1)hi(r)(t))TGi(wr,hi(r)(t))+12(hi(r)(t+1)hi(r)(t))T2Gi(wr,hi(r)(t)).(hi(r)(t+1)hi(r)(t)).\begin{split}G_{i}\left(w_{r},h_{i}^{(r)(t+1)}\right)=G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)+\left(h_{i}^{(r)(t+1)}-h_{i}^{(r)(t)}\right)^{\text{T}}\cdot\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)+\\ \frac{1}{2}\left(h_{i}^{(r)(t+1)}-h_{i}^{(r)(t)}\right)^{\text{T}}\cdot\nabla^{2}G_{i}\left(w_{r},h_{i}^{(r)(t)}\right).\cdot\left(h_{i}^{(r)(t+1)}-h_{i}^{(r)(t)}\right).\end{split}

As in GD method, we have

hi(r)(t+1)=hi(r)(t)ξGi(wr,hi(r)(t)).h_{i}^{(r)(t+1)}=h_{i}^{(r)(t)}-\xi\nabla G_{i}(w_{r},h_{i}^{(r)(t)}).

Therefore, based on C.1 and C.3, we have

Gi(wr,hi(r)(t+1))=Gi(wr,hi(r)(t))ξGi(wr,hi(r)(t))2+ξ22Gi(wr,hi(r)(t))22Gi(wr,hi(r)(t))(C.3)Gi(wr,hi(r)(t))ξGi(wr,hi(r)(t))2+Lξ22Gi(wr,hi(r)(t))2=Gi(wr,hi(r)(t))(2Lξ)ξ2Gi(wr,hi(r)(t))2,\begin{split}G_{i}(w_{r},h_{i}^{(r)(t+1)})=G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-\xi\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2}+~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \frac{\xi^{2}}{2}\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2}\nabla^{2}G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{(C.3)}{\leq}G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-\xi\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2}+\frac{L\xi^{2}}{2}\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2}\\ =G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-\frac{\left(2-L\xi\right)\xi}{2}\cdot\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2},~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\end{split}

Next we start to find the lower bound of Gi(wr,hi(r)(t))2\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2}. For the optimal hi(r)h_{i}^{(r)^{*}} of Gi(wr,hi(r))G_{i}(w_{r},h_{i}^{(r)^{*}}), we always have Gi(wr,hi(r))=0\nabla G_{i}(w_{r},h_{i}^{(r)^{*}})=0. Therefore, we have

Gi(wr,hi(r)(t))2=Gi(wr,hi(r)(t))Gi(wr,hi(r))2(C.2)γGi(wr,hi(r)(t))Gi(wr,hi(r))(wr,hi(r)(t))(wr,hi(r))γ(Gi(wr,hi(r)(t))Gi(wr,hi(r)))T((wr,hi(r)(t))(wr,hi(r)))=γGi(wr,hi(r)(t))T((wr,hi(r)(t))(wr,hi(r)))(C.4)γ(Gi(wr,hi(r)(t))Gi(wr,hi(r))).\begin{split}\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)\parallel^{2}=\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-\nabla G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\parallel^{2}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{(C.2)}{\geq}\gamma\parallel\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-\nabla G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\parallel\parallel(w_{r},h_{i}^{(r)(t)})-(w_{r},h_{i}^{(r)^{*}})\parallel\\ \geq\gamma\left(\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-\nabla G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\right)^{\text{T}}\left((w_{r},h_{i}^{(r)(t)})-(w_{r},h_{i}^{(r)^{*}})\right)\\ =\gamma\nabla G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)^{\text{T}}\left((w_{r},h_{i}^{(r)(t)})-(w_{r},h_{i}^{(r)^{*}})\right)\\ \overset{(C.4)}{\geq}\gamma\left(G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\right).~{}~{}~{}~{}~{}~{}~{}~{}~{}\end{split} (B.6)

Therefore, combining (B.5) and (B.6), we have

Gi(wr,hi(r)(t+1))Gi(wr,hi(r))Gi(wr,hi(r)(t))Gi(wr,hi(r))(2Lξ)ξγ2(Gi(wr,hi(r)(t))Gi(wr,hi(r)))(c)(1(2Lξ)ξγ2)(τ+1)(Gi(wr,hi(r)(0))Gi(wr,hi(r)))exp((τ+1)(2Lξ)ξγ2)(Gi(wr,hi(r)(0))Gi(wr,hi(r))),\begin{split}G_{i}\left(w_{r},h_{i}^{(r)(t+1)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\leq G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)-~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \frac{\left(2-L\xi\right)\xi\gamma}{2}\left(G_{i}\left(w_{r},h_{i}^{(r)(t)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\right)\overset{(\text{c})}{\leq}\left(1-\frac{\left(2-L\xi\right)\xi\gamma}{2}\right)^{(\tau+1)}\cdot\left(G_{i}\left(w_{r},h_{i}^{(r)(0)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\right)\\ \leq\text{exp}\left(-(\tau+1)\frac{\left(2-L\xi\right)\xi\gamma}{2}\right)\left(G_{i}\left(w_{r},h_{i}^{(r)(0)}\right)-G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right)\right),\end{split}

where (c) can be obtained from 1xexp(x)1-x\leq\text{exp}(-x). Therefore, to ensure that Gi(wr,hi(r)(t))Gi(wr,hi(r))ϵl(Gi(wr,hi(r)(0))Gi(wr,hi(r)))G_{i}(w_{r},h_{i}^{(r)(t)})-G_{i}(w_{r},h_{i}^{(r)*})\leq\epsilon_{l}(G_{i}(w_{r},h_{i}^{(r)(0)})-G_{i}(w_{r},h_{i}^{(r)*})), we have

ϵlexp((τ)(2Lξ)ξγ2).\epsilon_{l}\geq\text{exp}\left(-(\tau)\frac{\left(2-L\xi\right)\xi\gamma}{2}\right).

Therefore, when ξ<2L\xi<\frac{2}{L} (as (2Lξ)ξγ>0(2-L\xi)\xi\gamma>0), we have

τ2(2Lξ)ξγln(1ϵl),\tau\geq\frac{2}{(2-L\xi)\xi\gamma}\ln\left(\frac{1}{\epsilon_{l}}\right),

Appendix C Proof of Proposition 3

Under Assumption 1, Assumption 2, and Assumption 3, the following conditions hold on

1LFi(wr+hi)Fi(wr)2(Fi(wr+hi)Fi(wr))Thi1γFi(wr+hi)Fi(wr)2,\frac{1}{L}\parallel\nabla F_{i}(w_{r}+h_{i})-\nabla F_{i}(w_{r})\parallel^{2}\leq\left(\nabla F_{i}(w_{r}+h_{i})-\nabla F_{i}(w_{r})\right)^{\text{T}}h_{i}\\ \leq\frac{1}{\gamma}\parallel\nabla F_{i}(w_{r}+h_{i})-\nabla F_{i}(w_{r})\parallel^{2}, (C.1)
F(w)2γ(F(w)F(w)).\parallel\nabla F(w)\parallel^{2}\geq\gamma\left(F(w)-F(w^{*})\right). (C.2)

The proof of (C.2) is similar to that of (B.6) in Appendix C. For (C.1), based on Lagrange median theorem, we always have a ww such that

Fi(wr+hi)Fi(wr)=2Fi(w)hi.\nabla F_{i}(w_{r}+h_{i})-\nabla F_{i}(w_{r})=\nabla^{2}F_{i}(w)h_{i}. (C.3)

For the optimal solution of local optimization problem, we always have

Fi(wr+hi(r))(Fi(wr)ζF(wr))Thi(r)=0F_{i}\left(w_{r}+h_{i}^{(r)^{*}}\right)-\left(\nabla F_{i}(w_{r})-\zeta\nabla F(w_{r})\right)^{\text{T}}h_{i}^{(r)^{*}}=0 (C.4)

With the above equalities and inequalities, we now start to prove Proposition 4.

F(wr+1(𝒮^,𝑆𝐼𝑁𝑅up))=F(i=1n𝒮^iwi(rτ)i=1n𝒮^i)=F(wr(𝒮^,𝑆𝐼𝑁𝑅up)+i=1n𝒮^ihi(r)(τ)i=1n𝒮^i)(A1)F(wr(𝒮^,𝑆𝐼𝑁𝑅up))+1i=1n𝒮^ii=1n𝒮^iF(wr(𝒮^,𝑆𝐼𝑁𝑅up))Thi(r)(τ)++L2(i=1n𝒮^i)2i=1n𝒮^ihi(r)(τ)2.\begin{split}F\left(w_{r+1}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)=F\left(\frac{\sum_{i=1}^{n}\hat{\cal S}_{i}w_{i}(r\tau)}{\sum_{i=1}^{n}\hat{\cal S}_{i}}\right)=F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+\frac{\sum_{i=1}^{n}\hat{\cal S}_{i}h_{i}^{(r)(\tau)}}{\sum_{i=1}^{n}\hat{\cal S}_{i}}\right)~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{(A1)}{\leq}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)+\frac{1}{\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\nabla F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)^{\text{T}}h_{i}^{(r)(\tau)}++\frac{L}{2\left(\sum_{i=1}^{n}\hat{\cal S}_{i}\right)^{2}}\parallel\sum_{i=1}^{n}\hat{\cal S}_{i}h_{i}^{(r)(\tau)}\parallel^{2}.\end{split} (C.5)

As

Gi(wr,hi)=Fi(wr+hi)(Fi(wr)ζF(wr))Thi,\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}G_{i}(w_{r},h_{i})=F_{i}(w_{r}+h_{i})-(\nabla F_{i}(w_{r})-\zeta\nabla F(w_{r}))^{\text{T}}h_{i}, (C.6)
1(i=1n𝒮^i)2i=1n𝒮^ihi(r)(τ)2(1i=1n𝒮^ii=1n𝒮^ihi(r)(τ))hi(r)(τ)𝒮^i01i=1n𝒮^ii=1n𝒮^ihi(r)(τ)2.\displaystyle\parallel\frac{1}{\left(\sum_{i=1}^{n}\hat{\cal S}_{i}\right)^{2}}\sum_{i=1}^{n}\hat{\cal S}_{i}h_{i}^{(r)(\tau)}\parallel^{2}\leq\left(\frac{1}{\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\parallel\hat{\cal S}_{i}h_{i}^{(r)(\tau)}\parallel\right)\parallel h_{i}^{(r)(\tau)}\parallel\overset{\hat{\cal S}_{i}\geq 0}{\leq}\frac{1}{\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\parallel h_{i}^{(r)(\tau)}\parallel^{2}. (C.7)

Therefore, we have

F(wr+1(𝒮^,𝑆𝐼𝑁𝑅up))F(gr(𝒮^,𝑆𝐼𝑁𝑅up))+1ζi=1n𝒮^ii=1n𝒮^i{Gi(gr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r)(τ))Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)+hi(r)(τ))+Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up))hi(r)(τ)}+L2(i=1n𝒮^i)2i=1n𝒮^ihi(r)(τ)2(A2)F(wr(𝒮^,𝑆𝐼𝑁𝑅up))+1ζi=1n𝒮^ii=1n𝒮^i{Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r)(τ))Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up))γ2hi(r)(τ)2}+L2(i=1n𝒮^i)2i=1n𝒮^ihi(r)(τ)2(C.7)F(wr(𝒮^,𝑆𝐼𝑁𝑅up))+1ζi=1n𝒮^ii=1n𝒮^i{Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r)(τ))Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up))γLζ2hi(r)(τ)2}=(C.6)F(wr(𝒮^,𝑆𝐼𝑁𝑅up))+1ζi=1n𝒮^ii=1n𝒮^i{Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r)(τ))Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r))Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),0)+Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r))γLζ2hi(r)(τ)2}F(wr(𝒮^,𝑆𝐼𝑁𝑅up))1ζi=1n𝒮^ii=1n𝒮^i{γLζ2hi(r)(τ)2+(1ϵl)[Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),0)Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r))]}.\begin{split}F\left(w_{r+1}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)\leq F\left(g_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)+\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{G_{i}\left(g_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)(\tau)}\right)-\\ F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+h_{i}^{(r)(\tau)}\right)+\nabla F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)h_{i}^{(r)(\tau)}\}+\frac{L}{2\left(\sum_{i=1}^{n}\hat{\cal S}_{i}\right)^{2}}\parallel\sum_{i=1}^{n}\hat{\cal S}_{i}h_{i}^{(r)(\tau)}\parallel^{2}\\ \overset{(A2)}{\leq}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)+\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)(\tau)}\right)-~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{\gamma}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}\}+\frac{L}{2\left(\sum_{i=1}^{n}\hat{\cal S}_{i}\right)^{2}}\parallel\sum_{i=1}^{n}\hat{\cal S}_{i}h_{i}^{(r)(\tau)}\parallel^{2}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{(C.7)}{\leq}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)+\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)(\tau)}\right)-~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{\gamma-L\zeta}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}\}\\ \overset{(C.6)}{=}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)+\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)(\tau)}\right)-~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)^{*}}\right)-G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),0\right)+\\ G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)^{*}}\right)-\frac{\gamma-L\zeta}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}\}\\ \leq F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{\frac{\gamma-L\zeta}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}+~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \left(1-\epsilon_{l}\right)[G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),0\right)-G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)^{*}}\right)]\}.\end{split}

According to equation (C.6), we can calculate Gi(wr,0)G_{i}\left(w_{r},0\right) and Gi(wr,hi(r))G_{i}\left(w_{r},h_{i}^{(r)^{*}}\right) as follows,

Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),0)=Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up),G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),0\right)=F_{i}\left(w_{r}(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}
Gi(wr(𝒮^,𝑆𝐼𝑁𝑅up),hi(r))=Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)+hi(r))[Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up))ζF(wr(𝒮^,𝑆𝐼𝑁𝑅up))]Thi(r).\begin{split}G_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right),h_{i}^{(r)^{*}}\right)=F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+h_{i}^{(r)^{*}}\right)\\ -[\nabla F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\zeta\nabla F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)]^{\text{T}}h_{i}^{(r)^{*}}.\end{split}

Therefore, we have

F(wr+1(𝒮^,𝑆𝐼𝑁𝑅up))F(wr(𝒮^,𝑆𝐼𝑁𝑅up))1ζi=1n𝒮^ii=1n𝒮^i{γLζ2hi(r)(τ)2+(1ϵl){Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)+hi(r))+[Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up))ζF(wr(𝒮^,𝑆𝐼𝑁𝑅up))]Thi(r)}}=(C.4)F(wr(𝒮^,𝑆𝐼𝑁𝑅up))1ζi=1n𝒮^ii=1n𝒮^i{γLζ2hi(r)(τ)2+(1ϵl){Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)+hi(r))+Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up)+hi(r))Thi(r)}A2F(wr(𝒮^,𝑆𝐼𝑁𝑅up))1ζi=1n𝒮^ii=1n𝒮^i{γLζ2hi(r)(τ)2+(1ϵl)γ2hi(r)2}F(wr(𝒮^,𝑆𝐼𝑁𝑅up))(1ϵl)γ2ζi=1n𝒮^ii=1n𝒮^ihi(r)2\begin{split}F\left(w_{r+1}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)\leq F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{\frac{\gamma-L\zeta}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}+~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \left(1-\epsilon_{l}\right)\{F_{i}\left(w_{r}(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)-F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+h_{i}^{(r)^{*}}\right)+[\nabla F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \zeta\nabla F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)]^{\text{T}}h_{i}^{(r)^{*}}\}\}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{(C.4)}{=}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{\frac{\gamma-L\zeta}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}+\left(1-\epsilon_{l}\right)\{F_{i}\left(w_{r}(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)-~{}~{}~{}~{}~{}\\ F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+h_{i}^{(r)^{*}}\right)+\nabla F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+h_{i}^{(r)^{*}}\right)^{\text{T}}h_{i}^{(r)^{*}}\}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{A2}{\leq}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{1}{\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\{\frac{\gamma-L\zeta}{2}\parallel h_{i}^{(r)(\tau)}\parallel^{2}+\frac{(1-\epsilon_{l})\gamma}{2}\parallel h_{i}^{(r)^{*}}\parallel^{2}\}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \leq F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{(1-\epsilon_{l})\gamma}{2\zeta\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\parallel h_{i}^{(r)^{*}}\parallel^{2}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \end{split}
A1F(wr(𝒮^,𝑆𝐼𝑁𝑅up))(1ϵl)γ2ζL2i=1n𝒮^ii=1n𝒮^iFi(wr(𝒮^,𝑆𝐼𝑁𝑅up)+hi(r))Fi(wr(𝒮^,𝑆𝐼𝑁𝑅up))2=C.4F(wr(𝒮^,𝑆𝐼𝑁𝑅up))(1ϵl)γζ2L2i=1n𝒮^ii=1n𝒮^iF(wr(𝒮^,𝑆𝐼𝑁𝑅up))2C.2F(wr(𝒮^,𝑆𝐼𝑁𝑅up))(1ϵl)γζ2L2i=1n𝒮^ii=1n𝒮^iγ(F(wr(𝒮^,𝑆𝐼𝑁𝑅up)F(w)))=F(wr(𝒮^,𝑆𝐼𝑁𝑅up))(1ϵl)γ2ζ2L2(F(wr(𝒮^,𝑆𝐼𝑁𝑅up)F(w))).\begin{split}\overset{A1}{\leq}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{(1-\epsilon_{l})\gamma}{2\zeta L^{2}\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\parallel\nabla F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)+h_{i}^{(r)^{*}}\right)~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ -\nabla F_{i}\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)\parallel^{2}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{C.4}{=}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{(1-\epsilon_{l})\gamma\zeta}{2L^{2}\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\parallel\nabla F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)\parallel^{2}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \overset{C.2}{\leq}F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{(1-\epsilon_{l})\gamma\zeta}{2L^{2}\sum_{i=1}^{n}\hat{\cal S}_{i}}\sum_{i=1}^{n}\hat{\cal S}_{i}\gamma\left(F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)-F\left(w^{*}\right)\right)\right)~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ =F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-\frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\left(F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)-F\left(w^{*}\right)\right)\right).~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\end{split}

Therefore, based on (D.8), we have

F(wr+1(𝒮^,𝑆𝐼𝑁𝑅up))F(w)F(wr(𝒮^,𝑆𝐼𝑁𝑅up))F(w)(1ϵl)γ2ζ2L2(F(wr(𝒮^,𝑆𝐼𝑁𝑅up))F(w))=(1(1ϵl)γ2ζ2L2)(F(wr(𝒮^,𝑆𝐼𝑁𝑅up))F(w))(1(1ϵl)γ2ζ2L2)(r+1)(F(w0)F(w))(c)exp((r+1)((1ϵl)γ2ζ2L2))(F(w0)F(w)).\begin{split}F\left(w_{r+1}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-F\left(w^{*}\right)\leq F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-F\left(w^{*}\right)-~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ \frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\left(F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-F\left(w^{*}\right)\right)~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\\ =\left(1-\frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\right)\left(F\left(w_{r}\left(\hat{\cal S},\mathit{SINR}_{\text{up}}\right)\right)-F\left(w^{*}\right)\right)~{}~{}~{}~{}~{}~{}\\ \leq\left(1-\frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\right)^{(r+1)}\left(F(w_{0})-F\left(w^{*}\right)\right)\overset{(\text{c})}{\leq}\text{exp}\left(-(r+1)\left(\frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\right)\right)\left(F(w_{0})-F\left(w^{*}\right)\right).\end{split}

To guarantee the global accuracy, i.e., F(wr)F(w)ϵg(F(w0)F(w))F(w_{r})-F(w^{*})\leq\epsilon_{g}(F(w_{0})-F(w^{*})), we have

ϵgexp(r(1ϵl)γ2ζ2L2).\epsilon_{g}\geq\text{exp}\left(-r\frac{(1-\epsilon_{l})\gamma^{2}\zeta}{2L^{2}}\right).

Therefore, when 0<ζ<γL0<\zeta<\frac{\gamma}{L}, we have r2L2ln1ϵg(1ϵl)γ2ζr\geq\frac{2L^{2}\ln\frac{1}{\epsilon_{g}}}{(1-\epsilon_{l})\gamma^{2}\zeta}.

References

  • [1] B. Jovanović, “Internet of Things Statistics for 2021 – Taking Things Apart.” [Online]. Available: https://dataprot.net/statistics/iot-statistics/
  • [2] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated Learning in Mobile Edge Networks: A Comprehensive Survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 2031–2063, 2020.
  • [3] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated Machine Learning: Concept and Applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–19, 2019.
  • [4] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.-L. Kim, “Communication-efficient On-device Machine Learning: Federated Distillation and Augmentation under Non-iid Private Data,” arXiv preprint arXiv:1811.11479, 2018.
  • [5] C. He and M. Annavaram, “Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge,” in proceedings of Advances in Neural Information Processing Systems 33 (NeurIPS 2020), no. 33, 2020.
  • [6] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence Time Optimization for Federated Learning over Wireless Networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 4, pp. 2457–2471, 2020.
  • [7] M. Chen,H. V. Poor, W. Saad, and S. Cui, “Wireless Communications for Collaborative Federated Learning,” IEEE Communications Magazine, vol. 58, no. 12, pp. 48–54, 2020.
  • [8] L. U. Khan, S. R. Pandey, N. H. Tran, W. Saad, Z. Han, M. N. H. Nguyen, and C. S. Hong, “Federated Learning for Edge Networks: Resource Optimization and Incentive Mechanism,” IEEE Communications Magazine, vol. 58, no. 10, pp. 88–93, 2020.
  • [9] Y.-J. Liu, G. Feng, Y. Sun, S. Qin, and Y.-C. Liang, “Device Association for RAN Slicing based on Hybrid Federated Deep Reinforcement Learning,” IEEE Transactions on Vehicular Technology, vol. 69, no. 12, pp. 15 731–15 745, 2020.
  • [10] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-Effective Federated Learning Design,” in proceedings of IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, 2021, pp. 1–10.
  • [11] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated Learning via Over-the-Air Computation,” IEEE Transactions on Wireless Communications, vol. 19, no. 3, pp. 2022–2035, 2020.
  • [12] W. Xia, W. Wen, K.-K. Wong, T. Q. Quek, J. Zhang, and H. Zhu, “Federated-Learning-Based Client Scheduling for Low-Latency Wireless Communications,” IEEE Wireless Communications, vol. 28, no. 2, pp. 32–38, 2021.
  • [13] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar, “COTAF: Convergent Over-the-Air Federated Learning,” pp. 1–6, 2020.
  • [14] D. Wen, K.-J. Jeon, and K. Huang, “Federated Dropout–A Simple Approach for Enabling Federated Learning on Resource Constrained Devices,” arXiv preprint arXiv:2109.15258, 2021.
  • [15] C. T. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated Learning over Wireless Networks: Convergence Analysis and Resource Allocation,” IEEE/ACM Transactions on Networking, vol. 29, no. 1, pp. 398–409, 2020.
  • [16] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 269–283, 2021.
  • [17] I. Flint, H.-B. Kong, N. Privault, P. Wang, and D. Niyato, “Analysis of Heterogeneous Wireless Networks using Poisson Hard-core Hole Process,” IEEE Transactions on Wireless Communications, vol. 16, no. 11, pp. 7152–7167, 2017.
  • [18] A. M. Hunter, J. G. Andrews, and S. Weber, “Transmission Capacity of Ad Hoc Networks with Spatial Diversity,” IEEE Transactions on Wireless Communications, vol. 7, no. 12, pp. 5058–5071, 2008.
  • [19] S. Weber, J. G. Andrews, and N. Jindal, “An Overview of the Transmission Capacity of Wireless Networks,” IEEE Transactions on Communications, vol. 58, no. 12, pp. 3593–3604, 2010.
  • [20] Y. Sun, L. Zhang, G. Feng, B. Yang, B. Cao, and M. A. Imran, “Blockchain-Enabled Wireless Internet of Things: Performance Analysis and Optimal Communication Node Deployment,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 5791–5802, 2019.
  • [21] Y. J. Chun, M. O. Hasna, and A. Ghrayeb, “Modeling heterogeneous cellular networks interference using poisson cluster processes,” IEEE Journal on Selected Areas in Communications, vol. 33, no. 10, pp. 2182–2195, 2015.
  • [22] V. V. Chetlur and H. S. Dhillon, “Coverage Analysis of a Vehicular Network Modeled as Cox Process Driven by Poisson Line Process,” IEEE Transactions on Wireless Communications, vol. 17, no. 7, pp. 4401–4416, 2018.
  • [23] C. Hennig and M. Kutlukaya, “Some Thoughts about the Design of Loss Functions,” REVSTAT–Statistical Journal, vol. 5, no. 1, pp. 19–39, 2007.
  • [24] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy Efficient Federated Learning over Wireless Communication Networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 3, pp. 1935–1949, 2020.
  • [25] J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data et al., “A field Guide to Federated Optimization,” arXiv preprint arXiv:2107.06917, 2021.
  • [26] P. L. Hsu and H. Robbins, “Complete Convergence and the Law of Large Numbers,” in Proceedings of the National Academy of Sciences, vol. 33, no. 2, pp. 25–31, 1947.
  • [27] T. H. Hsu, H. Qi, and M. Brown, “Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification,” CoRR, vol. abs/1909.06335, 2019. [Online]. Available: http://arxiv.org/abs/1909.06335
  • [28] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, pp. 1273–1282, 20–22 Apr 2017. [Online]. Available: http://proceedings.mlr.press/v54/mcmahan17a.html
  • [29] Y. Gao, M. Kim, S. Abuadbba, Y. Kim, C. Thapa, K. Kim, S. A. Camtep, H. Kim, and S. Nepal, “End-to-End Evaluation of Federated Learning and Split Learning for Internet of Things,” in Proceedings of 2020 IEEE Computer Society International Symposium on Reliable Distributed Systems (SRDS), pp. 91–100, 2020.
  • [30] Y. Liu, S. Garg, J. Nie, Y. Zhang, Z. Xiong, J. Kang, and M. S. Hossain, “Deep Anomaly Detection for Time-series Data in Industrial IoT: a Communication-efficient on-device Federated Learning Approach,” IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6348–6358, 2020.
  • [31] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling Policies for Federated Learning in Wireless Networks,” IEEE Transactions on Communications, vol. 68, no. 1, pp. 317–333, 2020.
  • [32] Y. LeCun, “The MNIST Database of Handwritten Digits,” http://yann. lecun. com/exdb/mnist/, 1998.
  • [33] Y. Li and Y. Yuan, “Convergence Analysis of Two-layer Neural Networks with ReLU Activation,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 597–607.
  • [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Classification with Deep Convolutional Neural Networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
  • [35] C. Darken and J. E. Moody, “Note on Learning Rate Schedules for Stochastic Optimization,” in Proceedings of the 4th International Conference on Neural Information Processing Systems, vol. 91, 1990, pp. 832–838.
  • [36] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the Variance of the Adaptive Learning Rate and Beyond,” in Proceedings of International Conference on Learning Representations, 2019.
  • [37] Y.-J. Liu, G. Feng, J. Wang, Y. Sun, and S. Qin, “Access Control for RAN Slicing based on Federated Deep Reinforcement Learning,” in proceedings of ICC 2021-IEEE International Conference on Communications, pp. 1–6, 2021.