Federated Learning with Erroneous Communication Links

Mahyar Shirvanimoghaddam , , Ayoob Salari ,, , Yifeng Gao, Aradhika Guha The authors are with the School of Electrical and Information Engineering, The University of Sydney, NSW 2006, Australia. E-mails: [email protected], [email protected], [email protected], [email protected].

Abstract

In this paper, we consider the federated learning (FL) problem in the presence of communication errors. We model the link between the devices and the central node (CN) by a packet erasure channel, where the local parameters from devices are either erased or received correctly by CN with probability $\epsilon$ and $1-\epsilon$ , respectively. We proved that the FL algorithm in the presence of communication errors, where the CN uses the past local update if the fresh one is not received from a device, converges to the same global parameter as that the FL algorithm converges to without any communication error. We provide several simulation results to validate our theoretical analysis. We also show that when the dataset is uniformly distributed among devices, the FL algorithm that only uses fresh updates and discards missing updates might converge faster than the FL algorithm that uses past local updates.

Index Terms:

Convexity, federated learning, gradient descent, short packet communications, smoothness.

I Introduction

Internet of things (IoT) applications and services have become popular in recent years due to major technological advancements in sensing, communications, and computation [1]. In many IoT applications, devices operate with extremely low power due to the size, cost, and technological limitations. Traditionally, the devices upload their data to a central node (CN), where all data gathered will be analyzed together. However, such an approach is not feasible in many IoT settings, due to privacy issues, limited power, and usually insufficient communication bandwidth [2]. An alternative solution is to use federated learning (FL) [3], where each user is responsible for computing the updates to the current global model based on its own local training data and transmitting them to the CN. Using the updates from the devices, CN improves the global model and redistributes the updated global model to the users. This process is repeated until the convergence is achieved [4].

In FL, communication between devices and CN can be significantly slower than local computing [5, 6]. Since devices need to communicate with CN repeatedly until the convergence is acquired, there exists a trade-off between local computational power and communication overhead. As we increase the number of local iterations, the required communication between users and CN reduces, and vice versa. The effect of large communication overhead on the scalability of FL and trade-off between local updates and global aggregation was investigated in [7, 8]. Another challenge of FL is the system heterogeneity in which different devices have different computational capabilities, power levels, and storage capacities. In [9], wight-based federated averaging was proposed to tackle this issue.

The majority of works on FL examined a simplified model for the communication channel, where transmissions are error-free but rate-limited [10, 11, 12]. However, since many IoT devices are operating with low power, have limited computational capabilities, and send short packets, the wireless channel between IoT devices and CN is usually erroneous. Recently, a few studies considered FL over a fading wireless channel [13]. In [14], to enhance the learning accuracy over a fading channel, a new scheme was proposed, in which at each iteration, based on the channel state, only one device is selected for transmission using a capacity-achieving channel code. However, there is still a lack of full understanding of the effect of communication error on the accuracy and convergence of FL approaches. This is particularly important for IoT applications, since the channel between devices and CN is usually weak. Due to the large number of devices, the local updates from some of the devices may not reach the CN, which deteriorates the FL performance. In this paper, we shed some light on the effect of communication errors on the convergence and accuracy of FL algorithms.

We consider a distributed learning problem, where the links between the devices and the CN is modeled by packet erasure channels. In particular, we assume the local updates sent by devices face communication errors; that is, the update is erased or received successfully with a certain probability. We consider two scenarios to calculate the global parameter, where the CN only uses the received fresh local updates and discard missing updates or reuse past updates for devices with missing updates. We further analyse the convergence of these approaches, and prove that by using old local updates in case of errors, the FL algorithm employing gradient descent (GD) converges, and the global parameter will converge to the optimal global parameter. This means that the CN does not necessarily needs fresh updates in every communication round of FL to calculate the global parameter. Instead it can reuse past updates in case of error and continue the FL without jeopardizing the global accuracy. We provide simulation results to verify our analysis and further discuss the convergence behaviour for various datasets with different statistical properties.

The rest of the paper is organized as follows. In Section II, we explain the system model and describe different FL approaches in the presence of communication errors. In Section III, we analyze the convergence and accuracy of the FL approaches in the presence of communication error. Simulation results are provided in Section IV. Finally, Section V concludes the paper.

II System Model

We consider a general federated learning problem, where $N$ device with local datasets, $\mathcal{D}_{1}$ , $\mathcal{D}_{2}$ , $\cdots$ , $\mathcal{D}_{N}$ , communicate with the central node (CN). The channel between the $i^{th}$ device and CN is modeled by a packet erasure channel with erasure probability $\epsilon_{i}$ ; that is a packet sent from the device to the CN is erased with probability $\epsilon_{i}$ and received correctly with probability $1-\epsilon_{i}$ . The erasure events for all channels are independent of each other. This model is particularly important for massive IoT scenarios, where many IoT devices communicate with a CN using short packets and due to limited power at devices, the communication channel is mostly erroneous. In particular, let us assume that the local updates from each device is a packet of length $k$ bits, which is encoded using a channel code of rate $R=k/n$ , where $n$ is the codeword length. By using the normal approximation bound [15], the packet error rate at the receiver, when the signal-to-noise ratio is $\gamma$ , is given by:

\displaystyle\epsilon\approx Q\left(\frac{n\log_{2}(1+\gamma)-k+\log_{2}(n)}{\sqrt{nV(\gamma)}}\right),

(1)

where $Q(.)$ is the standard $Q$ -function and $V(\gamma)=\left(1-(1+\gamma)^{-2}\right)\log_{2}^{2}(e)$ is the channel dispersion [15]. The short packet communication can then be modeled by the packet erasure channel with erasure probability $\epsilon$ , when $k$ , $n$ , and $\gamma$ is known. Using this simplified model, the channel quality can be characterized by a single parameter $\epsilon$ .

We also assume that the downlink channel, i.e., the channel from the CN to devices, is error-free since the CN can transmit with high power; therefore, the error in the downlink channel can be appropriately mitigated.

II-A Federated Learning in the error-free scenario

In FL, each device $i$ calculates the local update and sends the parameters to the CN. Then, CN aggregates all the parameters received from all nodes and calculates the general parameters. In particular, let $w^{(t)}$ denote the local parameter calculated at device $i$ at time instant $t$ . By using the gradient descent (GD) method with learning rate $\eta$ and local loss function $F_{i}:\mathbb{R}^{d}\to\mathbb{R}$ , $w^{(t)}_{i}$ can be calculated as follows:

w^{(t)}_{i}=w^{(t-1)}-\eta\nabla F_{i}(w^{(t-1)}).

(2)

Upon receiving all local parameters (assuming that $\epsilon_{i}=0$ for all devices) from all devices, the CN calculates the global parameter as follows:

w^{(t)}=\frac{1}{D}\sum_{i=1}^{N}D_{i}w^{(t)}_{i},

(3)

where $D_{i}=|\mathcal{D}_{i}|$ is the size of dataset $\mathcal{D}_{i}$ and $D=\sum_{i=1}^{N}D_{i}$ .

II-B Federated Learning in the presence of communication error

Here, we consider that the channel between the nodes and the CN is erroneous and that the local updates calculated from some of the nodes may not reach the CN.

II-B1 FL with erroneous communication and no memory at CN

In this scenario, the CN calculates the global parameter using the received local updates only. The global update, in this case, is calculated as follows:

\displaystyle w^{(t)}=\frac{1}{\sum_{i\in\mathcal{S}(t)}D_{i}}\sum_{i\in\mathcal{S}(t)}D_{i}w^{(t)}_{i},

(4)

where $\mathcal{S}(t)$ is the set of all nodes that their updates have been successfully received at CN at time instant $t$ .

II-B2 FL with erroneous communication and reuse of old local updates

We consider another scenario, in which the CN uses the previous updates received from the devices in case their new updates are not received. In this case, the global update is calculated as follows:

\displaystyle w^{(t)}=\frac{1}{D}\left(\sum_{i\in\mathcal{S}(t)}D_{i}w^{(t)}_{i}+\sum_{j\in\mathcal{F}(t)}D_{j}w^{(t-1)}_{j}\right),

(5)

where $\mathcal{F}(t)$ is the set of all nodes that their updates have not been received at CN at time instant $t$ . Here, we assumed that the previous local updates for missing nodes are always available at the CN. This assumption is valid as long as the erasure probability is low.

Refer to caption — (a) MSE vs. Comm. round.

Fig. 1 shows a comparison between the FL schemes with and without communication errors. As can be seen in this figure, when CN does not have a memory, the overall mean squared error (MSE) fluctuates and the FL does not converge. However, when CN can store previous updates from the devices, it can reuse them if the fresh updates are missing (see (5)). In this case, the FL algorithm converges to the same MSE as that for FL without any communication error.

III Performance Analysis of the FL Algorithm in the Presence of Communication Error

For the simplicity of the analysis, we assume that the size of all local datasets are the same, i.e., $D_{i}=D/N$ , for $i=1,\cdots,N$ . When the CN does not have memory, all updates which have not been received will be discarded from the global aggregation step. Let us, for a moment, assume that the number of local iterations at the nodes is sufficiently large. That is, when each node performs a local update, the GD at that device converges. In this case, we define $w^{(t)}_{i}=w_{i}$ for $i=1,\cdots,N$ . By using (4), the global update at time instant $t$ is given by:

\displaystyle w^{(t)}=\frac{1}{|\mathcal{S}(t)|}\sum_{i\in\mathcal{S}(t)}w_{i}.

(6)

Since the link between the $i^{th}$ device and CN is an erasure channel with erasure probability $\epsilon_{i}$ , the probability mass function $(\mathrm{pmf})$ of $w^{(t)}$ can be calculated as follows:

\displaystyle\mathrm{Prob}\left\{w^{(t)}=\frac{\sum_{i=1}^{N}I_{i}w_{i}}{\sum_{i=1}^{N}I_{i}}\right\}=\prod_{i=1}^{N}\epsilon_{i}^{1-I_{i}}(1-\epsilon_{i})^{I_{i}},

(7)

where $I_{i}\in\{0,1\}$ denote the erasure event for the link between the $i^{th}$ device and CN, i.e., $\mathrm{Prob}\{I_{i}=1\}=1-\epsilon_{i}$ and $\mathrm{Prob}\{I_{i}=0\}=\epsilon_{i}$ . From (7) it can be easily observed that the global parameter always fluctuates due to the random nature of the erasure events. This can be clearly seen in Fig. 1(a).

III-A Performance Analysis of the FL algorithm with erroneous communication and reuse of old local updates

Here, we assume that $F_{i}(.)$ , for all $i\in\{1,\cdots,N\}$ , is convex and $L$ -smooth. For function $f:\mathbb{R}^{d}\to\mathbb{R}$ that is convex and $L$ -smooth, the following relations hold, $\forall x,y\in\mathbb{R}^{d}$ [16]:

$\displaystyle f(y)$	$\displaystyle\leq f(x)+\nabla f(x)(y-x)^{\prime}+\frac{L}{2}\|\|y-x\|\|_{2}^{2},$	(8)
$\displaystyle f(x^{*})$	$\displaystyle-f(x)\leq-\frac{1}{2L}\|\|\nabla f(x)\|\|_{2}^{2},~{}x^{*}=\arg\min_{x}f(x),$	(9)
$\displaystyle\|\|\nabla f($	$\displaystyle x)-\nabla f(y)\|\|_{2}\leq L\|\|x-y\|\|_{2},$	(10)
$\displaystyle\left(\nabla f(\right.$	$\displaystyle x)-\left.\nabla f(y)\right)(x-y)^{\prime}\geq\frac{1}{L}\|\|\nabla f(x)-\nabla f(y)\|\|_{2}^{2},$	(11)

where $(.)^{\prime}$ denote the matrix transpose operand.

Lemma 1.

Let $F_{i}:\mathbb{R}^{d}\to\mathbb{R}$ , for all $i\in\{1,\cdots,N\}$ , is convex and $L$ -smooth. $F_{\mathcal{G}}(x)=(1/N)\sum_{i\in\mathcal{G}}F_{i}(x)$ is also convex and $\frac{|\mathcal{G}|L}{N}$ -smooth.

Proof:

This can be easily proved from the sub-additivity of the norm and additivity of the gradient [16]. ∎

From Lemma 1, it can be easily proved that $F(x)=(1/N)\sum_{i=1}^{N}F_{i}(x)$ is also convex and $L$ -smooth. In the following theorem, we show that the FL algorithm in the presence of communication error, where the CN uses the past local updates in case of communication error, converges to the global minima of the global loss function and accordingly the optimal global parameter.

Theorem 1.

Let us consider a FL problem with $N$ devices, where the channel from each device to the CN is modeled by an erasure channel with erasure probability $\epsilon$ . We assume that the local loss function $F_{i}(x)$ at device $i$ is convex and $L$ -smooth. We further assume that $||\nabla F(x)-\nabla F(y)||_{2}\geq\mu||x-y||_{2}$ , for all $x,y\in\mathbb{R}^{d}$ , where $F(x)=\frac{1}{N}\sum_{i=1}^{N}F_{i}(x)$ . Let $\delta_{t}=||w^{(t)}-w^{*}||_{2}^{2}$ , where $w^{*}=\arg\min_{w}F(w)$ , and $\bar{\delta}_{t+1}=\frac{1}{t+1}\sum_{i=0}^{t}\delta_{i}$ . For the FL algorithm (5), when $\epsilon\leq\frac{\mu}{2L}$ and $\eta=\frac{1}{L}$ , $\bar{\delta}_{k}$ is upper bounded by:

\displaystyle\bar{\delta}_{t}\leq\frac{F(w^{(0)})-F(w^{*})}{t\beta^{2}},~{}~{}\text{for}~{}t>0,

(12)

where $\beta^{2}=\frac{\mu^{2}}{2L}-2L\epsilon^{2}$ .

Proof:

The proof is provided in Appendix A. ∎

Theorem 1 states that the gap to the global minima ( $w^{*}$ ) decreases with the iteration number $t$ and converges to zero, when $t$ is arbitrary large, i.e., $\lim_{t\to\infty}\bar{\delta}_{t}=0$ . it is also easy to show that $\lim_{t\to\infty}\delta_{t}=0$ , since otherwise, if $\delta_{t}\geq\epsilon$ , where $\epsilon>0$ , $\bar{\delta}_{t}$ will be always bounded above $\epsilon$ . It is important to note that Theorem 1 does not state that $\delta_{t}$ is a decreasing function of $t$ . Instead it shows that in the limit that $t$ is sufficiently large, the global parameter, $w^{(t)}$ , converges to the optimal global parameter $w^{*}$ , even in the presence of communication error.

IV Results and Discussion

In this section, we focus on the FL algorithm, where the CN uses old local updates received from devices when their fresh updates are not available due to communication error. Fig. 2 shows the overall MSE of the FL algorithm at various erasure probabilities, when $N=3$ . We use the same dataset as in Fig. 1(b). As can be seen in Fig. 2, when the erasure rate is small, i.e., $\epsilon=0.1$ , the performance of the FL with reuse of old updates closely approaches that of the FL without communication errors. However, when the erasure probability increases, the MSE in the early iterations has a significant gap to the ideal FL case with no error. In all scenarios, even when the erasure rate is large, i.e., $\epsilon=0.5$ , the FL algorithm with reuse of old updates in the presence of error converges to that of the ideal FL without error.

Fig. 3(a) shows the performance of FL algorithm with reuse of old updates at various erasure probabilities, when the number of devices is $N=10$ . The datasets are shown in Fig. 3(b). The data is created by using a non-linear model $y=x^{2}+z$ , where $z\sim\mathcal{N}(0,\sigma^{2})$ , is additive white Gaussian noise. As can be seen in Fig. 3(b), the linear regression curve for each local dataset has a different slope; therefore, losing any dataset due to error will result in a significant change in the global parameter. This model is of particular importance for IoT scenarios, where devices are distributed in a field at various locations; therefore, their measurements are location/time dependent and accordingly their datasets are non-iid. As can be seen in Fig. 3(a), when the CN reuses past local updates in case of error, the MSE performance of the FL algorithm converges to that of the ideal FL algorithm without error.

It is important to note that when the dataset is uniformly distributed among devices and the local parameters are not significantly different, even without reusing old local updates, the FL algorithm converges. An example is provided in Fig. 4, where the dataset is uniformly distributed between 3 devices. In this case, the local parameter from all devices are relatively close to each other. Therefore, missing some devices’ updates does not affect the overall performance. As can be seen in Fig. 4(a), the FL algorithm without memory at the CN outperforms the FL algorithm with reusing local updates, when the dataset is uniformly distributed among devices. This can be explained for an extreme case, when $D$ is relatively large, no error occurs up until the $i$ th communication round, and all local updates are the same in each round. In this case, by using (4), we will have $w^{(t)}=w^{(t)}_{i}=w^{(t-1)}-\eta\nabla F_{i}(w^{(t-1)})$ . Since the datasets are uniformly distributed, $w^{(t)}=w^{(t-1)}-\eta\nabla F(w^{(t-1)})$ , even if an error occurs. However, if in the case of error we reuse old updates, by using (5) the global parameter will be $w^{(t)}=w^{(t)}_{i}-\epsilon\left(w^{(t)}_{i}-w^{(t-1)}_{i}\right)$ . This is equivalent to $w^{(t)}=w^{(t-1)}-\eta\nabla F(w^{(t-1)})-\epsilon\left(w^{(t)}_{i}-w^{(t-1)}_{i}\right)$ , which means that the global parameter will have a gap, proportional to the erasure rate, to that of the error-free case. Therefore, for uniformly distributed datasets, by reusing old updates in case of communication error, the FL algorithm takes longer to converge.

We further consider an image classification task using a real dataset from MNIST [17], which consists of 4000 handwritten images of the numbers $0$ to $3$ . The example runs in parallel using 4 workers (i.e., devices), each processing images of a single digit. In particular, worker $i$ has 700 handwritten images of number $i-1$ as its training set. The validation and test set at the CN each has 150 images of each number $0$ to $3$ . Similar to [17], at each worker, we use a CNN with two $5\times 5$ convolution layers (the first with 32 channels, the second with 64, each followed with $2\times 2$ max pooling), a fully connected layer with 512 units and ReLu activation, and a final softmax output layer. We also use the stochastic GD optimizer with learning rate $0.001$ . Fig. 5 shows the accuracy of the FL algorithm after each communication round in the presence of communication error. As can be seen in this figure, when the CN uses the old local updates if the fresh updates are erased due to communication error, the FL algorithm converges to the same level of accuracy as that for the FL algorithm without any communication error. However, when the CN does not store past local updates and only uses fresh updates and discard missing updates in the aggregation stage, the accuracy changes significantly with the communication round.

V Conclusions

In this paper, we studied the federated learning algorithm in the presence of communication errors. We modelled the channel between the devices and the central node (CN) by packet erasure channels. We presented two approaches to deal with error events. That is when an error occurs, and local updates are not received at CN, the CN can either calculate the global parameter by using only the fresh local updates received correctly or reuse the old updates for missing local updates. We proved that the FL algorithm with reusing local updates in case of error converges to the same global parameters as that with FL without any communication error. This means that the CN does not need to wait for the retransmission of missing updates. Instead, it can reuse the past local updates to calculate the global parameters and continue the FL algorithm. This approach converges to the same result as FL achieves with no communication error. This is of practical importance as IoT devices can significantly save energy by relaxing the communication reliability requirements without jeopardizing the overall learning accuracy. We further provided simulation results to verify our theoretical results. We also highlighted that when the dataset is uniformly distributed among devices, the FL algorithm without memory at the CN converges faster than that with reusing local updates.

Appendix A Proof of Theroem 1

Assuming that $D_{i}=D/N$ , (5) can be written as follows:

$\displaystyle w^{(t+1)}$	$\displaystyle=\frac{1}{N}\sum_{i\in\mathcal{S}(t)}w^{(t)}_{i}+\frac{1}{N}\sum_{j\in\mathcal{F}(t)}w^{(t-1)}_{j}$	(13)
	$\displaystyle=\frac{1}{N}\sum_{i\in\mathcal{S}(t)}\left(w^{(t)}-\eta\nabla F_{i}(w^{(t)})\right)$
	$\displaystyle+\frac{1}{N}\sum_{j\in\mathcal{F}(t)}\left(w^{(t-1)}-\eta\nabla F_{j}(w^{(t-1)})\right)$
	$\displaystyle=w^{(t)}-\eta\nabla F(w^{(t)})+\frac{\|\mathcal{F}(t)\|}{N}\left(w^{(t-1)}-w^{(t)}\right)$
	$\displaystyle+\eta\nabla F_{\mathcal{F}}(w^{(t)})-\eta\nabla F_{\mathcal{F}}(w^{(t-1)}),$	(14)

where $F_{\mathcal{F}}(x)=\frac{|\mathcal{F}(x)|}{N}\sum_{j\in\mathcal{F}(t)}F_{j}(x)$ . Since $F(x)$ is convex and $L$ -smooth, by using (8), we have:

	$\displaystyle F($	$\displaystyle w^{(t+1)})\leq F(w^{(t)})+\nabla F(w^{(t)})\left(w^{(t+1)}-w^{(t)}\right)^{\prime}$
		$\displaystyle+\frac{L}{2}\|\|w^{(t+1)}-w^{(t)}\|\|_{2}^{2}$
		$\displaystyle\overset{\eqref{eq:iteraton}}{=}F(w^{(t)})-\eta\|\|\nabla F(w^{(t)})\|\|_{2}^{2}$
		$\displaystyle+\frac{\|\mathcal{F}(t)\|}{N}\nabla F(w^{(t)})(w^{(t-1)}-w^{(t)})^{\prime}$
		$\displaystyle+\eta\nabla F(w^{(t)})(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))^{\prime}$
		$\displaystyle+\frac{L}{2}\eta^{2}\|\|\nabla F(w^{(t)})\|\|_{2}^{2}+\frac{L\|\mathcal{F}(t)\|^{2}}{2N^{2}}\|\|w^{(t-1)}-w^{(t)}\|\|_{2}^{2}$
		$\displaystyle+\frac{L}{2}\eta^{2}\|\|\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)})\|\|_{2}^{2}$
		$\displaystyle-\frac{L\|\mathcal{F}(t)\|}{N}\eta\nabla F(w^{(t)})(w^{(t-1)}-w^{(t)})^{\prime}$
		$\displaystyle-L\eta^{2}\nabla F(w^{(t)})(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))^{\prime}$
		$\displaystyle-\frac{\eta L\|\mathcal{F}(t)\|}{N}(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))(w^{(t)}-w^{(t-1)})^{\prime}$

Now, assuming that $\eta=\frac{1}{L}$ and due to the fact that $|\mathcal{F}_{F}|\approx\epsilon N$ , when $N$ is sufficiently large, this can be simplified to:

	$\displaystyle F(w^{(t+1)})\leq F(w^{(t)})-\frac{1}{2L}\|\|\nabla F(w^{(t)})\|\|_{2}^{2}$
	$\displaystyle+\frac{L\epsilon^{2}\|\|w^{(t)}-w^{(t-1)}\|\|_{2}^{2}}{2}+\frac{\|\|\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)})\|\|_{2}^{2}}{2L}$
	$\displaystyle-\epsilon(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))(w^{(t)}-w^{(t-1)})^{\prime}$
	$\displaystyle\overset{(a)}{\leq}F(w^{(t)})-\frac{1}{2L}\|\|\nabla F(w^{(t)})\|\|_{2}^{2}+\frac{L\epsilon^{2}}{2}\|\|w^{(t)}-w^{(t-1)}\|\|_{2}^{2}$
	$\displaystyle-\frac{1}{2L}\|\|\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)})\|\|_{2}^{2}$
	$\displaystyle\leq F(w^{(t)})-\frac{\|\|\nabla F(w^{(t)})\|\|_{2}^{2}}{2L}+\frac{L}{2}\epsilon^{2}\|\|w^{(t)}-w^{(t-1)}\|\|_{2}^{2},$		(15)

where step $(a)$ follows from (11) and Lemma 1, which indicates that $F_{\mathcal{F}}(.)$ is convex and $Le$ -smooth. Since we assumed that $||\nabla F(x)-\nabla F(y)||_{2}\geq\mu||x-y||_{2}$ , for all $x,y\in\mathbb{R}^{d}$ , we have:

	$\displaystyle F(w^{(t+1)})$	$\displaystyle\leq F(w^{(t)})-\frac{\mu^{2}}{2L}\|\|w^{(t)}-w^{*}\|\|_{2}^{2}$
		$\displaystyle+\frac{L}{2}\epsilon^{2}\|\|w^{(t-1)}-w^{(t)}\|\|_{2}^{2}.$		(16)

It is easy to show that $||w^{(t-1)}-w^{(t)}||_{2}^{2}\leq 2\left(\delta_{t}+\delta_{t-1}\right)$ . We can further simplify (16) as follows:

\displaystyle F(w^{(t+1)})\leq F(w^{(t)})+\left(L\epsilon^{2}-\frac{\mu^{2}}{2L}\right)\delta_{t}+L\epsilon^{2}\delta_{t-1}.

(17)

Summing up both sides over $t=1,\cdots,k$ , and using telescopic cancellation, we have:

	$\displaystyle F(w^{(k+1)})$	$\displaystyle\leq F(w^{(0)})+(2L\epsilon^{2}-\frac{\mu^{2}}{2L})\sum_{i=1}^{k-1}\delta_{i}$
		$\displaystyle+(L\epsilon^{2}-\frac{\mu^{2}}{2L})(\delta_{k}+\delta_{0}),$		(18)

where we assumed that the first global update ( $t=1$ ), is calculated without any communications error. That is $F(w^{(1)})\leq F(w^{0})-\frac{\mu}{2L}\delta_{0}$ . Assuming that $\epsilon\leq\frac{\mu}{2L}$ , we have $\frac{\mu^{2}}{2L}-2L\epsilon^{2}<\frac{\mu^{2}}{2L}-L\epsilon^{2}$ . Therefore, (18) is simplified to:

\displaystyle F(w^{(k+1)})

\displaystyle\leq F(w^{(0)})-\beta^{2}(k+1)\bar{\delta}_{k+1},

(19)

where $\beta^{2}=\frac{\mu^{2}}{2L}-2L\epsilon^{2}$ . By rearranging the above inequality, we have:

\displaystyle\bar{\delta}_{k+1}\leq\frac{F(w^{(0)})-F(w^{(k+1)})}{(k+1)\beta^{2}}.

(20)

Since $F(w^{*})\leq F(w^{(k+1)})$ , We will have:

\displaystyle\bar{\delta}_{k+1}\leq\frac{F(w^{(0)})-F(w^{*})}{(k+1)\beta^{2}}.

(21)

References

[1] C. Zhang, Y. Xie, H. Bai, B. Yu, W. Li, and Y. Gao, “A survey on federated learning,” Knowledge-Based Syst., vol. 216, p. 106775, 2021.
[2] A. Imteaj, U. Thakker, S. Wang, J. Li, and M. H. Amini, “A survey on federated learning for resource-constrained IoT devices,” IEEE Internet Things J., vol. 9, no. 1, pp. 1–24, 2021.
[3] O. A. Wahab, A. Mourad, H. Otrok, and T. Taleb, “Federated machine learning: Survey, multi-level classification, desirable criteria and future directions in communication and networking systems,” IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 1342–1397, 2021.
[4] R. Jin, X. He, and H. Dai, “Communication efficient federated learning with energy awareness over wireless networks,” IEEE Trans. Wireless Commun., 2022.
[5] G. Zhu, Y. Du, D. Gündüz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 2120–2135, 2021.
[6] Z. Zhao, C. Feng, W. Hong, J. Jiang, C. Jia, T. Q. S. Quek, and M. Peng, “Federated learning with non-iid data in wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 3, pp. 1927–1942, 2022.
[7] S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney, “Giant: Globally improved approximate newton method for distributed optimization,” in Proc. Intl. Conf. Neural Inf. Process. Syst., 2018, pp. 2338–2348.
[8] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
[9] V.-D. Nguyen, S. K. Sharma, T. X. Vu, S. Chatzinotas, and B. Ottersten, “Efficient federated learning algorithm for resource allocation in wireless IoT networks,” IEEE Internet Things J., vol. 8, no. 5, pp. 3394–3409, 2021.
[10] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. V. Poor, “Convergence of update aware device scheduling for federated learning at the wireless edge,” IEEE Trans. Wireless Commun., vol. 20, no. 6, pp. 3643–3658, 2021.
[11] J. Ren, G. Yu, and G. Ding, “Accelerating DNN training in wireless federated edge learning systems,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 219–232, 2020.
[12] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, 2019.
[13] H. Zhang and L. Hanzo, “Federated learning assisted multi-UAV networks,” IEEE Trans. Veh. Technol., vol. 69, no. 11, pp. 14 104–14 109, 2020.
[14] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, 2020.
[15] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, May 2010.
[16] S. Bubeck, “Convex optimization: Algorithms and complexity,” Found. Trends Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
[17] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics, 2017, pp. 1273–1282.

	$\displaystyle F($	$\displaystyle w^{(t+1)})\leq F(w^{(t)})+\nabla F(w^{(t)})\left(w^{(t+1)}-w^{(t)}\right)^{\prime}$
		$\displaystyle+\frac{L}{2}\|\|w^{(t+1)}-w^{(t)}\|\|_{2}^{2}$
		$\displaystyle\overset{\eqref{eq:iteraton}}{=}F(w^{(t)})-\eta\|\|\nabla F(w^{(t)})\|\|_{2}^{2}$
		$\displaystyle+\frac{\|\mathcal{F}(t)\|}{N}\nabla F(w^{(t)})(w^{(t-1)}-w^{(t)})^{\prime}$
		$\displaystyle+\eta\nabla F(w^{(t)})(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))^{\prime}$
		$\displaystyle+\frac{L}{2}\eta^{2}\|\|\nabla F(w^{(t)})\|\|_{2}^{2}+\frac{L\|\mathcal{F}(t)\|^{2}}{2N^{2}}\|\|w^{(t-1)}-w^{(t)}\|\|_{2}^{2}$
		$\displaystyle+\frac{L}{2}\eta^{2}\|\|\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)})\|\|_{2}^{2}$
		$\displaystyle-\frac{L\|\mathcal{F}(t)\|}{N}\eta\nabla F(w^{(t)})(w^{(t-1)}-w^{(t)})^{\prime}$
		$\displaystyle-L\eta^{2}\nabla F(w^{(t)})(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))^{\prime}$
		$\displaystyle-\frac{\eta L\|\mathcal{F}(t)\|}{N}(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))(w^{(t)}-w^{(t-1)})^{\prime}$

	$\displaystyle F(w^{(t+1)})\leq F(w^{(t)})-\frac{1}{2L}\|\|\nabla F(w^{(t)})\|\|_{2}^{2}$
	$\displaystyle+\frac{L\epsilon^{2}\|\|w^{(t)}-w^{(t-1)}\|\|_{2}^{2}}{2}+\frac{\|\|\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)})\|\|_{2}^{2}}{2L}$
	$\displaystyle-\epsilon(\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)}))(w^{(t)}-w^{(t-1)})^{\prime}$
	$\displaystyle\overset{(a)}{\leq}F(w^{(t)})-\frac{1}{2L}\|\|\nabla F(w^{(t)})\|\|_{2}^{2}+\frac{L\epsilon^{2}}{2}\|\|w^{(t)}-w^{(t-1)}\|\|_{2}^{2}$
	$\displaystyle-\frac{1}{2L}\|\|\nabla F_{\mathcal{F}}(w^{(t)})-\nabla F_{\mathcal{F}}(w^{(t-1)})\|\|_{2}^{2}$
	$\displaystyle\leq F(w^{(t)})-\frac{\|\|\nabla F(w^{(t)})\|\|_{2}^{2}}{2L}+\frac{L}{2}\epsilon^{2}\|\|w^{(t)}-w^{(t-1)}\|\|_{2}^{2},$		(15)