This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

OledFL: Unleashing the Potential of Decentralized Federated Learning via Opposite Lookahead Enhancement

Qinglun Li, Miao Zhang, Mengzhu Wang, Quanjun Yin and Li Shen Qinglun Li is with the National University of Defense Technology, [email protected]. Miao Zhang is with the National University of Defense Technology, [email protected]. Mengzhu Wang is with the Hebei University of Technology, Tianjin, 300000, China, [email protected]. Quanjun Yin is with the National University of Defense Technology, [email protected]. Li Shen is with School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China. [email protected]. Manuscript received April XX, XXXX; revised August XX, XXXX.
Abstract

Decentralized Federated Learning (DFL) surpasses Centralized Federated Learning (CFL) in terms of faster training, privacy preservation, and light communication, making it a promising alternative in the field of federated learning. However, DFL still exhibits significant disparities with CFL in terms of generalization ability such as rarely theoretical understanding and degraded empirical performance due to severe inconsistency. In this paper, we enhance the consistency of DFL by developing an opposite lookahead enhancement technique (Ole), yielding OledFL to optimize the initialization of each client in each communication round, thus significantly improving both the generalization and convergence speed. Moreover, we rigorously establish its convergence rate in non-convex setting and characterize its generalization bound through uniform stability, which provides concrete reasons why OledFL can achieve both the fast convergence speed and high generalization ability. Extensive experiments conducted on the CIFAR10 and CIFAR100 datasets with Dirichlet and Pathological distributions illustrate that our OledFL can achieve up to 5% performance improvement and 8×\times speedup, compared to the most popular DFedAvg optimizer in DFL.

Index Terms:
Decentralized Federated Learning, Non-convex Optimization, Acceleration, Convergence Analysis, Generalization Analysis.

I Introduction

Federated Learning (FL) is a novel distributed machine learning paradigm that ensures privacy protection [1, 2, 3]. It enables multiple participants to collaborate on training models without sharing their raw data. Currently, most research efforts [4, 5, 6, 7, 8, 9] have focused on Centralized Federated Learning (CFL). However, the presence of a central server in CFL introduces challenges such as communication burden, single point of failure [10], and privacy breaches [11]. In contrast, Distributed Federated Learning (DFL) offers improved privacy protection [12], faster model training [13], and robustness to slow client devices [14]. Therefore, DFL has emerged as a popular alternative solution [10, 13].

However, there still exists a significant performance gap between CFL and existing DFL methods, which we attribute to the decentralized communication approach of DFL. The lack of central server coordination results in increasing inconsistency between the model parameters 𝐱i\mathbf{x}_{i} of client ii and 𝐱j\mathbf{x}_{j} of client jj as the communication rounds increase, the severe inconsistency leads to a significant gap between the final output of the model 𝐱¯\bar{\mathbf{x}} and the global optimum 𝐱\mathbf{x}^{*}, leading to performance disparities compared to CFL. Although existing methods such as DFedAvgM [15] and DFedSAM [16] have improved the performance of DFL algorithms by introducing new local optimizers to accelerate convergence and address heterogeneous data overfitting, they have not tackled the issue from the critical perspective of communication discrepancies between DFL and CFL, resulting in a persistent performance gap between DFL and CFL. Additionally, existing theoretical analyses of DFL methods are limited to convergence analysis, while the generalization theory is still absent.

In this work, we propose a plug-in method named the opposite Lookahead enhancement (Ole) technique to enhance the consistency of DFL, dubbed OledFL that reduces the performance gap between CFL and DFL. OledFL framework can seamlessly integrate existing DFL optimizers to significantly improve the convergence speed and generalization performance of DFL empirically. We also theoretically prove that OledFL can enhance the convergence speed and reduce generalization error compared to the original DFL methods. Specifically, OledFL performs a retraction operation during the initialization phase of optimizing each client model parameter 𝐱it\mathbf{x}_{i}^{t} (see Figure 1). This ensures that each client will not stray too far from 𝐱it\mathbf{x}_{i}^{t} during the subsequent optimization process. By adopting this initialization operation for each client before optimization, the mutual consistency among them is naturally strengthened.

Theoretically, we jointly analyze the optimization error and generalization error by introducing excess error. Specifically, we demonstrate that in a non-convex setting, OledFL-SGD (integrating OledFL with DFedAvg) and OledFL-SAM (integrating OledFL with DFedSAM) exhibit an optimization error and convergence rate of 𝒪(1KT)\mathcal{O}(\frac{1}{\sqrt{KT}}), and the theoretical analysis indicates that “Ole” can reduce the algorithm’s optimization error (Remark 2). Additionally, through uniform stability analysis, we assess the algorithm’s generalization error and find that “Ole” can significantly reduce generalization error (Remark 3). For the experiments on CIFAR10&100 datasets under Dirichlet and Pathological distributions, OledFL consistently demonstrates notable improvements in convergence speed (see Table III) and generalization performance (see Table II) over existing DFL methods and outperforms state-of-the-art CFL methods such as FedSAM [17] and SCAFFOLD [18].

In summary, our main contributions are three-fold:

  • We propose an opposite Lookahead enhancement (Ole) technique to address the inconsistency issue in DFL, dubbed OledFL. OledFL can seamlessly integrate existing DFL optimizers such as DFedAvg and DFedSAM to significantly improve their convergence speed and generalization performance, thereby reducing the performance gap between CFL and DFL. Furthermore, we provide a comprehensive explanation of the role of Ole, including intuitive (Figure 1), theoretical (Remark 3), and experimental interpretations (Section 5 & Figure 4).

  • We establish a pioneering generalization analysis (Section IV) in the field of DFL, which theoretically demonstrates the effectiveness of OledFL from the perspective of optimization error bounds and generalization error bounds.

  • We conduct extensive experiments on CIFAR10&100 datasets under Dirichlet and Pathological distributions, respectively, which demonstrate that OledFL significantly enhances the convergence speed and generalization performance of existing DFL methods. With the assistance of Ole, it can achieve up to 5% performance improvement and 8×\times speedup, compared to the most popular DFedAvg optimizer in DFL. This notably reduces the gap between CFL and DFL (Table II & Figure 2).

II Related work

TABLE I: the theoretical differences between the baseline methods and the OledFL. Note that OledFL supports generalization analysis.
Method Non-convex Convergence Analysis Generalization Analysis Multi-local Update No bounded gradient
DPSGD [13, 19] \checkmark \checkmark \checkmark ×\times \checkmark
DFedAvg [15] \checkmark \checkmark ×\times \checkmark ×\times
DFedAvgM [15] \checkmark \checkmark ×\times \checkmark ×\times
DFedSAM [16] \checkmark \checkmark ×\times \checkmark \checkmark
OledFL [ours] \checkmark \checkmark \checkmark \checkmark \checkmark

Below, we will briefly review the most relevant work to our research, which includes Decentralized Federated Learning (DFL), acceleration techniques in optimization, and theoretical guarantees of DFL.

Decentralized Federated Learning (DFL). To mitigate the communication burden on the server in centralized scenarios, decentralized communication methods distribute the communication load to each node while maintaining overall communication complexity equivalent to that in centralized scenarios [13]. Additionally, decentralized communication methods afford improved privacy protection compared to CFL [20, 21, 22]. DFL has emerged as a promising field of research, recognized as a challenge in various review articles in recent years [23, 24]. Within DFL, Sun et al. [15] extend the FedAvg algorithm [1] to decentralized scenarios and complement it with local momentum acceleration to enhance convergence. Furthermore, Dai et al. [25] introduce sparse training into DFL to reduce communication and computation costs, while shi et al. [16] apply SAM to DFL and enhance the consistency among clients by incorporating Multiple Gossip Steps. These endeavors gradually improve the performance of DFL from different perspectives. However, a significant performance gap still exists compared to CFL. For further related work on DFL, please refer to the survey papers [11, 26, 23] and their references therein.

Acceleration Techniques for Deep Learning. In deep learning, momentum and restart are typical acceleration techniques. The momentum techniques focus more on the optimization of the optimizer design, while restart techniques focus more on the selection of initialization points. In Federated Learning (FL), one type of method used in the centralized FL (CFL) domain is the global momentum, such as FedCM [27] and MimeLite [28], which estimates the global momentum at the server and applies it to each client update, thereby alleviating the problem of client heterogeneity. Another type is the local momentum used in DFL, such as the DFedAvgM algorithm proposed by [15], which utilizes local momentum to accelerate the convergence process. Restart techniques have nearly negligible computational cost but significantly enhance the effectiveness of the algorithm. Zhou [29] introduces the Lookahead optimizer, which improves learning stability and reduces the variance of its inner optimizer through a kk-steps forward, 11 step back approach. Additionally, lin et al.[30] develop a universal framework called Catalyst, which can accelerate first-order optimization methods such as SAG [31], SAGA [32], and SVRG [33]. Subsequently, Trimbach et al.[34] extend Catalyst to the field of distributed learning and used Catalyst to accelerate the DSGD algorithm [35].

Theoretical Guarantees of DFL. A more comprehensive comparison is presented in Table I. In the aspect of convergence analysis, current DFL works such as DFedAvg [15], DFedAvgM [15], and DFedSAM [16] have only concentrated on the convergence analysis. DFedSAM, by abandoning the gradient bounded assumption in DFedAvg, has demonstrated an advancement in the convergence analysis of the algorithm. However, in machine learning, algorithms are often evaluated based on their ability to perform well on new, unseen data. This is known as generalization performance, and algorithms that exhibit higher generalization performance are considered to be state-of-the-art. However, in the field of DFL, there is currently a lack of analysis on how well-proposed algorithms generalize to new data. This results in a lack of theoretical support for why these algorithms perform well.

III Methodology

In this section, we begin by elucidating the meanings of several notations. We then introduce the problem setup, and then, we propose OledFL and compare it with existing methods to demonstrate the novelty of OledFL. Finally, we will elucidate the relation with Chebyshev Acceleration and provide an effective and intuitive explanation of the OledFL.

III-A Notations

Let mm be the total number of clients. TT represents the number of communication rounds. ()i,kt(\cdot)_{i,k}^{t} indicates variable ()(\cdot) at the kk-th iteration of the tt-th round in the ii-th client. 𝐱\mathbf{x} denotes the model parameters. The communication topology between clients can be represented as graph denoted as 𝒢=(𝒩,,𝐖)\mathcal{G}=(\mathcal{N},\mathcal{E},{\bf W}), where 𝒩={1,2,,m}\mathcal{N}=\{1,2,\ldots,m\} represents the set of clients, 𝒩×𝒩\mathcal{E}\subseteq\mathcal{N}\times\mathcal{N} denotes the links between clients, wi,jw_{i,j} represents the weight of the link between client jj and client ii and 𝐖=[wi,j][0,1]m×m{\bf W}=[w_{i,j}]\in[0,1]^{m\times m} represents the mixing matrix. The inner product of two vectors is denoted by ,\langle\cdot,\cdot\rangle, and \|\cdot\| represents the Euclidean norm of a vector. Other symbols will be explained in their respective contexts.

III-B Problem Setup

In this paper, we consider a network of mm clients whose objective is to jointly solve the following distributed population risk FF minimization problem:

min𝐱dF(𝐱):=1mi=1mFi(𝐱),Fi(𝐱)=𝔼ξ𝒟iFi(𝐱;ξ)\small\min_{{\bf x}\in\mathbb{R}^{d}}F({\bf x}):=\frac{1}{m}\sum_{i=1}^{m}F_{i}({\bf x}),~{}~{}F_{i}({\bf x})=\mathbb{E}_{\xi\sim\mathcal{D}_{i}}F_{i}({\bf x};\xi) (1)

where 𝒟i\mathcal{D}_{i} represents the data distribution in the ii-th client, which exhibits heterogeneity across clients. Each client’s local objective function Fi(𝐱;ξ)F_{i}({\bf x};\xi) is associated with the training data samples ξ\xi. We denote 𝐱𝒟=argmin𝐱F(𝐱)\mathbf{x}_{\mathcal{D}}^{\star}=\arg\min_{\mathbf{x}}F(\mathbf{x}) as the optimal value of (1). Unlike CFL, we address (1) by enabling clients to collaborate through a mixing matrix 𝐖\mathbf{W} in a decentralized manner, leveraging peer-to-peer communication among clients without the need for server coordination.

Practically, we consider the empirical risk minimization of the non-convex finite-sum problem in DFL as:

min𝐱df(𝐱):=1mi=1mfi(𝐱),fi(𝐱)=1Sizj𝒮ifi(𝐱;zj)\small\min_{{\bf x}\in\mathbb{R}^{d}}f({\bf x}):=\frac{1}{m}\sum_{i=1}^{m}f_{i}({\bf x}),~{}~{}f_{i}({\bf x})=\frac{1}{S_{i}}\sum_{z_{j}\in\mathcal{S}_{i}}f_{i}({\bf x};z_{j}) (2)

where each client stores a private dataset Si={zj}S_{i}=\{z_{j}\}, with zjz_{j} drawn from an unknown distribution 𝒟i\mathcal{D}_{i}. We denote 𝐱=argmin𝐱f(𝐱)\mathbf{x}^{\star}=\arg\min_{\mathbf{x}}f(\mathbf{x}) as the optimal value of problem (2).

III-C OledFL Algorithm

Refer to caption
(a)
Figure 1: Simulate the optimization process diagrams for two clients under the DFedAvg and OledFL algorithms. Due to the presence of 𝐱it𝐱i,Kt1\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1} (Ole), where 𝐱i,Kt1𝐱i\mathbf{x}_{i,K}^{t-1}\approx\mathbf{x}_{i}^{*}, then the Ole initial point in OledFL represents taking a step back along the direction from the optimize starting point to the local optimum of the client. Furthermore, from the length of the dashed lines in the figure, it is evident that Ole significantly reduces the inconsistency during the optimization process.

The most popular decentralized FL optimizers for solving the problem (2) are DFedSAM [16] and DFedAvg [15]. However, the decentralized approach, lacking central server coordination, leads to increased client inconsistency. To enhance consistency, we design the opposite lookahead enhanced (Ole) initialization method before performing local optimization at each client:

𝐱i,0t=𝐱it+β(𝐱it𝐱i,Kt1)𝐎𝐥𝐞\mathbf{x}_{i,0}^{t}=\mathbf{x}_{i}^{t}+\beta\underbrace{(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})}_{\mathbf{Ole}} (3)

where 𝐱it=jwi,j𝐱j,Kt1\mathbf{x}_{i}^{t}=\sum_{j}w_{i,j}\mathbf{x}_{j,K}^{t-1} is the aggregated model. Intuitively, at each initialization, clients obtain the opposite direction of the most recent iteration through Ole. This ensures that the values obtained in the subsequent optimization, 𝐱i,Kt\mathbf{x}_{i,K}^{t}, do not deviate too far from the initial value 𝐱it\mathbf{x}_{i}^{t}, thereby strengthening the consistency among clients. A more intuitive explanation of the effect of Ole is shown in Figure 1. The complete algorithm is presented in Algorithm 1 111Here we provide the default form of OledFL that is composed with SAM. In the experiments, we will instantiate OledFL as OledFL-SGD (by setting λ=0\lambda=0) and OledFL-SAM if necessary..

Algorithm 1 OledFL Algorithm
0:  Initialize η,λ>0\eta,\lambda>0 and 𝐱i0=𝐱i,K1=𝐱0d\mathbf{x}_{i}^{0}\!=\!\mathbf{x}_{i,K}^{-1}\!=\!\mathbf{x}^{0}\!\in\!\mathbb{R}^{d} for all nodes.
0:  model average parameters 𝐱¯t\bar{\mathbf{x}}^{t}.
1:  for t=0,1,2,,T1t=0,1,2,\cdots,T-1 do
2:     for client ii in parallel do
3:        set 𝐱i,0t=𝐱it+β(𝐱it𝐱i,Kt1)\mathbf{x}_{i,0}^{t}=\mathbf{x}_{i}^{t}+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})
4:        for k=0,1,2,,K1k=0,1,2,\cdots,K-1 do
5:           sample a minibatch εi,kt\varepsilon_{i,k}^{t} and do
6:           estimate stochastic gradient: 𝐠i,k,1t=fi(𝐱i,kt;εi,kt)\mathbf{g}_{i,k,1}^{t}=\nabla f_{i}(\mathbf{x}_{i,k}^{t};\varepsilon_{i,k}^{t})
7:           update extra step: 𝐱˘i,kt=𝐱i,kt+λ𝐠i,k,1t𝐠i,k,1\breve{\mathbf{x}}_{i,k}^{t}=\mathbf{x}_{i,k}^{t}+\lambda\frac{\mathbf{g}_{i,k,1}^{t}}{\|\mathbf{g}_{i,k,1}\|}
8:           estimate stochastic gradient: 𝐠i,kt=fi(𝐱˘i,kt;εi,kt)\mathbf{g}_{i,k}^{t}=\nabla f_{i}(\breve{\mathbf{x}}_{i,k}^{t};\varepsilon_{i,k}^{t})
9:           perform SGD step: 𝐱i,k+1t=𝐱i,ktη𝐠i,kt\mathbf{x}_{i,k+1}^{t}=\mathbf{x}_{i,k}^{t}-\eta\mathbf{g}_{i,k}^{t}
10:        end for
11:        𝐳it=𝐱i,Kt\mathbf{z}_{i}^{t}=\mathbf{x}_{i,K}^{t}
12:        Mix the received model 𝐳jt\mathbf{z}_{j}^{t} with mixing matrix 𝐖{\bf W} (Refer to Definition 1): 𝐱it+1=jwi,j𝐳jt\mathbf{x}_{i}^{t+1}=\sum_{j}w_{i,j}\mathbf{z}_{j}^{t}
13:     end for
14:  end for

Discuss on Lookhead Optimizer. Zhang et al. [36] propose the lookahead optimizer with the initial optimization value denoted as 𝐱0t\mathbf{x}_{0}^{t}, whose core iterative form is given by:

𝐱0t=ϕt1+β(𝐱Kt1ϕt1)\mathbf{x}_{0}^{t}=\phi^{t-1}+\beta(\mathbf{x}_{K}^{t-1}-\phi^{t-1}) (4)

It optimizes the sequence of “fast weights” 𝐱\mathbf{x} through an internal loop KK times and then utilizes these fast weights to determine the initial search direction of the “slow weights” ϕ\phi, which reduces variance and facilitates rapid convergence of lookahead in practice. When we omit the subscript ii in (3) and set ϕt1=jwi,j𝐱j,Kt1\phi^{t-1}=\sum_{j}w_{i,j}\mathbf{x}_{j,K}^{t-1} in (4), the Ole component in (3) is exactly opposite to the symbol in (4), which is the origin of the name “Opposite lookahead enhancement”.

Discuss on Catalyst. In distributed optimization, lin et al.[30] propose a generic acceleration scheme for a large class of optimization methods with the initial optimization value denoted as 𝐱0t\mathbf{x}_{0}^{t}, whose core iterative form as follows:

𝐱0t=𝐱t+βt(𝐱t𝐱t1)\mathbf{x}_{0}^{t}=\mathbf{x}^{t}+\beta^{t}(\mathbf{x}^{t}-\mathbf{x}^{t-1}) (5)

Through (5), significant acceleration in the convergence of algorithms such as SAG [31], SAGA [32], and SVRG [33] can be achieved. By comparing (5) with (3), it is evident that the initialization coefficient β\beta in (3) is a simpler fixed value. When considering the subscript ii in (5) [34], the subtraction in Ole is performed using the client’s most recently obtained parameter value 𝐱i,Kt1\mathbf{x}_{i,K}^{t-1} rather than the value after the previous communication 𝐱it1\mathbf{x}_{i}^{t-1}. This difference leads to a fundamental distinction between catalyst and ole, where ole initializes the point in the opposite direction of the current point and the local optimal value point, as shown in Figure 1 (OledFL), while catalyst, similar to the momentum method, initializes the point in the direction of 𝐱t𝐱t1\mathbf{x}^{t}-\mathbf{x}^{t-1} from the current point.

Discussion of the relation with Chebyshev Acceleration (CA): Chebyshev Acceleration has been widely utilized to expedite the attainment of consensus among networked agents during the distributed optimization process [37]. When CA is integrated with stochastic gradient accumulation employing a large batch size at each iteration, it can ensure optimal convergence [38, 39]. A key idea of CA involves transforming the mixing matrix from 𝐖\bf W to 𝐖~\tilde{\bf{W}}. Here, 𝐖~\widetilde{\bf W} is expressed as:

𝐖~:=((1+η)𝐖η𝐈𝐈0)\widetilde{\bf W}:=\begin{pmatrix}(1+\eta)\bf W&-\eta\bf I\\ \bf I&0\end{pmatrix}

In simple terms, CA transforms the communication topology 𝐖\bf W into (1+η)𝐖η𝐈(1+\eta)\bf W-\eta\bf I in the communication period, where η<1\eta<1. It can be proven that 𝐖~\widetilde{\bf W} facilitates faster achievement of consensus among multiple clients compared to 𝐖\bf W [40, 41]. The update form of Ole is equivalent to aggregation using the communication matrix (1+β)𝐖β𝐈(1+\beta)\bf W-\beta\bf I, demonstrating a connection between Ole and the core idea of CA. With the relationship to CA, we will constrain β<1\beta<1 in Algorithm 1. Figure 1 shows that Ole enhances consistency, and the relationship between Ole and CA can explain that the acceleration effect of CA originates from the enhancement of consistency among clients.

IV Theoretical Analysis

In this section, we begin by presenting the theoretical analysis, which offers a comprehensive examination of the combined performance in optimization and generalization. Then, we introduce the necessary assumptions utilized in our proofs. At last, we outline the main theorems, encompassing both optimization and generalization, respectively.

IV-A Excess Risk Error

In existing literature on DFL, the most of analyses focus on the studies from the onefold perspective of convergence but ignore learning its impact on generality [16, 15, 42]. Additionally, some studies in distributed learning exclusively analyze the algorithm’s generalization while overlooking the effect on convergence [43, 44, 45]. To offer a more comprehensive examination of the joint performance of both optimization and generalization in FL, we introduce the well-known concept of excess risk in our analysis.

Firstly, we define 𝐱¯T\bar{\mathbf{x}}^{T} as the final model generated by the OledFL method after TT communication rounds, where 𝐱¯T=1mi=1m𝐱iT\bar{\mathbf{x}}^{T}=\frac{1}{m}\sum_{i=1}^{m}\mathbf{x}_{i}^{T}. In comparison with f(𝐱¯T)f(\bar{\mathbf{x}}^{T}), our primary focus is on the efficiency of F(𝐱¯T)F(\bar{\mathbf{x}}^{T}) which corresponds to its generalization performance and test accuracy. Consequently, we analyze 𝔼[F(𝐱¯T)]\mathbb{E}[F(\bar{\mathbf{x}}^{T})] based on the excess risk E\mathcal{E}_{E} as:

E\displaystyle\mathcal{E}_{E} =𝔼[F(𝐱¯T)]𝔼[f(𝐱)]\displaystyle=\mathbb{E}[F(\bar{\mathbf{x}}^{T})]-\mathbb{E}[f(\mathbf{x}^{\star})] (6)
=𝔼[F(𝐱¯T)f(𝐱¯T)]G: generalization error+𝔼[f(𝐱¯T)f(𝐱)]O: optimization error\displaystyle=\underbrace{\mathbb{E}[F(\bar{\mathbf{x}}^{T})-f(\bar{\mathbf{x}}^{T})]}_{\mathcal{E}_{G}:\text{ generalization error}}+\underbrace{\mathbb{E}[f(\bar{\mathbf{x}}^{T})-f(\mathbf{x}^{\star})]}_{\mathcal{E}_{O}:\text{ optimization error}}

Generally, 𝔼[f(𝐱)]\mathbb{E}[f(\mathbf{x}^{\star})] is expected to be very small and even to zero if the model could fit the dataset. Thus E\mathcal{E}_{E} could be considered as the joint efficiency of the generated model 𝐱¯T\bar{\mathbf{x}}^{T}. Thereinto, G\mathcal{E}_{G} means the different performance of 𝐱¯T\bar{\mathbf{x}}^{T} between the training dataset and the test dataset, and O\mathcal{E}_{O} means the similarity between 𝐱¯T\bar{\mathbf{x}}^{T} and optimization optimum 𝐱\mathbf{x}^{\star}. From the perspective of the excess risk, E\mathcal{E}_{E} approximates our focus 𝔼[F(𝐱¯T)]\mathbb{E}[F(\bar{\mathbf{x}}^{T})]. It is worth noting that, in non-convex settings, E\mathcal{E}_{E} is measured via gradient residual 𝔼f(𝐱¯T)2\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{T})\|^{2} rather than f(𝐱¯T)f(𝐱)f(\bar{\mathbf{x}}^{T})-f(\mathbf{x}^{\star}). We will investigate the optimization and generalization performance respectively in the following subsections.

IV-B Definition And Assumptions

Below, we introduce the definition of the mixing matrix and several assumptions utilized in our analysis.

Definition 1

(Gossip/Mixing matrix). [Definition 1, [15]] The gossip matrix 𝐖=[wi,j][0,1]m×m{\bf W}=[w_{i,j}]\in[0,1]^{m\times m} is assumed to have the following properties: (i) (Graph) If iji\neq j and (i,j)𝒱(i,j)\notin{\cal V}, then wi,j=0w_{i,j}=0, otherwise, wi,j>0w_{i,j}>0; (ii) (Symmetry) 𝐖=𝐖{\bf W}={\bf W}^{\top}; (iii) (Null space property) null{𝐈𝐖}=span{𝟏}\mathrm{null}\{{\bf I}-{\bf W}\}=\mathrm{span}\{\bf 1\}; (iv) (Spectral property) 𝐈𝐖𝐈{\bf I}\succeq{\bf W}\succ-{\bf I}. Under these properties, the eigenvalues of 𝐖\bf{\bf W} satisfy 1=|ψ1(𝐖))|>|ψ2(𝐖))||ψm(𝐖))|1=|\psi_{1}({\bf W)})|>|\psi_{2}({\bf W)})|\geq\dots\geq|\psi_{m}({\bf W)})|. Furthermore, we define ψ:=max{|ψ2(𝐖)|,|ψm(𝐖))|}\psi:=\max\{|\psi_{2}({\bf W)}|,|\psi_{m}({\bf W)})|\} and 1ψ(0,1]1-\psi\in(0,1] as the spectral gap of 𝐖\bf W.

Assumption 1

(L-Smoothness) The non-convex function fif_{i} satisfies the smoothness property for all i[m]i\in[m], i.e., fi(𝐱)fi(𝐲)L𝐱𝐲\|\nabla f_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{y})\|\leq L\|\mathbf{x}-\mathbf{y}\|, for all 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}.

Assumption 2

(Bounded Variance) The stochastic gradient 𝐠i,kt=fi(𝐱i,kt,εi,kt)\mathbf{g}_{i,k}^{t}=\nabla f_{i}(\mathbf{x}_{i,k}^{t},\varepsilon_{i,k}^{t}) with the randomly sampled data εi,kt\varepsilon_{i,k}^{t} on the local client ii is unbiased and with bounded variance, i.e., 𝔼[𝐠i,kt]=fi(𝐱i,kt)\mathbb{E}[\mathbf{g}_{i,k}^{t}]=\nabla f_{i}(\mathbf{x}_{i,k}^{t}) and 𝔼𝐠i,ktfi(𝐱i,kt)2σl2\mathbb{E}\|\mathbf{g}_{i,k}^{t}-\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}\leq\sigma_{l}^{2}, for all 𝐱i,ktd\mathbf{x}_{i,k}^{t}\in\mathbb{R}^{d}.

Assumption 3

(Bounded Heterogeneity) For all 𝐱d\mathbf{x}\in\mathbb{R}^{d}.the heterogeneous similarity is bounded on the gradient norm as 𝔼fi(w)2G2+B2𝔼f(w)2\mathbb{E}\|\nabla f_{i}(w)\|^{2}\leq G^{2}+B^{2}\mathbb{E}\|\nabla f(w)\|^{2}, where G0G\geq 0 and B1B\geq 1 are two constants.

Assumption 4

(Lipschitz Continuity). The global function fsatisfies the LGL_{G}-Lipschitz property, i.e. for 𝐱1,𝐱2\forall\mathbf{x}_{1},\mathbf{x}_{2}, f(𝐱1)f(𝐱2)LG𝐱1𝐱2.\|f(\mathbf{x}_{1})-f(\mathbf{x}_{2})\|\leq L_{G}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|.

Definition 1 is commonly used to describe the communication topology in DFL, where it can also be viewed as a Markov transition matrix. The term 1ψ1-\psi measures the speed of convergence to the equilibrium state. Assumptions 1-3 are mild and commonly used to analyze the non-convex objective DFL [15, 16]. Assumption 4 is utilized to bound the uniform stability for the non-convex objective [43, 46, 47]. It is important to note that when describing the algorithm’s convergence speed, we only use assumptions 1-3; whereas in describing the algorithm’s generalization bound, we utilize assumptions 1, 2 and 4. This is because assumption 4 implies bounded gradients, i.e., fi(𝐱)LG\|\nabla f_{i}(\mathbf{x})\|\leq L_{G}, while assumption 3 is more general than the bounded gradient assumption.

IV-C Optimization Analysis

In this part, we present the optimization error and convergence rate of OledFL under the assumptions 1-3. All the proofs can be found in the Appendix B-A.

Theorem 1

Under Assumption 1 - 3, let the learning rate satisfy η1K3/2LB\eta\leq\frac{1}{K^{3/2}LB} where K2K\geq 2, let the Ole parameter βmin{10(1ψ)40,530}\beta\leq\min\{\frac{\sqrt{10}(1-\psi)}{40},\frac{\sqrt{5}}{30}\}, and after training TT rounds, the averaged model parameters generated by our proposed algorithm satisfies:

1Tt=0T1𝔼f(𝐱¯t)2𝔼[f(𝐱¯0)f(𝐱¯T)]κηKT+ζ(η,L,ψ)σl2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}\leq\frac{\mathbb{E}[f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{T})]}{\kappa\eta KT}+\zeta(\eta,L,\psi)\sigma_{l}^{2}
+α(η,K,L,ψ)G2+ϕ(K,η,L)λ2χ(L,ψ,ΔT,T)β2\displaystyle+\alpha(\eta,K,L,\psi)G^{2}+\phi(K,\eta,L)\lambda^{2}-\chi(L,\psi,\Delta^{T},T)\beta^{2}

where κ(0,1)\kappa\in(0,1) is a constant and α(η,K,L,ψ)=9κη2K2L2(1+24(1ψ)2)\alpha(\eta,K,L,\psi)=\frac{9}{\kappa}\eta^{2}K^{2}L^{2}\left(1+\frac{24}{(1-\psi)^{2}}\right), ϕ(K,η,L)=ηLκ(1+9K2ηL)\phi(K,\eta,L)=\frac{\eta L}{\kappa}(1+9K^{2}\eta L), χ(L,ψ,ΔT,T)=(3+72(1ψ)2)L2ΔTκ(1γ)T\chi(L,\psi,\Delta^{T},T)=\left(3+\frac{72}{(1-\psi)^{2}}\right)\frac{L^{2}\Delta^{T}}{\kappa(1-\gamma)T}, which the consistency term ΔT=1mi=1m𝔼𝐱i,KT1𝐱iT2\Delta^{T}=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i,K}^{T-1}-\mathbf{x}_{i}^{T}\|^{2} and ζ(η,L,ψ)=1κηL(2+36(1ψ)2)\zeta(\eta,L,\psi)=\frac{1}{\kappa}\eta L\left(2+\frac{36}{(1-\psi)^{2}}\right).

Further, by selecting the proper learning rate η=𝒪(1KT)\eta=\mathcal{O}(\frac{1}{\sqrt{KT}}) and let D=f(𝐱¯0)f(𝐱¯)D=f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{*}) as the initialization bias, thenthe averaged model parameters 𝐱¯t\bar{\mathbf{x}}^{t} satisfies:

1Tt=0T1𝔼f(𝐱¯t)2=𝒪(DKT+KL2T(1ψ)2G2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}=\mathcal{O}\left(\frac{D}{\sqrt{KT}}+\frac{KL^{2}}{T(1-\psi)^{2}}G^{2}\right.
+LTK(1ψ)2σl2+LTKλ2)\displaystyle\qquad\qquad\qquad+\left.\frac{L}{\sqrt{T}K(1-\psi)^{2}}\sigma_{l}^{2}+\frac{L}{\sqrt{T}K}\lambda^{2}\right)
Remark 1

When setting λ=0\lambda=0 in Theorem 1, we can obtain the optimization error and convergence rate of OledFL with local SGD. Moreover, by setting λ=𝒪(1T)\lambda=\mathcal{O}(\frac{1}{\sqrt{T}}), the term LTKλ2=𝒪(1KT3/2)\frac{L}{\sqrt{T}K}\lambda^{2}=\mathcal{O}(\frac{1}{KT^{3/2}}) in the convergence rate, as generated by local SAM, can be neglected in comparison to the dominant term 𝒪(1KT)\mathcal{O}(\frac{1}{\sqrt{KT}}). Furthermore, OledFL achieves a convergence rate of 𝒪(1KT)\mathcal{O}(\frac{1}{\sqrt{KT}}), which has been demonstrated as the optimal rate in stochastic methods in DFL under general assumptions. Additionally, in the dominant term of the convergence rate 𝒪(DKT+LTK(1ψ)2σl2)\mathcal{O}(\frac{D}{\sqrt{KT}}+\frac{L}{\sqrt{T}K(1-\psi)^{2}}\sigma_{l}^{2}), it can be observed that better topological connectivity leads to a faster convergence rate. We will verify this conclusion in Section V-C. Moreover, a larger number of local epochs KK also leads to a faster convergence rate, as confirmed in Section V-D (see Figure 8 (b)).

Remark 2

Theorem 1 provides a general convergence bound for the OledFL. When β=0\beta=0, it degrades to the vanilla DFedAvg method [15]. Similar to DFedAvg, it is influenced by the initialization bias DD and the intrinsic variance σl\sigma_{l}. However, with the adoption of initialization, the consistency term ΔT\Delta^{T} can assist in reducing the upper bound of the convergence rate. Furthermore, from Theorem 1, it can be observed that a larger β\beta leads to a smaller optimization error. On the other hand, βmin{10(1ψ)40,530}\beta\leq\min\{\frac{\sqrt{10}(1-\psi)}{40},\frac{\sqrt{5}}{30}\} provides an upper bound. By neglecting specific numerical relations and focusing on the associated relations, we can obtain an important conclusion: better-connected communication topology (implying a smaller 1ψ1-\psi) corresponds to a smaller optimal value of β\beta for the algorithm. Section V-C verifies our conclusion (see Figure 7).

TABLE II: Top 1 test accuracy (%) on two datasets in both IID and non-IID settings.
Algorithm CIFAR-10     CIFAR-100
Dir 0.3 Dir 0.6 IID     Dir 0.3 Dir 0.6 IID
FedAvg 78.10 ±\pm 0.81 79.00 ±\pm 1.04 81.16 ±\pm 0.20     54.50 ±\pm 0.70 55.69 ±\pm 0.72 56.94 ±\pm 0.41
FedSAM 80.22 ±\pm 0.70 81.43 ±\pm 0.28 82.83 ±\pm 0.23     57.76 ±\pm 0.36 58.62 ±\pm 0.51 59.97 ±\pm 0.19
SCAFFOLD 78.39 ±\pm 0.35 79.85 ±\pm 0.18 81.75 ±\pm 0.23     60.23 ±\pm 0.30 61.42 ±\pm 0.52 62.75 ±\pm 0.31
D-PSGD 59.76 ±\pm 0.04 60.03 ±\pm 0.13 62.93 ±\pm 0.12     55.68 ±\pm 0.20 56.68 ±\pm 0.10 57.60 ±\pm 0.12
DFedAvg 77.25 ±\pm 0.12 77.83 ±\pm 0.11 79.97 ±\pm 0.08     58.17 ±\pm 0.10 58.22 ±\pm 0.50 59.00 ±\pm 0.31
DFedAvgM 79.30 ±\pm 0.24 80.66 ±\pm 0.07 82.72 ±\pm 0.20     57.79 ±\pm 0.29 58.13 ±\pm 0.39 58.90 ±\pm 0.47
DFedSAM 79.37 ±\pm 0.07 80.47 ±\pm 0.09 82.14 ±\pm 0.09     57.67 ±\pm 0.20 58.55 ±\pm 0.23 59.53 ±\pm 0.24
OledFL-SGD 82.20 ±\pm 0.20 82.72 ±\pm 0.22 84.10 ±\pm 0.24     60.00 ±\pm 0.15 60.65 ±\pm 0.30 62.58 ±\pm 0.21
OledFL-SAM 84.45 ±\pm 0.19 84.70 ±\pm 0.10 85.75 ±\pm 0.20     61.22 ±\pm 0.12 61.78 ±\pm 0.22 63.55 ±\pm 0.20
Algorithm CIFAR-10     CIFAR-100
Path 2 Path 4 Path 6     Path 10 Path 20 Path 30
FedAvg 69.01 ±\pm 0.96 75.80 ±\pm 1.01 76.76 ±\pm 1.11     52.12 ±\pm 0.70 55.70 ±\pm 0.63 57.14 ±\pm 0.26
FedSAM 69.33 ±\pm 1.32 75.78 ±\pm 1.04 77.55 ±\pm 0.68     52.46 ±\pm 0.69 55.75 ±\pm 0.55 57.60 ±\pm 0.36
SCAFFOLD 64.28 ±\pm 2.12 78.08 ±\pm 0.39 80.35 ±\pm 0.19     54.17 ±\pm 0.51 58.32 ±\pm 0.67 60.50 ±\pm 0.35
D-PSGD 59.53 ±\pm 0.40 63.80 ±\pm 0.22 64.86 ±\pm 0.48     52.66 ±\pm 0.58 55.79 ±\pm 0.11 56.82 ±\pm 0.35
DFedAvg 74.73 ±\pm 0.26 77.50 ±\pm 0.25 79.00 ±\pm 0.25     54.72 ±\pm 0.19 57.57 ±\pm 0.11 58.30 ±\pm 0.24
DFedAvgM 75.10 ±\pm 0.27 79.50 ±\pm 0.33 81.07 ±\pm 0.27     48.19 ±\pm 0.45 53.60 ±\pm 0.64 53.95 ±\pm 0.85
DFedSAM 75.08 ±\pm 0.11 79.85 ±\pm 0.09 81.36 ±\pm 0.11     53.86 ±\pm 0.17 57.58 ±\pm 0.22 58.80 ±\pm 0.21
OledFL-SGD 78.20 ±\pm 0.24 81.54 ±\pm 0.25 83.10 ±\pm 0.30     55.03 ±\pm 0.11 58.87 ±\pm 0.13 59.75 ±\pm 0.10
OledFL-SAM 79.58 ±\pm 0.18 83.45 ±\pm 0.21 84.90 ±\pm 0.18     56.68 ±\pm 0.16 60.22 ±\pm 0.24 61.57 ±\pm 0.21

IV-D Generalization Analysis

In this section, we primarily present a generalization analysis of our OledFL method under two gradient properties (bounded variance and Lipschitz continuity). We utilize uniform stability analysis, which is widely adopted in previous literature, to analyze the generalization error of OledFL. Next, we introduce the definition of uniform stability and then demonstrate the effectiveness of OledFL. All detailed proofs can be available in the Appendix B-B.

In DFL framework, we suppose there are mm clients participating in the training process as a set 𝒞={i}i=1m\mathcal{C}=\{i\}_{i=1}^{m}. Each client has a local dataset 𝒮i={zj}j=1S\mathcal{S}_{i}=\{z_{j}\}_{j=1}^{S} with total SS data sampled from a specific unknown distribution 𝒟i\mathcal{D}_{i}. Now we define a re-sampled dataset 𝒮i~\widetilde{\mathcal{S}_{i}} which only differs from the dataset 𝒮i\mathcal{S}_{i} on the jj^{*}-th data. We replace the 𝒮i\mathcal{S}_{i^{*}} with 𝒮~i\widetilde{\mathcal{S}}_{i^{*}} and keep other m1m-1 local dataset, which composes a new set 𝒞~\widetilde{\mathcal{C}}. 𝒞\mathcal{C} only differs from the 𝒞~\widetilde{\mathcal{C}} at jj^{*}-th data on the ii^{*}-th client. Then, based on these two sets, OledFL could generate two solutions, 𝐱¯t\bar{\mathbf{x}}^{t} and 𝐱¯~t\widetilde{\bar{\mathbf{x}}}^{t} respectively, after tt rounds. By bounding the difference according to these two models, we can obtain stability and generalization efficiency.

Definition 2

(Uniform Stability [47]) For these two models 𝐱¯T\bar{\mathbf{x}}^{T} and 𝐱¯~T\widetilde{\bar{\mathbf{x}}}^{T} generated as introduced above, a general method satisfies ϵ\epsilon-uniformly stability if:

supzj{𝒟i}𝔼[f(𝐱¯T;zj)f(𝐱¯~T;zj)]ϵ.\operatorname*{sup}_{z_{j}\sim\{\mathcal{D}_{i}\}}\mathbb{E}[f(\bar{\mathbf{x}}^{T};z_{j})-f(\widetilde{\bar{\mathbf{x}}}^{T};z_{j})]\leq\epsilon. (7)

Moreover, if a general method satisfies ϵ\epsilon-uniformly stability, then its generalization error could also be bounded [48, 47]

Gsupzj{𝒟i}𝔼[f(𝐱¯T;zj)f(𝐱¯~T;zj)]ϵ.\mathcal{E}_{G}\leq\operatorname*{sup}_{z_{j}\sim\{\mathcal{D}_{i}\}}\mathbb{E}[f(\bar{\mathbf{x}}^{T};z_{j})-f(\widetilde{\bar{\mathbf{x}}}^{T};z_{j})]\leq\epsilon.
Theorem 2

Under Assumption 2, 3, and 4, let all conditions in the optimization process be satisfied, let the learning rate be selected as η=𝒪(1t)=ct\eta=\mathcal{O}(\frac{1}{t})=\frac{c}{t} where cc is a constant and β1ψ4m+1ψ\beta\leq\frac{1-\psi}{4\sqrt{m}+1-\psi}, let t0t_{0} be a specific round to firstly select the different data sample, and let U=sup𝐱,z{f(𝐱;z)}U=\sup_{\mathbf{x},z}\{f({\mathbf{x}};z)\} be the upper bound, for arbitrary data sample zz followed the joint distribution 𝒟i\mathcal{D}_{i}, we have:

𝔼f(𝐱¯T+1;z)f(𝐱¯~T+1;z)Ut0S+(2LG(LG+Sσl)(1+2β)SL\displaystyle\mathbb{E}\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\|\leq\frac{Ut_{0}}{S}+\Big{(}\frac{2L_{G}(L_{G}+S\sigma_{l})}{(1+2\beta)SL}
+2LG(1+β)K(σl+LG)α(1+2β)Cλ)(Tt0)cKL\displaystyle\quad+\frac{2L_{G}(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}C_{\lambda}\Big{)}\left(\frac{T}{t_{0}}\right)^{cKL}

where SS denotes the amount of data owned by each client, α=14mβ(1ψ)(1β),Cλ:=ln1λλln1λλ+ln21λ16λλln1λ8+2λln1λ,(λ0);Cλ=0,(λ=0).\alpha=1-\frac{4\sqrt{m}\beta}{(1-\psi)(1-\beta)},C_{\lambda}:=\ln\frac{1}{\lambda}\frac{\lambda^{\ln\frac{1}{\lambda}}}{\lambda}+\frac{\ln^{2}\frac{1}{\lambda}}{16\lambda}\lambda^{\frac{\ln\frac{1}{\lambda}}{8}}+\frac{2}{\lambda\ln\frac{1}{\lambda}},(\lambda\neq 0);C_{\lambda}=0,(\lambda=0).\\ Furthermore, to minimize the stability errors, we can select the proper observation point t0=TcKL1+cKL((2LG(LG+Sσl)(1+2β)SL+2LG(1+β)K(σl+LG)α(1+2β)Cλ)ScKLU)11+cKLt_{0}\!=\!T^{\frac{cKL}{1\!+\!cKL}}\!\!\left((\frac{2L_{G}(L_{G}+S\sigma_{l})}{(1\!+\!2\beta)SL}\!+\!\frac{2L_{G}(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}C_{\lambda})\frac{ScKL}{U}\!\right)\!^{\frac{1}{1+cKL}} we then have

𝔼f(𝐱¯T+1;z)f(𝐱¯~T+1;z)2TcKL1+cKL(2(LG+Sσl)(1+2β)SL\displaystyle\mathbb{E}\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\|\leq 2T^{\frac{cKL}{1+cKL}}\Big{(}\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}
+2(1+β)K(σl+LG)α(1+2β)Cλ)11+cKL(US)cKL1+cKL\displaystyle\quad+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}C_{\lambda}\Big{)}^{\frac{1}{1+cKL}}\left(\frac{U}{S}\right)^{\frac{cKL}{1+cKL}}
Remark 3

We provide further discussion on the impact of β\beta on the convergence rate and generalization performance of the algorithm. Firstly, from the negative coefficient in the last term of Theorem 1, it can be inferred that a larger β\beta value will reduce the optimization error, thereby accelerating the convergence rate. As depicted in Figure 8 (a), with the increase in β\beta, the convergence curve becomes steeper, demonstrating a faster convergence speed. Additionally, the expression of the generalization error derived from Theorem 2, (2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλ)11+cKL(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}C_{\lambda})^{\frac{1}{1+cKL}} indicates that as β\beta increases, the terms 11+2β\frac{1}{1+2\beta} and 1+β1+2β\frac{1+\beta}{1+2\beta} decrease. This implies that with an increase in β\beta, the generalization error of the algorithm will decrease, thereby enhancing its generalization performance. It is evident from Figure 8 (a) that as β\beta increases, the maximum value of the convergence curve (indicating generalization performance) also increases. In summary, a larger β\beta will simultaneously enhance the convergence speed and generalization of the algorithm. Furthermore, from Theorem 2, it can be observed that better topological connectivity (smaller CλC_{\lambda}) leads to a smaller generalization error, which is validated in Section V-C (see Figure 6).

Remark 4

Theorem 2 establishes the generalization error G\mathcal{E}_{G} of OledFL. When β=0\beta=0, it reduces to the vanilla DFedAvg [15], the generalization error is given by 𝒪((TS)cKL1+cKL(1+KCλ)11+cKL)\mathcal{O}\left((\frac{T}{S})^{\frac{cKL}{1+cKL}}(1+KC_{\lambda})^{\frac{1}{1+cKL}}\right), which fills the gap in the generalization performance of DFedAvg [15]. From the generalization error, we mainly focus on the terms of the total number of data samples SS, training length TT and KK. Compared with the generalization bound of D-SGD proposed by [43], 𝒪((1+Cλ)TcL1+cL)\mathcal{O}((1+C_{\lambda})T^{\frac{cL}{1+cL}}), the local epochs KK in DFedAvg can enhance its generalization, which is consistent with the observation in Figure 2&3.

V Experiment

In this section, we conduct extensive experiments to verify the effectiveness of the proposed OledFL.

TABLE III: Communication rounds for each method achieving target accuracy on the CIFAR-10 dataset.
Methods Dir 0.3 Dir 0.6 IID
Acc@75 Acc@77 Acc@80 Acc@75 Acc@77 Acc@80 Acc@78 Acc@80 Acc@82
FedAvg 141 (1.3×\times) 244 (1.7×\times) >500 111 (1.4×\times) 166 (1.7×\times) 485 (1.0×\times) 150 (1.2×\times) 243 (2.0×\times) >500
FedSAM 141 (1.3×\times) 202 (2.1×\times) 402 (1.2×\times) 121 (1.3×\times) 165 (1.7×\times) 296 (1.7×\times) 142 (1.2×\times) 199 (2.4×\times) 363 (0.8 ×\times)
SCAFFOLD 264 (0.7×\times) 356 (1.2×\times) >500 202 (0.8×\times) 262 (1.1×\times) 470 (1.1×\times) 180 (1.0×\times) 273 (1.8×\times) >500
D-PSGD >500 >500 >500 >500 >500 >500 >500 >500 >500
DFedAvg 179 (1.0×\times) 419 (1.0×\times) >500 152 (1.0×\times) 283 (1.0×\times) >500 176 (1.0×\times) 479 (1.0×\times) >500
DFedAvgM 93 (1.9×\times) 141 (3.0×\times) >500 64 (2.4×\times) 99 (2.9×\times) 305 (1.6×\times) 59 (3.0×\times) 117 (4.1×\times) 303 (1.7×\times)
DFedSAM 187 (1.0×\times) 265 (1.6×\times) >500 155 (1.0×\times) 203 (1.4×\times) 414 (1.2×\times) 143 (1.2×\times) 212 (2.3×\times) 452 (1.1×\times)
OledFL-SGD 54 (3.3×\times) 77 (5.4×\times) 146 (3.4×\times) 41 (3.7×\times) 58 (4.9×\times) 110 (4.5×\times) 37 (4.8×\times) 59 (8.1×\times) 91 (5.5×\times)
OledFL-SAM 57 (3.1×\times) 68 (6.2×\times) 104 (4.8×\times) 45 (2.9×\times) 53 (5.3×\times) 82 (6.1×\times) 48 (3.7×\times) 63 (7.6×\times) 90 (5.6×\times)
Methods Path 2 Path 4 Path 6
Acc@55 Acc@65 Acc@70 Acc@65 Acc@70 Acc@75 Acc@70 Acc@75 Acc@80
FedAvg 137 (0.5×\times) 283 (0.5×\times) >500 113 (0.5×\times) 180 (0.5×\times) 327 (0.6×\times) 129 (0.5×\times) 208 (0.5×\times) >500
FedSAM 156 (0.4×\times) 313 (0.4×\times) 453 (0.5×\times) 119 (0.5×\times) 172 (0.5×\times) 333 (0.6×\times) 132(0.4×\times) 225 (0.5×\times) >500
SCAFFOLD 183 (0.3×\times) 439 (0.3 ×\times) >500 122 (0.5×\times) 172 (0.5 ×\times) 287 (0.7 ×\times) 114 (0.5 ×\times) 182 (0.6 ×\times) 439 (1.1×\times)
D-PSGD 340 (0.2×\times) >500 >500 >500 >500 >500 >500 >500 >500
DFedAvg 62 (1.0×\times) 139 (1.0×\times) 223 (1.0×\times) 59 (1.0×\times) 92 (1.0×\times) 187 (1.0×\times) 59 (1.0×\times) 113 (1.0×\times) >500
DFedAvgM 126 (0.5×\times) 179 (0.8×\times) 244 (0.9×\times) 46 (1.3×\times) 64 (1.4×\times) 119 (1.6×\times) 32 (1.8×\times) 65 (1.7×\times) 257 (1.9×\times)
DFedSAM 75 (0.8×\times) 157 (0.9×\times) 248 (0.9×\times) 75 (0.8×\times) 111 (0.8×\times) 187 (1.0×\times) 84 (0.7×\times) 128 (0.9×\times) 328 (1.5×\times)
OledFL-SGD 34 (1.8×\times) 66 (2.1×\times) 94 (2.4×\times) 25 (2.4×\times) 36 (2.6×\times) 57 (3.3×\times) 24 (2.5×\times) 37 (3.1×\times) 89 (5.6×\times)
OledFL-SAM 44 (1.4×\times) 88 (1.6×\times) 117 (1.9×\times) 30 (2.0×\times) 41 (2.2×\times) 63 (3.0×\times) 31 (1.9×\times) 43 (2.6×\times) 83 (6.0×\times)

V-A Experiment Setup

Dataset. We evaluate the proposed OledFL on CIFAR-10&100 datasets [49] in both IID and non-IID settings. To simulate non-IID data distribution among federated clients, we utilize the Dirichlet [50] and Pathological distribution [51]. Specifically, in the Dirichlet distribution, the local data of each client is partitioned by sampling label ratios from the Dirichlet distribution Dir(α\alpha). A smaller value of α\alpha indicates a higher degree of non-IID. In our experiments, we set α=0.3\alpha=0.3 and α=0.6\alpha=0.6 to represent different levels of non-IID. In the Pathological distribution, the local data of each client is partitioned by sampling label ratios from the Pathological distribution Path(α\alpha), where the value of α\alpha represents the number of classes owned by each client. For example, in the CIFAR-10 dataset, we set α=2\alpha=2 to indicate that each client possesses only 2 randomly selected classes out of the 10 available. In our experimental setup, we respectively set α=2\alpha=2, α=4\alpha=4, and α=6\alpha=6 for the CIFAR-10 dataset, and α=10\alpha=10, α=20\alpha=20, and α=30\alpha=30 for the CIFAR-100 dataset.

Baselines. The baselines used for comparison include several state-of-the-art (SOTA) methods in both the CFL and DFL settings. Specifically, the centralized baselines include FedAvg [1], FedSAM [52, 17], and SCAFFOLD [18] . In the decentralized setting, D-PSGD [13], DFedAvg, and DFedAvgM [15], along with DFedSAM [16], are used for comparison. Note that both our baseline and proposed algorithms are designed to operate synchronously.

Implementation Details. The total number of clients is set to 100, with 10% of the clients participating in the communication, creating a random bidirectional topology. For decentralized methods, all clients perform the local iteration step, while for centralized methods, only the participating clients perform the local update [16]. The local learning rate is initialized to 0.1 with a decay rate of 0.998 per communication round for all experiments. For SAM-based algorithms, such as DFedSAM and OledFL-SAM, we set the perturbation weight as λ=0.1\lambda=0.1. As for β\beta, we set β=0.99\beta=0.99 for CIFAR-10 and β=0.8\beta=0.8 for CIFAR-100 in the random topology. The maximum number of communication rounds is set to 500 for all experiments on CIFAR-10&100. Additionally, all ablation studies are conducted on the CIFAR-10 dataset with a data partition method of Dir 0.3 and 500 communication rounds.

Communication Configurations. To ensure a fair comparison between decentralized and centralized approaches, we have implemented a dynamic and time-varying connection topology for DFL methods. This approach ensures that the number of connections in each round does not exceed the number of connections in the centralized server, thus enabling the matching of communication volume between the decentralized and centralized methods as [25]. To ensure a fair comparison, we regulate the number of neighbors for each client in DFL using the client participation rate. Here, we set each client to randomly select 10 neighbors during each communication round. Other communication topologies, such as Grid, are generated according to corresponding rules.

V-B Performance Evaluation

Refer to caption
(a) CIFAR-10-Path
Refer to caption
(b) CIFAR-10-Dir
Figure 2: Test accuracy of all baselines on CIFAR-10 in both IID and different non-IID settings.
Refer to caption
(a) CIFAR-100-Path
Refer to caption
(b) CIFAR-100-Dir
Figure 3: Test accuracy of all baselines on CIFAR-100 in both IID and different non-IID settings.

Figures 2 and 3 display the comparison between the baseline method and OledFL on the CIFAR10&100 datasets under Dirichlet and Pathological distributions. From the figures, it is evident that OledFL can converge rapidly with fewer communication rounds, and OledFL-SAM also surpasses SCAFFOLD in terms of generalization performance. It is worth noting that, under the Pathological data distribution, CFL methods exhibit instability during training, whereas DFL methods provide a more stable training process.

Performance Analysis. In Table II, we conduct a series of experiments to compare the performance of our method with baseline methods. The results show that under various non-IID settings, our method consistently outperforms other methods, for example, on the CIFAR-10 dataset, our method outperforms the previous SOTA method DFedSAM by at least 4.5% in both Dirichlet and Pathological distributions. Even in the IID case, our method also outperforms DFedSAM by at least 3.5%. Similarly, these results are also observed on the more complex CIFAR-100 dataset, where compared to DFedSAM, our method provides at least a 3% improvement under Dirichlet distribution and at least a 2.6% improvement under Pathological distribution.

Impact of non-IID levels (α\alpha). The robustness of OledFL in various non-IID settings is evident from Figure 2& 3 and Table II. A smaller value of alpha (α\alpha) indicates a higher level of non-IID, which makes the task of optimizing a consensus problem more challenging. However, our algorithm consistently outperforms all baselines across different levels of non-IID. Furthermore, methods based on the SAM optimizer consistently achieve higher accuracy across different levels of non-IID. For instance, on the CIFAR-10 dataset, under the Dirichlet distribution, the performance of our algorithm OledFL-SAM only decreases by 1.3% from the IID to Dir 0.3, which significantly outperforms DFedSAM (2.76%), DFedAvgM (3.42%), DFedAvg (2.72%), D-PSGD (3.17%), and FedAvg (3.06%), respectively. This demonstrates the robustness of our method to heterogeneous data.

Refer to caption
(a)
Figure 4: The comparison of loss landscapes between DFedSAM and OledFL-SAM. Whereas the wireframe represents the loss landscape of OledFL-SAM, the surface represents the loss landscape of OledFL. It is clear that OledFL-SAM can find smoother minima.

Impact of Ole Parameter β\beta. In Table II and Figure 2&3, we observe a significant improvement in generalization and convergence speed due to Ole initialization. Taking the example of CIFAR-10 under Dirichlet 0.3, parameter initialization helps the DFedAvg algorithm achieve an improvement of around 4% in accuracy, and the DFedSAM algorithm achieves an improvement of around 5% in accuracy. In terms of convergence speed, parameter initialization helps both the DFedAvg and DFedSAM algorithms achieve at least 3×\times speedup, and in some cases, even up to 8×\times speedup, which coincides with our theoretical analysis and demonstrates the efficacy of our proposed OledFL (see Remark 3).

Refer to caption
(a) DFedSAM-3D
Refer to caption
(b) OledFL-SAM-3D
Refer to caption
(c) DFedSAM-2D
Refer to caption
(d) OledFL-SAM-2D
Figure 5: (a) and (b) depict the comparison of loss landscapes between DFedSAM and OledFL-SAM, while (c) and (d) show the contour plots of the loss landscapes of DFedSAM and OledFL-SAM. From (c) and (d), it can be observed that OledFL-SAM optimizes deeper than DFedSAM. As shown in Figure 4, a comparison in (a) and (b) of Figure 5 indicates that OledFL-SAM is able to find a flatter loss surface.

Explanation of the Effectiveness of Ole. To explore the effectiveness of Ole, we plot the loss landscapes and corresponding contour maps of OledFL-SAM and DFedSAM on the CIFAR10 dataset under Dirichlet 0.3, as shown in Figure 5. It is evident that OledFL-SAM can find parameters with lower loss values. Furthermore, from Figure 4, it can be observed that the loss landscape of OledFL-SAM is flatter. According to the research findings of Keskar et al. [53] and Dziugaite et al. [54], flatter loss surfaces lead to better generalization performance. This observation can explain why OledFL-SAM may exhibit superior generalization performance. This is also consistent with the theoretical analysis results in Remark 3.

V-C Topology-aware Performance

Below, we explore the effects of topologies on different DFL methods on CIFAR-10 dataset with Dirichlet α=0.3\alpha=0.3.

TABLE IV: Top 1 test accuracy (%) in various network topologies compared with decentralized algorithms on CIFAR-10.
Algorithm Ring Grid Exp Full
D-PSGD 51.24 52.06 55.58 65.46
DFedAvg 63.30 74.72 77.11 78.52
DFedAvgM 65.49 76.89 78.01 80.14
DFedSAM 65.12 77.86 78.88 81.04
OledFL-SGD 71.78 81.05 78.73 81.52
OledFL-SAM 74.48 82.95 79.99 83.13
Refer to caption
(a)
Figure 6: Accuracy of different DFL algorithms with different decentralized topologies on the test dataset.
TABLE V: ψ\psi value of different topology. Here 0 means the extra term will disappear and N/A means the term will diverge.
Topology ψ\psi 11ψ\frac{1}{1-\psi}
Fully connected 0 0
Exponential 121+lnm1-\frac{2}{1+\ln{m}} 𝒪(m)\mathcal{O}(m)
Grid 11mlnm1-\frac{1}{m\ln{m}} 𝒪(mlnm)\mathcal{O}(m\ln{m})
Ring 116π23m21-\frac{16\pi^{2}}{3m^{2}} 𝒪(m2)\mathcal{O}(m^{2})

Impact of Sparse Connectivity ψ\psi. Each client in the network only communicates with its predetermined neighbors, and the specific communication pattern is determined by the corresponding topology. In Table V, the degree of sparse connectivity ψ\psi follows the order: Ring >> Grid >> Exponential >> Full-connected [16, 45, 55]. From Figure 6 and Table IV, It can be observed that there is a general trend: as the sparse connectivity ψ\psi decreases, the accuracy of our proposed algorithm on the test set increases. This can be attributed to the fact that a well-designed communication topology enables clients to obtain better initial optimization points through communication, leading to improved results. Furthermore, OledFL consistently achieves higher test set accuracy compared to other DFL baselines across various topologies. For example, Ole can enhance DFedSAM’s performance by over 9% on a Ring topology. Additionally, from Figure 6, it can be observed that the convergence rate and generalization performance of OledFL generally increases with better topological connectivity, confirming the conclusion of Remarks 1&3.

Refer to caption
(a)
Figure 7: Optimal β\beta for Different Topologies in OledFL. It can be observed that OledFL-SGD exhibits a decreasing relationship between the optimal β\beta under different topologies and the second largest eigenvalue ψ\psi of the communication topology. This confirms Remark 2 in Theorem 1.

Relationship between β\beta and ψ\psi. From Section IV, we have concluded that tighter topological connectivity (indicating smaller ψ\psi) corresponds to smaller initial parameter values β\beta. To validate this theoretical result, we conduct experiments on the OledFL algorithm using the parameter set β={0.1,0.2,,0.9}\beta=\{0.1,0.2,\ldots,0.9\} in each of the four topologies presented in Table V to select the optimal β\beta corresponding to the algorithm’s performance. The results of the optimal β\beta values under different topologies are presented in Figure 7. By comparing the ψ\psi-β\beta relationships in Figure 7 with the corresponding values in Table V, the experimental results validate our conclusion in Remark 2 of Theorem 1.

V-D Ablation Study

We verify the influence of each component and hyperparameter in OledFL. All the ablation studies are conducted with the “Random” topology, which is consistent with the communication configuration used in Section V-B.

Refer to caption
(a)
Figure 8: Hyperparameter Sensitivity: local iterate KK, Ole parameter β\beta, number of participated clients mm, perturbation radius λ\lambda.

Impact of Ole Parameter β\beta. The convergence curves under different Ole coefficients after 500 communication rounds are displayed in Figure 8 (a) on the CIFAR-10 dataset under the Dirichlet 0.3 distribution. A larger β\beta value implies a greater deviation of each client’s initialization point from the corresponding optimal point. We verify the performance curves of OledFL-SAM under the β\beta parameter set {0.1,0.5,0.7,0.95,0.99}\{0.1,0.5,0.7,0.95,0.99\}. It can be seen that under the random bidirectional topology, the optimal β\beta is 0.99. Furthermore, it is clear that as β\beta increases, the algorithm’s convergence speed and generalization performance both improve. When β>1\beta>1, the algorithm diverges, which is consistent with the upper bound of β\beta obtained in the discussion of its relationship with CA in Section III-C.

Impact of Client Number mm. In Figure 8 (c), we present the performance with different numbers of participants, m={50,100,200}m=\{50,100,200\}. We observe that, with an increase in the number of local data, OledFL achieves the best performance when m=50m=50 or 100, while a decrease in performance is observed when mm is relatively large. This can be attributed to the fact that a smaller number of clients has a larger number of local training samples, leading to improved training performance.

Impact of Perturbation Radius λ\lambda. The perturbation radius λ\lambda is an additional factor influencing the convergence of OledFL. It controls the size of the perturbation radius, where a larger perturbation radius implies a more ambiguous direction of parameter descent, thereby affecting the convergence speed. We evaluate OledFL’s generalization performance using different values of λ\lambda from the set {0,0.01,0.05,0.1,0.2,0.3}\{0,0.01,0.05,0.1,0.2,0.3\}. Figure 8 (d) illustrates the highest accuracy achieved when λ=0.1\lambda=0.1. Additionally, we observe that a larger λ\lambda leads to a slower convergence speed, which is consistent with our previous analysis.

Impact of local epochs KK. KK represents the number of optimization rounds performed by each client. A larger value of KK is more likely to lead to the ”client drift” phenomenon, resulting in increased inconsistency between clients and thereby affecting the algorithm’s performance. We assess the generalization performance of OledFL using different values of KK from the set {1,2,5,10}\{1,2,5,10\}. Figure 8 (b) demonstrates the highest accuracy achieved when K=5K=5. Additionally, we observe that a larger KK leads to faster convergence in Figure 8 (b), consistent with our theoretically proven 𝒪(1KT)\mathcal{O}(\frac{1}{\sqrt{KT}}).

VI Conlusion

In this paper, we propose a plug-in method named OledFL, which integrates existing DFL optimizers to enhance the consistency among clients and significantly improve convergence speed and generalization performance at almost negligible computational cost. Theoretically, we are the first to conduct a joint analysis of algorithm convergence and generalization in the field of DFL, and we demonstrate the effectiveness of OledFL in reducing optimization error and generalization error. Moreover, a comprehensive explanation of the mechanism of action of Ole has been provided, encompassing intuitive, experimental, and theoretical perspectives. Furthermore, certain conclusions derived from theoretical deductions have been validated experimentally, thus offering additional insights into the Ole plugin methodology. Finally, extensive experiments on the CIFAR10&100 datasets under Dirichlet and Pathological distributions demonstrate that OledFL can significantly reduce the performance gap between CFL and DFL, and even surpass CFL optimizers such as FedSAM and SCAFFOLD, which is crucial for further advancement in the field of DFL.

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics.   PMLR, 2017, pp. 1273–1282.
  • [2] B. Gu, A. Xu, Z. Huo, C. Deng, and H. Huang, “Privacy-preserving asynchronous vertical federated learning algorithms for multiparty collaborative learning.” IEEE Transactions on Neural Networks and Learning Systems, p. 6103–6115, Nov 2022. [Online]. Available: http://dx.doi.org/10.1109/tnnls.2021.3072238
  • [3] P. Zhou, K. Wang, L. Guo, S. Gong, and B. Zheng, “A privacy-preserving distributed contextual federated online learning framework with big data support in social recommender systems,” IEEE Transactions on Knowledge and Data Engineering, p. 1–1, Jan 2019. [Online]. Available: http://dx.doi.org/10.1109/tkde.2019.2936565
  • [4] S. Zhou and G. Y. Li, “Federated learning via inexact admm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [5] Y. Sun, L. Shen, T. Huang, L. Ding, and D. Tao, “Fedspeed: Larger local interval, less communication round, and higher generalization accuracy,” arXiv preprint arXiv:2302.10429, 2023.
  • [6] T. Li, A. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” arXiv: Learning,arXiv: Learning, Dec 2018.
  • [7] D. A. E. Acar, Y. Zhao, R. M. Navarro, M. Mattina, P. N. Whatmough, and V. Saligrama, “Federated learning based on dynamic regularization,” arXiv preprint arXiv:2111.04263, 2021.
  • [8] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “Fedpd: A federated learning framework with adaptivity to non-iid data,” IEEE Transactions on Signal Processing, vol. 69, pp. 6055–6070, 2021.
  • [9] R. Dai, X. Yang, Y. Sun, L. Shen, X. Tian, M. Wang, and Y. Zhang, “Fedgamma: Federated learning with global sharpness-aware minimization,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [10] M. Chen, Y. Xu, H. Xu, and L. Huang, “Enhancing decentralized federated learning for non-iid data on heterogeneous devices,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE).   IEEE, 2023, pp. 2289–2302.
  • [11] E. Gabrielli, G. Pica, and G. Tolomei, “A survey on decentralized federated learning,” arXiv preprint arXiv:2308.04604, 2023.
  • [12] E. Cyffers and A. Bellet, “Privacy amplification by decentralization,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 5334–5353.
  • [13] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
  • [14] G. Neglia, G. Calbi, D. Towsley, and G. Vardoyan, “The role of network topology for distributed machine learning,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications.   IEEE, 2019, pp. 2350–2358.
  • [15] T. Sun, D. Li, and B. Wang, “Decentralized federated averaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [16] Y. Shi, L. Shen, K. Wei, Y. Sun, B. Yuan, X. Wang, and D. Tao, “Improving the model consistency of decentralized federated learning,” arXiv preprint arXiv:2302.04083, 2023.
  • [17] Z. Qu, X. Li, R. Duan, Y. Liu, B. Tang, and Z. Lu, “Generalized federated learning via sharpness aware minimization,” in International Conference on Machine Learning.   PMLR, 2022, pp. 18 250–18 280.
  • [18] S. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” International Conference on Machine Learning,International Conference on Machine Learning, Jul 2020.
  • [19] T. Sun, D. Li, and B. Wang, “Stability and generalization of decentralized stochastic gradient descent,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 9756–9764.
  • [20] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, “Federated learning. morgan & claypool publishers,” 2019.
  • [21] A. Lalitha, S. Shekhar, T. Javidi, and F. Koushanfar, “Fully decentralized federated learning,” in Third workshop on Bayesian Deep Learning (NeurIPS), 2018.
  • [22] A. Lalitha, O. C. Kilinc, T. Javidi, and F. Koushanfar, “Peer-to-peer federated learning on graphs,” arXiv preprint arXiv:1901.11173, 2019.
  • [23] E. T. M. Beltrán, M. Q. Pérez, P. M. S. Sánchez, S. L. Bernal, G. Bovet, M. G. Pérez, G. M. Pérez, and A. H. Celdrán, “Decentralized federated learning: Fundamentals, state-of-the-art, frameworks, trends, and challenges,” arXiv preprint arXiv:2211.08413, 2022.
  • [24] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and trends® in machine learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  • [25] R. Dai, L. Shen, F. He, X. Tian, and D. Tao, “Dispfl: Towards communication-efficient personalized federated learning via decentralized sparse training,” arXiv preprint arXiv:2206.00187, 2022.
  • [26] L. Yuan, L. Sun, P. S. Yu, and Z. Wang, “Decentralized federated learning: A survey and perspective,” arXiv preprint arXiv:2306.01603, 2023.
  • [27] J. Xu, S. Wang, L. Wang, and A. C.-C. Yao, “Fedcm: Federated learning with client-level momentum,” arXiv preprint arXiv:2106.10874, 2021.
  • [28] S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Mime: Mimicking centralized stochastic algorithms in federated learning,” arXiv preprint arXiv:2008.03606, 2020.
  • [29] P. Zhou, H. Yan, X. Yuan, J. Feng, and S. Yan, “Towards understanding why lookahead generalizes better than sgd and beyond,” Neural Information Processing Systems,Neural Information Processing Systems, Dec 2021.
  • [30] H. Lin, J. Mairal, and Z. Harchaoui, “A universal catalyst for first-order optimization,” Advances in neural information processing systems, vol. 28, 2015.
  • [31] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, vol. 162, pp. 83–112, 2017.
  • [32] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for non-strongly convex composite objectives,” Advances in neural information processing systems, vol. 27, 2014.
  • [33] L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM Journal on Optimization, vol. 24, no. 4, pp. 2057–2075, 2014.
  • [34] E. Trimbach and A. Rogozin, “An acceleration of decentralized sgd under general assumptions with low stochastic noise,” in International Conference on Mathematical Optimization Theory and Operations Research.   Springer, 2021, pp. 117–128.
  • [35] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized sgd with changing topology and local updates,” in International Conference on Machine Learning.   PMLR, 2020, pp. 5381–5393.
  • [36] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton, “Lookahead optimizer: k steps forward, 1 step back,” Advances in neural information processing systems, vol. 32, 2019.
  • [37] K. Scaman, F. Bach, S. Bubeck, Y. Lee, and L. Massoulié, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” Le Centre pour la Communication Scientifique Directe - HAL - Diderot,Le Centre pour la Communication Scientifique Directe - HAL - Diderot, Feb 2017.
  • [38] Y. Lu and C. Sa, “Optimal complexity in decentralized training,” Cornell University - arXiv,Cornell University - arXiv, Jun 2020.
  • [39] K. Yuan, X. Huang, Y. Chen, X. Zhang, Y. Zhang, and P. Pan, “Revisiting optimal convergence rate for smooth and non-convex stochastic decentralized optimization,” Oct 2022.
  • [40] K. Huang, S. Pu, and A. Nedić, “An accelerated distributed stochastic gradient method with momentum,” arXiv preprint arXiv:2402.09714, 2024.
  • [41] Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Cornell University - arXiv,Cornell University - arXiv, Oct 2021.
  • [42] Q. Li, L. Shen, G. Li, Q. Yin, and D. Tao, “Dfedadmm: Dual constraints controlled model inconsistency for decentralized federated learning,” arXiv preprint arXiv:2308.08290, 2023.
  • [43] T. Sun, D. Li, and B. Wang, “Stability and generalization of the decentralized stochastic gradient descent,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, p. 9756–9764, Sep 2022. [Online]. Available: http://dx.doi.org/10.1609/aaai.v35i11.17173
  • [44] M. Zhu, L. Shen, B. Du, and D. Tao, “Stability and generalization of the decentralized stochastic gradient descent ascent algorithm,” Oct 2023.
  • [45] T. Zhu, F. He, L. Zhang, Z. Niu, M. Song, and D. Tao, “Topology-aware generalization of decentralized sgd,” in International Conference on Machine Learning.   PMLR, 2022, pp. 27 479–27 503.
  • [46] Y. Sun, L. Shen, and D. Tao, “Understanding how consistency works in federated learning via stage-wise relaxed initialization,” arXiv preprint arXiv:2306.05706, 2023.
  • [47] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: stability of stochastic gradient descent,” International Conference on Machine Learning,International Conference on Machine Learning, Jun 2016.
  • [48] Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami, “Stability of sgd: Tightness analysis and improved bounds,” Cornell University - arXiv,Cornell University - arXiv, Feb 2021.
  • [49] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [50] H.-H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” arXiv: Learning,arXiv: Learning, Sep 2019.
  • [51] M. Zhang, K. Sapra, S. Fidler, S. Yeung, and J. Alvarez, “Personalized federated learning with first order model optimization,” Learning,Learning, Dec 2020.
  • [52] D. Caldarola, B. Caputo, and M. Ciccone, “Improving generalization in federated learning by seeking flat minima,” in European Conference on Computer Vision.   Springer, 2022, pp. 654–672.
  • [53] N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” International Conference on Learning Representations,International Conference on Learning Representations, Sep 2016.
  • [54] G. Dziugaite and D. Roy, “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data,” Uncertainty in Artificial Intelligence,Uncertainty in Artificial Intelligence, Mar 2017.
  • [55] B. Ying, K. Yuan, Y. Chen, H. Hu, P. Pan, and W. Yin, “Exponential graph is provably efficient for decentralized deep training,” Oct 2021.
  • [56] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [57] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-iid federated learning,” arXiv: Learning,arXiv: Learning, Jan 2021.

In this part, we provide supplementary materials including the proof of the optimization and generalization error. Here’s the table of contents for the Appendix.

  • Appendix B: Proof of the theoretical analysis.

    • Appendix B-A: Proof for optimization error.

    • Appendix B-B: Proof for generalization error.

Appendix A Supplementary Material for Experiment

Appendix B Proof of the theoretical analysis.

B-A Proofs for the Optimization Error

In this part, we prove the training error for our proposed method. We assume the objective f(𝐱):=1mi=1mfi(𝐱)f({\bf x}):=\frac{1}{m}\sum_{i=1}^{m}f_{i}({\bf x}) is LL-smooth w.r.t 𝐱\mathbf{x}. Then we could upper bound the training error in the FL. Some useful notations in the proof are introduced in the Table VI. Then we introduce some important lemmas used in the proof.

TABLE VI: Some abbreviations of the used terms in the proof of bounded training error.
Notation Formulation Description
𝐱i,kt\mathbf{x}_{i,k}^{t} - parameters at kk-th iteration in round tt on client ii
𝐱it\mathbf{x}_{i}^{t} - global parameters in round tt on client ii
V1tV_{1}^{t} 1mi=1mk=0K1𝔼𝐱i,kt𝐱it2\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}\|^{2} averaged norm of the local updates in round tt
V2tV_{2}^{t} 𝔼𝐱¯t+1𝐱¯t2\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2} norm of the average global updates in round tt
Δt\Delta^{t} 1mi=1m𝔼𝐱i,Kt1𝐱it2\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i,K}^{t-1}-\mathbf{x}_{i}^{t}\|^{2} inconsistency / divergence term in round tt
DD f(𝐱0)f(𝐱)f(\mathbf{x}^{0})-f(\mathbf{x}^{\star}) bias between the initialization state and optimal

B-A1 Important Lemmas

Lemma 1

(Bounded local updates) We first bound the local training updates in the local training. Under the Assumptions stated, the averaged norm of the local updates of total mm clients could be bounded as:

V1t6Kβ2Δt+3K2η2(σl2+6λ2+6KG2)+18K3η2B21mi=1m𝔼f(𝐱it)2.\displaystyle V_{1}^{t}\leq 6K\beta^{2}\Delta^{t}+3K^{2}\eta^{2}\left(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2}\right)+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}. (8)
Proof 1

V1tV_{1}^{t} measures the norm of the local offset during the local training stage. It could be bounded by two major steps. Firstly, we bound the separated term on the single client ii at iteration kk as:

𝔼t𝐱it𝐱i,kt2\displaystyle\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k}^{t}\|^{2} (9)
=𝔼t𝐱it𝐱i,k1t+η(gi,k1tfi(𝐱˘i,k1t)+fi(𝐱˘i,k1t)fi(𝐱i,k1t)+fi(𝐱i,k1t)fi(𝐱it)+fi(𝐱it))2\displaystyle=\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k-1}^{t}+\eta\left(g_{i,k-1}^{t}-\nabla f_{i}(\breve{\mathbf{x}}_{i,k-1}^{t})+\nabla f_{i}(\breve{\mathbf{x}}_{i,k-1}^{t})-\nabla f_{i}(\mathbf{x}_{i,k-1}^{t})+\nabla f_{i}(\mathbf{x}_{i,k-1}^{t})-\nabla f_{i}(\mathbf{x}_{i}^{t})+\nabla f_{i}(\mathbf{x}_{i}^{t})\right)\|^{2}
(1+12K1)𝔼t𝐱it𝐱i,k1t2+6η2K𝔼tfi(𝐱i,k1t)fi(𝐱it)2+6λ2η2\displaystyle\leq\left(1+\frac{1}{2K-1}\right)\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k-1}^{t}\|^{2}+6\eta^{2}K\mathbb{E}_{t}\|\nabla f_{i}(\mathbf{x}_{i,k-1}^{t})-\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+6\lambda^{2}\eta^{2}
+6η2K𝔼tfi(𝐱it)2+η2𝔼tgi,k1tfi(𝐱i,k1t)2\displaystyle\quad+6\eta^{2}K\mathbb{E}_{t}\|\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+\eta^{2}\mathbb{E}_{t}\|g_{i,k-1}^{t}-\nabla f_{i}(\mathbf{x}_{i,k-1}^{t})\|^{2}
(1+12K1+6η2KL2)𝔼t𝐱it𝐱i,k1t2+6η2K𝔼tfi(𝐱it)2+η2σl2+6λ2η2\displaystyle\leq\left(1+\frac{1}{2K-1}+6\eta^{2}KL^{2}\right)\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k-1}^{t}\|^{2}+6\eta^{2}K\mathbb{E}_{t}\|\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+\eta^{2}\sigma_{l}^{2}+6\lambda^{2}\eta^{2}
(1+1K1)𝔼t𝐱it𝐱i,k1t2+6η2K𝔼tfi(𝐱it)2+η2σl2+6λ2η2\displaystyle\leq\left(1+\frac{1}{K-1}\right)\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k-1}^{t}\|^{2}+6\eta^{2}K\mathbb{E}_{t}\|\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+\eta^{2}\sigma_{l}^{2}+6\lambda^{2}\eta^{2}

where the learning rate is required η36KL\eta\leq\frac{\sqrt{3}}{6KL}.

Computing the average of the separated term on client ii, we have:

1mi=1m𝔼t𝐱it𝐱i,kt2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k}^{t}\|^{2} (10)
(1+1K1)1mi=1m𝔼t𝐱it𝐱i,k1t2+6η2K1mi=1m𝔼tfi(𝐱it)2+η2σl2+6λ2η2\displaystyle\leq\left(1+\frac{1}{K-1}\right)\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k-1}^{t}\|^{2}+6\eta^{2}K\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+\eta^{2}\sigma_{l}^{2}+6\lambda^{2}\eta^{2}
(1+1K1)1mi=1m𝔼t𝐱it𝐱i,k1t2+η2σl2+6Kη2G2+6Kη2B21mi=1m𝔼tf(𝐱it)2+6λ2η2\displaystyle\leq\left(1+\frac{1}{K-1}\right)\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k-1}^{t}\|^{2}+\eta^{2}\sigma_{l}^{2}+6K\eta^{2}G^{2}+6K\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}+6\lambda^{2}\eta^{2}

Unrolling the aggregated term on iteration k<Kk<K. When local interval K2,(1+1k1)k(1+1K1)K4K\geq 2,\left(1+\frac{1}{k-1}\right)^{k}\leq\left(1+\frac{1}{K-1}\right)^{K}\leq 4. Then we have:

1mi=1m𝔼t𝐱it𝐱i,kt2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k}^{t}\|^{2} (11)
τ=0k1(1+1K1)τ(η2σl2+6λ2η2+6Kη2G2+6Kη2B21mi=1m𝔼tf(𝐱it)2)\displaystyle\leq\sum_{\tau=0}^{k-1}\left(1+\frac{1}{K-1}\right)^{\tau}(\eta^{2}\sigma_{l}^{2}+6\lambda^{2}\eta^{2}+6K\eta^{2}G^{2}+6K\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2})
+(1+1K1)k1mi=1m𝔼t𝐱it𝐱i,0t2\displaystyle\quad+\left(1+\frac{1}{K-1}\right)^{k}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,0}^{t}\|^{2}
3(K1)(η2σl2+6λ2η2+6Kη2G2+6Kη2B21mi=1m𝔼tf(𝐱it)2)+4β21mi=1m𝔼t𝐱it𝐱i,Kt12\displaystyle\leq 3(K-1)(\eta^{2}\sigma_{l}^{2}+6\lambda^{2}\eta^{2}+6K\eta^{2}G^{2}+6K\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2})+4\beta^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}\|^{2}
3η2K(σl2+6λ2+6KG2)+18K2η2B21mi=1m𝔼tf(𝐱it)2+4β2Δt\displaystyle\leq 3\eta^{2}K(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2})+18K^{2}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}+4\beta^{2}\Delta^{t}

Summing the iteration on K=0,1,2,,K1K=0,1,2,\cdots,K-1.

1mi=1mk=0K1𝔼t𝐱it𝐱i,kt23η2K2(σl2+6λ2+6KG2)+18K3η2B21mi=1m𝔼tf(𝐱it)2+4Kβ2Δt\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}_{t}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,k}^{t}\|^{2}\leq 3\eta^{2}K^{2}(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2})+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{t}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}+4K\beta^{2}\Delta^{t} (12)

This completes the proof.

Lemma 2

(Bounded global updates) Under assumptions stated above, the norm of the global update of clients could be bounded as:

V2tKη2σl2+2η2m2𝔼i=1mk=0K1fi(𝐱i,kt)2+2η2λ2\displaystyle V_{2}^{t}\leq K\eta^{2}\sigma_{l}^{2}+2\frac{\eta^{2}}{m^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+2\eta^{2}\lambda^{2} (13)
Proof 2
𝔼𝐱¯t+1𝐱¯t2\displaystyle\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2} =𝔼1mi=1m(𝐱it+1𝐱it)2\displaystyle=\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}(\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i}^{t})\|^{2} (14)
=𝔼1mi=1m(j=1mwi,j𝐱j,Kt𝐱it)2\displaystyle=\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}(\sum_{j=1}^{m}w_{i,j}\mathbf{x}_{j,K}^{t}-\mathbf{x}_{i}^{t})\|^{2}

As 𝐖\mathbf{W} is a doubly stochastic matrix, we have i=1mj=1mwi,jt𝐱j,Kt=j=1mi=1mwi,jt𝐱jt+12=j=1m𝐱j,Kt=i=1m𝐱i,Kt\sum_{i=1}^{m}\sum_{j=1}^{m}w_{i,j}^{t}\mathbf{x}_{j,K}^{t}=\sum_{j=1}^{m}\sum_{i=1}^{m}w_{i,j}^{t}\mathbf{x}_{j}^{t+\frac{1}{2}}=\sum_{j=1}^{m}\mathbf{x}_{j,K}^{t}=\sum_{i=1}^{m}\mathbf{x}_{i,K}^{t}. Then we have

𝔼𝐱¯t+1𝐱¯t2\displaystyle\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2} (15)
=𝔼1mi=1m(𝐱i,Kt𝐱it)2\displaystyle=\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}(\mathbf{x}_{i,K}^{t}-\mathbf{x}_{i}^{t})\|^{2}
=𝔼1mi=1m(k=0K1η𝐠i,kt+β(𝐱it𝐱i,Kt1))2\displaystyle=\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}\left(\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})\right)\|^{2}
=𝔼1mi=1m(k=0K1η(𝐠i,kt±fi(𝐱˘i,kt)±fi(𝐱i,kt))+β(𝐱it𝐱i,Kt1))2\displaystyle=\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}\left(\sum_{k=0}^{K-1}\eta(\mathbf{g}_{i,k}^{t}\pm\nabla f_{i}(\breve{\mathbf{x}}_{i,k}^{t})\pm\nabla f_{i}(\mathbf{x}_{i,k}^{t}))+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})\right)\|^{2}
=η21mi=1mk=0K1𝔼𝐠i,ktfi(𝐱i,kt)2+2𝔼1mi=1m(ηk=0K1fi(𝐱i,kt)+β(𝐱it𝐱i,Kt1))2+2η2λ2\displaystyle=\eta^{2}\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\mathbf{g}_{i,k}^{t}-\nabla f_{i}({\mathbf{x}}_{i,k}^{t})\|^{2}+2\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}\left(\eta\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})\right)\|^{2}+2\eta^{2}\lambda^{2}
Kη2σl2+2𝔼1mi=1m(ηk=0K1fi(𝐱i,kt)+β(𝐱it𝐱i,Kt1))2+2η2λ2\displaystyle\leq K\eta^{2}\sigma_{l}^{2}+2\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}\left(\eta\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})\right)\|^{2}+2\eta^{2}\lambda^{2}

Because of i=1m𝐱i,Kt1=i=1mj=1mwi,j𝐱j,Kt1=i=1m𝐱it\sum_{i=1}^{m}\mathbf{x}_{i,K}^{t-1}=\sum_{i=1}^{m}\sum_{j=1}^{m}w_{i,j}\mathbf{x}_{j,K}^{t-1}=\sum_{i=1}^{m}\mathbf{x}_{i}^{t}, we have i=1m(𝐱it𝐱i,Kt1)=0\sum_{i=1}^{m}(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})=0. Then, we get

𝔼𝐱¯t+1𝐱¯t2\displaystyle\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2} Kη2σl2+2𝔼1mi=1mηk=0K1fi(𝐱i,kt)2+2η2λ2\displaystyle\leq K\eta^{2}\sigma_{l}^{2}+2\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}\eta\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+2\eta^{2}\lambda^{2} (16)
=Kη2σl2+2η2m2𝔼i=1mk=0K1fi(𝐱i,kt)2+2η2λ2\displaystyle=K\eta^{2}\sigma_{l}^{2}+2\frac{\eta^{2}}{m^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+2\eta^{2}\lambda^{2}

Since V2t=𝔼𝐱¯t+1𝐱¯t2V_{2}^{t}=\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2}, we have completed the proof.

Lemma 3

[Lemma 4, [13]] For any t+t\in\mathbb{Z}^{+}, the mixing matrix 𝐖m{\bf W}\in\ \mathbb{R}^{m} satisfies 𝐖t𝐏opψt,\|{\bf W}^{t}-{\bf P}\|_{\emph{op}}\leq\psi^{t}, where ψ:=max{|ψ2(𝐖)|,|ψ𝐦(𝐖)|}\psi:=\max\{|\psi_{2}(\bf W)|,|\psi_{m}(\bf W)|\} and for a matrix 𝐀{\bf A}, we denote its spectral norm as 𝐀op\|{\bf A}\|_{\emph{op}}. Furthermore, 𝟏:=[1,1,,1]m{\bf 1}:=[1,1,\ldots,1]^{\top}\in\mathbb{R}^{m} and

𝐏:=𝟏𝟏mm×m.{\bf P}:=\frac{\mathbf{1}\mathbf{1}^{\top}}{m}\in\mathbb{R}^{m\times m}.

In [Proposition 1, [56]], the author also proved that 𝐖t𝐏opCψt\|{\bf W}^{t}-{\bf P}\|_{\textrm{op}}\leq C\psi^{t} for some C>0C>0 that depends on the matrix.

Lemma 4

Let {𝐱it}t0\{\mathbf{x}_{i}^{t}\}_{t\geq 0} be generated by our proposed Algorithm for all i{1,2,,m}i\in\{1,2,...,m\} and any learning rate 16KL>ηl>0\frac{1}{6KL}>\eta_{l}>0, we have following bound:

1mi=1m𝔼[𝐱it𝐱¯t2]C1t(1ψ)2.\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}[\|\mathbf{x}_{i}^{t}-\overline{\mathbf{x}}^{t}\|^{2}]\leq\frac{C_{1}^{t}}{(1-\psi)^{2}}. (17)

Where C1t=6Kβ2Δt+3K2η2(σl2+6λ2+6KG2)+18K3η2B21mi=1m𝔼f(𝐱it)2C_{1}^{t}=6K\beta^{2}\Delta^{t}+3K^{2}\eta^{2}\left(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2}\right)+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}.

Proof 3

Following [Lemma D.5, [16]], we denote 𝐙t:=[𝐳1t,𝐳2t,,𝐳mt]m×d{\bf Z}^{t}:=\begin{bmatrix}{\bf z}_{1}^{t},{\mathbf{z}}_{2}^{t},\ldots,{\bf z}_{m}^{t}\end{bmatrix}^{\top}\in\mathbb{R}^{m\times d}. With these notation, we have

𝐗t+1=𝐖𝐙t=𝐖𝐗tζt,\displaystyle{\bf X}^{t+1}={\bf W}{\bf Z}^{t}={\bf W}{\bf X}^{t}-{\bf\zeta}^{t}, (18)

where ζt:=𝐖𝐗t𝐖𝐙t{\bf\zeta}^{t}:={\bf W}{\bf X}^{t}-{\bf W}{\bf Z}^{t}. The iteration equation (29) can be rewritten as the following expression

𝐗t=𝐖t𝐗0j=0t1𝐖t1jζj.\displaystyle{\bf X}^{t}={\bf W}^{t}{\bf X}^{0}-\sum_{j=0}^{t-1}{\bf W}^{t-1-j}{\bf\zeta}^{j}. (19)

Obviously, it follows

𝐖𝐏=𝐏𝐖=𝐏.{\bf W}{\bf P}={\bf P}{\bf W}={\bf P}. (20)

According to Lemma 3, it holds

𝐖t𝐏ψt.\|{\bf W}^{t}-{\bf P}\|\leq\psi^{t}.

Multiplying both sides of equation (19) with 𝐏{\bf P} and using equation (20), we then get

𝐏𝐗t=𝐏𝐗0j=0t1𝐏ζj=j=0t1𝐏ζj,\displaystyle{\bf P}{\bf X}^{t}={\bf P}{\bf X}^{0}-\sum_{j=0}^{t-1}{\bf P}{\bf\zeta}^{j}=-\sum_{j=0}^{t-1}{\bf P}{\bf\zeta}^{j}, (21)

where we used initialization 𝐗0=0{\bf X}^{0}=\textbf{0}. Then, we are led to

𝐗t𝐏𝐗t=j=0t1(𝐏𝐖t1j)ζj\displaystyle\|{\bf X}^{t}-{\bf P}{\bf X}^{t}\|=\|\sum_{j=0}^{t-1}({\bf P}-{\bf W}^{t-1-j}){\bf\zeta}^{j}\| (22)
j=0t1𝐏𝐖t1jopζjj=0t1ψt1jζj.\displaystyle\leq\sum_{j=0}^{t-1}\|{\bf P}-{\bf W}^{t-1-j}\|_{\textrm{op}}\|{\bf\zeta}^{j}\|\leq\sum_{j=0}^{t-1}\psi^{t-1-j}\|{\bf\zeta}^{j}\|.

With Cauchy inequality,

𝔼𝐗t𝐏𝐗t2𝔼(j=0t1ψt1j2ψt1j2ζj)2\displaystyle\mathbb{E}\|{\bf X}^{t}-{\bf P}{\bf X}^{t}\|^{2}\leq\mathbb{E}(\sum_{j=0}^{t-1}\psi^{\frac{t-1-j}{2}}\cdot\psi^{\frac{t-1-j}{2}}\|{\bf\zeta}^{j}\|)^{2}
(j=0t1ψt1j)(j=0t1ψt1j𝔼ζj2)\displaystyle\leq(\sum_{j=0}^{t-1}\psi^{t-1-j})(\sum_{j=0}^{t-1}\psi^{t-1-j}\mathbb{E}\|{\bf\zeta}^{j}\|^{2})

Direct calculation gives us

𝔼ζj2𝐖2𝔼𝐗j𝐙j2𝔼𝐗j𝐙j2.\mathbb{E}\|{\bf\zeta}^{j}\|^{2}\leq\|{\bf W}\|^{2}\cdot\mathbb{E}\|{\bf X}^{j}-{\bf Z}^{j}\|^{2}\leq\mathbb{E}\|{\bf X}^{j}-{\bf Z}^{j}\|^{2}.

With Lemma 1, for any jj:

𝔼𝐗j𝐙j2m(6Kβ2Δt+3K2η2(σl2+6λ2+6KG2)+18K3η2B21mi=1m𝔼f(𝐱it)2)\displaystyle\mathbb{E}\|{\bf X}^{j}-{\bf Z}^{j}\|^{2}\leq m(6K\beta^{2}\Delta^{t}+3K^{2}\eta^{2}\left(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2}\right)+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}) (23)

Thus, we get:

𝔼𝐗t𝐏𝐗t2mC1t(1ψ)2.\displaystyle\mathbb{E}\|{\bf X}^{t}-{\bf P}{\bf X}^{t}\|^{2}\leq\frac{mC_{1}^{t}}{(1-\psi)^{2}}.

where C1t=6Kβ2Δt+3K2η2(σl2+6λ2+6KG2)+18K3η2B21mi=1m𝔼f(𝐱it)2C_{1}^{t}=6K\beta^{2}\Delta^{t}+3K^{2}\eta^{2}\left(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2}\right)+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}.

The fact that 𝐗t𝐏𝐗t=(𝐱1t𝐱t¯𝐱1t𝐱t¯𝐱mt𝐱t¯){\bf X}^{t}-{\bf P}{\bf X}^{t}=\left(\begin{array}[]{c}{\bf x}_{1}^{t}-\overline{{\bf x}^{t}}\\ {\bf x}_{1}^{t}-\overline{{\bf x}^{t}}\\ \vdots\\ {\bf x}_{m}^{t}-\overline{{\bf x}^{t}}\\ \end{array}\right) then proves the result.

Lemma 5

(Bounded divergence term) Under assumptions stated above, β2min{(1ψ)230K,160(2+K(1ψ)2)}\beta^{2}\leq min\{\frac{(1-\psi)^{2}}{30K},\frac{1}{60(2+\frac{K}{(1-\psi)^{2}})}\} and η1KL\eta\leq\frac{1}{KL}, The divergence term Δt\Delta^{t} could be bounded as the recursion of:

ΔtΔtΔt+11γ+5μ(1γ)C2t\Delta^{t}\leq\frac{\Delta^{t}-\Delta^{t+1}}{1-\gamma}+\frac{5}{\mu(1-\gamma)}C_{2}^{t}

Where γμ=5β2(25+6K(1ψ)2),μ=130K(1ψ)2β2,C2t=3(2K(1ψ)2+1)η2Kσl2+4(9K(1ψ)2+1)η2K2G2+2η2m2𝔼i=1mk=0K1fi(𝐱i,kt)2+4(9K(1ψ)2+1)η2K2B21mi=1m𝔼f(𝐱it)2+4(2+9K2(1ψ)2)λ2η2.\gamma\mu=5\beta^{2}\left(25+\frac{6K}{(1-\psi)^{2}}\right),\mu=1-\frac{30K}{(1-\psi)^{2}}\beta^{2},C_{2}^{t}=3\left(\frac{2K}{(1-\psi)^{2}}+1\right)\eta^{2}K\sigma_{l}^{2}+4\left(\frac{9K}{(1-\psi)^{2}}+1\right)\eta^{2}K^{2}G^{2}+2\frac{\eta^{2}}{m^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+4\left(\frac{9K}{(1-\psi)^{2}}+1\right)\eta^{2}K^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}+4\left(2+\frac{9K^{2}}{(1-\psi)^{2}}\right)\lambda^{2}\eta^{2}.

Proof 4

The divergence term measures the inconsistency level in the FL framework. According to the local updates, we have the following recursive formula:

𝐱it+1𝐱i,Ktlocal bias in round t+1=β(𝐱i,Kt1𝐱it)local bias in nound t+(𝐱it+1𝐱it)+k=0K1η𝐠i,kt.\displaystyle\underbrace{\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i,K}^{t}}_{\textit{local bias in round t}+1}=\beta\underbrace{(\mathbf{x}_{i,K}^{t-1}-\mathbf{x}_{i}^{t})}_{\textit{local bias in nound t}}+(\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i}^{t})+\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}. (24)

By taking the squared norm and expectation on both sides, we have:

𝔼𝐱it+1𝐱i,Kt2\displaystyle\mathbb{E}\|\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i,K}^{t}\|^{2}
=𝔼β(𝐱i,Kt1𝐱it)+𝐱it+1𝐱¯t+1+𝐱¯t𝐱it+𝐱¯t+1𝐱¯t+k=0K1η𝐠i,kt2\displaystyle=\mathbb{E}\|\beta(\mathbf{x}_{i,K}^{t-1}-\mathbf{x}_{i}^{t})+\mathbf{x}_{i}^{t+1}-\bar{\mathbf{x}}^{t+1}+\bar{\mathbf{x}}^{t}-\mathbf{x}_{i}^{t}+\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}+\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}\|^{2}
5β2𝔼𝐱i,Kt1𝐱it2+5𝔼𝐱¯t+1𝐱¯t2V2t+5𝔼𝐱it+1𝐱¯t+12+5𝔼𝐱¯t𝐱it2+5𝔼k=0K1η𝐠i,kt2\displaystyle\leq 5\beta^{2}\mathbb{E}\|\mathbf{x}_{i,K}^{t-1}-\mathbf{x}_{i}^{t}\|^{2}+5\underbrace{\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2}}_{V_{2}^{t}}+5\mathbb{E}\|\mathbf{x}_{i}^{t+1}-\bar{\mathbf{x}}^{t+1}\|^{2}+5\mathbb{E}\|\bar{\mathbf{x}}^{t}-\mathbf{x}_{i}^{t}\|^{2}+5\mathbb{E}\|\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}\|^{2}

The second term in the above inequality is V2tV_{2}^{t} we have bounded in lemma 2. The third and fourth terms have been bounded in Lemma 4. Then we bound the stochastic gradients term. We have:

𝔼k=0K1η𝐠i,kt2\displaystyle\mathbb{E}\|\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}\|^{2} =η2𝔼k=0K1𝐠i,kt2\displaystyle=\eta^{2}\mathbb{E}\|\sum_{k=0}^{K-1}\mathbf{g}_{i,k}^{t}\|^{2}
η2𝔼k=0K1(𝐠i,ktfi(𝐱˘i,kt))2+2η2𝔼k=0K1fi(𝐱i,kt)2+2η2λ2\displaystyle\leq\eta^{2}\mathbb{E}\|\sum_{k=0}^{K-1}\left(\mathbf{g}_{i,k}^{t}-\nabla f_{i}(\breve{\mathbf{x}}_{i,k}^{t})\right)\|^{2}+2\eta^{2}\mathbb{E}\|\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+2\eta^{2}\lambda^{2}
η2Kσl2+2η2Kk=0K1𝔼fi(𝐱i,kt)fi(𝐱it)+fi(𝐱it)2+2η2λ2\displaystyle\leq\eta^{2}K\sigma_{l}^{2}+2\eta^{2}K\sum_{k=0}^{K-1}\mathbb{E}\|\nabla f_{i}(\mathbf{x}_{i,k}^{t})-\nabla f_{i}(\mathbf{x}_{i}^{t})+\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+2\eta^{2}\lambda^{2}
η2Kσl2+4η2KL2k=0K1𝔼𝐱i,kt𝐱it2+4η2Kk=0K1𝔼fi(𝐱it)2+2η2λ2\displaystyle\leq\eta^{2}K\sigma_{l}^{2}+4\eta^{2}KL^{2}\sum_{k=0}^{K-1}\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}\|^{2}+4\eta^{2}K\sum_{k=0}^{K-1}\mathbb{E}\|\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+2\eta^{2}\lambda^{2}

Taking the average on client i, we have:

1mi=1m𝔼k=0K1η𝐠i,kt2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}\|^{2} η2Kσl2+4η2KL21mi=1mk=0K1𝔼𝐱i,kt𝐱it2V1t+4η2Kmi=1mk=0K1𝔼fi(𝐱it)2+2η2λ2\displaystyle\leq\eta^{2}K\sigma_{l}^{2}+4\eta^{2}KL^{2}\underbrace{\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}\|^{2}}_{V_{1}^{t}}+\frac{4\eta^{2}K}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\nabla f_{i}(\mathbf{x}_{i}^{t})\|^{2}+2\eta^{2}\lambda^{2}
η2Kσl2+4η2KL2V1t+4η2K2G2+4η2K2B21mi=1m𝔼f(𝐱it)2+2η2λ2\displaystyle\leq\eta^{2}K\sigma_{l}^{2}+4\eta^{2}KL^{2}{V_{1}^{t}}+4\eta^{2}K^{2}G^{2}+4\eta^{2}K^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}+2\eta^{2}\lambda^{2}

Combining this and the squared norm inequality, we have:

Δt+1=1mi=1m𝔼𝐱it+1𝐱i,Kt2\displaystyle\Delta^{t+1}=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i,K}^{t}\|^{2}
5β2Δt+5V2t+5C1t(1ψ)2+5C1t+1(1ψ)2+51mi=1m𝔼k=0K1η𝐠i,kt2\displaystyle\leq 5\beta^{2}\Delta^{t}+5{V_{2}^{t}}+\frac{5C_{1}^{t}}{(1-\psi)^{2}}+\frac{5C_{1}^{t+1}}{(1-\psi)^{2}}+5\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}\|^{2}
5β2Δt+5(1(1ψ)2+4η2KL2)V1t+51(1ψ)2V1t+1+20η2λ2\displaystyle\leq 5\beta^{2}\Delta^{t}+5\left(\frac{1}{(1-\psi)^{2}}+4\eta^{2}KL^{2}\right)V_{1}^{t}+5\frac{1}{(1-\psi)^{2}}V_{1}^{t+1}+20\eta^{2}\lambda^{2}
+10η2Kσl2+10η2m2𝔼i=1mk=0K1fi(𝐱i,kt)2+20η2K2G2+20η2K2B21mi=1m𝔼f(𝐱it)2\displaystyle\quad+10\eta^{2}K\sigma_{l}^{2}+10\frac{\eta^{2}}{m^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+20\eta^{2}K^{2}G^{2}+20\eta^{2}K^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}
(a)5(1+24η2K2L2+6K(1ψ)2)β2Δt+30K(1ψ)2β2Δt+1+15(2K(1ψ)2+1)η2Kσl2+20(9K2(1ψ)2+2)λ2η2\displaystyle\overset{(a)}{\leq}5\left(1+24\eta^{2}K^{2}L^{2}+\frac{6K}{(1-\psi)^{2}}\right)\beta^{2}\Delta^{t}+\frac{30K}{(1-\psi)^{2}}\beta^{2}\Delta^{t+1}+15\left(\frac{2K}{(1-\psi)^{2}}+1\right)\eta^{2}K\sigma_{l}^{2}+20\left(\frac{9K^{2}}{(1-\psi)^{2}}+2\right)\lambda^{2}\eta^{2}
+20(9K(1ψ)2+1)η2K2G2+10η2m2𝔼i=1mk=0K1fi(𝐱i,kt)2+20(9K(1ψ)2+1)η2K2B21mi=1m𝔼f(𝐱it)2\displaystyle\quad+20\left(\frac{9K}{(1-\psi)^{2}}+1\right)\eta^{2}K^{2}G^{2}+10\frac{\eta^{2}}{m^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+20\left(\frac{9K}{(1-\psi)^{2}}+1\right)\eta^{2}K^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}

Where (a) uses Lemma 1 and η\eta is a very small value, i.e., we have omitted terms of 𝒪(η4)\mathcal{O}(\eta^{4}). Let C2tC_{2}^{t} denote the term that is independent of Δt\Delta^{t},i.e., C2t=3(2K(1ψ)2+1)η2Kσl2+4(9K(1ψ)2+1)η2K2G2+2η2m2𝔼i=1mk=0K1fi(𝐱i,kt)2+4(9K(1ψ)2+1)η2K2B21mi=1m𝔼f(𝐱it)2+4(2+9K2(1ψ)2)λ2η2C_{2}^{t}=3\left(\frac{2K}{(1-\psi)^{2}}+1\right)\eta^{2}K\sigma_{l}^{2}+4\left(\frac{9K}{(1-\psi)^{2}}+1\right)\eta^{2}K^{2}G^{2}+2\frac{\eta^{2}}{m^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+4\left(\frac{9K}{(1-\psi)^{2}}+1\right)\eta^{2}K^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}+4\left(2+\frac{9K^{2}}{(1-\psi)^{2}}\right)\lambda^{2}\eta^{2}. Then we have:

(130K(1ψ)2β2)Δt+15(1+24η2K2L2+6K(1ψ)2)β2Δt+5C2t\displaystyle\left(1-\frac{30K}{(1-\psi)^{2}}\beta^{2}\right)\Delta^{t+1}\leq 5\left(1+24\eta^{2}K^{2}L^{2}+\frac{6K}{(1-\psi)^{2}}\right)\beta^{2}\Delta^{t}+5C_{2}^{t}

Let 30K(1ψ)2β2<1\frac{30K}{(1-\psi)^{2}}\beta^{2}<1 where β<1ψ30K\beta<\frac{1-\psi}{\sqrt{30K}} . Let μ=(130K(1ψ)2β2)\mu=\left(1-\frac{30K}{(1-\psi)^{2}}\beta^{2}\right) and dividing both sides by μ\mu, we obtain:

Δt+15(1+24η2K2L2+6K(1ψ)2)β2(130K(1ψ)2β2)Δt+5(130K(1ψ)2β2)C2t\Delta^{t+1}\leq\frac{5\left(1+24\eta^{2}K^{2}L^{2}+\frac{6K}{(1-\psi)^{2}}\right)\beta^{2}}{\left(1-\frac{30K}{(1-\psi)^{2}}\beta^{2}\right)}\Delta^{t}+\frac{5}{\left(1-\frac{30K}{(1-\psi)^{2}}\beta^{2}\right)}C_{2}^{t}

Utilizing the learning rate condition η1KL\eta\leq\frac{1}{KL}, we obtain η2K2L21\eta^{2}K^{2}L^{2}\leq 1, then let γ=5β2(25+6K(1ψ)2)130K(1ψ)2β2<1\gamma=\frac{5\beta^{2}\left(25+\frac{6K}{(1-\psi)^{2}}\right)}{1-\frac{30K}{(1-\psi)^{2}}\beta^{2}}<1 where β2160(2+K(1ψ)2)\beta^{2}\leq\frac{1}{60(2+\frac{K}{(1-\psi)^{2}})}, thus we add (1γ)Δt(1-\gamma)\Delta^{t} on both sides and get the recursive formulation:

ΔtΔtΔt+11γ+5μ(1γ)C2t\Delta^{t}\leq\frac{\Delta^{t}-\Delta^{t+1}}{1-\gamma}+\frac{5}{\mu(1-\gamma)}C_{2}^{t}

B-A2 Expanding the Smoothness Inequality for the Non-convex Objective

For the non-convex and LL-smooth function ff , we firstly expand the smoothness inequality at round tt as:

𝔼[f(𝐱¯t+1)f(𝐱¯t)]\displaystyle\mathbb{E}[f(\bar{\mathbf{x}}^{t+1})-f(\bar{\mathbf{x}}^{t})] (25)
𝔼f(𝐱¯t),𝐱¯t+1𝐱¯t+L2𝔼𝐱¯t+1𝐱¯t2V2t\displaystyle\leq\mathbb{E}\langle\nabla f(\bar{\mathbf{x}}^{t}),\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\rangle+\frac{L}{2}\underbrace{\mathbb{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|^{2}}_{V_{2}^{t}}
=𝔼f(𝐱¯t),1mi=1m(𝐱i,Kt𝐱it)+LV2t2\displaystyle=\mathbb{E}\langle\nabla f(\bar{\mathbf{x}}^{t}),\frac{1}{m}\sum_{i=1}^{m}(\mathbf{x}_{i,K}^{t}-\mathbf{x}_{i}^{t})\rangle+\frac{LV_{2}^{t}}{2}
=𝔼f(𝐱¯t),1mi=1m(𝐱i,Kt𝐱i,0t+β(𝐱it𝐱i,Kt1))+LV2t2\displaystyle=\mathbb{E}\langle\nabla f(\bar{\mathbf{x}}^{t}),\frac{1}{m}\sum_{i=1}^{m}\left(\mathbf{x}_{i,K}^{t}-\mathbf{x}_{i,0}^{t}+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})\right)\rangle+\frac{LV_{2}^{t}}{2}
=η𝔼f(𝐱¯t),1mi=1mk=0K1fi(𝐱i,kt)1mi=1mk=0K1fi(𝐱¯t)+Kf(𝐱¯t)+LV2t2\displaystyle=-\eta\mathbb{E}\langle\nabla f(\bar{\mathbf{x}}^{t}),\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\bar{\mathbf{x}}^{t})+K\nabla f(\bar{\mathbf{x}}^{t})\rangle+\frac{LV_{2}^{t}}{2}
=ηK𝔼f(𝐱¯t)2+𝔼ηKf(𝐱¯t),ηK1mi=1mk=0K1(fi(𝐱¯t)fi(𝐱i,kt))+LV2t2\displaystyle=-\eta K\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\mathbb{E}\langle\sqrt{\eta K}\nabla f(\bar{\mathbf{x}}^{t}),\sqrt{\frac{\eta}{K}}\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\left(\nabla f_{i}(\bar{\mathbf{x}}^{t})-\nabla f_{i}(\mathbf{x}_{i,k}^{t})\right)\rangle+\frac{LV_{2}^{t}}{2}
ηK𝔼f(𝐱¯t)2+ηK2𝔼f(𝐱¯t)2+η2mi=1mk=0K1𝔼fi(𝐱¯t)fi(𝐱i,kt)2\displaystyle\leq-\eta K\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\frac{\eta K}{2}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\frac{\eta}{2m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\nabla f_{i}(\bar{\mathbf{x}}^{t})-\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}
η2Km2𝔼i=1mk=0K1fi(𝐱i,kt)2+LV2t2\displaystyle\quad-\frac{\eta}{2Km^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+\frac{LV_{2}^{t}}{2}
ηK2𝔼f(𝐱¯t)2+ηL221mi=1mk=0K1𝔼𝐱¯t𝐱i,kt2V1tη2Km2𝔼i=1mk=0K1fi(𝐱i,kt)2+LV2t2\displaystyle\leq-\frac{\eta K}{2}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\frac{\eta L^{2}}{2}\underbrace{\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\bar{\mathbf{x}}^{t}-\mathbf{x}_{i,k}^{t}\|^{2}}_{V_{1}^{t}}-\frac{\eta}{2Km^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+\frac{LV_{2}^{t}}{2}
ηK2𝔼f(𝐱¯t)2+ηL2V1t2η2Km2𝔼i=1mk=0K1fi(𝐱i,kt)2+LV2t2\displaystyle\leq-\frac{\eta K}{2}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\frac{\eta L^{2}V_{1}^{t}}{2}-\frac{\eta}{2Km^{2}}\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+\frac{LV_{2}^{t}}{2}

According to Lemma 1 and lemma 2 to bound the V1tV_{1}^{t} and V2tV_{2}^{t} , we can get the following recursive formula:

𝔼[f(𝐱¯t+1)f(𝐱¯t)]\displaystyle\mathbb{E}[f(\bar{\mathbf{x}}^{t+1})-f(\bar{\mathbf{x}}^{t})]
ηK2𝔼f(𝐱¯t)2+ηL22(6Kβ2Δt+3K2η2(σl2+6KG2+6λ2)+18K3η2B21mi=1m𝔼f(𝐱it)2)\displaystyle\leq-\frac{\eta K}{2}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\frac{\eta L^{2}}{2}\left(6K\beta^{2}\Delta^{t}+3K^{2}\eta^{2}\left(\sigma_{l}^{2}+6KG^{2}+6\lambda^{2}\right)+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}\right)
+η2LK2σl2+(η2Lm2η2Km2)𝔼i=1mk=0K1fi(𝐱i,kt)2+Lη2λ2\displaystyle\quad+\frac{\eta^{2}LK}{2}\sigma_{l}^{2}+\left(\frac{\eta^{2}L}{m^{2}}-\frac{\eta}{2Km^{2}}\right)\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+L\eta^{2}\lambda^{2}
ηK2𝔼f(𝐱¯t)2+3ηL2Kβ2Δt+9η3K3L2G2+9η3K3L2B21mi=1m𝔼f(𝐱it)2\displaystyle\leq-\frac{\eta K}{2}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+3\eta L^{2}K\beta^{2}\Delta^{t}+9\eta^{3}K^{3}L^{2}G^{2}+9\eta^{3}K^{3}L^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}
+η2LK2(1+3ηLK)σl2+(η2Lm2η2Km2)𝔼i=1mk=0K1fi(𝐱i,kt)2+(1+9K2ηL)Lη2λ2\displaystyle\quad+\frac{\eta^{2}LK}{2}\left(1+3\eta LK\right)\sigma_{l}^{2}+\left(\frac{\eta^{2}L}{m^{2}}-\frac{\eta}{2Km^{2}}\right)\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+(1+9K^{2}\eta L)L\eta^{2}\lambda^{2}

Furthermore, with Lemma 4, we can get:

1mi=1m𝔼f(𝐱it)2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\left\|\nabla f(\mathbf{x}_{i}^{t})\right\|^{2} 2L2i=1m𝐱it𝐱¯t2m+2𝔼f(𝐱¯t)2\displaystyle\leq 2L^{2}\frac{\sum_{i=1}^{m}\left\|\mathbf{x}_{i}^{t}-\overline{\mathbf{x}}^{t}\right\|^{2}}{m}+2\mathbb{E}\left\|\nabla f(\overline{\mathbf{x}}^{t})\right\|^{2}
2L2C1t(1ψ)2+2𝔼f(𝐱¯t)2\displaystyle\leq 2L^{2}\frac{C_{1}^{t}}{(1-\psi)^{2}}+2\mathbb{E}\left\|\nabla f(\overline{\mathbf{x}}^{t})\right\|^{2}

Where C1t=6Kβ2Δt+3K2η2(σl2+6λ2+6KG2)+18K3η2B21mi=1m𝔼f(𝐱it)2C_{1}^{t}=6K\beta^{2}\Delta^{t}+3K^{2}\eta^{2}\left(\sigma_{l}^{2}+6\lambda^{2}+6KG^{2}\right)+18K^{3}\eta^{2}B^{2}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\nabla f(\mathbf{x}_{i}^{t})\|^{2}, Therefore, we have:

1mi=1m𝔼f(𝐱it)28KL2β2Δt+6η2K2L2(σl2+4KG2)+2(1ψ)2𝔼f(𝐱¯t)2(1ψ)224η2K3B2L2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\left\|\nabla f(\mathbf{x}_{i}^{t})\right\|^{2}\leq\frac{8KL^{2}\beta^{2}\Delta^{t}+6\eta^{2}K^{2}L^{2}(\sigma_{l}^{2}+4KG^{2})+2(1-\psi)^{2}\mathbb{E}\left\|\nabla f(\overline{\mathbf{x}}^{t})\right\|^{2}}{(1-\psi)^{2}-24\eta^{2}K^{3}B^{2}L^{2}}

And then, (25) can be represented as

𝔼[f(𝐱¯t+1)f(𝐱¯t)]\displaystyle\mathbb{E}[f(\bar{\mathbf{x}}^{t+1})-f(\bar{\mathbf{x}}^{t})] (26)
ηK(1218η2K2L2B2)𝔼f(𝐱¯t)2+ηL2Kβ2(3+72η2K3L2B2(1ψ)2)Δt+9η3K3L2(1+24η2K3L2B2(1ψ)2)G2\displaystyle\leq-\eta K(\frac{1}{2}-18\eta^{2}K^{2}L^{2}B^{2})\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\eta L^{2}K\beta^{2}\left(3+\frac{72\eta^{2}K^{3}L^{2}B^{2}}{(1-\psi)^{2}}\right)\Delta^{t}+9\eta^{3}K^{3}L^{2}\left(1+\frac{24\eta^{2}K^{3}L^{2}B^{2}}{(1-\psi)^{2}}\right)G^{2}
+η2LK(12+32ηLK+54η3K4L3B2(1ψ)2)σl2+(η2Lm2η2Km2)𝔼i=1mk=0K1fi(𝐱i,kt)2+(1+9K2ηL)Lη2λ2\displaystyle\quad+\eta^{2}LK\left(\frac{1}{2}+\frac{3}{2}\eta LK+\frac{54\eta^{3}K^{4}L^{3}B^{2}}{(1-\psi)^{2}}\right)\sigma_{l}^{2}+\left(\frac{\eta^{2}L}{m^{2}}-\frac{\eta}{2Km^{2}}\right)\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+(1+9K^{2}\eta L)L\eta^{2}\lambda^{2}

Where we use vary small η\eta [16]. Furthermore, in Lemma 5, μμγ=160Kβ2(1ψ)275β2\mu-\mu\gamma=1-\frac{60K\beta^{2}}{(1-\psi)^{2}}-75\beta^{2}. By setting β2min{(1ψ)2240K,1300}\beta^{2}\leq\min\{\frac{(1-\psi)^{2}}{240K},\frac{1}{300}\}, we can obtain μμγ>12\mu-\mu\gamma>\frac{1}{2}. Next, with Lemma 5, we get:

ηL2Kβ2(3+72η2K3L2B2(1ψ)2)Δt\displaystyle\eta L^{2}K\beta^{2}\left(3+\frac{72\eta^{2}K^{3}L^{2}B^{2}}{(1-\psi)^{2}}\right)\Delta^{t} ηL2Kβ2(3+72η2K3L2B2(1ψ)2)ΔtΔt+11γ+𝒪(η3)\displaystyle\leq\eta L^{2}K\beta^{2}\left(3+\frac{72\eta^{2}K^{3}L^{2}B^{2}}{(1-\psi)^{2}}\right)\frac{\Delta^{t}-\Delta^{t+1}}{1-\gamma}+\mathcal{O}(\eta^{3}) (27)

We omit the terms of 𝒪(η3)\mathcal{O}(\eta^{3}) in (27) and substitute (27) into (26) to obtain:

𝔼[f(𝐱¯t+1)f(𝐱¯t)]\displaystyle\mathbb{E}[f(\bar{\mathbf{x}}^{t+1})-f(\bar{\mathbf{x}}^{t})]
ηK(1218η2K2L2B2)𝔼f(𝐱¯t)2+ηL2Kβ2(3+72η2K3L2B2(1ψ)2)ΔtΔt+11γ+9η3K3L2(1+24η2K3L2B2(1ψ)2)G2\displaystyle\leq-\eta K(\frac{1}{2}-18\eta^{2}K^{2}L^{2}B^{2})\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}+\eta L^{2}K\beta^{2}\left(3+\frac{72\eta^{2}K^{3}L^{2}B^{2}}{(1-\psi)^{2}}\right)\frac{\Delta^{t}-\Delta^{t+1}}{1-\gamma}+9\eta^{3}K^{3}L^{2}\left(1+\frac{24\eta^{2}K^{3}L^{2}B^{2}}{(1-\psi)^{2}}\right)G^{2}
+η2LK(12+32ηLK+54η3K4L3B2(1ψ)2)σl2+(η2Lm2η2Km2)𝔼i=1mk=0K1fi(𝐱i,kt)2+(1+9K2ηL)Lη2λ2\displaystyle\quad+\eta^{2}LK\left(\frac{1}{2}+\frac{3}{2}\eta LK+\frac{54\eta^{3}K^{4}L^{3}B^{2}}{(1-\psi)^{2}}\right)\sigma_{l}^{2}+\left(\frac{\eta^{2}L}{m^{2}}-\frac{\eta}{2Km^{2}}\right)\mathbb{E}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|^{2}+(1+9K^{2}\eta L)L\eta^{2}\lambda^{2}

Firstly, to remove the gradient term, we follow the [57, 18] and let η2Lm2η2Km20\frac{\eta^{2}L}{m^{2}}-\frac{\eta}{2Km^{2}}\leq 0, then learning rate η12KL\eta\leq\frac{1}{2KL}. Then, according to the [57], there exists a positive constant κ(0,1)\kappa\in(0,1) such that 1218η2K2L2B2κ>0\frac{1}{2}-18\eta^{2}K^{2}L^{2}B^{2}\geq\kappa>0 when η16KLB\eta\leq\frac{1}{6KLB}. Also, when η1K3/2LB\eta\leq\frac{1}{K^{3/2}LB}, we have η2K3L2B21\eta^{2}K^{3}L^{2}B^{2}\leq 1. Therefore, we have:

κηK𝔼f(𝐱¯t)2\displaystyle\kappa\eta K\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2} 𝔼[f(𝐱¯t)f(𝐱¯t+1)]+ηL2Kβ2(3+72(1ψ)2)ΔtΔt+11γ\displaystyle\leq\mathbb{E}[f(\bar{\mathbf{x}}^{t})-f(\bar{\mathbf{x}}^{t+1})]+\eta L^{2}K\beta^{2}\left(3+\frac{72}{(1-\psi)^{2}}\right)\frac{\Delta^{t}-\Delta^{t+1}}{1-\gamma} (28)
+9η3K3L2(1+24(1ψ)2)G2+η2LK(2+36(1ψ)2)σl2+(1+9K2ηL)Lη2λ2\displaystyle\quad+9\eta^{3}K^{3}L^{2}\left(1+\frac{24}{(1-\psi)^{2}}\right)G^{2}+\eta^{2}LK\left(2+\frac{36}{(1-\psi)^{2}}\right)\sigma_{l}^{2}+(1+9K^{2}\eta L)L\eta^{2}\lambda^{2}

B-A3 Proof of Theorem 1

Theorem 3

Under Assumption 1 - 3, let the learning rate satisfy η1K3/2LB\eta\leq\frac{1}{K^{3/2}LB} where K2K\geq 2, let the relaxation coefficient βmin{10(1ψ)40,530}\beta\leq min\{\frac{\sqrt{10}(1-\psi)}{40},\frac{\sqrt{5}}{30}\} , and after training TT rounds, the averaged model parameters generated by our proposed algorithm satisfies:

1Tt=0T1𝔼f(𝐱¯t)2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2} 𝔼[f(𝐱¯0)f(𝐱¯T)]κηKT+9η2K2L2(1+24(1ψ)2)G2κ+ηL(2+36(1ψ)2)σl2κ\displaystyle\leq\frac{\mathbb{E}[f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{T})]}{\kappa\eta KT}+9\eta^{2}K^{2}L^{2}\left(1+\frac{24}{(1-\psi)^{2}}\right)\frac{G^{2}}{\kappa}+\eta L\left(2+\frac{36}{(1-\psi)^{2}}\right)\frac{\sigma_{l}^{2}}{\kappa}
+(1+9K2ηL)Lη2λ2L2β2(3+72(1ψ)2)ΔTκ(1γ)T\displaystyle\quad+(1+9K^{2}\eta L)L\eta^{2}\lambda^{2}-L^{2}\beta^{2}\left(3+\frac{72}{(1-\psi)^{2}}\right)\frac{\Delta^{T}}{\kappa(1-\gamma)T}

Where κ(0,1)\kappa\in(0,1) is a constant.

Further, by selecting the proper learning rate η=𝒪(1KT)\eta=\mathcal{O}(\frac{1}{\sqrt{KT}}) and let D=f(𝐱¯0)f(𝐱¯)D=f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{*}) as the initialization bias, thenthe averaged model parameters 𝐱¯t\bar{\mathbf{x}}^{t} satisfies:

1Tt=0T1𝔼f(𝐱¯t)2=𝒪(DKT+KL2T(1ψ)2G2+LTK(1ψ)2σl2+LTKλ2)\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}=\mathcal{O}(\frac{D}{\sqrt{KT}}+\frac{KL^{2}}{T(1-\psi)^{2}}G^{2}+\frac{L}{\sqrt{T}K(1-\psi)^{2}}\sigma_{l}^{2}+\frac{L}{TK}\lambda^{2})
Proof 5

According to the expansion of the smoothness inequality (28), we have:

κηK𝔼f(𝐱¯t)2\displaystyle\kappa\eta K\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2} 𝔼[f(𝐱¯t)f(𝐱¯t+1)]+ηL2Kβ2(3+72(1ψ)2)ΔtΔt+11γ\displaystyle\leq\mathbb{E}[f(\bar{\mathbf{x}}^{t})-f(\bar{\mathbf{x}}^{t+1})]+\eta L^{2}K\beta^{2}\left(3+\frac{72}{(1-\psi)^{2}}\right)\frac{\Delta^{t}-\Delta^{t+1}}{1-\gamma}
+9η3K3L2(1+24(1ψ)2)G2+η2LK(2+36(1ψ)2)σl2+(1+9K2ηL)Lη2λ2\displaystyle\quad+9\eta^{3}K^{3}L^{2}\left(1+\frac{24}{(1-\psi)^{2}}\right)G^{2}+\eta^{2}LK\left(2+\frac{36}{(1-\psi)^{2}}\right)\sigma_{l}^{2}+(1+9K^{2}\eta L)L\eta^{2}\lambda^{2}

Taking the accumulation from 0 to T1T-1, we have:

1Tt=0T1𝔼f(𝐱¯t)2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}
𝔼[f(𝐱¯0)f(𝐱¯T)]κηKT+L2β2(3+72(1ψ)2)Δ0ΔTκ(1γ)T\displaystyle\leq\frac{\mathbb{E}[f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{T})]}{\kappa\eta KT}+L^{2}\beta^{2}\left(3+\frac{72}{(1-\psi)^{2}}\right)\frac{\Delta^{0}-\Delta^{T}}{\kappa(1-\gamma)T}
+9η2K2L2(1+24(1ψ)2)G2κ+ηL(2+36(1ψ)2)σl2κ+(1+9K2ηL)Lηλ2κ\displaystyle\quad+9\eta^{2}K^{2}L^{2}\left(1+\frac{24}{(1-\psi)^{2}}\right)\frac{G^{2}}{\kappa}+\eta L\left(2+\frac{36}{(1-\psi)^{2}}\right)\frac{\sigma_{l}^{2}}{\kappa}+(1+9K^{2}\eta L)L\eta\frac{\lambda^{2}}{\kappa}
𝔼[f(𝐱¯0)f(𝐱¯T)]κηKT+9η2K2L2(1+24(1ψ)2)G2κ+ηL(2+36(1ψ)2)σl2κ+(1+9K2ηL)Lηλ2κ\displaystyle\leq\frac{\mathbb{E}[f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{T})]}{\kappa\eta KT}+9\eta^{2}K^{2}L^{2}\left(1+\frac{24}{(1-\psi)^{2}}\right)\frac{G^{2}}{\kappa}+\eta L\left(2+\frac{36}{(1-\psi)^{2}}\right)\frac{\sigma_{l}^{2}}{\kappa}+(1+9K^{2}\eta L)L\eta\frac{\lambda^{2}}{\kappa}
L2β2(3+72(1ψ)2)ΔTκ(1γ)T\displaystyle\quad-L^{2}\beta^{2}\left(3+\frac{72}{(1-\psi)^{2}}\right)\frac{\Delta^{T}}{\kappa(1-\gamma)T}

We select the learning rate η=𝒪(1KT)\eta=\mathcal{O}(\frac{1}{\sqrt{KT}}) and let D=f(𝐱¯0)f(𝐱¯)D=f(\bar{\mathbf{x}}^{0})-f(\bar{\mathbf{x}}^{*}) as the initialization bias, then we have:

1Tt=0T1𝔼f(𝐱¯t)2=𝒪(DKT+KL2T(1ψ)2G2+LTK(1ψ)2σl2+LTKλ2)\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla f(\bar{\mathbf{x}}^{t})\|^{2}=\mathcal{O}(\frac{D}{\sqrt{KT}}+\frac{KL^{2}}{T(1-\psi)^{2}}G^{2}+\frac{L}{\sqrt{T}K(1-\psi)^{2}}\sigma_{l}^{2}+\frac{L}{\sqrt{T}K}\lambda^{2})

This completes the proof.

B-B Proofs for the Generalization Error

In this part, we prove the generalization error for our proposed method. We assume the objective function ff is LL-smooth and LGL_{G}-Lipschitz as defined in [47, 29]. We follow the uniform stability to upper bound the generalization error in the DFL.

We suppose there are mm clients participating in the training process as a set 𝒞={i}i=1m\mathcal{C}=\{i\}_{i=1}^{m}. Each client has a local dataset 𝒮i={zj}j=1S\mathcal{S}_{i}=\{z_{j}\}_{j=1}^{S} with total SS data sampled from a specific unknown distribution 𝒟i\mathcal{D}_{i}. Now we define a re-sampled dataset 𝒮i~\widetilde{\mathcal{S}_{i}} which only differs from the dataset 𝒮i\mathcal{S}_{i} on the jj^{*}-th data. We replace the 𝒮i\mathcal{S}_{i^{*}} with 𝒮~i\widetilde{\mathcal{S}}_{i^{*}} and keep other m1m-1 local dataset, which composes a new set 𝒞~\widetilde{\mathcal{C}}. 𝒞\mathcal{C} only differs from the 𝒞~\widetilde{\mathcal{C}} at jj^{*}-th data on the ii^{*}-th client. Then, based on these two sets, our method could generate two output models, 𝐱¯t\bar{\mathbf{x}}^{t} and 𝐱¯~t\widetilde{\bar{\mathbf{x}}}^{t} respectively, after tt training rounds. We first introduce some notations used in the proof of the generalization error.

TABLE VII: Some abbreviations of the used terms in the proof of bounded stability error.
Notation Formulation Description
𝐱¯t\bar{\mathbf{x}}^{t} - parameters trained with set 𝒞\mathcal{C}
𝐱¯~t\widetilde{\bar{\mathbf{x}}}^{t} - parameters trained with set 𝒞~\widetilde{\mathcal{C}}
δkt\delta_{k}^{t} 1mi=1m𝔼𝐱i,kt𝐱~i,kt\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\| stability difference at kk-iteration on tt-round

Then we introduce some important lemmas in our proofs.

B-B1 Important Lemmas

Lemma 6

(Lemma 3.11 in [47]) We follow the definition in [47, 29] to upper bound the uniform stability term after each communication round in DFL paradigm. Different from their vanilla calculations, DFL considers the finite-sum function on heterogeneous clients. Let non-negative objective ff is LL-smooth and LGL_{G}-Lipschitz. After training TT rounds on 𝒞\mathcal{C} and 𝒞~\widetilde{\mathcal{C}}. our method generates two models 𝐱¯T+1\bar{\mathbf{x}}^{T+1} and 𝐱¯~T+1\widetilde{\bar{\mathbf{x}}}^{T+1} respectively. For each data zz and every t0{1,2,3,,S}t_{0}\in\{1,2,3,\cdots,S\}, we have:

𝔼f(𝐱¯T+1;z)f(𝐱¯~T+1;z)LGmi=1m𝔼[𝐱i,KT𝐱~i,KT|ξ]+Ut0S\mathbb{E}\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\|\leq\frac{L_{G}}{m}\sum_{i=1}^{m}\mathbb{E}\left[\|\mathbf{x}_{i,K}^{T}-\widetilde{\mathbf{x}}_{i,K}^{T}\||\xi\right]+\frac{Ut_{0}}{S}
Proof 6

Let ξ=1\xi=1 denote the event 𝐱¯t0𝐱¯~t0=0\|\bar{\mathbf{x}}^{t_{0}}-\widetilde{\bar{\mathbf{x}}}^{t_{0}}\|=0 and U=sup𝐱,z{f(𝐱;z)}U=\sup_{\mathbf{x},z}\{f({\mathbf{x}};z)\}, we have:

𝐄f(𝐱¯T+1;z)f(𝐱¯~T+1;z)\displaystyle\mathbf{E}\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\|
=P({ξ})𝔼[f(𝐱¯T+1;z)f(𝐱¯~T+1;z)|ξ]+P({ξc})𝔼[f(𝐱¯T+1;z)f(𝐱¯~T+1;z)|ξc]\displaystyle=P(\{\xi\})\mathbb{E}\left[\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\||\xi\right]+P(\{\xi^{c}\})\mathbb{E}\left[\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\||\xi^{c}\right]
𝔼[f(𝐱¯T+1;z)f(𝐱¯~T+1;z)|ξ]+P({ξc})sup𝐱,z{f(𝐱;z)}\displaystyle\leq\mathbb{E}\left[\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\||\xi\right]+P(\{\xi^{c}\})\sup_{\mathbf{x},z}\{f({\mathbf{x}};z)\}
LG𝔼[𝐱¯T+1𝐱¯~T+1|ξ]+P({ξc})U\displaystyle\leq L_{G}\mathbb{E}\left[\|{\bar{\mathbf{x}}}^{T+1}-\widetilde{\bar{\mathbf{x}}}^{T+1}\||\xi\right]+P(\{\xi^{c}\})U
=LG𝔼[1mi=1m(𝐱i,KT𝐱~i,KT)|ξ]+P({ξc})U\displaystyle=L_{G}\mathbb{E}\left[\|\frac{1}{m}\sum_{i=1}^{m}(\mathbf{x}_{i,K}^{T}-\widetilde{\mathbf{x}}_{i,K}^{T})\||\xi\right]+P(\{\xi^{c}\})U
LGmi=1m𝔼[𝐱i,KT𝐱~i,KT|ξ]+P({ξc})U\displaystyle\leq\frac{L_{G}}{m}\sum_{i=1}^{m}\mathbb{E}\left[\|\mathbf{x}_{i,K}^{T}-\widetilde{\mathbf{x}}_{i,K}^{T}\||\xi\right]+P(\{\xi^{c}\})U

Before the jj^{*}-th data on ii^{*}-th client is sampled, the iterative states are identical on both 𝒞\mathcal{C} and 𝒞~\widetilde{\mathcal{C}}. Let j~\widetilde{j} is the index of the first different sampling, if j~>t0\widetilde{j}>t_{0}, then ξ=1\xi=1 hold for t0t_{0}. Therefore, we have:

P({ξc})=P({ξ=0})=P(j~t0)t0SP(\{\xi^{c}\})=P(\{\xi=0\})=P(\widetilde{j}\leq t_{0})\leq\frac{t_{0}}{S}

We complete the proof.

Lemma 7

(Lemma 1.1 in [29]) Different from their calculations, we prove similar inequalities on ff in the stochastic optimization. Under Assumption 1 and 2, the local updates satisfy 𝐱i,k+1t=𝐱i,ktη𝐠i,kt\mathbf{x}_{i,k+1}^{t}=\mathbf{x}_{i,k}^{t}-\eta\mathbf{g}_{i,k}^{t} on 𝒞\mathcal{C} and 𝐱~i,k+1t=𝐱~i,ktη𝐠~i,kt\widetilde{\mathbf{x}}_{i,k+1}^{t}=\widetilde{\mathbf{x}}_{i,k}^{t}-\eta\widetilde{\mathbf{g}}_{i,k}^{t} on 𝒞~\widetilde{\mathcal{C}}. If at kk-th iteration on each round, we sample the same data in 𝒞\mathcal{C} and 𝒞~\widetilde{\mathcal{C}}, then we have:

𝔼𝐱i,k+1t𝐱~i,k+1t(1+ηL)𝔼𝐱i,kt𝐱~i,kt+2ησl\mathbb{E}\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\|\leq(1+\eta L)\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+2\eta\sigma_{l}
Proof 7

In each round tt, by the triangle inequality and omitting the same data zz, we have:

𝔼𝐱i,k+1t𝐱~i,k+1t\displaystyle\mathbb{E}\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\|
=𝔼𝐱i,ktηgi,kt𝐱~i,ktηg~i,kt\displaystyle=\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\eta g_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}-\eta\widetilde{g}_{i,k}^{t}\|
𝔼wi,kt𝐱~i,kt+η𝔼gi,ktg~i,kt\displaystyle\leq\mathbb{E}\|w_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+\eta\mathbb{E}\|g_{i,k}^{t}-\widetilde{g}_{i,k}^{t}\|
𝔼𝐱i,kt𝐱~i,kt+η𝔼gi,ktfi(𝐱i,kt)+η𝔼g~i,ktfi(𝐱~i,kt)\displaystyle\leq\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+\eta\mathbb{E}\|g_{i,k}^{t}-\nabla f_{i}(\mathbf{x}_{i,k}^{t})\|+\eta\mathbb{E}\|\widetilde{g}_{i,k}^{t}-\nabla f_{i}(\widetilde{\mathbf{x}}_{i,k}^{t})\|
(1+ηL)𝔼𝐱i,kt𝐱~i,kt+2ησl\displaystyle\leq(1+\eta L)\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+2\eta\sigma_{l}

The final inequality adopts assumptions of 𝔼gi,ktfi(wi,kt)𝔼gi,ktfi(wi,kt)2σl\mathbb{E}\|g_{i,k}^{t}-\nabla f_{i}(w_{i,k}^{t})\|\leq\sqrt{\mathbb{E}\|g_{i,k}^{t}-\nabla f_{i}(w_{i,k}^{t})\|^{2}}\leq\sigma_{l}. This completes the proof.

Lemma 8

(Lemma 1.2 in [29]) Different from their calculations, we prove similar inequalities on ff in the stochastic optimization. Under Assumption 1, 4 and 2, the local updates satisfy 𝐱i,k+1t=𝐱i,ktη𝐠i,kt\mathbf{x}_{i,k+1}^{t}=\mathbf{x}_{i,k}^{t}-\eta\mathbf{g}_{i,k}^{t} on 𝒞\mathcal{C} and 𝐱~i,k+1t=𝐱~i,ktη𝐠~i,kt\widetilde{\mathbf{x}}_{i,k+1}^{t}=\widetilde{\mathbf{x}}_{i,k}^{t}-\eta\widetilde{\mathbf{g}}_{i,k}^{t} on 𝒞~\widetilde{\mathcal{C}}. If at kk-th iteration on each round, we sample the different data in 𝒞\mathcal{C} and 𝒞~\widetilde{\mathcal{C}}, then we have:

𝔼𝐱i,k+1t𝐱~i,k+1t𝔼𝐱i,kt𝐱~i,kt+2η(σl+LG)\mathbb{E}\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\|\leq\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+2\eta(\sigma_{l}+L_{G})
Proof 8

In each round tt, let by the triangle inequality and denoting the different data as zz and z~\widetilde{z}, we have:

𝔼𝐱i,k+1t𝐱~i,k+1t\displaystyle\mathbb{E}\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\|
=𝔼𝐱i,ktηgi,kt𝐱~i,ktηg~i,kt\displaystyle=\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\eta g_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}-\eta\widetilde{g}_{i,k}^{t}\|
𝔼𝐱i,kt𝐱~i,kt+η𝔼gi,ktg~i,kt\displaystyle\leq\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+\eta\mathbb{E}\|g_{i,k}^{t}-\widetilde{g}_{i,k}^{t}\|
=𝔼𝐱i,kt𝐱~i,kt+η𝔼gi,ktfi(𝐱i,kt;z)g~i,ktfi(𝐱~i,kt;z~)+fi(𝐱i,kt;z)fi(𝐱~i,kt;z~)\displaystyle=\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+\eta\mathbb{E}\|g_{i,k}^{t}-\nabla f_{i}(\mathbf{x}_{i,k}^{t};z)-\widetilde{g}_{i,k}^{t}-\nabla f_{i}(\widetilde{\mathbf{x}}_{i,k}^{t};\widetilde{z})+\nabla f_{i}(\mathbf{x}_{i,k}^{t};z)-\nabla f_{i}(\widetilde{\mathbf{x}}_{i,k}^{t};\widetilde{z})\|
𝔼𝐱i,kt𝐱~i,kt+2ησl+η𝔼fi(𝐱i,kt;z)fi(𝐱~i,kt;z~)\displaystyle\leq\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+2\eta\sigma_{l}+\eta\mathbb{E}\|\nabla f_{i}(\mathbf{x}_{i,k}^{t};z)-\nabla f_{i}(\widetilde{\mathbf{x}}_{i,k}^{t};\widetilde{z})\|
𝔼𝐱i,kt𝐱~i,kt+2η(σl+LG).\displaystyle\leq\mathbb{E}\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\|+2\eta(\sigma_{l}+L_{G}).

The final inequality adopts the LGL_{G}-Lipschitz. This completes the proof.

Lemma 9

Given the stepsize β1ψ4m+1ψ\beta\leq\frac{1-\psi}{4\sqrt{m}+1-\psi}, we have following bound:

𝐄𝐱it𝐱¯t1+βα(1β)K(σl+LG)j=0t1ηjψt1j\mathbf{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|\leq\frac{1+\beta}{\alpha(1-\beta)}K(\sigma_{l}+L_{G})\sum_{j=0}^{t-1}\eta_{j}\psi^{t-1-j}

Where α=14mβ(1ψ)(1β)\alpha=1-\frac{4\sqrt{m}\beta}{(1-\psi)(1-\beta)}.

Proof 9

Following [Lemma 4, [15]], we denote 𝐙t:=[𝐳1t,𝐳2t,,𝐳mt]m×d{\bf Z}^{t}:=\begin{bmatrix}{\bf z}_{1}^{t},{\mathbf{z}}_{2}^{t},\ldots,{\bf z}_{m}^{t}\end{bmatrix}^{\top}\in\mathbb{R}^{m\times d}. With these notation, we have

𝐗t+1=𝐖𝐙t=𝐖𝐗tζt,\displaystyle{\bf X}^{t+1}={\bf W}{\bf Z}^{t}={\bf W}{\bf X}^{t}-{\bf\zeta}^{t}, (29)

where ζt:=𝐖𝐗t𝐖𝐙t{\bf\zeta}^{t}:={\bf W}{\bf X}^{t}-{\bf W}{\bf Z}^{t}. Following [Lemma 8, [43]], we have:

𝐄(𝕀𝐏)𝐗t+1ψ𝐄(𝕀𝐏)𝐗t+2𝐄ζt\displaystyle\mathbf{E}\|(\mathbb{I}-\mathbf{P})\mathbf{X}^{t+1}\|\leq\psi\mathbf{E}\|(\mathbb{I}-\mathbf{P})\mathbf{X}^{t}\|+2\mathbf{E}\|\zeta^{t}\| (30)

Assuming 𝐄𝐱it𝐱¯tD\mathbf{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|\leq D , it means that 𝐄(𝕀𝐏)𝐗tmD\mathbf{E}\|(\mathbb{I}-\mathbf{P})\mathbf{X}^{t}\|\leq\sqrt{m}D. Next, we will bound 𝐄𝐱i,kt𝐱it\mathbf{E}\|\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}\|. According to the equation 𝐱i,kt𝐱it=β(𝐱it𝐱i,Kt1)ηj=0k1𝐠i,jt\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}=\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})-\eta\sum_{j=0}^{k-1}\mathbf{g}_{i,j}^{t} and Assumption 4, we have:

𝐄𝐱i,kt𝐱itβ𝐄𝐱it𝐱i,Kt1+Kη(σl+LG)\displaystyle\mathbf{E}\|\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}\|\leq\beta\mathbf{E}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}\|+K\eta(\sigma_{l}+L_{G})

Next, we need to bound 𝐄𝐱it𝐱i,Kt1\mathbf{E}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}\|. According to (24), we have:

𝐄𝐱it+1𝐱i,Kt\displaystyle\mathbf{E}\|\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i,K}^{t}\| β𝐄𝐱it𝐱i,Kt1+𝐄𝐱it+1𝐱it+1+𝐄ηk=0K1𝐠i,kt\displaystyle\leq\beta\mathbf{E}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}\|+\mathbf{E}\|\mathbf{x}_{i}^{t+1}-\mathbf{x}_{i}^{t+1}\|+\mathbf{E}\|\eta\sum_{k=0}^{K-1}\mathbf{g}_{i,k}^{t}\|
β𝐄𝐱it𝐱i,Kt1+𝐄𝐱it+1𝐱¯t+1+𝐄𝐱it𝐱¯t+𝐄𝐱¯t+1𝐱¯t+𝐄ηk=0K1𝐠i,kt\displaystyle\leq\beta\mathbf{E}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}\|+\mathbf{E}\|\mathbf{x}_{i}^{t+1}-\bar{\mathbf{x}}^{t+1}\|+\mathbf{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|+\mathbf{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|+\mathbf{E}\|\eta\sum_{k=0}^{K-1}\mathbf{g}_{i,k}^{t}\|
(a)β𝐄𝐱it𝐱i,Kt1+2D+2Kη(σl+LG)\displaystyle\overset{(a)}{\leq}\beta\mathbf{E}\|\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}\|+2D+2K\eta(\sigma_{l}+L_{G})
2D+2Kη(σl+LG)1β\displaystyle\leq\frac{2D+2K\eta(\sigma_{l}+L_{G})}{1-\beta}

Where (a) uses 𝐄𝐱it𝐱¯tD\mathbf{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|\leq D, 𝐄ηk=0K1𝐠i,ktKη(σl+LG)\mathbf{E}\|\eta\sum_{k=0}^{K-1}\mathbf{g}_{i,k}^{t}\|\leq K\eta(\sigma_{l}+L_{G}) and 𝐄𝐱¯t+1𝐱¯t=𝐄1mi=1mk=0K1η𝐠i,ktKη(σl+LG)\mathbf{E}\|\bar{\mathbf{x}}^{t+1}-\bar{\mathbf{x}}^{t}\|=\mathbf{E}\|\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\eta\mathbf{g}_{i,k}^{t}\|\leq K\eta(\sigma_{l}+L_{G}). Then we get

𝐄𝐱i,kt𝐱it2β1β(D+Kη(σl+LG))+Kη(σl+LG)=2β1βD+(1+β1β)Kη(σl+LG)\displaystyle\mathbf{E}\|\mathbf{x}_{i,k}^{t}-\mathbf{x}_{i}^{t}\|\leq\frac{2\beta}{1-\beta}\left(D+K\eta(\sigma_{l}+L_{G})\right)+K\eta(\sigma_{l}+L_{G})=\frac{2\beta}{1-\beta}D+\left(\frac{1+\beta}{1-\beta}\right)K\eta(\sigma_{l}+L_{G})

This implies:

𝐄ζtm(2β1βD+(1+β1β)Kη(σl+LG))\mathbf{E}\|\zeta^{t}\|\leq\sqrt{m}\left(\frac{2\beta}{1-\beta}D+\left(\frac{1+\beta}{1-\beta}\right)K\eta(\sigma_{l}+L_{G})\right)

According to (30), we have:

𝐄(𝕀𝐏)𝐗t2m(2β(1ψ)(1β)D+1+β1βK(σl+LG)j=0t1ηjψt1j)\displaystyle\mathbf{E}\|(\mathbb{I}-\mathbf{P})\mathbf{X}^{t}\|\leq 2\sqrt{m}\left(\frac{2\beta}{(1-\psi)(1-\beta)}D+\frac{1+\beta}{1-\beta}K(\sigma_{l}+L_{G})\sum_{j=0}^{t-1}\eta_{j}\psi^{t-1-j}\right) (31)

Letting the term on the right-hand side of (31) be denoted as DD, we obtain that, when β1ψ4m+1ψ\beta\leq\frac{1-\psi}{4\sqrt{m}+1-\psi} and let α=14mβ(1ψ)(1β)\alpha=1-\frac{4\sqrt{m}\beta}{(1-\psi)(1-\beta)}, we have

𝐄𝐱it𝐱¯t1+βα(1β)K(σl+LG)j=0t1ηjψt1j\mathbf{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|\leq\frac{1+\beta}{\alpha(1-\beta)}K(\sigma_{l}+L_{G})\sum_{j=0}^{t-1}\eta_{j}\psi^{t-1-j}

We have completed the proof.

Lemma 10

(Lemma 5 in [43]) For any 0<ψ<10<\psi<1 and t+t\in\mathbb{Z}^{+}, it holds

j=0t1ψt1jj+1Cλt\sum_{j=0}^{t-1}\frac{\psi^{t-1-j}}{j+1}\leq\frac{C_{\lambda}}{t}

with Cλ:={ln1λλln1λλ+ln21λ16λλln1λ8+2λln1λλ0,0,λ=0.\left.C_{\lambda}:=\left\{\begin{array}[]{cc}\ln\frac{1}{\lambda}\frac{\lambda^{\ln\frac{1}{\lambda}}}{\lambda}+\frac{\ln^{2}\frac{1}{\lambda}}{16\lambda}\lambda^{\frac{\ln\frac{1}{\lambda}}{8}}+\frac{2}{\lambda\ln\frac{1}{\lambda}}\lambda\neq 0,\\ 0,\lambda=0.\end{array}\right.\right.

B-B2 Bounded Uniform Stability

According to Lemma 6, we firstly bound the recursive stability on kk in one round. If the sampled data is the same, we can adopt Lemma 7. Otherwise, we adopt Lemma 8. Thus we can bound the first term in Lemma 6 as:

δk+1t\displaystyle\delta_{k+1}^{t} =1mi=1m𝔼[𝐱i,k+1t𝐱~i,k+1t|ξ]\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\left[\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\||\xi\right]
=P(z)1mi=1m𝔼[𝐱i,k+1t𝐱~i,k+1t|ξ,z]+P(z~)1mi=1m𝔼[𝐱i,k+1t𝐱~i,k+1t|ξ,z~]\displaystyle=P(z)\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\left[\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\||\xi,z\right]+P(\widetilde{z})\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\left[\|\mathbf{x}_{i,k+1}^{t}-\widetilde{\mathbf{x}}_{i,k+1}^{t}\||\xi,\widetilde{z}\right]
S1mSi=1m((1+ηL)𝔼[𝐱i,kt𝐱~i,kt|ξ]+2ησl)+1mSi=1m(𝔼[𝐱i,kt𝐱~i,kt|ξ]+2η(σl+LG))\displaystyle\leq\frac{S-1}{mS}\sum_{i=1}^{m}\left((1+\eta L)\mathbb{E}\left[\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\||\xi\right]+2\eta\sigma_{l}\right)+\frac{1}{mS}\sum_{i=1}^{m}\left(\mathbb{E}\left[\|\mathbf{x}_{i,k}^{t}-\widetilde{\mathbf{x}}_{i,k}^{t}\||\xi\right]+2\eta(\sigma_{l}+L_{G})\right)
(1+ηL)δkt+2ηLGS+2ησl\displaystyle\leq(1+\eta L)\delta_{k}^{t}+\frac{2\eta L_{G}}{S}+2\eta\sigma_{l}

Balancing the LHS and RHS, we have the following recursive formulation:

δk+1t+2(LG+Sσl)SL(1+ηL)(δkt+2(LG+Sσl)SL)\delta_{k+1}^{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\leq(1+\eta L)\left(\delta_{k}^{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\right)

Therefore, in one single communication round, by generally defining learning rate η=ηkt\eta=\eta_{k}^{t}:

δKt+2(LG+Sσl)SL(k=0K1(1+ηktL))(δ0t+2(LG+Sσl)SL)\delta_{K}^{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\leq\left(\prod_{k=0}^{K-1}(1+\eta_{k}^{t}L)\right)\left(\delta_{0}^{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\right)

The next important relationship is to measure the δKt1\delta_{K}^{t-1} and δ0t\delta_{0}^{t}. According to the update rule 𝐱i,0t=𝐱it+β(𝐱it𝐱i,Kt1)\mathbf{x}_{i,0}^{t}=\mathbf{x}_{i}^{t}+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1}), we have the difference follows:

𝐱i,0t𝐱~i,0t\displaystyle\mathbf{x}_{i,0}^{t}-\widetilde{\mathbf{x}}_{i,0}^{t} =𝐱it𝐱~it+β(𝐱it𝐱i,Kt1)β(𝐱~it𝐱~i,Kt1)\displaystyle=\mathbf{x}_{i}^{t}-\widetilde{\mathbf{x}}_{i}^{t}+\beta(\mathbf{x}_{i}^{t}-\mathbf{x}_{i,K}^{t-1})-\beta(\widetilde{\mathbf{x}}_{i}^{t}-\widetilde{\mathbf{x}}_{i,K}^{t-1})
=(1+β)(𝐱it𝐱~it)β(𝐱i,Kt1𝐱~i,Kt1)\displaystyle=(1+\beta)(\mathbf{x}_{i}^{t}-\widetilde{\mathbf{x}}_{i}^{t})-\beta(\mathbf{x}_{i,K}^{t-1}-\widetilde{\mathbf{x}}_{i,K}^{t-1})

By taking the expectation on the l2l_{2} norm, we have:

δ0t=1mi=1m𝔼𝐱i,0t𝐱~i,0t\displaystyle\delta_{0}^{t}=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i,0}^{t}-\widetilde{\mathbf{x}}_{i,0}^{t}\| 1+βmi=1m𝔼𝐱it𝐱~it+βmi=1m𝔼𝐱i,Kt1𝐱~i,Kt1\displaystyle\leq\frac{1+\beta}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\widetilde{\mathbf{x}}_{i}^{t}\|+\frac{\beta}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i,K}^{t-1}-\widetilde{\mathbf{x}}_{i,K}^{t-1}\|
(1+β)1mi=1m𝔼𝐱it𝐱~itR1+βδKt1\displaystyle\leq(1+\beta)\underbrace{\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\widetilde{\mathbf{x}}_{i}^{t}\|}_{R_{1}}+\beta\delta_{K}^{t-1}

Next, we bound R1.

R1\displaystyle R_{1} =1mi=1m𝔼𝐱it𝐱~it\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\widetilde{\mathbf{x}}_{i}^{t}\|
1mi=1m𝔼𝐱it𝐱¯t+1mi=1m𝔼𝐱~it𝐱¯~t+1mi=1m𝔼𝐱¯~t𝐱¯t\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|+\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\widetilde{\mathbf{x}}_{i}^{t}-\widetilde{\bar{\mathbf{x}}}^{t}\|+\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\widetilde{\bar{\mathbf{x}}}^{t}-\bar{\mathbf{x}}^{t}\|
=1mi=1m𝔼𝐱it𝐱¯t+1mi=1m𝔼𝐱~it𝐱¯~t+𝔼1mi=1m(𝐱i,Kt1𝐱~i,Kt1)\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|+\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\widetilde{\mathbf{x}}_{i}^{t}-\widetilde{\bar{\mathbf{x}}}^{t}\|+\mathbb{E}\|\frac{1}{m}\sum_{i=1}^{m}(\mathbf{x}_{i,K}^{t-1}-\widetilde{\mathbf{x}}_{i,K}^{t-1})\|
1mi=1m𝔼𝐱it𝐱¯t+1mi=1m𝔼𝐱~it𝐱¯~t+1mi=1m𝔼𝐱i,Kt1𝐱~i,Kt1\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|+\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\widetilde{\mathbf{x}}_{i}^{t}-\widetilde{\bar{\mathbf{x}}}^{t}\|+\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i,K}^{t-1}-\widetilde{\mathbf{x}}_{i,K}^{t-1}\|
1mi=1m𝔼𝐱it𝐱¯t+1mi=1m𝔼𝐱~it𝐱¯~t+δKt1\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|+\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|\widetilde{\mathbf{x}}_{i}^{t}-\widetilde{\bar{\mathbf{x}}}^{t}\|+\delta_{K}^{t-1}
2Dt+δKt1\displaystyle\leq 2D_{t}+\delta_{K}^{t-1}

Where we used Lemma 9 and Lemma 10 to establish the final inequality and let the learning rate be the same selection as it in the optimization of 𝒪(1t)=ct\mathcal{O}(\frac{1}{t})=\frac{c}{t}, then we get Dt=1αK(σl+LG)CλtD_{t}=\frac{1}{\alpha}K(\sigma_{l}+L_{G})\frac{C_{\lambda}}{t}. Finally we get

δ0t(1+2β)δKt1+2(1+β)Dt\delta_{0}^{t}\leq(1+2\beta)\delta_{K}^{t-1}+2(1+\beta)D_{t}

By denoting ϕ(t)=k=0K1(1+ηktL)\phi(t)=\prod_{k=0}^{K-1}(1+\eta_{k}^{t}L)be the combination of learning rate ηkt\eta_{k}^{t}, we can provide an upper bound of the recursive formulation as:

δKt+2(LG+Sσl)SL(k=0K1(1+ηktL))(δ0t+2(LG+Sσl)SL)ϕ(t)((1+2β)δKt1+2(1+β)Dt+2(LG+Sσl)SL)\delta_{K}^{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\leq\left(\prod_{k=0}^{K-1}(1+\eta_{k}^{t}L)\right)\left(\delta_{0}^{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\right)\leq\phi(t)\left((1+2\beta)\delta_{K}^{t-1}+2(1+\beta)D_{t}+\frac{2(L_{G}+S\sigma_{l})}{SL}\right)

To balance the constant part, assuming the learning rate is decayed by communication round tt which indicates ϕ(t)ϕ(t1)\phi(t)\leq\phi(t-1) and let 1+2βϕ(t1)ϕ(t)1+2\beta\leq\frac{\phi(t-1)}{\phi(t)} be the upper bound and small learning rates ηkt\eta_{k}^{t} ensure that 4β(LG+Sσl)SL2(1+β)Dt\frac{4\beta(L_{G}+S\sigma_{l})}{SL}\geq 2(1+\beta)D_{t}, then we have the following recursive formulation:

δKt+ϕ(t)1(1+2β)ϕ(t)1A+ϕ(t)(1+2β)ϕ(t)1Ctϕ(t1)(δKt1+ϕ(t1)1(1+2β)ϕ(t1)1A+ϕ(t1)(1+2β)ϕ(t1)1Ct1)\delta_{K}^{t}+\frac{\phi(t)-1}{(1+2\beta)\phi(t)-1}A+\frac{\phi(t)}{(1+2\beta)\phi(t)-1}C_{t}\leq\phi(t-1)\left(\delta_{K}^{t-1}+\frac{\phi(t-1)-1}{(1+2\beta)\phi(t-1)-1}A+\frac{\phi(t-1)}{(1+2\beta)\phi(t-1)-1}C_{t-1}\right)

Where A=2(LG+Sσl)SL,Ct=2(1+β)Dt=2(1+β)αK(σl+LG)CλtA=\frac{2(L_{G}+S\sigma_{l})}{SL},C_{t}=2(1+\beta)D_{t}=\frac{2(1+\beta)}{\alpha}K(\sigma_{l}+L_{G})\frac{C_{\lambda}}{t}, Unrolling from t01t_{0}-1 to TT, we have:

δKT+1\displaystyle\delta_{K}^{T+1} (τ=t01Tϕ(τ))(δKt01+ϕ(t01)1(1+2β)ϕ(t01)1A+ϕ(t01)(1+2β)ϕ(t01)1Ct01)\displaystyle\leq\left(\prod_{\tau=t_{0}-1}^{T}\phi(\tau)\right)\left(\delta_{K}^{t_{0}-1}+\frac{\phi(t_{0}-1)-1}{(1+2\beta)\phi(t_{0}-1)-1}A+\frac{\phi(t_{0}-1)}{(1+2\beta)\phi(t_{0}-1)-1}C_{t_{0}-1}\right)
(ϕ(T+1)1(1+2β)ϕ(T+1)1A+ϕ(T+1)(1+2β)ϕ(T+1)1Ct01)\displaystyle\quad-\left(\frac{\phi(T+1)-1}{(1+2\beta)\phi(T+1)-1}A+\frac{\phi(T+1)}{(1+2\beta)\phi(T+1)-1}C_{t_{0}-1}\right)
(τ=t01Tϕ(τ))(ϕ(t01)1(1+2β)ϕ(t01)1A+ϕ(t01)(1+2β)ϕ(t01)1Ct01)\displaystyle\leq\left(\prod_{\tau=t_{0}-1}^{T}\phi(\tau)\right)\left(\frac{\phi(t_{0}-1)-1}{(1+2\beta)\phi(t_{0}-1)-1}A+\frac{\phi(t_{0}-1)}{(1+2\beta)\phi(t_{0}-1)-1}C_{t_{0}-1}\right)
(τ=t01Tk=0K1(1+ηkτL))(A+Ct01(1+2β))\displaystyle\leq\left(\prod_{\tau=t_{0}-1}^{T}\prod_{k=0}^{K-1}(1+\eta_{k}^{\tau}L)\right)\left(\frac{A+C_{t_{0}-1}}{(1+2\beta)}\right)
exp(τ=t01Tk=0K1ηkτL)(2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλt0)\displaystyle\leq\exp{\left(\sum_{\tau=t_{0}-1}^{T}\sum_{k=0}^{K-1}\eta_{k}^{\tau}L\right)}\left(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}\frac{C_{\lambda}}{t_{0}}\right)

According to the preceding selection of the learning rate, 𝒪(1t)=ct\mathcal{O}(\frac{1}{t})=\frac{c}{t}, we have:

δKT\displaystyle\delta_{K}^{T} exp(τ=t01T1k=0K1ηkτL)(2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλt0)\displaystyle\leq\exp{\left(\sum_{\tau=t_{0}-1}^{T-1}\sum_{k=0}^{K-1}\eta_{k}^{\tau}L\right)}\left(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}\frac{C_{\lambda}}{t_{0}}\right)
exp(τ=t01T1cKLτ)(2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλt0)\displaystyle\leq\exp{\left(\sum_{\tau=t_{0}-1}^{T-1}\frac{cKL}{\tau}\right)}\left(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}\frac{C_{\lambda}}{t_{0}}\right)
exp(τ=t0TcKLτ𝑑τ)(2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλt0)\displaystyle\leq\exp\left(\int_{\tau=t_{0}}^{T}\frac{cKL}{\tau}d\tau\right)\left(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}\frac{C_{\lambda}}{t_{0}}\right)
(Tt0)cKL(2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλt0)\displaystyle\leq\left(\frac{T}{t_{0}}\right)^{cKL}\left(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}\frac{C_{\lambda}}{t_{0}}\right)

To summarize the above inequalities and the Lemma 6, we have:

𝔼f(𝐱¯T+1;z)f(𝐱¯~T+1;z)LGδKT+Ut0S\displaystyle\mathbb{E}\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\|\leq L_{G}\delta_{K}^{T}+\frac{Ut_{0}}{S}
LG(Tt0)cKL(2(LG+Sσl)(1+2β)SL+2(1+β)K(σl+LG)α(1+2β)Cλt0)+Ut0S\displaystyle\leq L_{G}\left(\frac{T}{t_{0}}\right)^{cKL}\left(\frac{2(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}\frac{C_{\lambda}}{t_{0}}\right)+\frac{Ut_{0}}{S}

Setting t0=TcKL1+cKL((2LG(LG+Sσl)(1+2β)SL+2LG(1+β)K(σl+LG)α(1+2β)Cλ)ScKLU)11+cKLt_{0}=T^{\frac{cKL}{1+cKL}}\left((\frac{2L_{G}(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2L_{G}(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}C_{\lambda})\frac{ScKL}{U}\right)^{\frac{1}{1+cKL}}, We then have

𝔼f(𝐱¯T+1;z)f(𝐱¯~T+1;z)2TcKL1+cKL(2LG(LG+Sσl)(1+2β)SL+2LG(1+β)K(σl+LG)α(1+2β)Cλ)11+cKL(US)cKL1+cKL\displaystyle\mathbb{E}\|f({\bar{\mathbf{x}}}^{T+1};z)-f(\widetilde{\bar{\mathbf{x}}}^{T+1};z)\|\leq 2T^{\frac{cKL}{1+cKL}}\left(\frac{2L_{G}(L_{G}+S\sigma_{l})}{(1+2\beta)SL}+\frac{2L_{G}(1+\beta)K(\sigma_{l}+L_{G})}{\alpha(1+2\beta)}C_{\lambda}\right)^{\frac{1}{1+cKL}}\left(\frac{U}{S}\right)^{\frac{cKL}{1+cKL}}