This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continuous-Time Analysis of Federated Averaging

Tom Overman    Diego Klabjan
Abstract

Federated averaging (FedAvg) is a popular algorithm for horizontal federated learning (FL), where samples are gathered across different clients and are not shared with each other or a central server. Extensive convergence analysis of FedAvg exists for the discrete iteration setting, guaranteeing convergence for a range of loss functions and varying levels of data heterogeneity. We extend this analysis to the continuous-time setting where the global weights evolve according to a multivariate stochastic differential equation (SDE), which is the first time FedAvg has been studied from the continuous-time perspective. We use techniques from stochastic processes to establish convergence guarantees under different loss functions, some of which are more general than existing work in the discrete setting. We also provide conditions for which FedAvg updates to the server weights can be approximated as normal random variables. Finally, we use the continuous-time formulation to reveal generalization properties of FedAvg.

Machine Learning, ICML

1 Introduction

Federated learning (FL) is a popular privacy-preserving machine learning framework allowing a server to train a central model without accessing data locked on clients. In FL, each client holds a subset of the overall data and is not willing to share this data with the server; however, the clients can send model weights to the server. In horizontal FL studied herein, each client holds a subset of samples but each client has access to all features. A popular algorithm for horizontal FL is FedAvg (McMahan et al., 2017), which is the focus in this work.

FedAvg works by averaging the model weights of all clients periodically. Specifically, each client trains a local model for a specific number of iterations using only the data on the client. Typically, this local training is done using batch stochastic gradient descent. Then, the clients send their local model weights to the server, where the server averages the model weights and sends them back to the clients. Understanding the conditions for FedAvg to converge and what factors influence convergence and generalization is of great interest to federated learning practitioners and has been the focus of significant work (Li et al., 2019; Karimireddy et al., 2020).

A major challenge in FL is dealing with non-IID data across clients (heterogeneous setting). The local updates on each client significantly diverge due to the different data distributions they are trained on, resulting in slower convergence. Many improvements to FedAvg exist, such as SCAFFOLD (Karimireddy et al., 2020), which reduces the impact of the data drift during local training and speeds up training in heterogeneous settings, and RADFed (Xue et al., 2022), which handles non-IID data by delaying aggregation with specialized redistribution rounds. Despite the abundance of new algorithms being introduced to tackle new challenges in FL, FedAvg remains an important algorithm that is the foundation of most new approaches.

The continuous-time perspective of stochastic gradient descent (SGD) has provided a framework for developing more compact proofs of convergence than the discrete process (Orvieto & Lucchi, 2019) and has allowed the study of generalization properties. Specifically, it has been shown through the continuous-time representation of SGD how the learning rate and batch size influence the width of local minima converged to during SGD (Jastrzebski et al., 2018). This local minima width impacts the model’s generalization ability to unseen data (Keskar et al., 2017). The continuous-time representation also allows for studying first exit times of SGD from local minima and how this depends on the width of the local minima (Nguyen et al., 2019). Our work is focused on developing a continuous-time representation of FedAvg. It is our hope that this formulation provides a framework for new interesting analyses, just as the continuous-time representation did for SGD.

We focus on a theoretical analysis of FedAvg. We formulate a continuous-time representation of FedAvg in the form of a stochastic differential equation (SDE) and use this formulation to prove convergence properties. The convergence proofs are relatively compact and the proof framework may be extended to other FL algorithms. We show convergence of FedAvg to a stationary point for general, non-convex loss functions and demonstrate that it is likely not sufficient for only the server learning rate to decay; it is necessary that the client-side learning rate must decay at certain rates. We show convergence of FedAvg to the global minimum for weakly quasi-convex loss functions. To the best of our knowledge, weak quasi-convexity has not been studied for FedAvg up to this point. Next, we show that the server weight updates in FedAvg can be approximated as normal random variables under certain assumptions, even for heterogeneous data with extensive local updates before averaging. This is a surprising result and can assist in further analyses of FL algorithms. Finally, we use our continuous-time approach with a quadratic single variate form of each clients’ loss landscape to determine how different FedAvg hyperparameters affect the trade-off between minimizing expected loss and ability to escape poorly-generalizing local minima.

Our contributions are as follows.

  1. 1.

    We are the first to formulate an SDE that models the evolution of server weights in continuous time during FedAvg. Existing work (Zhang et al., 2023) for modeling distributed learning algorithms uses ordinary differential equations without stochasticity.

  2. 2.

    Using the continuous-time formulation, we devise convergence proofs of FedAvg in deterministic and stochastic cases. We show convergence under a certain normality assumption to a stationary point for non-convex loss functions and convergence to a global minimum for weakly quasi-convex functions. To the best of our knowledge, no other works in either deterministic or stochastic regimes have shown global convergence of FedAvg for weakly quasi-convex loss functions, which are more general than convex functions and allow for locally concave regions. The new proof framework we provide allows for relatively compact proofs of FedAvg and can inspire compact proofs of other FL algorithms.

  3. 3.

    We show that the server weight updates converge in distribution to a normal distribution as the number of clients grow, even in non-IID data settings with many local client iterations without server averaging. This justifies the normality assumption in the convergence results. We demonstrate this through the Lyapunov central limit theorem.

  4. 4.

    Using a quadratic single variate loss function for each client’s loss landscape, we uncover dynamics of how various hyperparameters in FedAvg affect the trade-off between generalization and the optimality gap of the expected loss in the continuous-time setting.

In Section 2, we discuss related work on the analysis of FedAvg in the discrete case and continuous-time analysis of SGD. In Section 3, we formulate the SDE that models the evolution of server weights during FedAvg. In Section 4, we provide convergence guarantees of FedAvg for non-convex and weakly quasi-convex loss functions using the SDE formulation. In Section 5, we provide conditions for which the server weight updates can be approximated as normally distributed; this is a key assumption for the SDE formulation to be an Itô process. In Section 6, we analyze the case where each client’s local loss function is a quadratic; this allows us to examine how various parameters, such as the client learning rate and the number of local iterations, affect the trade-off between minimizing the loss function and generalization.

2 Related Work

Extensive work has been done in analyzing stochastic gradient descent (SGD) from the continuous-time perspective. Although theory of the discrete process of stochastic gradient descent has existed for a long time, continuous-time perspectives have added value in understanding the behavior of SGD particularly in the area of generalization (Mandt et al., 2015). Using the continuous-time perspective, it has been shown that learning rate magnitude and batch size affect the sharpness of the local minima that SGD converges to (Jastrzebski et al., 2018), and the sharpness of local minima has been linked to generalization ability (Keskar et al., 2017). Furthermore, there has been work in developing concise convergence proofs for the continuous SDE form of SGD (Orvieto & Lucchi, 2019). While the previously mentioned works form a Brownian motion representation with normally-distributed noise, there is also work on forming Lévy processes where the noise is the more general α\alpha-stable distribution that can have heavy tails (Nguyen et al., 2019). This Lévy process is used to describe first-exit times of SGD under heavy-tailed noise and demonstrates why SGD prefers converging to wide local minima which generalize better than narrow minima. All of these works take the actual discrete stochastic gradient descent process and form the continuous-time SDE that models the evolution of the weights over time.

Furthermore, the discrete process of FedAvg is well-studied. Convergence has been proven for the convex case, even when the data distribution across different clients is non-IID (Li et al., 2019). However, the convergence rate is slower when data is highly non-IID and the number of local iterations before averaging is required to be large due to “client drift” (Karimireddy et al., 2020).

The case of analyzing federated learning from the continuous-time perspective is less-studied. There has been work on analyzing the continuous-time system of a broad class of distributed optimization algorithms using a control-theory perspective, however this work assumes full gradients and thus has no stochastic component (Zhang et al., 2023). Furthermore, their framework does not work for the FedAvg algorithm specifically because FedAvg cannot be mapped to the double-feedback system they developed. Since most implementations of FedAvg use stochastic gradients on the clients, we focus on developing and analyzing an SDE that models the evolution of the global weights while assuming that the client updates use noisy, stochastic gradients. Recent work (Mehrjou, 2021) analyzes the continuous-time limit of local client updates to make connections to game theory, but does not form a continuous-time representation of the server weights and does not study convergence of the global weights. Existing work (Glasgow et al., 2022) uses the continuous-time limit of each client’s local SGD iterates to uncover iterate bias, but the authors do not form the continuous-time SDE for the evolution of weights on the server, and thus their approach is very different from our approach.

3 Continuous-Time Formulation

We are given NN samples with the loss of sample ii being Fi(w)F_{i}(w) and wdw\in\mathbb{R}^{d} are the model weights. Furthermore, we assume the samples are gathered across QQ clients in the horizontal federated learning fashion. We refer to each local client component of the overall objective function as Fk(w)=1NkiIkFi(w)F^{k}(w)=\frac{1}{N_{k}}\sum_{i\in I_{k}}F_{i}(w) where IkI_{k} is the set of samples that belong to client kk and Nk=|Ik|N_{k}=|I_{k}|. We state the loss function as F(w)=k=1QpkFk(w)F(w)=\sum_{k=1}^{Q}p_{k}F^{k}(w) where pkp_{k} is the weight of client kk, usually set to pk=NkNp_{k}=\frac{N_{k}}{N}. We denote the copy of model weights on client kk as wkw^{k}. The goal is to solve minwF(w)\min_{w}F(w).

Client weights are updated using standard SGD, except for iterations where averaging occurs. We can write the evolution of client weights as wTk=wT1kηk,T11SiSk,TFi(wT1k)w^{k}_{T}=w^{k}_{T-1}-\eta_{k,T-1}\frac{1}{S}\sum_{i\in S_{k,T}}\nabla F_{i}(w^{k}_{T-1}), when TT0+ET\neq T_{0}+E, and wTk=k^=1Qpk^(wTEk^+η^0,T1(wT1k^ηk^,T11SiSk^,TFi(wT1k^)wTEk^))w^{k}_{T}=\sum_{\hat{k}=1}^{Q}p_{\hat{k}}\bigg{(}w_{T-E}^{\hat{k}}+\hat{\eta}_{0,T-1}\big{(}w^{\hat{k}}_{T-1}-\eta_{\hat{k},T-1}\frac{1}{S}\sum_{i\in S_{\hat{k},T}}\nabla F_{i}(w^{\hat{k}}_{T-1})-w_{T-E}^{\hat{k}}\big{)}\bigg{)}, when T=T0+ET=T_{0}+E, where SS is the batch size, Sk^,TS_{\hat{k},T} is a random batch drawn from the available samples in Ik^I_{\hat{k}} at iteration TT, T0T_{0} is the most recent iteration where averaging occurred, ηk^,T1\eta_{\hat{k},T-1} is the learning rate on client k^\hat{k} and may vary over iterations, η^0,T1\hat{\eta}_{0,T-1} is the global server learning rate which may vary over iterations, and EE is the number of local SGD updates before averaging. When T=T0+ET=T_{0}+E and averaging occurs, the most recent iteration of average is incremeted as T0T0+ET_{0}\leftarrow T_{0}+E. It is important to note that the averaging step does not imply that clients have access to the updates of the other clients; this average is actually computed on the server and sent back to each client, and on this time step every client holds the same-valued weights. A common choice of server learning rate is η^0,T1=1\hat{\eta}_{0,T-1}=1, which simplifies this update schedule to wTk=wT1kηk,T11SiSk,TFi(wT1k)w^{k}_{T}=w^{k}_{T-1}-\eta_{k,T-1}\frac{1}{S}\sum_{i\in S_{k,T}}\nabla F_{i}(w^{k}_{T-1}), when TT0+ET\neq T_{0}+E, and wTk=k^=1Qpk^(wT1k^ηk^,T11SiSk^,TFi(wT1k^))w^{k}_{T}=\sum_{\hat{k}=1}^{Q}p_{\hat{k}}\bigg{(}w^{\hat{k}}_{T-1}-\eta_{\hat{k},T-1}\frac{1}{S}\sum_{i\in S_{\hat{k},T}}\nabla F_{i}(w^{\hat{k}}_{T-1})\bigg{)}, when T=T0+ET=T_{0}+E.

At iteration T=T0+ET=T_{0}+E, each client performs one more step of SGD, then sends updated weights to the server, where the server averages the updates along with an optional server-side learning rate η^0,T1\hat{\eta}_{0,T-1} that controls how much to update the server weights. While the mathematical expression for the case when T=T0+ET=T_{0}+E shows gradient passing, this is not the case in an actual implementation where it is equivalent to sending the updated weights after each client’s local updates. The server then sends this average back to every client. We study the convergence of global server weights in this work, which we denote as w0,Tw_{0,T} for iteration TT. Updates to the server weights only occur every EE iterations during the averaging step, otherwise they remain the same as the most recent averaging iteration.

As shown in (Jastrzebski et al., 2018), the stochastic gradients, 𝒢Tk\mathcal{G}_{T}^{k}, on the local client updates can be assumed to be normally distributed as follows

𝒢Tk=1SiSk,T+1Fi(wTk)𝒩(Fk(wTk),Σk(wTk))\mathcal{G}_{T}^{k}=\frac{1}{S}\sum_{i\in S_{k,T+1}}\nabla F_{i}(w^{k}_{T})\sim\mathcal{N}(\nabla F^{k}(w_{T}^{k}),\Sigma_{k}(w_{T}^{k}))

where Σk(wTk)=(1S1Nk)1Nk1i=1Nk(Fi(wTk)Fk(wTk))(Fi(wTk)Fk(wTk))T\Sigma_{k}(w_{T}^{k})=(\frac{1}{S}-\frac{1}{N_{k}})\frac{1}{N_{k}-1}\sum_{i=1}^{N_{k}}(\nabla F_{i}(w_{T}^{k})-\nabla F^{k}(w_{T}^{k}))(\nabla F_{i}(w_{T}^{k})-\nabla F^{k}(w_{T}^{k}))^{T}. We can think of this as a normal random variable centered around the full gradient of the local loss function.

Therefore, we can write our discrete updates in the time region where no averaging occurs as

wT+1k\displaystyle w^{k}_{T+1} =wTkηk,T𝒢Tk\displaystyle=w^{k}_{T}-\eta_{k,T}\mathcal{G}_{T}^{k}
=wTkηk,T𝒩(Fk(wTk),Σk(wTk))\displaystyle=w^{k}_{T}-\eta_{k,T}\mathcal{N}(\nabla F^{k}(w_{T}^{k}),\Sigma_{k}(w_{T}^{k}))
=wTkηk,TFk(wTk)+ηk,T𝒩(0,Σk(wTk)).\displaystyle=w^{k}_{T}-\eta_{k,T}\nabla F^{k}(w_{T}^{k})+\eta_{k,T}\mathcal{N}(0,\Sigma_{k}(w_{T}^{k})).

Expanding this for EE time steps, we get

wT+Ek=wTk+i=0E1ηk,T+i(NT+ikGT+ik)\displaystyle w_{T+E}^{k}=w_{T}^{k}+\sum_{i=0}^{E-1}\eta_{k,T+i}(N_{T+i}^{k}-G_{T+i}^{k})

where NT+ik𝒩(0,Σk(wT+ik))N^{k}_{T+i}\sim\mathcal{N}(0,\Sigma_{k}(w_{T+i}^{k})) and GT+ik=Fk(wT+ik)G_{T+i}^{k}=\nabla F^{k}(w_{T+i}^{k}). Notice that the expressions for GT+ikG^{k}_{T+i} and NT+ikN^{k}_{T+i} depend on wT+ikw^{k}_{T+i} and not wTkw^{k}_{T}. This results in the full expression being a very complicated recurrence relation.

We can re-index as

wTk=wTEk+i=0E1ηk,TE+i(NTE+ikGTE+ik).\displaystyle w_{T}^{k}=w_{T-E}^{k}+\sum_{i=0}^{E-1}\eta_{k,T-E+i}(N^{k}_{T-E+i}-G^{k}_{T-E+i}).

We assume that at iteration TT (and thus also TnET-nE for all integers nn such that TnE0T-nE\geq 0), the server aggregates across clients, and we write the aggregated server weights at this time as w0,Tw_{0,T}. Using the averaging technique specified in the FedAvg update schedule we get

w0,T\displaystyle w_{0,T} =w0,TE\displaystyle=w_{0,T-E}
+η^0,Tk=1Qpk[i=0E1ηk,TE+i(NTE+ikGTE+ik)].\displaystyle+\hat{\eta}_{0,T}\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k,T-E+i}(N^{k}_{T-E+i}-G^{k}_{T-E+i})].

We split the global learning rate, η^0,T\hat{\eta}_{0,T}, into the product of a constant term, hh, used for lifting the difference equation to continuous-time and the learning rate, η0,T\eta_{0,T}, which depends on iteration TT. We write this as η^0,T=hη0,T\hat{\eta}_{0,T}=h\eta_{0,T}. Thus, we rewrite the difference equation as

w0,T\displaystyle w_{0,T} =w0,TE\displaystyle=w_{0,T-E}
+hη0,Tk=1Qpk[i=0E1ηk,TE+i(NTE+ikGTE+ik)]AT.\displaystyle+h\eta_{0,T}\underbrace{\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k,T-E+i}(N^{k}_{T-E+i}-G^{k}_{T-E+i})]}_{A_{T}}.
Assumption 3.1.

ATA_{T} is a normally distributed random variable such that AT𝒩(MT,VT)A_{T}\sim\mathcal{N}(M_{T},V_{T}). Both MTM_{T} and VTV_{T} are functions of wTEkw_{T-E}^{k}.

We further discuss Assumption 3.1 in Section 5. We can then write the evolution of iterates of the global server weights as

w0,Tw0,TE=hη0,TMT+hη0,T𝒩(0,VT)\displaystyle w_{0,T}-w_{0,T-E}=h\eta_{0,T}M_{T}+h\eta_{0,T}\mathcal{N}(0,V_{T}) (1)

where MT=𝔼[AT]M_{T}=\mathbb{E}[A_{T}] and VT=𝔼[ATATT]MTMTTV_{T}=\mathbb{E}[A_{T}A_{T}^{T}]-M_{T}M_{T}^{T}.

According to the FedAvg server update schedule, the global server weights only change every EE iterations as indexed by TT, thus w0,Tw0,TEw_{0,T}-w_{0,T-E} in (1) represents the difference in a single update of the server weights. So, we can view this as the discretization of an SDE using the Euler-Maruyama method with a step size of hh. This discretization is more accurate when hh is small. We now form the SDE as

dw0(t)=η0(t)M^(w0(t))dt+η0(t)hV^1/2(w0(t))dB(t)dw_{0}(t)=\eta_{0}(t)\hat{M}(w_{0}(t))dt+\eta_{0}(t)\sqrt{h}\hat{V}^{1/2}(w_{0}(t))dB(t) (2)

where B(t)B(t) is a standard Brownian motion, M^(w0(t))=𝔼[A^(w0(t))]\hat{M}(w_{0}(t))=\mathbb{E}[\hat{A}(w_{0}(t))], V^1/2(w0(t))(V^1/2(w0(t)))T=V^(w0(t))\hat{V}^{1/2}(w_{0}(t))(\hat{V}^{1/2}(w_{0}(t)))^{T}=\hat{V}(w_{0}(t)), V^(w0(t))=𝔼[A^(w0(t))A^(w0(t))T]M^(w0(t))M^(w0(t))T\hat{V}(w_{0}(t))=\mathbb{E}[\hat{A}(w_{0}(t))\hat{A}(w_{0}(t))^{T}]-\hat{M}(w_{0}(t))\hat{M}(w_{0}(t))^{T}, and rewriting ATA_{T} in continuous-time as

A^(w0(t))=k=1Qpk[i=0E1ηk(t)(Nk(t,i)Gk(t,i)))]\hat{A}(w_{0}(t))=\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)(N^{k}(t,i)-G^{k}(t,i)))]

where Nk(t,i)=𝒩(0,Σk(wk(t,i)))N^{k}(t,i)=\mathcal{N}(0,\Sigma_{k}(w^{k}(t,i))), Gk(t,i)=Fk(wk(t,i))G^{k}(t,i)=\nabla F^{k}(w^{k}(t,i)), wk(t,i)=wk(t,i1)ηk(t)[Gk(t,i1)+Nk(t,i1)]w^{k}(t,i)=w^{k}(t,i-1)-\eta_{k}(t)[G^{k}(t,i-1)+N^{k}(t,i-1)], and wk(t,i=0)=w0(t)w^{k}(t,i=0)=w_{0}(t) for all clients k=1,,Qk=1,...,Q.

We show in Lemma A.4 that the drift term, η0(t)M^(w0(t))\eta_{0}(t)\hat{M}(w_{0}(t)), in (2) is Lipschitz continuous which is an important property for our convergence proofs.

4 Convergence Proofs

We use the formulation of (2) in Section 3, and similar tools used in (Orvieto & Lucchi, 2019) to form convergence proofs for FedAvg in various settings. While the techniques used in convergence proofs for continuous-time SGD are similar to our approaches for FedAvg, the case for FedAvg that we show is much more complicated.

Assumption 4.1.

Fi(w)F_{i}(w) each are μ\mu-smooth functions. We require each Fi(w)F_{i}(w) to be twice differentiable across their entire domain and Fi(w)L||\nabla F_{i}(w)||_{\infty}\leq L and diag(H)L||\text{diag}(H)||_{\infty}\leq L over the entire domain, where HH is the Hessian of Fi(w)F_{i}(w).

Assumption 4.2.

The learning rates on each client are the same, may depend on tt, and can be written as ηk(t)=η(t)\eta_{k}(t)=\eta(t) for all k=1,2,..,Qk=1,2,..,Q.

As a consequence of Assumption 4.2, we rewrite

A^(w0(t))\displaystyle\hat{A}(w_{0}(t)) =η(t)k=1Qpk[i=0E1(Nk(t,i)Gk(t,i)))]A^1(w0(t))\displaystyle=\eta(t)\underbrace{\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}(N^{k}(t,i)-G^{k}(t,i)))]}_{\hat{A}_{1}(w_{0}(t))}
=η(t)A^1(w0(t))\displaystyle=\eta(t)\hat{A}_{1}(w_{0}(t))

and

V^(w0(t))\displaystyle\hat{V}(w_{0}(t)) =𝔼(A^(w0(t))A^(w0(t))T)M^(w0(t))M^(w0(t))T\displaystyle=\mathbb{E}(\hat{A}(w_{0}(t))\hat{A}(w_{0}(t))^{T})-\hat{M}(w_{0}(t))\hat{M}(w_{0}(t))^{T}
=η(t)2(𝔼(A^1A^1T)𝔼[A^1]𝔼[A^1]TV^1(w0(t)))\displaystyle=\eta(t)^{2}\bigg{(}\underbrace{\mathbb{E}(\hat{A}_{1}\hat{A}_{1}^{T})-\mathbb{E}[\hat{A}_{1}]\mathbb{E}[\hat{A}_{1}]^{T}}_{\hat{V}_{1}(w_{0}(t))}\bigg{)}
=η(t)2V^1(w0(t)).\displaystyle=\eta(t)^{2}\hat{V}_{1}(w_{0}(t)).
Assumption 4.3.

We assume that the variances of the server updates are bounded for all tt. More precisely, V=maxtV1^(w0(t))S<V^{*}=\max_{t}||\hat{V_{1}}(w_{0}(t))||_{S}<\infty, where ||||s||\cdot||_{s} is the spectral norm.

Assumption 4.4.

We assume Σk(wTk)=Σk\Sigma_{k}(w_{T}^{k})=\Sigma_{k}, where Σkd×d\Sigma_{k}\in\mathbb{R}^{d\times d} does not vary over iterations and may be different for each client kk.

Assumption 4.4 is similar to the assumptions made in the continuous-time SGD literature (Mandt et al., 2015).

4.1 Non-convex loss functions

We prove convergence to a stationary point for general, non-convex loss functions.

Theorem 4.5.

We assume Assumptions 3.1, 4.1, 4.2, 4.3, and 4.4 are met, and the server learning rate η0(t)=1\eta_{0}(t)=1. For a random time point t~[0,t]\tilde{t}\in[0,t] that follows the distribution η(t~)0tη(s)𝑑s\frac{\eta(\tilde{t})}{\int_{0}^{t}\eta(s)ds}, we have

𝔼t~,𝒢||\displaystyle\mathbb{E}_{\tilde{t},\mathcal{G}}|| F(w0(t~))||2F(w0(0))F(w0)Eφ(t)\displaystyle\nabla F(w_{0}(\tilde{t}))||^{2}\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\varphi(t)}
+1Eφ(t)0t[C1η(t)2+hη(t)2VL2]𝑑t\displaystyle+\frac{1}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime} (3)

where C1=E2Lμk=1Qpk[L+Tr(Σk)]2C_{1}=\frac{E^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]}{2}, φ(t)=0tη(t)𝑑t\varphi(t)=\int_{0}^{t}\eta(t^{\prime})dt^{\prime}, w0=argminw0F(w0)w_{0}^{*}=\text{argmin}_{w_{0}}F(w_{0}), and the expectation 𝔼t~,𝒢\mathbb{E}_{\tilde{t},\mathcal{G}} is over the random time point t~\tilde{t} and stochasticity in gradients 𝒢\mathcal{G}.

Proof.

Proof provided in Section A.3. ∎

Theorem 4.5 provides a general expression that can be used to easily derive convergence rates for a variety of learning rates, η(t)\eta(t). We obtain more concrete convergence rates for specific choices of η(t)\eta(t) and asymptotic rates for intervals of η(t)\eta(t) in Corollary 4.6.

Corollary 4.6.

For a random time point t~[0,t]\tilde{t}\in[0,t] that follows the distribution η(t~)0tη(s)𝑑s\frac{\eta(\tilde{t})}{\int_{0}^{t}\eta(s)ds}, and with η(t)=1t+1\eta(t)=\frac{1}{t+1}, we have

𝔼t~,𝒢F(w0(t~))2\displaystyle\mathbb{E}_{\tilde{t},\mathcal{G}}||\nabla F(w_{0}(\tilde{t}))||^{2} F(w0(0))F(w0)E1log(t+1)\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E}\cdot\frac{1}{\log(t+1)}
+C1+hVL2E1log(t+1),\displaystyle+\frac{C_{1}+\frac{hV^{*}L}{2}}{E}\cdot\frac{1}{\log(t+1)},

for η(t)=1t+1\eta(t)=\frac{1}{\sqrt{t+1}}, we have

𝔼t~,𝒢F(w0(t~))2\displaystyle\mathbb{E}_{\tilde{t},\mathcal{G}}||\nabla F(w_{0}(\tilde{t}))||^{2} F(w0(0))F(w0)E(2t+12)\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E(2\sqrt{t+1}-2)}
+[C1+hVL2]log(t+1)E(2t+12),\displaystyle+\frac{[C_{1}+\frac{hV^{*}L}{2}]\log(t+1)}{E(2\sqrt{t+1}-2)},

and with η(t)=1/(t+1)b\eta(t)=1/(t+1)^{b}, we have

𝔼t~,𝒢F(w0(t~))2{𝒪(1tb)0<b<12𝒪(log(t)t)b=12𝒪(1t1b)12<b<1𝒪(1log(t))b=1.\mathbb{E}_{\tilde{t},\mathcal{G}}||\nabla F(w_{0}(\tilde{t}))||^{2}\leq\begin{cases}\mathcal{O}(\frac{1}{t^{b}})&0<b<\frac{1}{2}\\ \mathcal{O}(\frac{\log(t)}{\sqrt{t}})&b=\frac{1}{2}\\ \mathcal{O}(\frac{1}{t^{1-b}})&\frac{1}{2}<b<1\\ \mathcal{O}(\frac{1}{\log(t)})&b=1\end{cases}.
Proof.

Proof provided in Section A.4. ∎

Next, we show in Corollary 4.7 that convergence to a stationary point requires a decreasing learning rate on the clients, η(t)\eta(t). Interestingly, a decreasing global server learning rate, n0(t)n_{0}(t), by itself is likely not sufficient for convergence. This is because the bound on the expected client drift from Lemma A.1 in the appendix is dependent on the client-side learning rate η(t)\eta(t).

Corollary 4.7.

For a random time point t~[0,t]\tilde{t}\in[0,t] that follows the distribution η(t~)0tη(s)𝑑s\frac{\eta(\tilde{t})}{\int_{0}^{t}\eta(s)ds} and with constant client learning rate η(t)=ηc\eta(t)=\eta_{c} and decreasing global server-side learning rate η0(t)=1/(t+1)\eta_{0}(t)=1/(t+1), we have

𝔼t~,𝒢F(w0(t^))2\displaystyle\mathbb{E}_{\tilde{t},\mathcal{G}}||\nabla F(w_{0}(\hat{t}))||^{2} F(w0(0))F(w0)+ηc2hVL/2Eηclog(t+1)\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})+\eta_{c}^{2}hV^{*}L/2}{E\eta_{c}\log(t+1)}
+ηcC1.\displaystyle+\eta_{c}C_{1}.
Proof.

Proof provided in Section A.5. ∎

Corollary 4.7 shows that despite a decreasing server-side learning rate, FedAvg is only guaranteed to converge to a neighborhood of a stationary point that depends on the size of the constant client-side learning rate η(t)=ηc\eta(t)=\eta_{c}. The algorithm might still converge to a stationary point since we provide an upper bound.

4.2 Weakly quasi-convex loss functions

We prove convergence to the global minimum for weakly quasi-convex loss functions as defined in Assumption 4.8.

Assumption 4.8.

Function FiF_{i} is weakly quasi-convex if for some ww^{*} and τ>0\tau>0

Fi(w),wwτ(Fi(w)Fi(w))\langle\nabla F_{i}(w),w-w^{*}\rangle\geq\tau(F_{i}(w)-F_{i}(w^{*}))

holds for every ww.

This class of functions is more general than convex functions that are studied in previous discrete analyses of FedAvg. In fact, weakly quasi-convex functions can have locally concave regions, and it has been shown that some non-convex LSTMs induce weakly-quasi convex loss functions (Hardt et al., 2018). To the best of our knowledge, this is the first work to provide convergence results of FedAvg for weakly quasi-convex loss functions.

Theorem 4.9.

We assume Assumptions 3.1, 4.1, 4.2, 4.3, 4.4, and 4.8 are met, and the serving learning rate is a constant η0(t)=η0\eta_{0}(t)=\eta_{0}. For a random time point t~[0,t]\tilde{t}\in[0,t] that follows the distribution η(t~)0tη(s)𝑑s\frac{\eta(\tilde{t})}{\int_{0}^{t}\eta(s)ds}, we have

𝔼t~,𝒢[\displaystyle\mathbb{E}_{\tilde{t},\mathcal{G}}[ (F(w0(t^))F(w0))]w0(0)w0τφ(t)\displaystyle(F(w_{0}(\hat{t}))-F(w_{0}^{*}))]\leq\frac{||w_{0}(0)-w_{0}^{*}||}{\tau\varphi(t)}
+C2τφ(t)0t[η(s)2(LE0sη(t)dt\displaystyle+\frac{C_{2}}{\tau\varphi(t)}\int_{0}^{t}\bigg{[}\eta(s)^{2}\bigg{(}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}
+hV0sη(t)2𝑑t)]ds+C3τφ(t)0tη(s)2ds\displaystyle+\sqrt{h}V^{*}\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}dt^{\prime}}\bigg{)}\bigg{]}ds+\frac{C_{3}}{\tau\varphi(t)}\int_{0}^{t}\eta(s)^{2}ds

where C2=μE2k=1Qpk[L+Tr(Σk)]C_{2}=\mu E^{2}\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}], C3=[dhη02V2+η0C2w0(0)w0]C_{3}=\bigg{[}\frac{dh\eta_{0}^{2}V^{*}}{2}+\eta_{0}C_{2}||w_{0}(0)-w_{0}^{*}||\bigg{]}, and w0w_{0}^{*} is the global minimum point.

Proof.

Proof provided in Section A.6. ∎

Theorem 4.9 provides a general expression that can be used to easily derive convergence rates for a variety of learning rates, η(t)\eta(t). We obtain more concrete convergence rates for a specific choice of η(t)\eta(t) in Corollary 4.10.

Corollary 4.10.

If we choose a random time point t~[0,t]\tilde{t}\in[0,t] that follows the distribution η(t~)0tη(s)𝑑s\frac{\eta(\tilde{t})}{\int_{0}^{t}\eta(s)ds}, the serving learning rate is a constant η0(t)=η0\eta_{0}(t)=\eta_{0}, and for η(t)=1t+1\eta(t)=\frac{1}{t+1} we have

𝔼t~,𝒢[(F(w0(t^))\displaystyle\mathbb{E}_{\tilde{t},\mathcal{G}}[(F(w_{0}(\hat{t})) F(w0))]η02C2LEτη0tlog(t+1)tlog(t+1)\displaystyle-F(w_{0}^{*}))]\leq\frac{\eta_{0}^{2}C_{2}LE}{\tau\eta_{0}}\cdot\frac{t-\log(t+1)}{t\log(t+1)}
+w0(0)w0+C3+η02C2hVτη0log(t+1)\displaystyle+\frac{||w_{0}(0)-w_{0}^{*}||+C_{3}+\eta_{0}^{2}C_{2}\sqrt{h}V^{*}}{\tau\eta_{0}\log(t+1)}
=𝒪(1log(t)).\displaystyle=\mathcal{O}\bigg{(}\frac{1}{\log(t)}\bigg{)}.
Proof.

Proof provided in Section A.7. ∎

5 On the Assumption of Normally Distributed Server Updates

In order for the SDE specified in (2) to be an Itô process, we require ATA_{T} to be normally distributed. We show in this section that this is a reasonable assumption in many cases, such as in the regime of a very large number of clients.

Large IID Client Setting Assuming data is independent and identically distributed across clients, the weights pkp_{k} should be approximately equal, and assuming bounded variance (Assumption 4.3), we have the conditions met for the traditional central limit theorem. Thus, as the number of clients QQ goes to \infty, we have AT𝒟𝒩(MT,VT)A_{T}\xrightarrow{\mathcal{D}}\mathcal{N}(M_{T},V_{T}).

Large non-IID Client Setting We now show that we can approximate ATA_{T} as a normal random variable under certain conditions even if the data is not identically distributed across clients. The proof hinges on the Lyapunov central limit theorem which generalizes the traditional central limit theorem to cases where random variables are not identically distributed (Billingsley, 2012).

Assumption 5.1.

The covariance matrix Σk(wtk)\Sigma_{k}(w_{t}^{k}) is a diagonal matrix for each client k=1,,Qk=1,...,Q.

Assumption 5.2.

We require

0<C\displaystyle 0<C (pk2ηk,t2𝔼[(Nk,i+Rk,i)2])2\displaystyle\leq\Big{(}p_{k}^{2}\eta_{k,t}^{2}\mathbb{E}[(N_{k,i}+R_{k,i})^{2}]\Big{)}^{2}
12Qj=1Q(pk2ηk,t2𝔼[(Nk,i+Rk,i)2])2\displaystyle-\frac{1}{2Q}\sum_{j=1}^{Q}\Big{(}p_{k}^{2}\eta_{k,t}^{2}\mathbb{E}[(N_{k,i}+R_{k,i})^{2}])^{2}
pj2ηj,t2𝔼[(Nj,i+Rj,i)2])2)2,\displaystyle-p_{j}^{2}\eta_{j,t}^{2}\mathbb{E}[(N_{j,i}+R_{j,i})^{2}])^{2}\Big{)}^{2},

holds for all clients k=1,,Qk=1,...,Q and coordinates i=1,..,di=1,..,d, where ηk,t\eta_{k,t} is the client-side learning rate on client kk. Values Nk,iN_{k,i} and Rk,iR_{k,i} are the ii-th coordinates of vectors NkN_{k} and RkR_{k}, respectively, where Nk=i=0E1NtE+ikN_{k}=\sum_{i=0}^{E-1}N^{k}_{t-E+i} and Rk=i=0E1(𝔼[GtE+ik]GtE+ik)R_{k}=\sum_{i=0}^{E-1}(\mathbb{E}[G^{k}_{t-E+i}]-G^{k}_{t-E+i}).

Assumption 5.2 requires that for all clients indexed by kk, the square of the second moment of Nk+RkN_{k}+R_{k} to be greater than the square of the difference of the second moments of Nk+RkN_{k}+R_{k} and Nj+RjN_{j}+R_{j} averaged over all clients indexed by jj. This essentially requires the second moments to not be too different across different clients.

Assumption 5.3.

The fourth-order mixed moments of NkN_{k} and RkR_{k} are bounded for all clients k=1,,Qk=1,...,Q. More formally, we require |𝔼[Nk,iuRk,iv]|D\Big{|}\mathbb{E}[N_{k,i}^{u}R_{k,i}^{v}]\Big{|}\leq D for all clients k=1,,Qk=1,...,Q, coordinates i=1,..,di=1,..,d, and 0u,v40\leq u,v\leq 4 with u+v=4u+v=4.

We note that Assumption 5.3 is guaranteed to hold if NkN_{k} and RkR_{k} have bounded support, which is usually the case in practice because gradients are typically clipped.

Theorem 5.4.

With assumptions 5.1, 5.2 and 5.3, as the number of clients QQ goes to \infty, we have AT𝒟𝒩(MT,VT)A_{T}\xrightarrow{\mathcal{D}}\mathcal{N}(M_{T},V_{T}).

Proof.

The proof uses the Lyapunov Central Limit Theorem, which is more flexible than the standard Central Limit Theorem and allows for non-identically distributed random variables as long as the Lyapunov condition is met. The full proof is provided in Section A.8. ∎

Section 6 further demonstrates the normality assumption being met when each client’s loss landscape is a quadratic form.

6 Analysis in the Quadratic Case

Quadratic Client Loss Landscape We can show that the server updates in FedAvg are normally distributed if we assume the loss landscape of each client follows a different quadratic form.

Assumption 6.1.

Each client’s loss landscape is Fk(w)=12(wak)TUk(wak)F^{k}(w)=\frac{1}{2}(w-a_{k})^{T}U_{k}(w-a_{k}) for some Ukd×dU_{k}\in\mathbb{R}^{d\times d} and akda_{k}\in\mathbb{R}^{d}.

With Assumption 6.1, the loss landscape on the server is F(w)=12k=1Qpk(wak)TUk(wak)F(w)=\frac{1}{2}\sum_{k=1}^{Q}p_{k}(w-a_{k})^{T}U_{k}(w-a_{k}). This global loss function can be rewritten as

F(w)=12(wa)T(k=1QpkUk)(wa),\displaystyle F(w)=\frac{1}{2}\bigg{(}w-a\bigg{)}^{T}\bigg{(}\sum_{k=1}^{Q}p_{k}U_{k}\bigg{)}\bigg{(}w-a\bigg{)},

where a=(k=1QpkUk)1k=1QpkUkaka=(\sum_{k=1}^{Q}p_{k}U_{k})^{-1}\sum_{k=1}^{Q}p_{k}U_{k}a_{k}, which is clearly another quadratic form.

Theorem 6.2.

With Assumption 4.4 and 6.1, a server learning rate of η0(t)=1\eta_{0}(t)=1, and a constant client-side learning rate of ηk(t)=η\eta_{k}(t)=\eta, the local weight vector on client kk after EE local updates is

wt+Ek\displaystyle w_{t+E}^{k} w0,tηj=0E1(IηUk)jUk(w0,tak)\displaystyle\sim w_{0,t}-\eta\sum_{j=0}^{E-1}(I-\eta U_{k})^{j}U_{k}(w_{0,t}-a_{k})
ηj=0E1i=1j(IηUk)jiUk𝒩(0,Σk)\displaystyle-\eta\sum_{j=0}^{E-1}\sum_{i=1}^{j}(I-\eta U_{k})^{j-i}U_{k}\mathcal{N}(0,\Sigma_{k})
+Eη𝒩(0,Σk),\displaystyle+E\eta\mathcal{N}(0,\Sigma_{k}),

where w0,tw_{0,t} is the shared server weight vector from the most recent averaging step.

Proof.

Proof provided in Section A.9. ∎

Theorem 6.2 shows that no matter how many local updates are performed on a client, the resulting final update to the weights is normally distributed. Therefore, after averaging, the updates to the global weights are also normally distributed.

Generalization Analysis of Single Variate Quadratic Loss Landscape

Using the evolution of local weights for quadratic loss functions provided in Theorem 6.2, assuming we have the single variate case, and using the same procedure for building the SDE as in Section 3, we obtain the SDE for the global weights as

dw0(t)=[k=1Qpkj=0E1(1ηUk)jUk(w0(t)ak)]dt\displaystyle dw_{0}(t)=-\bigg{[}\sum_{k=1}^{Q}p_{k}\sum_{j=0}^{E-1}(1-\eta U_{k})^{j}U_{k}(w_{0}(t)-a_{k})\bigg{]}dt
+η[E+k=1Qj=0E1i=1jpkΣk(1ηUk)jiUk]dB(t).\displaystyle+\sqrt{\eta}\bigg{[}E+\sum_{k=1}^{Q}\sum_{j=0}^{E-1}\sum_{i=1}^{j}p_{k}\sqrt{\Sigma_{k}}(1-\eta U_{k})^{j-i}U_{k}\bigg{]}dB(t). (4)

The SDE in (6) is a linear SDE and allows us to find analytical solutions as shown in Theorem 6.3.

Theorem 6.3.

The solution to the SDE specified in (6) is normally distributed as

w0(t)𝒩(m0(t),v0(t)),\displaystyle w_{0}(t)\sim\mathcal{N}(m_{0}(t),v_{0}(t)),

where

m0(t)=C4+(w0(0)C4)e(k=1Qpkj=0E1(1ηUk)jUk)t,\displaystyle m_{0}(t)=C_{4}+(w_{0}(0)-C_{4})e^{-(\sum_{k=1}^{Q}p_{k}\sum_{j=0}^{E-1}(1-\eta U_{k})^{j}U_{k})t},
C4=k=1Qpkj=0E1(1ηUk)jUkakk=1Qpkj=0E1(1ηUk)jUk,\displaystyle C_{4}=\frac{\sum_{k=1}^{Q}p_{k}\sum_{j=0}^{E-1}(1-\eta U_{k})^{j}U_{k}a_{k}}{\sum_{k=1}^{Q}p_{k}\sum_{j=0}^{E-1}(1-\eta U_{k})^{j}U_{k}},

and

v0(t)\displaystyle v_{0}(t) =η[E+k=1Qj=0E1i=1jpkΣk(1ηUk)jiUk]22k=1Qpkj=0E1(1ηUk)jUk\displaystyle=\frac{-\eta\big{[}E+\sum_{k=1}^{Q}\sum_{j=0}^{E-1}\sum_{i=1}^{j}p_{k}\sqrt{\Sigma_{k}}(1-\eta U_{k})^{j-i}U_{k}\big{]}^{2}}{2\sum_{k=1}^{Q}p_{k}\sum_{j=0}^{E-1}(1-\eta U_{k})^{j}U_{k}}
(exp((k=1Qpkj=0E1(1ηUk)jUk)t)1).\displaystyle\cdot\bigg{(}\exp\big{(}-(\sum_{k=1}^{Q}p_{k}\sum_{j=0}^{E-1}(1-\eta U_{k})^{j}U_{k})t\big{)}-1\bigg{)}.
Proof.

Proof is provided in Section A.11. ∎

We are interested in how various hyperparameters influence the expected loss and the variance of the solution distribution shown in Theorem 6.3. In the limit as η0\eta\rightarrow 0 and tt\rightarrow\infty, the solution approaches 𝒩(k=1QpkUkakk=1QpkUk,0)\mathcal{N}(\frac{\sum_{k=1}^{Q}p_{k}U_{k}a_{k}}{\sum_{k=1}^{Q}p_{k}U_{k}},0) which is the true global minimum with zero variance. This matches the behavior of SGD, that as the learning rate is decreased, the expected loss approaches the true minimum but the variance of the solution decreases and thus may not be able to escape poor local minima (Bishop, 1995; Jastrzebski et al., 2018).

As the number of local iterations before communication, EE, grows, the mean of the solution diverges from the true solution. Furthermore, as EE grows, the variance of the solution increases as 𝒪(E2)\mathcal{O}(E^{2}). This causes the expected solution to have a high bias with regards to the true solution, but an increase in EE may allow FedAvg to escape poorly-generalizing local minima. This demonstrates the trade-off between expected loss and ability to escape poor local minima.

This reasoning demonstrates how our proposed continuous-time limit of FedAvg can be used to explore behaviors of FL algorithms such as generalization ability. We leave further exploration of these ideas to future work, such as exploring the first exit time of the continuous-time formulation of FedAvg from a local minimum, similar to the existing work in SGD (Nguyen et al., 2019).

7 Conclusion

We have developed a continuous-time SDE that models the evolution of server weights during FedAvg. Using this formulation, we prove convergence to stationary points for general non-convex loss functions and are the first to show global convergence for weakly quasi-convex loss functions. We discuss the various assumptions required for the SDE to be an Itô process and demonstrate that these assumptions are reasonable. We demonstrate that the updates to the server weights can be approximated as a normal random variable in many cases, even if data is not identically distributed across clients. Finally, we provide a generalization analysis using our continuous-time formulation and quadratic client losses. Our SDE formulation serves as a framework for future efforts in analyzing FedAvg and other FL algorithms from the continuous-time perspective.

Impact Statement

This paper is focused on mathematically analyzing the underlying behavior of federated learning from a new perspective.

References

  • Billingsley (2012) Billingsley, P. Probability and Measure. Wiley Series in Probability and Statistics. Wiley, 2012.
  • Bishop (1995) Bishop, C. Neural Networks for Pattern Recognition. Advanced Texts in Econometrics. Clarendon Press, 1995.
  • Eriksson et al. (2004) Eriksson, K., Johnson, C., and Estep, D. Vector-Valued Functions of Several Real Variables, pp.  789–814. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
  • Glasgow et al. (2022) Glasgow, M. R., Yuan, H., and Ma, T. Sharp bounds for federated averaging (local SGD) and continuous perspective. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp.  9050–9090. PMLR, 28–30 Mar 2022.
  • Hardt et al. (2018) Hardt, M., Ma, T., and Recht, B. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1–44, 2018.
  • Jastrzebski et al. (2018) Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three factors influencing minima in SGD, 2018. URL https://arxiv.org/abs/1711.04623.
  • Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • Karimireddy et al. (2020) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. SCAFFOLD: Stochastic controlled averaging for federated learning. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  5132–5143. PMLR, 13–18 Jul 2020.
  • Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
  • Krichene & Bartlett (2017) Krichene, W. and Bartlett, P. L. Acceleration and averaging in stochastic descent dynamics. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Li et al. (2019) Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of FedAvg on non-IID data. In International Conference on Learning Representations, 2019.
  • Mandt et al. (2015) Mandt, S., Hoffman, M. D., Blei, D. M., et al. Continuous-time limit of stochastic gradient descent revisited. Advances in Neural Information Processing Systems, 2015.
  • McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp.  1273–1282. PMLR, 20–22 Apr 2017.
  • Mehrjou (2021) Mehrjou, A. Federated learning as a mean-field game, 2021. URL https://arxiv.org/abs/2107.03770.
  • Mertikopoulos & Staudigl (2018) Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, 2018.
  • Movellan (2011) Movellan, J. R. Tutorial on stochastic differential equations. MPLab Tutorials Version, 6, 2011.
  • Nguyen et al. (2019) Nguyen, T. H., Simsekli, U., Gurbuzbalaban, M., and Richard, G. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Orvieto & Lucchi (2019) Orvieto, A. and Lucchi, A. Continuous-time models for stochastic optimization algorithms. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Xue et al. (2022) Xue, Y., Klabjan, D., and Luo, Y. Aggregation delayed federated learning. In 2022 IEEE International Conference on Big Data (Big Data), pp.  85–94, 2022.
  • Zhang et al. (2023) Zhang, X., Hong, M., and Elia, N. Understanding a class of decentralized and federated optimization algorithms: A multirate feedback control perspective. SIAM Journal on Optimization, 33(2):652–683, 2023.

Appendix A Proofs

For a nice review of the mathematical preliminaries of SDEs, see the starting sections of the Supplementary Materials in (Orvieto & Lucchi, 2019).
We introduce a bound on the expected drift between client weights and the server weights. Lemma A.1 is used in the proofs of Theorems 4.5 and 4.9. The proof follows similar ideas as (Li et al., 2019) but has several key differences due to the continuous-time setting.

Lemma A.1.

With Assumptions 4.1, 4.2, 4.4, the expected drift between server global weights and the weights of client kk is bounded as

𝔼w0(t)wk(t,i)iη(t)[L+Tr(Σk)].\mathbb{E}||w_{0}(t)-w^{k}(t,i)||\leq i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}].

A.1 Proof of Lemma A.1

Proof.

We first examine the form of wk(t,i)w^{k}(t,i) as

wk(t,i)=w0(t)+j=0i1η(t)[Nk(t,j)Gk(t,j)]\displaystyle w^{k}(t,i)=w_{0}(t)+\sum_{j=0}^{i-1}\eta(t)[N^{k}(t,j)-G^{k}(t,j)]

Now we form the bound

𝔼w0(t)wk(t,i)\displaystyle\mathbb{E}||w_{0}(t)-w^{k}(t,i)|| =𝔼w0(t)w0(t)j=0i1η[Nk(t,j)Gk(t,j)]\displaystyle=\mathbb{E}||w_{0}(t)-w_{0}(t)-\sum_{j=0}^{i-1}\eta[N^{k}(t,j)-G^{k}(t,j)]||
=𝔼j=0i1η[Gk(t,j)Nk(t,j)]\displaystyle=\mathbb{E}||\sum_{j=0}^{i-1}\eta[G^{k}(t,j)-N^{k}(t,j)]||
j=0i1η(t)[𝔼Fk(wk(t,j))+𝔼Nk(t,j)]\displaystyle\leq\sum_{j=0}^{i-1}\eta(t)[\mathbb{E}||\nabla F^{k}(w^{k}(t,j))||+\mathbb{E}||N^{k}(t,j)||]
j=0i1η(t)[L+Tr(Σk)]\displaystyle\leq\sum_{j=0}^{i-1}\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]
=iη(t)[L+Tr(Σk)].\displaystyle=i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}].

A.2 Other Lemmas

We first include some helpful lemmas that have been proved in other works or are straightforward to show.

Lemma A.2.

For two symmetric, d×dd\times d matrices AA and BB, we have the following hold

Tr(AB)dASBS.\text{Tr}(AB)\leq d||A||_{S}||B||_{S}.
Proof.

(Mertikopoulos & Staudigl, 2018), (Krichene & Bartlett, 2017), and (Orvieto & Lucchi, 2019) all provide proofs of this statement. ∎

Lemma A.3.

The probability density function f(t)=η(t)φ(t)f(t^{\prime})=\frac{\eta(t^{\prime})}{\varphi(t)} defined over support t[0,t]t\in[0,t] is a valid probability density function.

Proof.

We define the learning rate η(t)\eta(t) as strictly positive, and thus the resulting probability density function is non-negative. We also show that

f(t)𝑑t\displaystyle\int_{-\infty}^{\infty}f(t^{\prime})dt^{\prime} =0tη(t)φ(t)𝑑t\displaystyle=\int_{0}^{t}\frac{\eta(t^{\prime})}{\varphi(t)}dt^{\prime}
=1φ(t)0tη(t)𝑑tφ(t)\displaystyle=\frac{1}{\varphi(t)}\underbrace{\int_{0}^{t}\eta(t^{\prime})dt^{\prime}}_{\varphi(t)}
=1.\displaystyle=1.

Therefore, the probability density function integrated over its entire support is 1, and the necessary conditions of a probability density function are met. ∎

Lemma A.4.

The drift term, η0(t)M^(w0(t))\eta_{0}(t)\hat{M}(w_{0}(t)), in Equation 2 is Lipschitz continuous.

Proof.

We require Assumption 4.1. The drift term is a vector-valued function, and it suffices to prove Lipschitz continuity for each component of the function to prove Lipschitz continuity of the whole vector-valued function (Eriksson et al., 2004). Furthermore, it is well known that for everywhere differentiable functions, the function is Lipschitz continuous if and only if the absolute value of the derivative of the function is bounded by a finite value for the entire input domain. We prove this lemma by combining these two facts and proving the bounded derivative, component-wise for M^(t)\hat{M}(t).

M^(w0(t))\displaystyle\hat{M}(w_{0}(t)) =𝔼[k=1Qpk[i=0E1ηk(t)(Nk(t,i)Gk(t,i)))]]\displaystyle=\mathbb{E}\bigg{[}\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)(N^{k}(t,i)-G^{k}(t,i)))]\bigg{]}
=k=1Qpk[i=0E1ηk(t)𝔼[Gk(t,i)]]\displaystyle=-\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)\mathbb{E}[G^{k}(t,i)]]
=k=1Qpk[i=0E1ηk(t)𝔼[Fk(wk(t,i))]]\displaystyle=-\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)\mathbb{E}[\nabla F^{k}(w^{k}(t,i))]]

where Nk(t,i)=𝒩(0,Σk(wk(t,i)))N^{k}(t,i)=\mathcal{N}(0,\Sigma_{k}(w^{k}(t,i))), Gk(t,i)=Fk(wk(t,i))G^{k}(t,i)=\nabla F^{k}(w^{k}(t,i)), wk(t,i)=wk(t,i1)ηk(t)[Gk(t,i1)+Nk(t,i1)]w^{k}(t,i)=w^{k}(t,i-1)-\eta_{k}(t)[G^{k}(t,i-1)+N^{k}(t,i-1)], and wk(t,i=0)=w0(t)w^{k}(t,i=0)=w_{0}(t) for all k=1,,Qk=1,...,Q.

We examine at the component-level and index for the jj-th component. We use the notation F(x)=ddxF(x)F^{\prime}(x)=\frac{d}{dx}F(x). We also change Fk()F^{k}() to Fk()F_{k}() for readability when using the derivative notation. We have

M^(t)j\displaystyle\hat{M}(t)_{j} =k=1Qpk[i=0E1ηk(t)𝔼[ddwjk(t,i)Fk(wjk(t,i))]]\displaystyle=-\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)\mathbb{E}[\frac{d}{dw_{j}^{k}(t,i)}F_{k}(w_{j}^{k}(t,i))]]
=k=1Qpk[i=0E1ηk(t)𝔼[Fk(wjk(t,i))]].\displaystyle=-\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)\mathbb{E}[F^{\prime}_{k}(w_{j}^{k}(t,i))]].

Now we need to show |ddwjk(t,0)M^(t)j|M|\frac{d}{dw^{k}_{j}(t,0)}\hat{M}(t)_{j}|\leq M for some finite MM. Note, the derivative is with respect to wjk(t,0)w^{k}_{j}(t,0), but the function arguments are wjk(t,i)w^{k}_{j}(t,i), so chain rule needs to be applied consecutively. We have

ddwjk(t,0)Fk(wjk(t,i))\displaystyle\frac{d}{dw^{k}_{j}(t,0)}F^{\prime}_{k}(w_{j}^{k}(t,i)) =Fk′′(wjk(t,i))dwjk(t,i)dwjk(t,0)\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\frac{dw_{j}^{k}(t,i)}{dw^{k}_{j}(t,0)}
=Fk′′(wjk(t,i))ddwjk(t,0)(wjk(t,i1)ηk(t)[Fk(wjk(t,i1))+Njk(t,i1)])\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\frac{d}{dw^{k}_{j}(t,0)}\bigg{(}w_{j}^{k}(t,i-1)-\eta_{k}(t)[F_{k}^{\prime}(w^{k}_{j}(t,i-1))+N_{j}^{k}(t,i-1)]\bigg{)}
=Fk′′(wjk(t,i))(dwjk(t,i1)dwjk(t,0)ηk(t)Fk′′(wjk(t,i1))dwjk(t,i1)dwjk(t,0))\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\bigg{(}\frac{dw_{j}^{k}(t,i-1)}{dw^{k}_{j}(t,0)}-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-1))\cdot\frac{dw_{j}^{k}(t,i-1)}{dw^{k}_{j}(t,0)}\bigg{)}
=Fk′′(wjk(t,i))(dwjk(t,i1)dwjk(t,0)[1ηk(t)Fk′′(wjk(t,i1))]).\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\bigg{(}\frac{dw_{j}^{k}(t,i-1)}{dw^{k}_{j}(t,0)}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-1))]\bigg{)}.

Using a new local iteration, we have that

dwjk(t,i1)dwjk(t,0)=(dwjk(t,i2)dwjk(t,0)[1ηk(t)Fk′′(wjk(t,i2))]).\displaystyle\frac{dw_{j}^{k}(t,i-1)}{dw^{k}_{j}(t,0)}=\bigg{(}\frac{dw_{j}^{k}(t,i-2)}{dw^{k}_{j}(t,0)}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-2))]\bigg{)}.

Combining we get

ddwjk(t,0)Fk(wjk(t,i))\displaystyle\frac{d}{dw^{k}_{j}(t,0)}F^{\prime}_{k}(w_{j}^{k}(t,i)) =Fk′′(wjk(t,i))[(dwjk(t,i2)dwjk(t,0)[1ηk(t)Fk′′(wjk(t,i2))])[1ηk(t)Fk′′(wjk(t,i1))]].\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\bigg{[}\bigg{(}\frac{dw_{j}^{k}(t,i-2)}{dw^{k}_{j}(t,0)}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-2))]\bigg{)}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-1))]\bigg{]}.

Since dwjk(t,0)dwjk(t,0)=1\frac{dw_{j}^{k}(t,0)}{dw^{k}_{j}(t,0)}=1, we can expand as

ddwjk(t,0)Fk(wjk(t,i))\displaystyle\frac{d}{dw^{k}_{j}(t,0)}F^{\prime}_{k}(w_{j}^{k}(t,i)) =Fk′′(wjk(t,i))[(dwjk(t,i2)dwjk(t,0)[1ηk(t)Fk′′(wjk(t,i2))])[1ηk(t)Fk′′(wjk(t,i1))]]\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\bigg{[}\bigg{(}\frac{dw_{j}^{k}(t,i-2)}{dw^{k}_{j}(t,0)}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-2))]\bigg{)}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-1))]\bigg{]}
=Fk′′(wjk(t,i))[l=1i1[1ηk(t)Fk′′(wjk(t,il))]].\displaystyle=F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\bigg{[}\prod_{l=1}^{i-1}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-l))]\bigg{]}.

Returning to the original expression for M^(t)j\hat{M}(t)_{j} we have

ddwjk(t,0)M^(t)j=k=1Qpk[i=0E1ηk(t)Fk′′(wjk(t,i))[l=1i1[1ηk(t)Fk′′(wjk(t,il))]]].\displaystyle\frac{d}{dw^{k}_{j}(t,0)}\hat{M}(t)_{j}=-\sum_{k=1}^{Q}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)F_{k}^{\prime\prime}(w_{j}^{k}(t,i))\cdot\bigg{[}\prod_{l=1}^{i-1}[1-\eta_{k}(t)F_{k}^{\prime\prime}(w^{k}_{j}(t,i-l))]\bigg{]}].

We now bound the absolute value of this derivative as

|ddwjk(t,0)M^(t)j|\displaystyle|\frac{d}{dw^{k}_{j}(t,0)}\hat{M}(t)_{j}| k=1Qpki=0E1ηk(t)|Fk′′(wjk(t,i))|[l=1i1[1+ηk(t)|Fk′′(wjk(t,il))|]]\displaystyle\leq\sum_{k=1}^{Q}p_{k}\sum_{i=0}^{E-1}\eta_{k}(t)|F_{k}^{\prime\prime}(w_{j}^{k}(t,i))|\cdot\bigg{[}\prod_{l=1}^{i-1}[1+\eta_{k}(t)|F_{k}^{\prime\prime}(w^{k}_{j}(t,i-l))|]\bigg{]}
k=1Qpki=0E1ηk(t)L[l=1i1[1+ηk(t)L]]\displaystyle\leq\sum_{k=1}^{Q}p_{k}\sum_{i=0}^{E-1}\eta_{k}(t)L\cdot\bigg{[}\prod_{l=1}^{i-1}[1+\eta_{k}(t)L]\bigg{]}
k=1Qpki=0E1[1+ηk(t)L]i.\displaystyle\leq\sum_{k=1}^{Q}p_{k}\sum_{i=0}^{E-1}[1+\eta_{k}(t)L]^{i}.

This holds for all components, therefore, we can conclude the drift term, η0(t)M^(t)\eta_{0}(t)\hat{M}(t), is Lipschitz continuous.

A.3 Proof of Theorem 4.5

Proof.

The proof structure is inspired by the structure of the continuous-time proofs for standard SGD, but have several additional complexities due to the averaging of client weights to determine server weights after a certain amount of time. The main steps are finding a suitable energy function of the stochastic process, bounding the infinitesimal diffusion generator, and using Dynkin’s formula to complete the proof.

We first write out M^(t)\hat{M}(t) as

M^(t)=k=1Qpk(i=0E1ηk(t)𝔼(G(t1,i))).\displaystyle\hat{M}(t)=-\sum_{k=1}^{Q}p_{k}(\sum_{i=0}^{E-1}\eta_{k}(t)\mathbb{E}(G(t-1,i))).

A good resource for reviewing the background mathematics used in this proof is in the Supplementary materials of (Orvieto & Lucchi, 2019).

Using a similar approach as in (Orvieto & Lucchi, 2019), we define an appropriate energy function and then use Dynkin’s formula to complete the proof. We are able to use all of the machinery for Itô diffusions due to our SDE being an Itô diffusion. In particular, we show in Lemma A.4 that the drift term is Lipschitz continuous and the proof for Lemma 1 in the supplementary materials of (Orvieto & Lucchi, 2019) proves Lipschitz continuity of the volatility matrix η0(t)hV^1/2(t)\eta_{0}(t)\sqrt{h}\hat{V}^{1/2}(t).

We define x()\partial_{x}(\cdot) as the vector of first derivatives with respect to each component of xx and xx()\partial_{xx}(\cdot) as the matrix of partial derivatives of xx()\partial_{xx}(\cdot) with respect to each component of xx. They are just the gradient and hessian with respect to the arguments of the energy function, but we use different notation than \nabla to prevent confusion with F(w0)\nabla F(w_{0}) which is the gradient of the loss function. The infinitesimal generator for an Itô diffusion of the form dX(t)=f(t)dt+σ(t)dB(t)dX(t)=f(t)dt+\sigma(t)dB(t) is defined as

𝒜()=t()+x(),f(t)+12Tr(σ(t)σ(t)Txx()).\mathscr{A}(\cdot)=\partial_{t}(\cdot)+\langle\partial_{x}(\cdot),f(t)\rangle+\frac{1}{2}\text{Tr}(\sigma(t)\sigma(t)^{T}\partial_{xx}(\cdot)).

Following (Orvieto & Lucchi, 2019) and using Itô’s lemma and the definition of an Itô diffusion, we obtain Dynkin’s formula as

𝔼[(X(t),t)](x0,0)=𝔼[0t𝒜(X(t),t)𝑑t].\mathbb{E}[\mathcal{E}(X(t),t)]-\mathcal{E}(x_{0},0)=\mathbb{E}[\int_{0}^{t}\mathscr{A}\mathcal{E}(X(t^{\prime}),t^{\prime})dt^{\prime}]. (5)

Dynkin’s formula is vital to our convergence proofs along with defining the correct energy function ()\mathcal{E}(\cdot).

We choose the energy function (w0):d+\mathcal{E}(w_{0}):\mathbb{R}^{d}\rightarrow\mathbb{R}_{+} as (w0)=F(w0)F(w0)\mathcal{E}(w_{0})=F(w_{0})-F(w_{0}^{*}). It is important to note that we define the energy function as just an argument of w0w_{0} rather than both w0w_{0} and tt, thus the first term in Dynkin’s formula is not applicable. This follows the same approach as (Orvieto & Lucchi, 2019).

We then bound the expectation of the stochastic integral of the infinitesimal generator of the process {(w0(t))}t0\{\mathcal{E}(w_{0}(t))\}_{t\geq 0} as

𝒜(w0(t))=w0((w0(t))),η0(t)M^(t)B1+12Tr(h(η0(t))2V^(t)w0w0((w0(t))))B2.\displaystyle\mathscr{A}\mathcal{E}(w_{0}(t))=\underbrace{\langle\partial_{w_{0}}(\mathcal{E}(w_{0}(t))),\eta_{0}(t)\hat{M}(t)\rangle}_{B_{1}}+\underbrace{\frac{1}{2}\text{Tr}(h(\eta_{0}(t))^{2}\hat{V}(t)\partial_{w_{0}w_{0}}(\mathcal{E}(w_{0}(t))))}_{B_{2}}.

We first bound B1B_{1} (for simplicity we assume ηk(t)=η(t)\eta_{k}(t)=\eta(t) for all clients) as

B1\displaystyle B_{1} =F(w0(t)),η0(t)k=1Qpkη(t)(i=0E1𝔼[Gk(t,i)])\displaystyle=\langle\nabla F(w_{0}(t)),-\eta_{0}(t)\sum_{k=1}^{Q}p_{k}\eta(t)(\sum_{i=0}^{E-1}\mathbb{E}[G^{k}(t,i)])\rangle
=η0(t)k=1Qpkη(t)i=0E1F(w0(t))T𝔼[Gk(t,i)]\displaystyle=-\eta_{0}(t)\sum_{k=1}^{Q}p_{k}\eta(t)\sum_{i=0}^{E-1}\nabla F(w_{0}(t))^{T}\mathbb{E}[G^{k}(t,i)]
=η0(t)i=0E1η(t)F(w0(t))Tk=1Qpk𝔼[Gk(t,i)].\displaystyle=-\eta_{0}(t)\sum_{i=0}^{E-1}\eta(t)\nabla F(w_{0}(t))^{T}\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)].

We first define a function R(t)R(t) such that F(w0(t))2=F(w0(t))T(k=1Qpk𝔼[Gk(t,i)])+R(t)||\nabla F(w_{0}(t))||^{2}=\nabla F(w_{0}(t))^{T}(\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)])+R(t).

We then bound R(t)R(t) as

R(t)\displaystyle R(t) =F(w0(t))2F(w0(t))T(k=1Qpk𝔼[Gk(t,i)])\displaystyle=||\nabla F(w_{0}(t))||^{2}-F(w_{0}(t))^{T}(\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)])
=F(w0(t))T(F(w0(t))k=1Qpk𝔼[Gk(t,i)])\displaystyle=\nabla F(w_{0}(t))^{T}(F(w_{0}(t))-\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)])
|F(w0(t))T(F(w0(t))k=1Qpk𝔼[Gk(t,i)])|\displaystyle\leq|\nabla F(w_{0}(t))^{T}(\nabla F(w_{0}(t))-\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)])|
F(w0(t))LF(w0(t))k=1Qpk𝔼[Gk(t,i)](Cauchy-Schwarz)\displaystyle\leq\underbrace{||\nabla F(w_{0}(t))||}_{\leq L}\cdot||\nabla F(w_{0}(t))-\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)]||\;\;\;\text{(Cauchy-Schwarz)}
LF(w0(t))k=1Qpk𝔼[Gk(t,i)]\displaystyle\leq L||\nabla F(w_{0}(t))-\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)]||
=L𝔼k=1Q[F(w0(t))pkFk(wk(t,i))]\displaystyle=L||\mathbb{E}\sum_{k=1}^{Q}[\nabla F(w_{0}(t))-p_{k}\nabla F^{k}(w^{k}(t,i))]||
Lk=1Qpk𝔼[F(w0(t))Fk(wk(t,i))]\displaystyle\leq L\sum_{k=1}^{Q}p_{k}||\mathbb{E}[\nabla F(w_{0}(t))-\nabla F^{k}(w^{k}(t,i))]||
Lk=1Qpk𝔼F(w0(t))Fk(wk(t,i))(Jensen’s Inequality)\displaystyle\leq L\sum_{k=1}^{Q}p_{k}\mathbb{E}||\nabla F(w_{0}(t))-\nabla F^{k}(w^{k}(t,i))||\;\;\;\text{(Jensen's Inequality)}
Lμk=1Qpk𝔼w0(t)wk(t,i)(F is μ-smooth)\displaystyle\leq L\mu\sum_{k=1}^{Q}p_{k}\mathbb{E}||w_{0}(t)-w^{k}(t,i)||\;\;\;\text{($F$ is $\mu$-smooth)}
Lμiη(t)k=1Qpk[L+Tr(Σk)](Lemma A.1).\displaystyle\leq L\mu i\eta(t)\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]\;\;\;\text{(Lemma \ref{bounded_drift})}.

It follows that

F(w0(t))T(k=1Qpk𝔼[Gk(t,i)])\displaystyle-\nabla F(w_{0}(t))^{T}(\sum_{k=1}^{Q}p_{k}\mathbb{E}[G^{k}(t,i)]) =R(t)F(w0(t))2\displaystyle=R(t)-||\nabla F(w_{0}(t))||^{2}
Lμiη(t)k=1Qpk[L+Tr(Σk)]F(w0(t))2.\displaystyle\leq L\mu i\eta(t)\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]-||\nabla F(w_{0}(t))||^{2}.

We return to our bound on B1B_{1} as

B1\displaystyle B_{1} η0(t)i=0E1η(t)[Lμiη(t)k=1Qpk[L+Tr(Σk)]F(w0(t))2]\displaystyle\leq\eta_{0}(t)\sum_{i=0}^{E-1}\eta(t)\Big{[}L\mu i\eta(t)\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]-||\nabla F(w_{0}(t))||^{2}\Big{]}
=Eη0(t)η(t)F(w0(t))2+η0(t)η(t)2Lμk=1Qpk[L+Tr(Σk)]i=0E1i\displaystyle=-E\eta_{0}(t)\eta(t)||\nabla F(w_{0}(t))||^{2}+\eta_{0}(t)\eta(t)^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]\sum_{i=0}^{E-1}i
=Eη0(t)η(t)F(w0(t))2+η0(t)η(t)2Lμk=1Qpk[L+Tr(Σk)](12E(E1))\displaystyle=-E\eta_{0}(t)\eta(t)||\nabla F(w_{0}(t))||^{2}+\eta_{0}(t)\eta(t)^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}](\frac{1}{2}E(E-1))
Eη0(t)η(t)F(w0(t))2+12E2η0(t)η(t)2Lμk=1Qpk[L+Tr(Σk)].\displaystyle\leq-E\eta_{0}(t)\eta(t)||\nabla F(w_{0}(t))||^{2}+\frac{1}{2}E^{2}\eta_{0}(t)\eta(t)^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}].

Now we bound B2B_{2} as

B2\displaystyle B_{2} =12Tr(h(η0(t))2V^(t)w0w0((w0(t))))\displaystyle=\frac{1}{2}\text{Tr}(h(\eta_{0}(t))^{2}\hat{V}(t)\partial_{w_{0}w_{0}}(\mathcal{E}(w_{0}(t))))
=h(η0(t))22Tr(V^(t)2F(w0(t))).\displaystyle=\frac{h(\eta_{0}(t))^{2}}{2}\text{Tr}\bigg{(}\hat{V}(t)\nabla^{2}F(w_{0}(t))\bigg{)}.

Using Lemma A.2, we can bound the trace of a product of symmetric matrices by the product of spectral norms. Both V^(t)\hat{V}(t) and 2F(w0(t))\nabla^{2}F(w_{0}(t)) are symmetric by their construction. So we have

B2\displaystyle B_{2} h(η0(t))22V^(t)S2F(w0(t))S.\displaystyle\leq\frac{h(\eta_{0}(t))^{2}}{2}||\hat{V}(t)||_{S}||\nabla^{2}F(w_{0}(t))||_{S}.

Then by assumptions 4.1, 4.2, and 4.3, we have

B2\displaystyle B_{2} h(η0(t)η(t))22V^1(t)S2F(w0(t))S\displaystyle\leq\frac{h(\eta_{0}(t)\eta(t))^{2}}{2}||\hat{V}_{1}(t)||_{S}||\nabla^{2}F(w_{0}(t))||_{S}
h(η0(t)η(t))2VL2.\displaystyle\leq\frac{h(\eta_{0}(t)\eta(t))^{2}V^{*}L}{2}.

We return to the bound on the infinitesimal generator as

𝒜(w0(t))\displaystyle\mathscr{A}\mathcal{E}(w_{0}(t)) =w0((w0(t))),η0(t)M^(t)B1+12Tr(h(η0(t))2V^(t)w0w0((w0(t))))B2\displaystyle=\underbrace{\langle\partial_{w_{0}}(\mathcal{E}(w_{0}(t))),\eta_{0}(t)\hat{M}(t)\rangle}_{B_{1}}+\underbrace{\frac{1}{2}\text{Tr}(h(\eta_{0}(t))^{2}\hat{V}(t)\partial_{w_{0}w_{0}}(\mathcal{E}(w_{0}(t))))}_{B_{2}}
Eη0(t)ηF(w0(t))2+E2Lμk=1Qpk[L+Tr(Σk)]2C1η0(t)η2+h(η0(t)η(t))2VL2.\displaystyle\leq-E\eta_{0}(t)\eta||\nabla F(w_{0}(t))||^{2}+\underbrace{\frac{E^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]}{2}}_{C_{1}}\eta_{0}(t)\eta^{2}+\frac{h(\eta_{0}(t)\eta(t))^{2}V^{*}L}{2}.

Then we use Dynkin’s Formula (Equation 5) and substitute η0(t)=1\eta_{0}(t)=1 to obtain the following expression

𝔼[(w0(t))](w0(0))𝔼[0t[Eη(t)F(w0(t))2+C1η(t)2+hη(t)2VL2]𝑑t].\mathbb{E}[\mathcal{E}(w_{0}(t))]-\mathcal{E}(w_{0}(0))\leq\mathbb{E}\bigg{[}\int_{0}^{t}[-E\eta(t^{\prime})||\nabla F(w_{0}(t^{\prime}))||^{2}+C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}\bigg{]}. (6)

We follow a similar approach as (Orvieto & Lucchi, 2019) and define φ(t)=0tη(t)𝑑t\varphi(t)=\int_{0}^{t}\eta(t^{\prime})dt^{\prime}. We can then define the distribution defined by the probability density function f(t)=η(t)φ(t)f(t^{\prime})=\frac{\eta(t^{\prime})}{\varphi(t)}. Lemma A.3 shows this is a valid probability density function. We define a random variable t^[0,t]\hat{t}\in[0,t] with probability density function η(t)φ(t)\frac{\eta(t^{\prime})}{\varphi(t)}. We then use a similar trick to (Johnson & Zhang, 2013) and (Orvieto & Lucchi, 2019) and use the law of the unconscious statistician to obtain

𝔼t~F(w0(t^))2=1φ(t)0tη(t)F(w0(t))2𝑑t\mathbb{E}_{\tilde{t}}||\nabla F(w_{0}(\hat{t}))||^{2}=\frac{1}{\varphi(t)}\int_{0}^{t}\eta(t^{\prime})||\nabla F(w_{0}(t^{\prime}))||^{2}dt^{\prime} (7)

where 𝔼t~\mathbb{E}_{\tilde{t}} is the expectation over the random time point t~\tilde{t}.

We substitute Equation 7 into Inequality 6 and get

𝔼[(w0(t))]E1(t)(w0(0))Eφ(t)𝔼𝒢,t~F(w0(t^))2+0t[C1η(t)2+hη(t)2VL2]𝑑t\underbrace{\mathbb{E}[\mathcal{E}(w_{0}(t))]}_{E_{1}(t)}-\mathcal{E}(w_{0}(0))\leq-E\varphi(t)\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\hat{t}))||^{2}+\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime} (8)

where 𝔼𝒢,t~\mathbb{E}_{\mathcal{G},\tilde{t}} is the expectation over the random time point t~\tilde{t} and the stochastic gradients, 𝒢\mathcal{G}.

We notice that E1(t)=F(w0(t))F(w0)0E_{1}(t)=F(w_{0}(t))-F(w_{0}^{*})\geq 0, so we can safely drop this term from the inequality.

So we can rewrite Equation 8 as

𝔼𝒢,t~F(w0(t^))2\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\hat{t}))||^{2} (w0(0))Eφ(t)+1Eφ(t)0t[C1η(t)2+hη(t)2VL2]𝑑t\displaystyle\leq\frac{\mathcal{E}(w_{0}(0))}{E\varphi(t)}+\frac{1}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}
=F(w0(0))F(w0)Eφ(t)+1Eφ(t)0t[C1η(t)2+hη(t)2VL2]𝑑t.\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\varphi(t)}+\frac{1}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}.

A.4 Proof of Corollary 4.6

Proof.

We first examine the case where η(t)=1t+1\eta(t)=\frac{1}{t+1}.

We have

φ(t)\displaystyle\varphi(t) =0tη(t)𝑑t\displaystyle=\int_{0}^{t}\eta(t^{\prime})dt^{\prime}
=0t1t+1𝑑t\displaystyle=\int_{0}^{t}\frac{1}{t^{\prime}+1}dt^{\prime}
=log(t+1)|0t\displaystyle=\log(t^{\prime}+1)\bigg{|}_{0}^{t}
=log(t+1).\displaystyle=\log(t+1).

Therefore we find

𝔼𝒢,t~F(w0(t~))2\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\tilde{t}))||^{2} F(w0(0))F(w0)Eφ(t)+1Eφ(t)0t[C1η(t)2+hη(t)2VL2]𝑑t\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\varphi(t)}+\frac{1}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}
=F(w0(0))F(w0)Elog(t+1)+1Elog(t+1)[C10tη(t)2𝑑t=tt+1+hVL20tη(t)2𝑑t]\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\log(t+1)}+\frac{1}{E\log(t+1)}\bigg{[}C_{1}\underbrace{\int_{0}^{t}\eta(t^{\prime})^{2}dt^{\prime}}_{=\frac{t}{t+1}}+\frac{hV^{*}L}{2}\int_{0}^{t}\eta(t^{\prime})^{2}dt^{\prime}\bigg{]}
F(w0(0))F(w0)Elog(t+1)+C1+hVL2Elog(t+1)\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\log(t+1)}+\frac{C_{1}+\frac{hV^{*}L}{2}}{E\log(t+1)}
=F(w0(0))F(w0)+C1+hVL2E1log(t+1).\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})+C_{1}+\frac{hV^{*}L}{2}}{E}\frac{1}{\log(t+1)}.

Now we examine η(t)=1t+1\eta(t)=\frac{1}{\sqrt{t+1}} We have φ(t)=0tη(t)𝑑t=2t+12\varphi(t)=\int_{0}^{t}\eta(t^{\prime})dt^{\prime}=2\sqrt{t+1}-2, and we find

𝔼𝒢,t~F(w0(t~))2\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\tilde{t}))||^{2} F(w0(0))F(w0)Eφ(t)+1Eφ(t)0t[C1η(t)2+hη(t)2VL2]𝑑t\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\varphi(t)}+\frac{1}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}
=F(w0(0))F(w0)E(2t+12)+1E(2t+12)[C10tη(t)2𝑑t=log(t+1)+hVL20tη(t)2𝑑t]\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})}{E(2\sqrt{t+1}-2)}+\frac{1}{E(2\sqrt{t+1}-2)}\bigg{[}C_{1}\underbrace{\int_{0}^{t}\eta(t^{\prime})^{2}dt^{\prime}}_{=\log(t+1)}+\frac{hV^{*}L}{2}\int_{0}^{t}\eta(t^{\prime})^{2}dt^{\prime}\bigg{]}
F(w0(0))F(w0)E(2t+12)+[C1+hVL2]log(t+1)E(2t+12).\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E(2\sqrt{t+1}-2)}+\frac{[C_{1}+\frac{hV^{*}L}{2}]\log(t+1)}{E(2\sqrt{t+1}-2)}.

Now we examine the asymptotic rates. This proof is the same as in (Orvieto & Lucchi, 2019). The general form of the inequality is

𝔼𝒢,t~F(w0(t~))2F(w0(0))F(w0)Eφ(t)+1Eφ(t)0t[C1η(t)2+hη(t)2VL2]𝑑t.\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\tilde{t}))||^{2}\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\varphi(t)}+\frac{1}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta(t^{\prime})^{2}+\frac{h\eta(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}.

The first term on the RHS of the inequality is the deterministic term and has rate 𝒪(tb1)\mathcal{O}(t^{b-1}) for 0<b<10<b<1 and 𝒪(1/log(t))\mathcal{O}(1/\log(t)) for b=1b=1. The stochastic term is 𝒪(tb)\mathcal{O}(t^{-b}) for b(0,1/2)(1/2,1)b\in(0,1/2)\cup(1/2,1), 𝒪(log(t)t)\mathcal{O}(\frac{\log(t)}{\sqrt{t}}) for b=1/2b=1/2, and 𝒪(1logt)\mathcal{O}(\frac{1}{\log{t}}) for b=1b=1.

A.5 Proof of Corollary 4.7

Proof.

We examine the convergence for the choice of client learning rate, η(t)=ηc\eta(t)=\eta_{c}, and server learning rate, η0(t)=1t+1\eta_{0}(t)=\frac{1}{t+1}. Most of the proof follows the same steps as the proof for Theorem 4.5, but we return to the inequality of the infinitesimal generator as

𝒜(w0(t))\displaystyle\mathscr{A}\mathcal{E}(w_{0}(t)) Eη0(t)ηF(w0(t))2+E2Lμk=1Qpk[L+Tr(Σk)]2C1η0(t)η2+h(η0(t)η(t))2VL2\displaystyle\leq-E\eta_{0}(t)\eta||\nabla F(w_{0}(t))||^{2}+\underbrace{\frac{E^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]}{2}}_{C_{1}}\eta_{0}(t)\eta^{2}+\frac{h(\eta_{0}(t)\eta(t))^{2}V^{*}L}{2}
=Eη0(t)ηcF(w0(t))2+E2Lμk=1Qpk[L+Tr(Σk)]2C1η0(t)ηc2+h(η0(t)ηc)2VL2.\displaystyle=-E\eta_{0}(t)\eta_{c}||\nabla F(w_{0}(t))||^{2}+\underbrace{\frac{E^{2}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]}{2}}_{C_{1}}\eta_{0}(t)\eta_{c}^{2}+\frac{h(\eta_{0}(t)\eta_{c})^{2}V^{*}L}{2}.

Use Dynkin’s Formula, 5, to obtain

𝔼[(w0(t))](w0(0))𝔼[0t[Eηcη0(t)F(w0(t))2+C1η0(t)ηc2+hη0(t)2ηc2VL2]𝑑t].\mathbb{E}[\mathcal{E}(w_{0}(t))]-\mathcal{E}(w_{0}(0))\leq\mathbb{E}\bigg{[}\int_{0}^{t}[-E\eta_{c}\eta_{0}(t^{\prime})||\nabla F(w_{0}(t^{\prime}))||^{2}+C_{1}\eta_{0}(t^{\prime})\eta_{c}^{2}+\frac{h\eta_{0}(t^{\prime})^{2}\eta_{c}^{2}V^{*}L}{2}]dt^{\prime}\bigg{]}. (9)

Now we follow steps similar to the final steps of the proof for Theorem 4.5.

We set φ(t)=0tη0(t)𝑑t\varphi(t)=\int_{0}^{t}\eta_{0}(t^{\prime})dt^{\prime}. We then substitute Equation 7 into Inequality 6 and get

𝔼[(w0(t))]E1(t)(w0(0))Eηcφ(t)𝔼𝒢,t~F(w0(t^))2+ηc20t[C1η0(t)+hη0(t)2VL2]𝑑t.\underbrace{\mathbb{E}[\mathcal{E}(w_{0}(t))]}_{E_{1}(t)}-\mathcal{E}(w_{0}(0))\leq-E\eta_{c}\varphi(t)\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\hat{t}))||^{2}+\eta_{c}^{2}\int_{0}^{t}[C_{1}\eta_{0}(t^{\prime})+\frac{h\eta_{0}(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}. (10)

We notice that E1(t)=F(w0(t))F(w0)0E_{1}(t)=F(w_{0}(t))-F(w_{0}^{*})\geq 0, so we can safely drop this term from the inequality.

So we can rewrite Equation 10 as

𝔼𝒢,t~F(w0(t^))2\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\hat{t}))||^{2} (w0(0))Eηcφ(t)+ηcEφ(t)0t[C1η0(t)+hη0(t)2VL2]𝑑t\displaystyle\leq\frac{\mathcal{E}(w_{0}(0))}{E\eta_{c}\varphi(t)}+\frac{\eta_{c}}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta_{0}(t^{\prime})+\frac{h\eta_{0}(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}
=F(w0(0))F(w0)Eηcφ(t)+ηcEφ(t)0t[C1η0(t)+hη0(t)2VL2]𝑑t.\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\eta_{c}\varphi(t)}+\frac{\eta_{c}}{E\varphi(t)}\int_{0}^{t}[C_{1}\eta_{0}(t^{\prime})+\frac{h\eta_{0}(t^{\prime})^{2}V^{*}L}{2}]dt^{\prime}.

Now we substitute η0(t)=1t+1\eta_{0}(t)=\frac{1}{t+1} and get

𝔼𝒢,t~F(w0(t^))2\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}||\nabla F(w_{0}(\hat{t}))||^{2} F(w0(0))F(w0)Eηclog(t+1)+ηcElog(t+1)[C1log(t+1)+hVL2].\displaystyle\leq\frac{F(w_{0}(0))-F(w_{0}^{*})}{E\eta_{c}\log(t+1)}+\frac{\eta_{c}}{E\log(t+1)}[C_{1}\log(t+1)+\frac{hV^{*}L}{2}].
=F(w0(0))F(w0)+ηc2hVL/2Eηclog(t+1)+C1ηcE\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})+\eta_{c}^{2}hV^{*}L/2}{E\eta_{c}\log(t+1)}+\frac{C_{1}\eta_{c}}{E}
=F(w0(0))F(w0)+ηc2hVL/2Eηclog(t+1)+EηcLμk=1Qpk[L+Tr(Σk)]2.\displaystyle=\frac{F(w_{0}(0))-F(w_{0}^{*})+\eta_{c}^{2}hV^{*}L/2}{E\eta_{c}\log(t+1)}+\frac{E\eta_{c}L\mu\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]}{2}.

A.6 Proof of Theorem 4.9

Proof.

This proof follows similarly to the proof of Theorem 4.5 where we find a suitable energy function, bound the infinitesimal generator, then use Dynkin’s Formula to complete the bound.

For this case, we choose the energy function (w0):d+\mathcal{E}(w_{0}):\mathbb{R}^{d}\rightarrow\mathbb{R}_{+} as (w0)=12w0w02\mathcal{E}(w_{0})=\frac{1}{2}||w_{0}-w_{0}^{*}||^{2}.

We then bound the expectation of the stochastic integral of the infinitesimal generator of the process {(w0(t))}t0\{\mathcal{E}(w_{0}(t))\}_{t\geq 0} as

𝒜(w0(t))=w0((w0(t))),η0(t)M^(t)B1+12Tr(h(η0(t))2V^(t)w0w0((w0(t))))B2.\displaystyle\mathscr{A}\mathcal{E}(w_{0}(t))=\underbrace{\langle\partial_{w_{0}}(\mathcal{E}(w_{0}(t))),\eta_{0}(t)\hat{M}(t)\rangle}_{B_{1}}+\underbrace{\frac{1}{2}\text{Tr}(h(\eta_{0}(t))^{2}\hat{V}(t)\partial_{w_{0}w_{0}}(\mathcal{E}(w_{0}(t))))}_{B_{2}}.

First examine B1B_{1} as

B1\displaystyle B_{1} =w0w0,η0(t)k=1Qpkη(t)i=0E1𝔼[Gk(t,i)]\displaystyle=\langle w_{0}-w_{0}^{*},-\eta_{0}(t)\sum_{k=1}^{Q}p_{k}\eta(t)\sum_{i=0}^{E-1}\mathbb{E}[G^{k}(t,i)]\rangle
=η0(t)k=1Qpkη(t)i=0E1(w0(t)w0)T𝔼[Fk(wk(t,i))].\displaystyle=-\eta_{0}(t)\sum_{k=1}^{Q}p_{k}\eta(t)\sum_{i=0}^{E-1}(w_{0}(t)-w_{0}^{*})^{T}\mathbb{E}[\nabla F^{k}(w^{k}(t,i))].

We define R(t)R(t) such that

(w0(t)w0)T𝔼[Fk(wk(t,i))]+R(t)=(w0(t)w0)T(Fk(w0(t))).(w_{0}(t)-w_{0}^{*})^{T}\mathbb{E}[\nabla F^{k}(w^{k}(t,i))]+R(t)=(w_{0}(t)-w_{0}^{*})^{T}(\nabla F^{k}(w_{0}(t))). (11)

Now we bound R(t)R(t) as

R(t)\displaystyle R(t) =(w0(t)w0)T(Fk(w0(t))𝔼[Fk(wk(t,i))])\displaystyle=(w_{0}(t)-w_{0}^{*})^{T}(\nabla F^{k}(w_{0}(t))-\mathbb{E}[\nabla F^{k}(w^{k}(t,i))])
|(w0(t)w0)T(Fk(w0(t))𝔼[Fk(wk(t,i))])|\displaystyle\leq|(w_{0}(t)-w_{0}^{*})^{T}(\nabla F^{k}(w_{0}(t))-\mathbb{E}[\nabla F^{k}(w^{k}(t,i))])|
w0(t)w0Fk(w0(t))𝔼[Fk(wk(t,i))](Cauchy-Schwarz)\displaystyle\leq||w_{0}(t)-w_{0}^{*}||\cdot||\nabla F^{k}(w_{0}(t))-\mathbb{E}[\nabla F^{k}(w^{k}(t,i))]||\;\;\;\text{(Cauchy-Schwarz)}
w0(t)w0𝔼[Fk(w0(t))Fk(wk(t,i))]\displaystyle\leq||w_{0}(t)-w_{0}^{*}||\cdot||\mathbb{E}[\nabla F^{k}(w_{0}(t))-\nabla F^{k}(w^{k}(t,i))]||
w0(t)w0𝔼Fk(w0(t))Fk(wk(t,i))(Jensen’s Inequality)\displaystyle\leq||w_{0}(t)-w_{0}^{*}||\cdot\mathbb{E}||\nabla F^{k}(w_{0}(t))-\nabla F^{k}(w^{k}(t,i))||\;\;\;\text{(Jensen's Inequality)}
μw0(t)w0𝔼w0(t)wk(t,i)(μ-smooth)\displaystyle\leq\mu||w_{0}(t)-w_{0}^{*}||\cdot\mathbb{E}||w_{0}(t)-w^{k}(t,i)||\;\;\;(\mu\text{-smooth})
μiη(t)[L+Tr(Σk)]w0(t)w0(Lemma A.1)\displaystyle\leq\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]||w_{0}(t)-w_{0}^{*}||\;\;\;(\text{Lemma \ref{bounded_drift}})
=μiη(t)[L+Tr(Σk)]w0(0)+(w0(t)w0(0))w0.\displaystyle=\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]||w_{0}(0)+(w_{0}(t)-w_{0}(0))-w_{0}^{*}||.

Now we need to find a bound on w0(0)+(w0(t)w0(0))w0||w_{0}(0)+(w_{0}(t)-w_{0}(0))-w_{0}^{*}||. Specifically, we need to bound w0(t)w0(0)||w_{0}(t)-w_{0}(0)||. We bound w0(t)||w_{0}(t)|| as

w0(t)w0(0)\displaystyle||w_{0}(t)-w_{0}(0)|| =w0(0)w0(0)+0tη0(t)M^(t)𝑑t+0thη0(t)V^1/2(t)𝑑B(t)\displaystyle=||w_{0}(0)-w_{0}(0)+\int_{0}^{t}\eta_{0}(t^{\prime})\hat{M}(t^{\prime})dt^{\prime}+\int_{0}^{t}\sqrt{h}\eta_{0}(t^{\prime})\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||
0tη0(t)M^(t)𝑑t+0thη0(t)V^1/2(t)𝑑B(t).\displaystyle\leq\int_{0}^{t}\eta_{0}(t^{\prime})||\hat{M}(t^{\prime})||dt^{\prime}+||\int_{0}^{t}\sqrt{h}\eta_{0}(t^{\prime})\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||.

We assume constant server learning rate η0(t)=η0\eta_{0}(t)=\eta_{0}. We examine 0tη0(t)M^(t)𝑑t\int_{0}^{t}\eta_{0}(t^{\prime})||\hat{M}(t^{\prime})||dt^{\prime} as follows

0tη0(t)M^(t)𝑑t\displaystyle\int_{0}^{t}\eta_{0}(t^{\prime})||\hat{M}(t^{\prime})||dt^{\prime} =0tη0(t)η(t)k=1Qpki=0E1𝔼[Gk(t,i)]𝑑t\displaystyle=\int_{0}^{t}\eta_{0}(t^{\prime})||\eta(t^{\prime})\sum_{k=1}^{Q}p_{k}\sum_{i=0}^{E-1}\mathbb{E}[G^{k}(t^{\prime},i)]||dt^{\prime}
0tη0(t)η(t)k=1Qpki=0E1𝔼Fk(wk(t,i))dt\displaystyle\leq\int_{0}^{t}\eta_{0}(t^{\prime})\eta(t^{\prime})\sum_{k=1}^{Q}p_{k}\sum_{i=0}^{E-1}\mathbb{E}||\nabla F^{k}(w_{k}(t^{\prime},i))||dt^{\prime}
0tη0(t)η(t)LE𝑑t\displaystyle\leq\int_{0}^{t}\eta_{0}(t^{\prime})\eta(t^{\prime})LEdt^{\prime}
=η0LE0tη(t)𝑑t.\displaystyle=\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}.

Therefore we have the following

w0(t)w0(0)η0LE0tη(t)𝑑t+hη00tV^1/2(t)𝑑B(t).||w_{0}(t)-w_{0}(0)||\leq\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}+\sqrt{h}\eta_{0}||\int_{0}^{t}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||.

Returning back to R(t)R(t), we have

R(t)μiη(t)[L+Tr(Σk)][w0(0)w0+η0LE0tη(t)𝑑t+hη00tV^1/2(t)𝑑B(t)].R(t)\leq\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}+\sqrt{h}\eta_{0}||\int_{0}^{t}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}.

Returning to Equation 11, it follows that

(w0(t)w0)T𝔼[Fk(wk(t,i))]\displaystyle-(w_{0}(t)-w_{0}^{*})^{T}\mathbb{E}[\nabla F^{k}(w^{k}(t,i))] =R(t)(w0(t)w0)T(Fk(w0(t)))\displaystyle=R(t)-(w_{0}(t)-w_{0}^{*})^{T}(\nabla F^{k}(w_{0}(t)))
μiη(t)[L+Tr(Σk)][||w0(0)w0||+η0LE0tη(t)dt\displaystyle\leq\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}
+hη0||0tV^1/2(t)dB(t)||](w0(t)w0)T(Fk(w0(t)))\displaystyle+\sqrt{h}\eta_{0}||\int_{0}^{t}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}-(w_{0}(t)-w_{0}^{*})^{T}(\nabla F^{k}(w_{0}(t)))
μiη(t)[L+Tr(Σk)][||w0(0)w0||+η0LE0tη(t)dt\displaystyle\leq\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}
+hη0||0tV^1/2(t)dB(t)||]τ(F(w0(t))F(w0))(Assumption 4.8).\displaystyle+\sqrt{h}\eta_{0}||\int_{0}^{t}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}-\tau(F(w_{0}(t))-F(w_{0}^{*}))\;\;\;(\text{Assumption \ref{wqc}}).

Returning back to the original bound on B1B_{1}, we have

B1\displaystyle B_{1} η0k=1Qpkη(t)i=0E1(μiη(t)[L+Tr(Σk)][||w0(0)w0||+η0LE0tη(t)dt+hη0||0tV^1/2(t)dB(t)||]\displaystyle\leq\eta_{0}\sum_{k=1}^{Q}p_{k}\eta(t)\sum_{i=0}^{E-1}\bigg{(}\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}+\sqrt{h}\eta_{0}||\int_{0}^{t}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}
τ(F(w0(t))F(w0))).\displaystyle-\tau(F(w_{0}(t))-F(w_{0}^{*}))\bigg{)}.

Now we examine B2B_{2} as

B2\displaystyle B_{2} =12Tr(h(η0(t))2V^(t)w0w0((w0(t))))\displaystyle=\frac{1}{2}\text{Tr}(h(\eta_{0}(t))^{2}\hat{V}(t)\partial_{w_{0}w_{0}}(\mathcal{E}(w_{0}(t)))) (12)
=hη022Tr(V^(t)).\displaystyle=\frac{h\eta_{0}^{2}}{2}\text{Tr}(\hat{V}(t)). (13)

From Lemma 3 in the supplementary materials of (Orvieto & Lucchi, 2019) and the same approach that we use in the proof of Theorem 4.5, we have that Tr(V^(t))dη(t)2V\text{Tr}(\hat{V}(t))\leq d\eta(t)^{2}V^{*} where dd is the dimensionality of our weights w0w_{0}. Therefore we have

B2\displaystyle B_{2} dhη02η(t)2V2.\displaystyle\leq\frac{dh\eta_{0}^{2}\eta(t)^{2}V^{*}}{2}. (14)

We return to the bound on the infinitesimal generator as

𝒜(w0(t))\displaystyle\mathscr{A}\mathcal{E}(w_{0}(t)) η0k=1Qpkη(t)i=0E1(μiη(t)[L+Tr(Σk)][||w0(0)w0||+η0LE0tη(t)dt+hη0||0tV^1/2(t)dB(t)||]\displaystyle\leq\eta_{0}\sum_{k=1}^{Q}p_{k}\eta(t)\sum_{i=0}^{E-1}\bigg{(}\mu i\eta(t)[L+\sqrt{\text{Tr}(\Sigma_{k})}]\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{t}\eta(t^{\prime})dt^{\prime}+\sqrt{h}\eta_{0}||\int_{0}^{t}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}
τ(F(w0(t))F(w0)))+dhη02η(t)2V2.\displaystyle-\tau(F(w_{0}(t))-F(w_{0}^{*}))\bigg{)}+\frac{dh\eta_{0}^{2}\eta(t)^{2}V^{*}}{2}.

As in the proof of Theorem 4.5, we now use Dynkin’s Formula to get

𝔼[(w0(t))](w0(0))\displaystyle\mathbb{E}[\mathcal{E}(w_{0}(t))]-\mathcal{E}(w_{0}(0)) 𝔼[0t[η0k=1Qpkη(s)i=0E1(μiη(s)[L+Tr(Σk)][||w0(0)w0||+η0LE0sη(t)dt\displaystyle\leq\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta_{0}\sum_{k=1}^{Q}p_{k}\eta(s)\sum_{i=0}^{E-1}\bigg{(}\mu i\eta(s)[L+\sqrt{\text{Tr}(\Sigma_{k})}]\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}
+hη0||0sV^1/2(t)dB(t)||]τ(F(w0(s))F(w0)))+dhη02η(s)2V2]ds]\displaystyle+\sqrt{h}\eta_{0}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}-\tau(F(w_{0}(s))-F(w_{0}^{*}))\bigg{)}+\frac{dh\eta_{0}^{2}\eta(s)^{2}V^{*}}{2}\bigg{]}ds\Bigg{]}
𝔼[0t[η0η(s)(μE2k=1Qpk[L+Tr(Σk)]C2η(s)[||w0(0)w0||+η0LE0sη(t)dt\displaystyle\leq\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta_{0}\eta(s)\bigg{(}\underbrace{\mu E^{2}\sum_{k=1}^{Q}p_{k}[L+\sqrt{\text{Tr}(\Sigma_{k})}]}_{C_{2}}\eta(s)\bigg{[}||w_{0}(0)-w_{0}^{*}||+\eta_{0}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}
+hη0||0sV^1/2(t)dB(t)||]τ(F(w0(s))F(w0)))+dhη02η(s)2V2]ds]\displaystyle+\sqrt{h}\eta_{0}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}-\tau(F(w_{0}(s))-F(w_{0}^{*}))\bigg{)}+\frac{dh\eta_{0}^{2}\eta(s)^{2}V^{*}}{2}\bigg{]}ds\Bigg{]}
=𝔼[0t[η0C2η(s)2(η0LE0sη(t)dt+hη0||0sV^1/2(t)dB(t)||)]ds]\displaystyle=\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta_{0}C_{2}\eta(s)^{2}\bigg{(}\eta_{0}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}+\sqrt{h}\eta_{0}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{)}\bigg{]}ds\Bigg{]}
+[dhη02V2+η0C2||w0(0)w0||]C3𝔼[0tη(s)2ds]𝔼[0t[η0η(s)τ(F(w0(s))F(w0))]ds]\displaystyle+\underbrace{\bigg{[}\frac{dh\eta_{0}^{2}V^{*}}{2}+\eta_{0}C_{2}||w_{0}(0)-w_{0}^{*}||\bigg{]}}_{C_{3}}\mathbb{E}\Bigg{[}\int_{0}^{t}\eta(s)^{2}ds\Bigg{]}-\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta_{0}\eta(s)\tau(F(w_{0}(s))-F(w_{0}^{*}))\bigg{]}ds\Bigg{]}
=η02C2𝔼[0t[η(s)2(LE0sη(t)dt+h||0sV^1/2(t)dB(t)||)]ds]\displaystyle=\eta_{0}^{2}C_{2}\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta(s)^{2}\bigg{(}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}+\sqrt{h}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{)}\bigg{]}ds\Bigg{]}
+C3𝔼[0tη(s)2ds]τη0𝔼[0t[η(s)(F(w0(s))F(w0))]ds].\displaystyle+C_{3}\mathbb{E}\Bigg{[}\int_{0}^{t}\eta(s)^{2}ds\Bigg{]}-\tau\eta_{0}\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta(s)(F(w_{0}(s))-F(w_{0}^{*}))\bigg{]}ds\Bigg{]}.

We examine the term with V^1/2(t)\hat{V}^{1/2}(t^{\prime}) as

𝔼[0tη(s)2h||0sV^1/2(t)dB(t)||ds]=0tη(s)2h𝔼[||0sV^1/2(t)dB(t)||]ds,\displaystyle\mathbb{E}\Bigg{[}\int_{0}^{t}\eta(s)^{2}\sqrt{h}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||ds\Bigg{]}=\int_{0}^{t}\eta(s)^{2}\sqrt{h}\mathbb{E}\bigg{[}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}ds,

and we can make this switch because η(s)2h𝔼[||0sV^1/2(t)dB(t)||]<\eta(s)^{2}\sqrt{h}\mathbb{E}\bigg{[}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||\bigg{]}<\infty. From multivariate Ito isometry and Jensen’s inequality for concave functions (such as the square root function), we have that

𝔼||0sV^1/2(t)dB(t)||\displaystyle\mathbb{E}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})|| 𝔼||0sV^1/2(t)dB(t)||2\displaystyle\leq\sqrt{\mathbb{E}||\int_{0}^{s}\hat{V}^{1/2}(t^{\prime})dB(t^{\prime})||^{2}}
𝔼0s||V^1/2(t)||2dt\displaystyle\leq\sqrt{\mathbb{E}\int_{0}^{s}||\hat{V}^{1/2}(t^{\prime})||^{2}dt^{\prime}}
=𝔼0s||η(t)V1^1/2(t)||F2dt\displaystyle=\sqrt{\mathbb{E}\int_{0}^{s}||\eta(t^{\prime})\hat{V_{1}}^{1/2}(t^{\prime})||_{F}^{2}dt^{\prime}}
=𝔼0sη(t)2||V1^1/2(t)||F2dt\displaystyle=\sqrt{\mathbb{E}\int_{0}^{s}\eta(t^{\prime})^{2}||\hat{V_{1}}^{1/2}(t^{\prime})||_{F}^{2}dt^{\prime}}
0sη(t)2(V)2dt\displaystyle\leq\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}(V^{*})^{2}dt^{\prime}}
=V0sη(t)2dt,\displaystyle=V^{*}\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}dt^{\prime}},

where ||||F||\cdot||_{F} is the Frobenius norm.

Returning to the bound from Dynkin’s Formula we have

𝔼[(w0(t))]0(w0(0))\displaystyle\underbrace{\mathbb{E}[\mathcal{E}(w_{0}(t))]}_{\geq 0}-\mathcal{E}(w_{0}(0)) η02C20t[η(s)2(LE0sη(t)dt+hV0sη(t)2dt)]ds\displaystyle\leq\eta_{0}^{2}C_{2}\int_{0}^{t}\bigg{[}\eta(s)^{2}\bigg{(}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}+\sqrt{h}V^{*}\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}dt^{\prime}}\bigg{)}\bigg{]}ds
+C30tη(s)2dsτη0𝔼[0t[η(s)(F(w0(s))F(w0))]ds].\displaystyle+C_{3}\int_{0}^{t}\eta(s)^{2}ds-\tau\eta_{0}\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta(s)(F(w_{0}(s))-F(w_{0}^{*}))\bigg{]}ds\Bigg{]}.

Rearranging we obtain

𝔼[0t[η(s)(F(w0(s))F(w0))]ds]\displaystyle\mathbb{E}\Bigg{[}\int_{0}^{t}\bigg{[}\eta(s)(F(w_{0}(s))-F(w_{0}^{*}))\bigg{]}ds\Bigg{]} (w0(0))τη0+η02C2τη00t[η(s)2(LE0sη(t)dt+hV0sη(t)2dt)]ds\displaystyle\leq\frac{\mathcal{E}(w_{0}(0))}{\tau\eta_{0}}+\frac{\eta_{0}^{2}C_{2}}{\tau\eta_{0}}\int_{0}^{t}\bigg{[}\eta(s)^{2}\bigg{(}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}+\sqrt{h}V^{*}\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}dt^{\prime}}\bigg{)}\bigg{]}ds
+C3τη00tη(s)2ds.\displaystyle+\frac{C_{3}}{\tau\eta_{0}}\int_{0}^{t}\eta(s)^{2}ds.

Using the same trick as in the proof of Theorem 4.5, where we create random variable t^[0,t]\hat{t}\in[0,t] with probability density function η(t)φ(t)\frac{\eta(t^{\prime})}{\varphi(t)}, we can substitute

φ(t)𝔼t~[(F(w0(t^))F(w0))]=0tη(t)(F(w0(t))F(w0))dt,\varphi(t)\mathbb{E}_{\tilde{t}}[(F(w_{0}(\hat{t}))-F(w_{0}^{*}))]=\int_{0}^{t}\eta(t^{\prime})(F(w_{0}(t^{\prime}))-F(w_{0}^{*}))dt^{\prime},

where φ(t)=0tη(t)dt\varphi(t)=\int_{0}^{t}\eta(t^{\prime})dt^{\prime}.

We finally find

𝔼𝒢,t~[(F(w0(t^))F(w0))]\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}[(F(w_{0}(\hat{t}))-F(w_{0}^{*}))] ||w0(0)w0||τη0φ(t)+η02C2τη0φ(t)0t[η(s)2(LE0sη(t)dt+hV0sη(t)2dt)]ds\displaystyle\leq\frac{||w_{0}(0)-w_{0}^{*}||}{\tau\eta_{0}\varphi(t)}+\frac{\eta_{0}^{2}C_{2}}{\tau\eta_{0}\varphi(t)}\int_{0}^{t}\bigg{[}\eta(s)^{2}\bigg{(}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}+\sqrt{h}V^{*}\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}dt^{\prime}}\bigg{)}\bigg{]}ds
+C3τη0φ(t)0tη(s)2ds.\displaystyle+\frac{C_{3}}{\tau\eta_{0}\varphi(t)}\int_{0}^{t}\eta(s)^{2}ds.

A.7 Proof of Corollary 4.10

Proof.

We have φ(t)=log(t+1)\varphi(t)=\log(t+1).

We examine the case where η(t)=1t+1\eta(t)=\frac{1}{t+1} as

𝔼𝒢,t~[(F(w0(t^))F(w0))]\displaystyle\mathbb{E}_{\mathcal{G},\tilde{t}}[(F(w_{0}(\hat{t}))-F(w_{0}^{*}))] ||w0(0)w0||τη0φ(t)+η02C2τη0φ(t)0t[η(s)2(LE0sη(t)dt+hV0sη(t)2dt)]ds\displaystyle\leq\frac{||w_{0}(0)-w_{0}^{*}||}{\tau\eta_{0}\varphi(t)}+\frac{\eta_{0}^{2}C_{2}}{\tau\eta_{0}\varphi(t)}\int_{0}^{t}\bigg{[}\eta(s)^{2}\bigg{(}LE\int_{0}^{s}\eta(t^{\prime})dt^{\prime}+\sqrt{h}V^{*}\sqrt{\int_{0}^{s}\eta(t^{\prime})^{2}dt^{\prime}}\bigg{)}\bigg{]}ds
+C3τη0φ(t)0tη(s)2ds\displaystyle+\frac{C_{3}}{\tau\eta_{0}\varphi(t)}\int_{0}^{t}\eta(s)^{2}ds
=||w0(0)w0||τη0log(t+1)+η02C2τη0log(t+1)0t[(LElog(s+1)(s+1)2+hV(s+1)2s(s+1))]ds\displaystyle=\frac{||w_{0}(0)-w_{0}^{*}||}{\tau\eta_{0}\log(t+1)}+\frac{\eta_{0}^{2}C_{2}}{\tau\eta_{0}\log(t+1)}\int_{0}^{t}\bigg{[}\bigg{(}LE\frac{\log(s+1)}{(s+1)^{2}}+\frac{\sqrt{h}V^{*}}{(s+1)^{2}}\sqrt{\frac{s}{(s+1)}}\bigg{)}\bigg{]}ds
+C3τη0log(t+1)0tη(s)2ds\displaystyle+\frac{C_{3}}{\tau\eta_{0}\log(t+1)}\int_{0}^{t}\eta(s)^{2}ds
||w0(0)w0||+C3+η02C2hVτη0log(t+1)+η02C2LEτη0tlog(t+1)tlog(t+1).\displaystyle\leq\frac{||w_{0}(0)-w_{0}^{*}||+C_{3}+\eta_{0}^{2}C_{2}\sqrt{h}V^{*}}{\tau\eta_{0}\log(t+1)}+\frac{\eta_{0}^{2}C_{2}LE}{\tau\eta_{0}}\cdot\frac{t-\log(t+1)}{t\log(t+1)}.

A.8 Proof of Theorem 5.4

Proof.

Lyapunov CLT We first describe the Lyapunov Central Limit Theorem. Assume we have a sequence of random variables X1,X2,,XnX_{1},X_{2},...,X_{n} which are independent but not necessarily identically distributed. These random variables have possible different means μi\mu_{i} and variances σi2\sigma_{i}^{2}. To satisfy the Lyapunov condition, we need to show that for some δ>0\delta>0

limn1sn2+δi=1n𝔼[|Xiμi|2+δ]=0\lim_{n\rightarrow\infty}\frac{1}{s_{n}^{2+\delta}}\sum_{i=1}^{n}\mathbb{E}[|X_{i}-\mu_{i}|^{2+\delta}]=0 (15)

where sn2=i=1nσi2s_{n}^{2}=\sum_{i=1}^{n}\sigma_{i}^{2}.

The problem we are studying is of A(t)A(t), which is formulated as

A(t)=k=1Qpk[i=0E1ηk(t)(NktE+iGktE+i)]AkA(t)=\sum_{k=1}^{Q}\underbrace{p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)(N^{k}_{t-E+i}-G^{k}_{t-E+i})]}_{A_{k}} (16)

where Nkt+i𝒩(0,Σk(wt+ik))N^{k}_{t+i}\sim\mathcal{N}(0,\Sigma_{k}(w_{t+i}^{k})) and Gt+ik=Fk(wt+ik)G_{t+i}^{k}=\nabla F^{k}(w_{t+i}^{k}).

For simplicity, the proof below assumes AkA_{k} is one-dimensional. However, from Assumption 5.1, we have that Σk(wt+ik)\Sigma_{k}(w_{t+i}^{k}) is diagonal. This means we can apply this approach to each component of a vector AkA_{k}, and thus each component of AkA_{k} tends to a normal distribution that are independent of each other because Σk(wt+ik)\Sigma_{k}(w_{t+i}^{k}) is diagonal. Since the components are independent and normally distributed, we have that AkA_{k} tends to a multivariate normal distribution.

To satisfy the Lyapunov condition, we need to show that for some choice of δ\delta the following holds

limQ1sQ2+δk=1Q𝔼[|Ak𝔼[Ak]|2+δ]=0.\displaystyle\lim_{Q\rightarrow\infty}\frac{1}{s_{Q}^{2+\delta}}\sum_{k=1}^{Q}\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{2+\delta}\bigg{]}=0.

We choose δ=2\delta=2 and examine 𝔼[|Ak𝔼[Ak]|2+δ]\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{2+\delta}\bigg{]} as

𝔼[|Ak𝔼[Ak]|2+δ]\displaystyle\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{2+\delta}\bigg{]} =𝔼[|pk[i=0E1ηk(t)(NktE+iGktE+i)]𝔼[pk[i=0E1ηk(t)(NktE+iGktE+i)]]|4]\displaystyle=\mathbb{E}\bigg{[}\Big{|}p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)(N^{k}_{t-E+i}-G^{k}_{t-E+i})]-\mathbb{E}[p_{k}[\sum_{i=0}^{E-1}\eta_{k}(t)(N^{k}_{t-E+i}-G^{k}_{t-E+i})]]\Big{|}^{4}\bigg{]}
=pk4ηk(t)4𝔼[|i=0E1(NktE+i)=Nk+i=0E1(𝔼[GktE+i]GktE+i)=Rk|4]\displaystyle=p_{k}^{4}\eta_{k}(t)^{4}\mathbb{E}\bigg{[}\Big{|}\underbrace{\sum_{i=0}^{E-1}(N^{k}_{t-E+i})}_{=N_{k}}+\underbrace{\sum_{i=0}^{E-1}(\mathbb{E}[G^{k}_{t-E+i}]-G^{k}_{t-E+i})}_{=R_{k}}\Big{|}^{4}\bigg{]}
=pk4ηk(t)4𝔼[(Nk+Rk)4]\displaystyle=p_{k}^{4}\eta_{k}(t)^{4}\mathbb{E}\bigg{[}\Big{(}N_{k}+R_{k}\Big{)}^{4}\bigg{]}
=pk4ηk(t)4𝔼[Nk4+4Nk3Rk+6Nk2Rk2+4NkRk3+Rk4]\displaystyle=p_{k}^{4}\eta_{k}(t)^{4}\mathbb{E}\bigg{[}N_{k}^{4}+4N_{k}^{3}R_{k}+6N_{k}^{2}R_{k}^{2}+4N_{k}R_{k}^{3}+R_{k}^{4}\bigg{]}
=pk4ηk(t)4[𝔼[Nk4]+4𝔼[Nk3Rk]+6𝔼[Nk2Rk2]+𝔼[4NkRk3]+𝔼[Rk4]].\displaystyle=p_{k}^{4}\eta_{k}(t)^{4}\bigg{[}\mathbb{E}[N_{k}^{4}]+4\mathbb{E}[N_{k}^{3}R_{k}]+6\mathbb{E}[N_{k}^{2}R_{k}^{2}]+\mathbb{E}[4N_{k}R_{k}^{3}]+\mathbb{E}[R_{k}^{4}]\bigg{]}.

We have that sQ4=(k=1Q𝔼[(Ak𝔼[Ak])2])2=(k=1Qpk2ηk(t)2𝔼[(Nk+Rk)2])2s_{Q}^{4}=\bigg{(}\sum_{k=1}^{Q}\mathbb{E}\bigg{[}\big{(}A_{k}-\mathbb{E}[A_{k}]\big{)}^{2}\bigg{]}\bigg{)}^{2}=\Big{(}\sum_{k=1}^{Q}p_{k}^{2}\eta_{k}(t)^{2}\mathbb{E}\big{[}(N_{k}+R_{k})^{2}\big{]}\Big{)}^{2}.

We can use the following formula which states that for positive xix_{i},

i=1nxi2=(i=1nxi)2+12i=1nj=1n(xixj)2n.\sum_{i=1}^{n}x_{i}^{2}=\frac{(\sum_{i=1}^{n}x_{i})^{2}+\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_{i}-x_{j})^{2}}{n}.

We quickly prove this equation as follows

(i=1nxi)2+12i=1nj=1n(xixj)2n\displaystyle\frac{(\sum_{i=1}^{n}x_{i})^{2}+\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_{i}-x_{j})^{2}}{n} =i=1nxi2+i=1njixixj+12i=1nj=1n(xi22xixj+xj2)n\displaystyle=\frac{\sum_{i=1}^{n}x_{i}^{2}+\sum_{i=1}^{n}\sum_{j\neq i}x_{i}x_{j}+\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_{i}^{2}-2x_{i}x_{j}+x_{j}^{2})}{n}
=(n+1)i=1nxi2+i=1njixixji=1nj=1nxixjn\displaystyle=\frac{(n+1)\sum_{i=1}^{n}x_{i}^{2}+\sum_{i=1}^{n}\sum_{j\neq i}x_{i}x_{j}-\sum_{i=1}^{n}\sum_{j=1}^{n}x_{i}x_{j}}{n}
=(n+1)i=1nxi2+i=1n(jixixjj=1nxixj)n\displaystyle=\frac{(n+1)\sum_{i=1}^{n}x_{i}^{2}+\sum_{i=1}^{n}(\sum_{j\neq i}x_{i}x_{j}-\sum_{j=1}^{n}x_{i}x_{j})}{n}
=(n+1)i=1nxi2i=1nxi2n\displaystyle=\frac{(n+1)\sum_{i=1}^{n}x_{i}^{2}-\sum_{i=1}^{n}x_{i}^{2}}{n}
=i=1nxi2.\displaystyle=\sum_{i=1}^{n}x_{i}^{2}.

Using this formula and substituting xi=pk2ηk(t)2𝔼[(Nk+Rk)2]x_{i}=p_{k}^{2}\eta_{k}(t)^{2}\mathbb{E}\big{[}(N_{k}+R_{k})^{2}\big{]} we get

(k=1Qpk2ηk(t)2𝔼[(Nk+Rk)2])2\displaystyle\Big{(}\sum_{k=1}^{Q}p_{k}^{2}\eta_{k}(t)^{2}\mathbb{E}\big{[}(N_{k}+R_{k})^{2}\big{]}\Big{)}^{2} =Qk=1Q(pk2ηk(t)2𝔼[(NK+Rk)2])2\displaystyle=Q\sum_{k=1}^{Q}\bigg{(}p_{k}^{2}\eta_{k}(t)^{2}\mathbb{E}\big{[}(N_{K}+R_{k})^{2}\big{]}\bigg{)}^{2}
12i=1Qj=1Q(pi2ηi(t)2𝔼[(Ni+Ri)2]pj2ηj(t)2𝔼[(Nj+Rj)2])2.\displaystyle-\frac{1}{2}\sum_{i=1}^{Q}\sum_{j=1}^{Q}\bigg{(}p_{i}^{2}\eta_{i}(t)^{2}\mathbb{E}\big{[}(N_{i}+R_{i})^{2}\big{]}-p_{j}^{2}\eta_{j}(t)^{2}\mathbb{E}\big{[}(N_{j}+R_{j})^{2}\big{]}\bigg{)}^{2}.

From our assumptions on the similarity of variance between clients, we have for some C>0C>0

(k=1Qpk2ηk(t)2𝔼[(NK+Rk)2])2\displaystyle\Big{(}\sum_{k=1}^{Q}p_{k}^{2}\eta_{k}(t)^{2}\mathbb{E}\big{[}(N_{K}+R_{k})^{2}\big{]}\Big{)}^{2} =Qk=1QC\displaystyle=Q\sum_{k=1}^{Q}C
=CQ2.\displaystyle=CQ^{2}.

We have

1sQ4k=1Q𝔼[|Ak𝔼[Ak]|4]\displaystyle\frac{1}{s_{Q}^{4}}\sum_{k=1}^{Q}\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{4}\bigg{]} =1CQ2k=1Q𝔼[|Ak𝔼[Ak]|4]\displaystyle=\frac{1}{CQ^{2}}\sum_{k=1}^{Q}\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{4}\bigg{]}
=1CQ2k=1Qpk4ηk(t)4[𝔼[Nk4]+4𝔼[Nk3Rk]+6𝔼[Nk2Rk2]+𝔼[4NkRk3]+𝔼[Rk4]]\displaystyle=\frac{1}{CQ^{2}}\sum_{k=1}^{Q}p_{k}^{4}\eta_{k}(t)^{4}\bigg{[}\mathbb{E}[N_{k}^{4}]+4\mathbb{E}[N_{k}^{3}R_{k}]+6\mathbb{E}[N_{k}^{2}R_{k}^{2}]+\mathbb{E}[4N_{k}R_{k}^{3}]+\mathbb{E}[R_{k}^{4}]\bigg{]}
1CQ2k=1Qpk4ηk(t)4[|𝔼[Nk4]|+4|𝔼[Nk3Rk]|+6|𝔼[Nk2Rk2]|+4|𝔼[NkRk3]|+|𝔼[Rk4]|]\displaystyle\leq\frac{1}{CQ^{2}}\sum_{k=1}^{Q}p_{k}^{4}\eta_{k}(t)^{4}\bigg{[}|\mathbb{E}[N_{k}^{4}]|+4|\mathbb{E}[N_{k}^{3}R_{k}]|+6|\mathbb{E}[N_{k}^{2}R_{k}^{2}]|+4|\mathbb{E}[N_{k}R_{k}^{3}]|+|\mathbb{E}[R_{k}^{4}]|\bigg{]}
1CQ2k=1Qpk4ηk(t)4[D+4D+6D+4D+D]\displaystyle\leq\frac{1}{CQ^{2}}\sum_{k=1}^{Q}p_{k}^{4}\eta_{k}(t)^{4}\bigg{[}D+4D+6D+4D+D\bigg{]}
16DQCQ2.\displaystyle\leq\frac{16DQ}{CQ^{2}}.

Since, 01sQ4k=1Q𝔼[|Ak𝔼[Ak]|4]16DCQ0\leq\frac{1}{s_{Q}^{4}}\sum_{k=1}^{Q}\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{4}\bigg{]}\leq\frac{16D}{CQ}, and limQ0=0\lim_{Q\rightarrow\infty}0=0 and limQ16DCQ=0\lim_{Q\rightarrow\infty}\frac{16D}{CQ}=0, by the squeeze theorem, we know that limQ1sQ4k=1Q𝔼[|Ak𝔼[Ak]|4]=0\lim_{Q\rightarrow\infty}\frac{1}{s_{Q}^{4}}\sum_{k=1}^{Q}\mathbb{E}\bigg{[}\big{|}A_{k}-\mathbb{E}[A_{k}]\big{|}^{4}\bigg{]}=0. Thus, the Lyapunov condition holds.

A.9 Proof of Theorem 6.2

Proof.

We wish to show that after EE local iterations of SGD, the local weights evolve as

wt+Ekwt0ηj=0E1(IηUk)jUk(wt0ak)ηj=0E1i=1j(IηUk)jiUk𝒩(0,Σk)+E𝒩(0,Σk).w_{t+E}^{k}\sim w_{t}^{0}-\eta\sum_{j=0}^{E-1}(I-\eta U_{k})^{j}U_{k}(w_{t}^{0}-a_{k})-\eta\sum_{j=0}^{E-1}\sum_{i=1}^{j}(I-\eta U_{k})^{j-i}U_{k}\mathcal{N}(0,\Sigma_{k})+E\mathcal{N}(0,\Sigma_{k}).

First observe the evolution for a few iterations as

wt+1k\displaystyle w_{t+1}^{k} wt0η𝒩(Uk(wt0ak),Σk)\displaystyle\sim w_{t}^{0}-\eta\mathcal{N}(U_{k}(w_{t}^{0}-a_{k}),\Sigma_{k})
wt+2k|wt+1k\displaystyle w_{t+2}^{k}|w_{t+1}^{k} wt+1kη𝒩(Uk(wt+1kak),Σk)\displaystyle\sim w_{t+1}^{k}-\eta\mathcal{N}(U_{k}(w_{t+1}^{k}-a_{k}),\Sigma_{k})
wt0ηUk(wt0ak)ηUk(wt+1kak)+2η𝒩(0,Σk)\displaystyle\sim w_{t}^{0}-\eta U_{k}(w_{t}^{0}-a_{k})-\eta U_{k}(w_{t+1}^{k}-a_{k})+2\eta\mathcal{N}(0,\Sigma_{k})
wt+3k|(wt+2k,wt+1k)\displaystyle w_{t+3}^{k}|(w_{t+2}^{k},w_{t+1}^{k}) wt0ηUk(wt0ak)ηUk(wt+1kak)+2η𝒩(0,Σk)ηUk(wt+2kak)+η𝒩(0,Σk).\displaystyle\sim w_{t}^{0}-\eta U_{k}(w_{t}^{0}-a_{k})-\eta U_{k}(w_{t+1}^{k}-a_{k})+2\eta\mathcal{N}(0,\Sigma_{k})-\eta U_{k}(w_{t+2}^{k}-a_{k})+\eta\mathcal{N}(0,\Sigma_{k}).

Extend to EE number of time steps forward as

wt+Ek|(wt+1k,,wt+E1k)wt0ηj=0E1Uk(wt+jkak)+Eη𝒩(0,Σk).\displaystyle w_{t+E}^{k}|(w_{t+1}^{k},...,w_{t+E-1}^{k})\sim w_{t}^{0}-\eta\sum_{j=0}^{E-1}U_{k}(w_{t+j}^{k}-a_{k})+E\eta\mathcal{N}(0,\Sigma_{k}). (17)

In general we have

Uk(wt+jkak)\displaystyle U_{k}(w_{t+j}^{k}-a_{k}) =(1ηUk)Uk(wt+j1kak)+Uk𝒩(0,Σk).\displaystyle=(1-\eta U_{k})U_{k}(w_{t+j-1}^{k}-a_{k})+U_{k}\mathcal{N}(0,\Sigma_{k}).

Starting from j=1j=1, we have

Uk(wt+1kak)\displaystyle U_{k}(w_{t+1}^{k}-a_{k}) =(1ηUk)Uk(wt0ak)+Uk𝒩(0,Σk)\displaystyle=(1-\eta U_{k})U_{k}(w_{t}^{0}-a_{k})+U_{k}\mathcal{N}(0,\Sigma_{k})
=(UkηUk2)wt0+(Uk+ηUk2)ak+Uk𝒩(0,Σk).\displaystyle=(U_{k}-\eta U_{k}^{2})w_{t}^{0}+(-U_{k}+\eta U_{k}^{2})a_{k}+U_{k}\mathcal{N}(0,\Sigma_{k}).

Now for j=2j=2, we have

Uk(wt+2kak)\displaystyle U_{k}(w_{t+2}^{k}-a_{k}) =(1ηUk)[(1ηUk)Uk(wt0ak)+Uk𝒩(0,Σk)]+Uk𝒩(0,Σk).\displaystyle=(1-\eta U_{k})[(1-\eta U_{k})U_{k}(w_{t}^{0}-a_{k})+U_{k}\mathcal{N}(0,\Sigma_{k})]+U_{k}\mathcal{N}(0,\Sigma_{k}).

Expanding out to j=Ej=E, we have

Uk(wt+jkak)=(IηUk)jUk(wt0ak)+i=1j(IηUk)jiUk𝒩(0,Σk).\displaystyle U_{k}(w_{t+j}^{k}-a_{k})=(I-\eta U_{k})^{j}U_{k}(w_{t}^{0}-a_{k})+\sum_{i=1}^{j}(I-\eta U_{k})^{j-i}U_{k}\mathcal{N}(0,\Sigma_{k}). (18)

Combining equations 17 and 18, we have

wt+Ekwt0ηj=0E1(IηUk)jUk(wt0ak)ηj=0E1i=1j(IηUk)jiUk𝒩(0,Σk)+Eη𝒩(0,Σk).\displaystyle w_{t+E}^{k}\sim w_{t}^{0}-\eta\sum_{j=0}^{E-1}(I-\eta U_{k})^{j}U_{k}(w_{t}^{0}-a_{k})-\eta\sum_{j=0}^{E-1}\sum_{i=1}^{j}(I-\eta U_{k})^{j-i}U_{k}\mathcal{N}(0,\Sigma_{k})+E\eta\mathcal{N}(0,\Sigma_{k}).

A.10 Global loss function form for quadratic assumption

With the quadratic assumption, we can reformat the equation for the server loss function as

F(w)=12(wa)T(k=1QpkUk)(wa).\displaystyle F(w)=\frac{1}{2}(w-a)^{T}(\sum_{k=1}^{Q}p_{k}U_{k})(w-a).

We must find aa. We have that minwF(w)=a\min_{w}F(w)=a, F(w)=k=1QpkUk(wak)\nabla F(w)=\sum_{k=1}^{Q}p_{k}U_{k}(w-a_{k}) and wF(a)=0\nabla_{w}F(a)=0. We find

wF(a)=0\displaystyle\nabla_{w}F(a)=0 =k=1QpkUk(aak)\displaystyle=\sum_{k=1}^{Q}p_{k}U_{k}(a-a_{k})
=k=1QpkUkak=1QpkUkak.\displaystyle=\sum_{k=1}^{Q}p_{k}U_{k}a-\sum_{k=1}^{Q}p_{k}U_{k}a_{k}.

Now we solve for aa as

k=1QpkUka=k=1QpkUkak\displaystyle\sum_{k=1}^{Q}p_{k}U_{k}a=\sum_{k=1}^{Q}p_{k}U_{k}a_{k}
a=(k=1QpkUk)1k=1QpkUkak.\displaystyle a=(\sum_{k=1}^{Q}p_{k}U_{k})^{-1}\sum_{k=1}^{Q}p_{k}U_{k}a_{k}.

Therefore, the server loss function is also represented by a quadratic form with local minimum at w0=(k=1QpkUk)1k=1QpkUkakw_{0}^{*}=(\sum_{k=1}^{Q}p_{k}U_{k})^{-1}\sum_{k=1}^{Q}p_{k}U_{k}a_{k} and Hessian H0=(k=1QpkUk)H_{0}=(\sum_{k=1}^{Q}p_{k}U_{k}).

A.11 Proof of Theorem 6.3

For a linear SDE of the form, dX(t)=(a(t)+A(t)X(t))dt+b(t)dB(t)dX(t)=(a(t)+A(t)X(t))dt+b(t)dB(t), the solution is normally distributed as 𝒩(m(t),v(t))\mathcal{N}(m(t),v(t)). The mean m(t)m(t) is the solution of the ordinary differential equation (ODE) dm(t)dt=A(t)m(t)+a(t)\frac{dm(t)}{dt}=A(t)m(t)+a(t) with initial condition m(0)=X(0)m(0)=X(0). The variance v(t)v(t) is the solution of the ODE dv(t)dt=2A(t)v(t)+b(t)2\frac{dv(t)}{dt}=2A(t)v(t)+b(t)^{2} with initial condition v(0)=0v(0)=0 (Movellan, 2011). These ODEs are straightforward to solve and we obtain the solution in Theorem 6.3.