Continuous-Time Analysis of Federated Averaging
Abstract
Federated averaging (FedAvg) is a popular algorithm for horizontal federated learning (FL), where samples are gathered across different clients and are not shared with each other or a central server. Extensive convergence analysis of FedAvg exists for the discrete iteration setting, guaranteeing convergence for a range of loss functions and varying levels of data heterogeneity. We extend this analysis to the continuous-time setting where the global weights evolve according to a multivariate stochastic differential equation (SDE), which is the first time FedAvg has been studied from the continuous-time perspective. We use techniques from stochastic processes to establish convergence guarantees under different loss functions, some of which are more general than existing work in the discrete setting. We also provide conditions for which FedAvg updates to the server weights can be approximated as normal random variables. Finally, we use the continuous-time formulation to reveal generalization properties of FedAvg.
1 Introduction
Federated learning (FL) is a popular privacy-preserving machine learning framework allowing a server to train a central model without accessing data locked on clients. In FL, each client holds a subset of the overall data and is not willing to share this data with the server; however, the clients can send model weights to the server. In horizontal FL studied herein, each client holds a subset of samples but each client has access to all features. A popular algorithm for horizontal FL is FedAvg (McMahan et al., 2017), which is the focus in this work.
FedAvg works by averaging the model weights of all clients periodically. Specifically, each client trains a local model for a specific number of iterations using only the data on the client. Typically, this local training is done using batch stochastic gradient descent. Then, the clients send their local model weights to the server, where the server averages the model weights and sends them back to the clients. Understanding the conditions for FedAvg to converge and what factors influence convergence and generalization is of great interest to federated learning practitioners and has been the focus of significant work (Li et al., 2019; Karimireddy et al., 2020).
A major challenge in FL is dealing with non-IID data across clients (heterogeneous setting). The local updates on each client significantly diverge due to the different data distributions they are trained on, resulting in slower convergence. Many improvements to FedAvg exist, such as SCAFFOLD (Karimireddy et al., 2020), which reduces the impact of the data drift during local training and speeds up training in heterogeneous settings, and RADFed (Xue et al., 2022), which handles non-IID data by delaying aggregation with specialized redistribution rounds. Despite the abundance of new algorithms being introduced to tackle new challenges in FL, FedAvg remains an important algorithm that is the foundation of most new approaches.
The continuous-time perspective of stochastic gradient descent (SGD) has provided a framework for developing more compact proofs of convergence than the discrete process (Orvieto & Lucchi, 2019) and has allowed the study of generalization properties. Specifically, it has been shown through the continuous-time representation of SGD how the learning rate and batch size influence the width of local minima converged to during SGD (Jastrzebski et al., 2018). This local minima width impacts the model’s generalization ability to unseen data (Keskar et al., 2017). The continuous-time representation also allows for studying first exit times of SGD from local minima and how this depends on the width of the local minima (Nguyen et al., 2019). Our work is focused on developing a continuous-time representation of FedAvg. It is our hope that this formulation provides a framework for new interesting analyses, just as the continuous-time representation did for SGD.
We focus on a theoretical analysis of FedAvg. We formulate a continuous-time representation of FedAvg in the form of a stochastic differential equation (SDE) and use this formulation to prove convergence properties. The convergence proofs are relatively compact and the proof framework may be extended to other FL algorithms. We show convergence of FedAvg to a stationary point for general, non-convex loss functions and demonstrate that it is likely not sufficient for only the server learning rate to decay; it is necessary that the client-side learning rate must decay at certain rates. We show convergence of FedAvg to the global minimum for weakly quasi-convex loss functions. To the best of our knowledge, weak quasi-convexity has not been studied for FedAvg up to this point. Next, we show that the server weight updates in FedAvg can be approximated as normal random variables under certain assumptions, even for heterogeneous data with extensive local updates before averaging. This is a surprising result and can assist in further analyses of FL algorithms. Finally, we use our continuous-time approach with a quadratic single variate form of each clients’ loss landscape to determine how different FedAvg hyperparameters affect the trade-off between minimizing expected loss and ability to escape poorly-generalizing local minima.
Our contributions are as follows.
-
1.
We are the first to formulate an SDE that models the evolution of server weights in continuous time during FedAvg. Existing work (Zhang et al., 2023) for modeling distributed learning algorithms uses ordinary differential equations without stochasticity.
-
2.
Using the continuous-time formulation, we devise convergence proofs of FedAvg in deterministic and stochastic cases. We show convergence under a certain normality assumption to a stationary point for non-convex loss functions and convergence to a global minimum for weakly quasi-convex functions. To the best of our knowledge, no other works in either deterministic or stochastic regimes have shown global convergence of FedAvg for weakly quasi-convex loss functions, which are more general than convex functions and allow for locally concave regions. The new proof framework we provide allows for relatively compact proofs of FedAvg and can inspire compact proofs of other FL algorithms.
-
3.
We show that the server weight updates converge in distribution to a normal distribution as the number of clients grow, even in non-IID data settings with many local client iterations without server averaging. This justifies the normality assumption in the convergence results. We demonstrate this through the Lyapunov central limit theorem.
-
4.
Using a quadratic single variate loss function for each client’s loss landscape, we uncover dynamics of how various hyperparameters in FedAvg affect the trade-off between generalization and the optimality gap of the expected loss in the continuous-time setting.
In Section 2, we discuss related work on the analysis of FedAvg in the discrete case and continuous-time analysis of SGD. In Section 3, we formulate the SDE that models the evolution of server weights during FedAvg. In Section 4, we provide convergence guarantees of FedAvg for non-convex and weakly quasi-convex loss functions using the SDE formulation. In Section 5, we provide conditions for which the server weight updates can be approximated as normally distributed; this is a key assumption for the SDE formulation to be an Itô process. In Section 6, we analyze the case where each client’s local loss function is a quadratic; this allows us to examine how various parameters, such as the client learning rate and the number of local iterations, affect the trade-off between minimizing the loss function and generalization.
2 Related Work
Extensive work has been done in analyzing stochastic gradient descent (SGD) from the continuous-time perspective. Although theory of the discrete process of stochastic gradient descent has existed for a long time, continuous-time perspectives have added value in understanding the behavior of SGD particularly in the area of generalization (Mandt et al., 2015). Using the continuous-time perspective, it has been shown that learning rate magnitude and batch size affect the sharpness of the local minima that SGD converges to (Jastrzebski et al., 2018), and the sharpness of local minima has been linked to generalization ability (Keskar et al., 2017). Furthermore, there has been work in developing concise convergence proofs for the continuous SDE form of SGD (Orvieto & Lucchi, 2019). While the previously mentioned works form a Brownian motion representation with normally-distributed noise, there is also work on forming Lévy processes where the noise is the more general -stable distribution that can have heavy tails (Nguyen et al., 2019). This Lévy process is used to describe first-exit times of SGD under heavy-tailed noise and demonstrates why SGD prefers converging to wide local minima which generalize better than narrow minima. All of these works take the actual discrete stochastic gradient descent process and form the continuous-time SDE that models the evolution of the weights over time.
Furthermore, the discrete process of FedAvg is well-studied. Convergence has been proven for the convex case, even when the data distribution across different clients is non-IID (Li et al., 2019). However, the convergence rate is slower when data is highly non-IID and the number of local iterations before averaging is required to be large due to “client drift” (Karimireddy et al., 2020).
The case of analyzing federated learning from the continuous-time perspective is less-studied. There has been work on analyzing the continuous-time system of a broad class of distributed optimization algorithms using a control-theory perspective, however this work assumes full gradients and thus has no stochastic component (Zhang et al., 2023). Furthermore, their framework does not work for the FedAvg algorithm specifically because FedAvg cannot be mapped to the double-feedback system they developed. Since most implementations of FedAvg use stochastic gradients on the clients, we focus on developing and analyzing an SDE that models the evolution of the global weights while assuming that the client updates use noisy, stochastic gradients. Recent work (Mehrjou, 2021) analyzes the continuous-time limit of local client updates to make connections to game theory, but does not form a continuous-time representation of the server weights and does not study convergence of the global weights. Existing work (Glasgow et al., 2022) uses the continuous-time limit of each client’s local SGD iterates to uncover iterate bias, but the authors do not form the continuous-time SDE for the evolution of weights on the server, and thus their approach is very different from our approach.
3 Continuous-Time Formulation
We are given samples with the loss of sample being and are the model weights. Furthermore, we assume the samples are gathered across clients in the horizontal federated learning fashion. We refer to each local client component of the overall objective function as where is the set of samples that belong to client and . We state the loss function as where is the weight of client , usually set to . We denote the copy of model weights on client as . The goal is to solve .
Client weights are updated using standard SGD, except for iterations where averaging occurs. We can write the evolution of client weights as , when , and , when , where is the batch size, is a random batch drawn from the available samples in at iteration , is the most recent iteration where averaging occurred, is the learning rate on client and may vary over iterations, is the global server learning rate which may vary over iterations, and is the number of local SGD updates before averaging. When and averaging occurs, the most recent iteration of average is incremeted as . It is important to note that the averaging step does not imply that clients have access to the updates of the other clients; this average is actually computed on the server and sent back to each client, and on this time step every client holds the same-valued weights. A common choice of server learning rate is , which simplifies this update schedule to , when , and , when .
At iteration , each client performs one more step of SGD, then sends updated weights to the server, where the server averages the updates along with an optional server-side learning rate that controls how much to update the server weights. While the mathematical expression for the case when shows gradient passing, this is not the case in an actual implementation where it is equivalent to sending the updated weights after each client’s local updates. The server then sends this average back to every client. We study the convergence of global server weights in this work, which we denote as for iteration . Updates to the server weights only occur every iterations during the averaging step, otherwise they remain the same as the most recent averaging iteration.
As shown in (Jastrzebski et al., 2018), the stochastic gradients, , on the local client updates can be assumed to be normally distributed as follows
where . We can think of this as a normal random variable centered around the full gradient of the local loss function.
Therefore, we can write our discrete updates in the time region where no averaging occurs as
Expanding this for time steps, we get
where and . Notice that the expressions for and depend on and not . This results in the full expression being a very complicated recurrence relation.
We can re-index as
We assume that at iteration (and thus also for all integers such that ), the server aggregates across clients, and we write the aggregated server weights at this time as . Using the averaging technique specified in the FedAvg update schedule we get
We split the global learning rate, , into the product of a constant term, , used for lifting the difference equation to continuous-time and the learning rate, , which depends on iteration . We write this as . Thus, we rewrite the difference equation as
Assumption 3.1.
is a normally distributed random variable such that . Both and are functions of .
We further discuss Assumption 3.1 in Section 5. We can then write the evolution of iterates of the global server weights as
(1) |
where and .
According to the FedAvg server update schedule, the global server weights only change every iterations as indexed by , thus in (1) represents the difference in a single update of the server weights. So, we can view this as the discretization of an SDE using the Euler-Maruyama method with a step size of . This discretization is more accurate when is small. We now form the SDE as
(2) |
where is a standard Brownian motion, , , , and rewriting in continuous-time as
where , , , and for all clients .
4 Convergence Proofs
We use the formulation of (2) in Section 3, and similar tools used in (Orvieto & Lucchi, 2019) to form convergence proofs for FedAvg in various settings. While the techniques used in convergence proofs for continuous-time SGD are similar to our approaches for FedAvg, the case for FedAvg that we show is much more complicated.
Assumption 4.1.
each are -smooth functions. We require each to be twice differentiable across their entire domain and and over the entire domain, where is the Hessian of .
Assumption 4.2.
The learning rates on each client are the same, may depend on , and can be written as for all .
Assumption 4.3.
We assume that the variances of the server updates are bounded for all . More precisely, , where is the spectral norm.
Assumption 4.4.
We assume , where does not vary over iterations and may be different for each client .
Assumption 4.4 is similar to the assumptions made in the continuous-time SGD literature (Mandt et al., 2015).
4.1 Non-convex loss functions
We prove convergence to a stationary point for general, non-convex loss functions.
Theorem 4.5.
Proof.
Proof provided in Section A.3. ∎
Theorem 4.5 provides a general expression that can be used to easily derive convergence rates for a variety of learning rates, . We obtain more concrete convergence rates for specific choices of and asymptotic rates for intervals of in Corollary 4.6.
Corollary 4.6.
For a random time point that follows the distribution , and with , we have
for , we have
and with , we have
Proof.
Proof provided in Section A.4. ∎
Next, we show in Corollary 4.7 that convergence to a stationary point requires a decreasing learning rate on the clients, . Interestingly, a decreasing global server learning rate, , by itself is likely not sufficient for convergence. This is because the bound on the expected client drift from Lemma A.1 in the appendix is dependent on the client-side learning rate .
Corollary 4.7.
For a random time point that follows the distribution and with constant client learning rate and decreasing global server-side learning rate , we have
Proof.
Proof provided in Section A.5. ∎
Corollary 4.7 shows that despite a decreasing server-side learning rate, FedAvg is only guaranteed to converge to a neighborhood of a stationary point that depends on the size of the constant client-side learning rate . The algorithm might still converge to a stationary point since we provide an upper bound.
4.2 Weakly quasi-convex loss functions
We prove convergence to the global minimum for weakly quasi-convex loss functions as defined in Assumption 4.8.
Assumption 4.8.
Function is weakly quasi-convex if for some and
holds for every .
This class of functions is more general than convex functions that are studied in previous discrete analyses of FedAvg. In fact, weakly quasi-convex functions can have locally concave regions, and it has been shown that some non-convex LSTMs induce weakly-quasi convex loss functions (Hardt et al., 2018). To the best of our knowledge, this is the first work to provide convergence results of FedAvg for weakly quasi-convex loss functions.
Theorem 4.9.
Proof.
Proof provided in Section A.6. ∎
Theorem 4.9 provides a general expression that can be used to easily derive convergence rates for a variety of learning rates, . We obtain more concrete convergence rates for a specific choice of in Corollary 4.10.
Corollary 4.10.
If we choose a random time point that follows the distribution , the serving learning rate is a constant , and for we have
Proof.
Proof provided in Section A.7. ∎
5 On the Assumption of Normally Distributed Server Updates
In order for the SDE specified in (2) to be an Itô process, we require to be normally distributed. We show in this section that this is a reasonable assumption in many cases, such as in the regime of a very large number of clients.
Large IID Client Setting Assuming data is independent and identically distributed across clients, the weights should be approximately equal, and assuming bounded variance (Assumption 4.3), we have the conditions met for the traditional central limit theorem. Thus, as the number of clients goes to , we have .
Large non-IID Client Setting We now show that we can approximate as a normal random variable under certain conditions even if the data is not identically distributed across clients. The proof hinges on the Lyapunov central limit theorem which generalizes the traditional central limit theorem to cases where random variables are not identically distributed (Billingsley, 2012).
Assumption 5.1.
The covariance matrix is a diagonal matrix for each client .
Assumption 5.2.
We require
holds for all clients and coordinates , where is the client-side learning rate on client . Values and are the -th coordinates of vectors and , respectively, where and .
Assumption 5.2 requires that for all clients indexed by , the square of the second moment of to be greater than the square of the difference of the second moments of and averaged over all clients indexed by . This essentially requires the second moments to not be too different across different clients.
Assumption 5.3.
The fourth-order mixed moments of and are bounded for all clients . More formally, we require for all clients , coordinates , and with .
We note that Assumption 5.3 is guaranteed to hold if and have bounded support, which is usually the case in practice because gradients are typically clipped.
Proof.
The proof uses the Lyapunov Central Limit Theorem, which is more flexible than the standard Central Limit Theorem and allows for non-identically distributed random variables as long as the Lyapunov condition is met. The full proof is provided in Section A.8. ∎
Section 6 further demonstrates the normality assumption being met when each client’s loss landscape is a quadratic form.
6 Analysis in the Quadratic Case
Quadratic Client Loss Landscape We can show that the server updates in FedAvg are normally distributed if we assume the loss landscape of each client follows a different quadratic form.
Assumption 6.1.
Each client’s loss landscape is for some and .
With Assumption 6.1, the loss landscape on the server is . This global loss function can be rewritten as
where , which is clearly another quadratic form.
Theorem 6.2.
Proof.
Proof provided in Section A.9. ∎
Theorem 6.2 shows that no matter how many local updates are performed on a client, the resulting final update to the weights is normally distributed. Therefore, after averaging, the updates to the global weights are also normally distributed.
Generalization Analysis of Single Variate Quadratic Loss Landscape
Using the evolution of local weights for quadratic loss functions provided in Theorem 6.2, assuming we have the single variate case, and using the same procedure for building the SDE as in Section 3, we obtain the SDE for the global weights as
(4) |
Theorem 6.3.
Proof.
Proof is provided in Section A.11. ∎
We are interested in how various hyperparameters influence the expected loss and the variance of the solution distribution shown in Theorem 6.3. In the limit as and , the solution approaches which is the true global minimum with zero variance. This matches the behavior of SGD, that as the learning rate is decreased, the expected loss approaches the true minimum but the variance of the solution decreases and thus may not be able to escape poor local minima (Bishop, 1995; Jastrzebski et al., 2018).
As the number of local iterations before communication, , grows, the mean of the solution diverges from the true solution. Furthermore, as grows, the variance of the solution increases as . This causes the expected solution to have a high bias with regards to the true solution, but an increase in may allow FedAvg to escape poorly-generalizing local minima. This demonstrates the trade-off between expected loss and ability to escape poor local minima.
This reasoning demonstrates how our proposed continuous-time limit of FedAvg can be used to explore behaviors of FL algorithms such as generalization ability. We leave further exploration of these ideas to future work, such as exploring the first exit time of the continuous-time formulation of FedAvg from a local minimum, similar to the existing work in SGD (Nguyen et al., 2019).
7 Conclusion
We have developed a continuous-time SDE that models the evolution of server weights during FedAvg. Using this formulation, we prove convergence to stationary points for general non-convex loss functions and are the first to show global convergence for weakly quasi-convex loss functions. We discuss the various assumptions required for the SDE to be an Itô process and demonstrate that these assumptions are reasonable. We demonstrate that the updates to the server weights can be approximated as a normal random variable in many cases, even if data is not identically distributed across clients. Finally, we provide a generalization analysis using our continuous-time formulation and quadratic client losses. Our SDE formulation serves as a framework for future efforts in analyzing FedAvg and other FL algorithms from the continuous-time perspective.
Impact Statement
This paper is focused on mathematically analyzing the underlying behavior of federated learning from a new perspective.
References
- Billingsley (2012) Billingsley, P. Probability and Measure. Wiley Series in Probability and Statistics. Wiley, 2012.
- Bishop (1995) Bishop, C. Neural Networks for Pattern Recognition. Advanced Texts in Econometrics. Clarendon Press, 1995.
- Eriksson et al. (2004) Eriksson, K., Johnson, C., and Estep, D. Vector-Valued Functions of Several Real Variables, pp. 789–814. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
- Glasgow et al. (2022) Glasgow, M. R., Yuan, H., and Ma, T. Sharp bounds for federated averaging (local SGD) and continuous perspective. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 9050–9090. PMLR, 28–30 Mar 2022.
- Hardt et al. (2018) Hardt, M., Ma, T., and Recht, B. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1–44, 2018.
- Jastrzebski et al. (2018) Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three factors influencing minima in SGD, 2018. URL https://arxiv.org/abs/1711.04623.
- Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
- Karimireddy et al. (2020) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. SCAFFOLD: Stochastic controlled averaging for federated learning. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5132–5143. PMLR, 13–18 Jul 2020.
- Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
- Krichene & Bartlett (2017) Krichene, W. and Bartlett, P. L. Acceleration and averaging in stochastic descent dynamics. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Li et al. (2019) Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of FedAvg on non-IID data. In International Conference on Learning Representations, 2019.
- Mandt et al. (2015) Mandt, S., Hoffman, M. D., Blei, D. M., et al. Continuous-time limit of stochastic gradient descent revisited. Advances in Neural Information Processing Systems, 2015.
- McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273–1282. PMLR, 20–22 Apr 2017.
- Mehrjou (2021) Mehrjou, A. Federated learning as a mean-field game, 2021. URL https://arxiv.org/abs/2107.03770.
- Mertikopoulos & Staudigl (2018) Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, 2018.
- Movellan (2011) Movellan, J. R. Tutorial on stochastic differential equations. MPLab Tutorials Version, 6, 2011.
- Nguyen et al. (2019) Nguyen, T. H., Simsekli, U., Gurbuzbalaban, M., and Richard, G. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Orvieto & Lucchi (2019) Orvieto, A. and Lucchi, A. Continuous-time models for stochastic optimization algorithms. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Xue et al. (2022) Xue, Y., Klabjan, D., and Luo, Y. Aggregation delayed federated learning. In 2022 IEEE International Conference on Big Data (Big Data), pp. 85–94, 2022.
- Zhang et al. (2023) Zhang, X., Hong, M., and Elia, N. Understanding a class of decentralized and federated optimization algorithms: A multirate feedback control perspective. SIAM Journal on Optimization, 33(2):652–683, 2023.
Appendix A Proofs
For a nice review of the mathematical preliminaries of SDEs, see the starting sections of the Supplementary Materials in (Orvieto & Lucchi, 2019).
We introduce a bound on the expected drift between client weights and the server weights. Lemma A.1 is used in the proofs of Theorems 4.5 and 4.9. The proof follows similar ideas as (Li et al., 2019) but has several key differences due to the continuous-time setting.
Lemma A.1.
A.1 Proof of Lemma A.1
Proof.
We first examine the form of as
Now we form the bound
∎
A.2 Other Lemmas
We first include some helpful lemmas that have been proved in other works or are straightforward to show.
Lemma A.2.
For two symmetric, matrices and , we have the following hold
Proof.
Lemma A.3.
The probability density function defined over support is a valid probability density function.
Proof.
We define the learning rate as strictly positive, and thus the resulting probability density function is non-negative. We also show that
Therefore, the probability density function integrated over its entire support is 1, and the necessary conditions of a probability density function are met. ∎
Lemma A.4.
The drift term, , in Equation 2 is Lipschitz continuous.
Proof.
We require Assumption 4.1. The drift term is a vector-valued function, and it suffices to prove Lipschitz continuity for each component of the function to prove Lipschitz continuity of the whole vector-valued function (Eriksson et al., 2004). Furthermore, it is well known that for everywhere differentiable functions, the function is Lipschitz continuous if and only if the absolute value of the derivative of the function is bounded by a finite value for the entire input domain. We prove this lemma by combining these two facts and proving the bounded derivative, component-wise for .
where , , , and for all .
We examine at the component-level and index for the -th component. We use the notation . We also change to for readability when using the ′ derivative notation. We have
Now we need to show for some finite . Note, the derivative is with respect to , but the function arguments are , so chain rule needs to be applied consecutively. We have
Using a new local iteration, we have that
Combining we get
Since , we can expand as
Returning to the original expression for we have
We now bound the absolute value of this derivative as
This holds for all components, therefore, we can conclude the drift term, , is Lipschitz continuous.
∎
A.3 Proof of Theorem 4.5
Proof.
The proof structure is inspired by the structure of the continuous-time proofs for standard SGD, but have several additional complexities due to the averaging of client weights to determine server weights after a certain amount of time. The main steps are finding a suitable energy function of the stochastic process, bounding the infinitesimal diffusion generator, and using Dynkin’s formula to complete the proof.
We first write out as
A good resource for reviewing the background mathematics used in this proof is in the Supplementary materials of (Orvieto & Lucchi, 2019).
Using a similar approach as in (Orvieto & Lucchi, 2019), we define an appropriate energy function and then use Dynkin’s formula to complete the proof. We are able to use all of the machinery for Itô diffusions due to our SDE being an Itô diffusion. In particular, we show in Lemma A.4 that the drift term is Lipschitz continuous and the proof for Lemma 1 in the supplementary materials of (Orvieto & Lucchi, 2019) proves Lipschitz continuity of the volatility matrix .
We define as the vector of first derivatives with respect to each component of and as the matrix of partial derivatives of with respect to each component of . They are just the gradient and hessian with respect to the arguments of the energy function, but we use different notation than to prevent confusion with which is the gradient of the loss function. The infinitesimal generator for an Itô diffusion of the form is defined as
Following (Orvieto & Lucchi, 2019) and using Itô’s lemma and the definition of an Itô diffusion, we obtain Dynkin’s formula as
(5) |
Dynkin’s formula is vital to our convergence proofs along with defining the correct energy function .
We choose the energy function as . It is important to note that we define the energy function as just an argument of rather than both and , thus the first term in Dynkin’s formula is not applicable. This follows the same approach as (Orvieto & Lucchi, 2019).
We then bound the expectation of the stochastic integral of the infinitesimal generator of the process as
We first bound (for simplicity we assume for all clients) as
We first define a function such that .
We then bound as
It follows that
We return to our bound on as
Now we bound as
Using Lemma A.2, we can bound the trace of a product of symmetric matrices by the product of spectral norms. Both and are symmetric by their construction. So we have
We return to the bound on the infinitesimal generator as
Then we use Dynkin’s Formula (Equation 5) and substitute to obtain the following expression
(6) |
We follow a similar approach as (Orvieto & Lucchi, 2019) and define . We can then define the distribution defined by the probability density function . Lemma A.3 shows this is a valid probability density function. We define a random variable with probability density function . We then use a similar trick to (Johnson & Zhang, 2013) and (Orvieto & Lucchi, 2019) and use the law of the unconscious statistician to obtain
(7) |
where is the expectation over the random time point .
where is the expectation over the random time point and the stochastic gradients, .
We notice that , so we can safely drop this term from the inequality.
So we can rewrite Equation 8 as
∎
A.4 Proof of Corollary 4.6
Proof.
We first examine the case where .
We have
Therefore we find
Now we examine We have , and we find
Now we examine the asymptotic rates. This proof is the same as in (Orvieto & Lucchi, 2019). The general form of the inequality is
The first term on the RHS of the inequality is the deterministic term and has rate for and for . The stochastic term is for , for , and for .
∎
A.5 Proof of Corollary 4.7
Proof.
We examine the convergence for the choice of client learning rate, , and server learning rate, . Most of the proof follows the same steps as the proof for Theorem 4.5, but we return to the inequality of the infinitesimal generator as
Use Dynkin’s Formula, 5, to obtain
(9) |
Now we follow steps similar to the final steps of the proof for Theorem 4.5.
We notice that , so we can safely drop this term from the inequality.
So we can rewrite Equation 10 as
Now we substitute and get
∎
A.6 Proof of Theorem 4.9
Proof.
This proof follows similarly to the proof of Theorem 4.5 where we find a suitable energy function, bound the infinitesimal generator, then use Dynkin’s Formula to complete the bound.
For this case, we choose the energy function as .
We then bound the expectation of the stochastic integral of the infinitesimal generator of the process as
First examine as
We define such that
(11) |
Now we bound as
Now we need to find a bound on . Specifically, we need to bound . We bound as
We assume constant server learning rate . We examine as follows
Therefore we have the following
Returning back to , we have
Returning to Equation 11, it follows that
Returning back to the original bound on , we have
Now we examine as
(12) | ||||
(13) |
From Lemma 3 in the supplementary materials of (Orvieto & Lucchi, 2019) and the same approach that we use in the proof of Theorem 4.5, we have that where is the dimensionality of our weights . Therefore we have
(14) |
We return to the bound on the infinitesimal generator as
As in the proof of Theorem 4.5, we now use Dynkin’s Formula to get
We examine the term with as
and we can make this switch because . From multivariate Ito isometry and Jensen’s inequality for concave functions (such as the square root function), we have that
where is the Frobenius norm.
Returning to the bound from Dynkin’s Formula we have
Rearranging we obtain
Using the same trick as in the proof of Theorem 4.5, where we create random variable with probability density function , we can substitute
where .
We finally find
∎
A.7 Proof of Corollary 4.10
Proof.
We have .
We examine the case where as
∎
A.8 Proof of Theorem 5.4
Proof.
Lyapunov CLT We first describe the Lyapunov Central Limit Theorem. Assume we have a sequence of random variables which are independent but not necessarily identically distributed. These random variables have possible different means and variances . To satisfy the Lyapunov condition, we need to show that for some
(15) |
where .
The problem we are studying is of , which is formulated as
(16) |
where and .
For simplicity, the proof below assumes is one-dimensional. However, from Assumption 5.1, we have that is diagonal. This means we can apply this approach to each component of a vector , and thus each component of tends to a normal distribution that are independent of each other because is diagonal. Since the components are independent and normally distributed, we have that tends to a multivariate normal distribution.
To satisfy the Lyapunov condition, we need to show that for some choice of the following holds
We choose and examine as
We have that .
We can use the following formula which states that for positive ,
We quickly prove this equation as follows
Using this formula and substituting we get
From our assumptions on the similarity of variance between clients, we have for some
We have
Since, , and and , by the squeeze theorem, we know that . Thus, the Lyapunov condition holds.
∎
A.9 Proof of Theorem 6.2
Proof.
We wish to show that after local iterations of SGD, the local weights evolve as
First observe the evolution for a few iterations as
Extend to number of time steps forward as
(17) |
In general we have
Starting from , we have
Now for , we have
Expanding out to , we have
(18) |
∎
A.10 Global loss function form for quadratic assumption
With the quadratic assumption, we can reformat the equation for the server loss function as
We must find . We have that , and . We find
Now we solve for as
Therefore, the server loss function is also represented by a quadratic form with local minimum at and Hessian .
A.11 Proof of Theorem 6.3
For a linear SDE of the form, , the solution is normally distributed as . The mean is the solution of the ordinary differential equation (ODE) with initial condition . The variance is the solution of the ODE with initial condition (Movellan, 2011). These ODEs are straightforward to solve and we obtain the solution in Theorem 6.3.