This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\optauthor\Name

Tianyi Chen \Email[email protected]
\addrRensselaer Polytechnic Institute
\NameZiye Guo \Email[email protected]
\addrRensselaer Polytechnic Institute
\NameYuejiao Sun \Email[email protected]
\addrUniversity of California, Los Angeles
\NameWotao Yin \Email [email protected]
\addrUniversity of California, Los Angeles

CADA: Communication-Adaptive Distributed Adam

Abstract

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

1 Introduction

Stochastic gradient descent (SGD) method [Robbins and Monro(1951)] is prevalent in solving large-scale machine learning problems during the last decades. Although simple to use, the plain-vanilla SGD is often sensitive to the choice of hyper-parameters and sometimes suffer from the slow convergence. Among various efforts to improve SGD, adaptive methods such as AdaGrad [Duchi et al.(2011)Duchi, Hazan, and Singer], Adam [Kingma and Ba(2014)] and AMSGrad [Reddi et al.(2018)Reddi, Kale, and Kumar] have well-documented empirical performance, especially in training deep neural networks.

To achieve “adaptivity,” these algorithms adaptively adjust the update direction or tune the learning rate, or, the combination of both. While adaptive SGD methods have been mostly studied in the setting where data and computation are both centralized in a single node, their performance in the distributed learning setting is less understood. As this setting brings new challenges to machine learning, can we add an additional dimension of adaptivity to Adam in this regime?

In this context, we aim to develop a fully adaptive SGD algorithm tailored for the distributed learning. We consider the setting composed of a central server and a set of MM workers in :={1,,M}{\cal M}:=\{1,\ldots,M\}, where each worker mm has its local data ξm\xi_{m} from a distribution Ξm\Xi_{m}. Workers may have different data distributions {Ξm}\{\Xi_{m}\}, and they collaboratively solve the following problem

minθp(θ)=1Mmm(θ)withm(θ):=𝔼ξm[(θ;ξm)],m\displaystyle\min_{\theta\in\mathbb{R}^{p}}~{}~{}~{}{\cal L}(\theta)=\frac{1}{M}\sum_{m\in{\cal M}}{\cal L}_{m}(\theta)~{}~{}~{}{\rm with}~{}~{}~{}{\cal L}_{m}(\theta):=\!\mathbb{E}_{\xi_{m}}\left[\ell(\theta;\xi_{m})\right],~{}m\in{\cal M} (1)

where θp\theta\in\mathbb{R}^{p} is the sought variable and {m,m}\{{\cal L}_{m},m\!\in\!{\cal M}\} are smooth (but not necessarily convex) functions. We focus on the setting where local data ξm\xi_{m} at each worker mm can not be uploaded to the server, and collaboration is needed through communication between the server and workers. This setting often emerges due to the data privacy concerns, e.g., federated learning [McMahan et al.(2017a)McMahan, Moore, Ramage, Hampson, and y Arcas, Kairouz et al.(2019)Kairouz, McMahan, Avent, Bellet, Bennis, Bhagoji, Bonawitz, Charles, Cormode, Cummings, et al.].

To solve (1), we can in principle apply the single-node version of the adaptive SGD methods such as Adam [Kingma and Ba(2014)]: At iteration kk, the server broadcasts θk\theta^{k} to all the workers; each worker mm computes (θk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k}) using a randomly selected sample or a minibatch of samples {ξmk}Ξm\{\xi_{m}^{k}\}\sim\Xi_{m}, and then uploads it to the server; and once receiving stochastic gradients from all workers, the server can simply use the aggregated stochastic gradient ¯k=1Mm(θk;ξmk)\bar{\bm{\nabla}}^{k}=\frac{1}{M}\!\sum_{m\in{\cal M}}\nabla\ell(\theta^{k};\xi_{m}^{k}) to update the parameter via the plain-vanilla single-node Adam. When (θk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k}) is an unbiased gradient of m(θ)\mathcal{L}_{m}(\theta), the convergence of this distributed implementation of Adam follows from the original ones [Reddi et al.(2018)Reddi, Kale, and Kumar, Chen et al.(2019)Chen, Liu, Sun, and Hong]. To implement this, however, all the workers have to upload the fresh {(θk;ξmk)}\{\nabla\ell(\theta^{k};\xi_{m}^{k})\} at each iteration. This prevents the efficient implementation of Adam in scenarios where the communication uplink and downlink are not symmetric, and communication especially upload from workers and the server is costly; e.g., cellular networks [Park et al.(2019)Park, Samarakoon, Bennis, and Debbah]. Therefore, our goal is to endow an additional dimension of adaptivity to Adam for solving the distributed problem (1). In short, on top of its adaptive learning rate and update direction, we want Adam to be communication-adaptive.

1.1 Related work

To put our work in context, we review prior contributions that we group in two categories.

1.1.1 SGD with adaptive gradients

A variety of SGD variants have been developed recently, including momentum and acceleration [Polyak(1964), Nesterov(1983), Ghadimi and Lan(2016)]. However, these methods are relatively sensitive to the hyper-parameters such as stepsizes, and require significant efforts on finding the optimal parameters.

Adaptive learning rate. One limitation of SGD is that it scales the gradient uniformly in all directions by a pre-determined constant or a sequence of constants (a.k.a. learning rates). This may lead to poor performance when the training data are sparse [Duchi et al.(2011)Duchi, Hazan, and Singer]. To address this issue, adaptive learning rate methods have been developed that scale the gradient in an entry-wise manner by using past gradients, which include AdaGrad [Duchi et al.(2011)Duchi, Hazan, and Singer, Ward et al.(2019)Ward, Wu, and Bottou], AdaDelta [Zeiler(2012)] and other variants [Li and Orabona(2019)]. This simple technique has improved the performance of SGD in some scenarios.

Adaptive SGD. Adaptive SGD methods achieve the best of both worlds, which update the search directions and the learning rates simultaneously using past gradients. Adam [Kingma and Ba(2014)] and AMSGrad [Reddi et al.(2018)Reddi, Kale, and Kumar] are the representative ones in this category. While these methods are simple-to-use, analyzing their convergence is challenging [Reddi et al.(2018)Reddi, Kale, and Kumar, Wang et al.(2020a)Wang, Lu, Tu, and Zhang]. Their convergence in the nonconvex setting has been settled only recently [Chen et al.(2019)Chen, Liu, Sun, and Hong, Défossez et al.(2020)Défossez, Bottou, Bach, and Usunier]. However, most adaptive SGD methods are studied in the single-node setting where data and computation are both centralized. Very recently, adaptive SGD has been studied in the shared memory setting [Xu et al.(2020)Xu, Sutcher-Shepard, Xu, and Chen], where data is still centralized and communication is not adaptive.

1.1.2 Communication-efficient distributed optimization

Popular communication-efficient distributed learning methods belong to two categories: c1) reduce the number of bits per communication round; and, c2) save the number of communication rounds.

For c1), methods are centered around the ideas of quantization and sparsification.
Reducing communication bits. Quantization has been successfully applied to distributed machine learning. The 1-bit and multi-bits quantization methods have been developed in [Seide et al.(2014)Seide, Fu, Droppo, Li, and Yu, Alistarh et al.(2017)Alistarh, Grubic, Li, Tomioka, and Vojnovic, Tang et al.(2018)Tang, Gan, Zhang, Zhang, and Liu]. More recently, signSGD with majority vote has been developed in [Bernstein et al.(2018)Bernstein, Wang, Azizzadenesheli, and Anandkumar]. Other advances of quantized gradient schemes include error compensation [Wu et al.(2018)Wu, Huang, Huang, and Zhang, Karimireddy et al.(2019)Karimireddy, Rebjock, Stich, and Jaggi], variance-reduced quantization [Zhang et al.(2017)Zhang, Li, Kara, Alistarh, Liu, and Zhang, Horváth et al.(2019)Horváth, Kovalev, Mishchenko, Stich, and Richtárik], and quantization to a ternary vector [Wen et al.(2017)Wen, Xu, Yan, Wu, Wang, Chen, and Li, Reisizadeh et al.(2019)Reisizadeh, Taheri, Mokhtari, Hassani, and Pedarsani]. Sparsification amounts to transmitting only gradient coordinates with large enough magnitudes exceeding a certain threshold [Strom(2015), Aji and Heafield(2017)]. To avoid losing information of skipping communication, small gradient components will be accumulated and transmitted when they are large enough [Lin et al.(2018)Lin, Han, Mao, Wang, and Dally, Stich et al.(2018)Stich, Cordonnier, and Jaggi, Alistarh et al.(2018)Alistarh, Hoefler, Johansson, Konstantinov, Khirirat, and Renggli, Wangni et al.(2018)Wangni, Wang, Liu, and Zhang, Jiang and Agrawal(2018), Tang et al.(2020)Tang, Gan, Rajbhandari, Lian, Zhang, Liu, and He]. Other compression methods also include low-rank approximation [Vogels et al.(2019)Vogels, Karimireddy, and Jaggi] and sketching [Ivkin et al.(2019)Ivkin, Rothchild, Ullah, Stoica, Arora, et al.]. However, all these methods aim to resolve c1). In some cases, other latencies dominate the bandwidth-dependent transmission latency. This motivates c2).

Reducing communication rounds. One of the most popular techniques in this category is the periodic averaging, e.g., elastic averaging SGD [Zhang et al.(2015)Zhang, Choromanska, and LeCun], local SGD (a.k.a. FedAvg) [McMahan et al.(2017b)McMahan, Moore, Ramage, Hampson, and y Arcas, Lin et al.(2020)Lin, Stich, Patel, and Jaggi, Kamp et al.(2018)Kamp, Adilova, Sicking, Hüger, Schlicht, Wirtz, and Wrobel, Stich(2019), Wang and Joshi(2019), Karimireddy et al.(2020)Karimireddy, Kale, Mohri, Reddi, Stich, and Suresh, Haddadpour et al.(2019)Haddadpour, Kamani, Mahdavi, and Cadambe] or local momentum SGD [Yu et al.(2019)Yu, Jin, and Yang, Wang et al.(2020b)Wang, Tantia, Ballas, and Rabbat]. In local SGD, workers perform local model updates independently and the models are averaged periodically. Therefore, communication frequency is reduced. However, except [Kamp et al.(2018)Kamp, Adilova, Sicking, Hüger, Schlicht, Wirtz, and Wrobel, Wang and Joshi(2019), Haddadpour et al.(2019)Haddadpour, Kamani, Mahdavi, and Cadambe], most of local SGD methods follow a pre-determined communication schedule that is nonadaptive. Some of them are tailored for the homogeneous settings, where the data are independent and identically distributed over all workers. To tackle the heterogeneous case, FedProx has been developed in [Li et al.(2018)Li, Sahu, Zaheer, Sanjabi, Talwalkar, and Smith] by solving local subproblems. For learning tasks where the loss function is convex and its conjugate dual is expressible, the dual coordinate ascent-based approaches have been demonstrated to yield impressive empirical performance [Jaggi et al.(2014)Jaggi, Smith, Takác, Terhorst, Krishnan, Hofmann, and Jordan, Ma et al.(2017)Ma, Konečnỳ, Jaggi, Smith, Jordan, Richtárik, and Takáč]. Higher-order methods have also been considered [Shamir et al.(2014)Shamir, Srebro, and Zhang, Zhang and Lin(2015)]. Roughly speaking, algorithms in [Li et al.(2018)Li, Sahu, Zaheer, Sanjabi, Talwalkar, and Smith, Jaggi et al.(2014)Jaggi, Smith, Takác, Terhorst, Krishnan, Hofmann, and Jordan, Ma et al.(2017)Ma, Konečnỳ, Jaggi, Smith, Jordan, Richtárik, and Takáč, Shamir et al.(2014)Shamir, Srebro, and Zhang, Zhang and Lin(2015)] reduce communication by increasing local gradient computation.

The most related line of work to this paper is the lazily aggregated gradient (LAG) approach [Chen et al.(2018)Chen, Giannakis, Sun, and Yin, Sun et al.(2019)Sun, Chen, Giannakis, and Yang]. In contrast to periodic communication, the communication in LAG is adaptive and tailored for the heterogeneous settings. Parameters in LAG are updated at the server, and workers only adaptively upload information that is determined to be informative enough. Unfortunately, while LAG has good performance in the deterministic settings (e.g., with full gradient), as shown in Section 2.1, its performance will be significantly degraded in the stochastic settings [Chen et al.(2020)Chen, Sun, and Yin]. In contrast, our approach generalizes LAG to the regime of running adaptive SGD. Very recently, FedAvg with local adaptive SGD update has been proposed in [Reddi et al.(2020)Reddi, Charles, Zaheer, Garrett, Rush, Konečnỳ, Kumar, and McMahan], which sets a strong benchmark for communication-efficient learning. When the new algorithm in [Reddi et al.(2020)Reddi, Charles, Zaheer, Garrett, Rush, Konečnỳ, Kumar, and McMahan] achieves the sweet spot between local SGD and adaptive momentum SGD, the proposed algorithm is very different from ours, and the averaging period and the selection of participating workers are nonadaptive.

1.2 Our approach

We develop a new adaptive SGD algorithm for distributed learning, called Communication-Adaptive Distributed Adam (CADA). Akin to the dynamic scaling of every gradient coordinate in Adam, the key motivation of adaptive communication is that during distributed learning, not all communication rounds between the server and workers are equally important. So a natural solution is to use a condition that decides whether the communication is important or not, and then adjust the frequency of communication between a worker and the server. If some workers are not communicating, the server uses their stale information instead of the fresh ones. We will show that this adaptive communication technique can reduce the less informative communication of distributed Adam.

Analogous to the original Adam [Kingma and Ba(2014)] and its modified version AMSGrad [Reddi et al.(2018)Reddi, Kale, and Kumar], our new CADA approach also uses the exponentially weighted stochastic gradient hk+1h^{k+1} as the update direction of θk+1\theta^{k+1}, and leverages the weighted stochastic gradient magnitude vk+1v^{k+1} to inversely scale the update direction hk+1h^{k+1}. Different from the direct distributed implementation of Adam that incorporates the fresh (thus unbiased) stochastic gradients ¯k=1Mm(θk;ξmk)\bar{\bm{\nabla}}^{k}=\frac{1}{M}\!\sum_{m\in{\cal M}}\!\!\nabla\ell(\theta^{k};\xi_{m}^{k}), CADA exponentially combines the aggregated stale stochastic gradients k=1Mm(θ^mk;ξ^mk)\bm{\nabla}^{k}=\frac{1}{M}\!\sum_{m\in{\cal M}}\!\!\nabla\ell(\hat{\theta}_{m}^{k};\hat{\xi}_{m}^{k}), where (θ^mk;ξ^mk)\nabla\ell(\hat{\theta}_{m}^{k};\hat{\xi}_{m}^{k}) is either the fresh stochastic gradient (θk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k}), or an old copy when θ^mkθk;ξ^mkξmk\hat{\theta}_{m}^{k}\neq\theta^{k};\hat{\xi}_{m}^{k}\neq\xi_{m}^{k}. Informally, with αk>0\alpha_{k}>0 denoting the stepsize at iteration kk, CADA has the following update

hk+1\displaystyle\!\!\!h^{k\!+\!1}\! =β1hk+(1β1)k,withk=1Mm(θ^mk;ξ^mk)\displaystyle=\!\beta_{1}h^{k}\!+\!(1\!-\!\beta_{1})\bm{\nabla}^{k}\!,~{}{\rm with}~{}\bm{\nabla}^{k}\!=\!\frac{1}{M}\!\sum_{m\in{\cal M}}\!\!\nabla\ell(\hat{\theta}_{m}^{k};\hat{\xi}_{m}^{k}) (2a)
vk+1\displaystyle\!v^{k+1} =β2v^k+(1β2)(k)2\displaystyle=\beta_{2}\hat{v}^{k}+(1-\beta_{2})(\bm{\nabla}^{k})^{2} (2b)
θk+1\displaystyle\!\theta^{k+1} =θkαk(ϵI+V^k+1)12hk+1\displaystyle=\theta^{k}-\alpha_{k}(\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}h^{k+1} (2c)

where β1,β2>0\beta_{1},\beta_{2}>0 are the momentum weights, V^k+1:=diag(v^k+1)\hat{V}^{k+1}:={\text{diag}}(\hat{v}^{k+1}) is a diagonal matrix whose diagonal vector is v^k+1:=max{vk+1,v^k}\hat{v}^{k+1}:=\max\{v^{k+1},\hat{v}^{k}\}, the constant is ϵ>0\epsilon>0, and II is an identity matrix. To reduce the memory requirement of storing all the stale stochastic gradients {(θk;ξmk)}\{\nabla\ell(\theta^{k};\xi_{m}^{k})\}, we can obtain k\bm{\nabla}^{k} by refining the previous aggregated stochastic gradients k1\bm{\nabla}^{k-1} stored in the server via

k=k1+1Mmkδmk\bm{\nabla}^{k}=\bm{\nabla}^{k-1}+\frac{1}{M}\!\sum\limits_{m\in{\cal M}^{k}}\delta_{m}^{k} (3)

where δmk:=(θk;ξmk)(θ^mk;ξ^mk)\delta_{m}^{k}:=\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\hat{\theta}_{m}^{k};\hat{\xi}_{m}^{k}) is the stochastic gradient innovation, and k{\cal M}^{k} is the set of workers that upload the stochastic gradient to the server at iteration kk. Henceforth, θ^mk=θk;ξ^mk=ξmk,mk\hat{\theta}_{m}^{k}=\theta^{k};\hat{\xi}_{m}^{k}=\xi_{m}^{k},\,\forall m\in{\cal M}^{k} and θ^mk=θ^mk1;ξ^mk=ξ^mk1,mk\hat{\theta}_{m}^{k}=\hat{\theta}_{m}^{k-1};\hat{\xi}_{m}^{k}=\hat{\xi}_{m}^{k-1},\,\forall m\notin{\cal M}^{k}. See CADA’s implementation in Figure 1.

Clearly, the selection of subset k{\cal M}^{k} is both critical and challenging. It is critical because it adaptively determines the number of communication rounds per iteration |k||{\cal M}^{k}|. However, it is challenging since 1) the staleness introduced in the Adam update will propagate not only through the momentum gradients but also the adaptive learning rate; 2) the importance of each communication round is dynamic, thus a fixed or nonadaptive condition is ineffective; and 3) the condition needs to be checked efficiently without extra overhead. To overcome these challenges, we develop two adaptive conditions to select k{\cal M}^{k} in CADA.

With details deferred to Section 2, the contributions of this paper are listed as follows.

c1) We introduce a novel communication-adaptive distributed Adam (CADA) approach that reuses stale stochastic gradients to reduce communication for distributed implementation of Adam.

c2) We develop a new Lyapunov function to establish convergence of CADA under both the nonconvex and Polyak-Łojasiewicz (PL) conditions even when the datasets are non-i.i.d. across workers. The convergence rate matches that of the original Adam.

c3) We confirm that our novel fully-adaptive CADA algorithms achieve at least 60%60\% performance gains in terms of communication upload over some popular alternatives using numerical tests on logistic regression and neural network training.

Figure 1: The CADA implementation.

2 CADA: Communication-Adaptive Distributed Adam

In this section, we revisit the recent LAG method [Chen et al.(2018)Chen, Giannakis, Sun, and Yin] and provide insights why it does not work well in stochastic settings, and then develop our communication-adaptive distributed Adam approach. To be more precise in our notations, we henceforth use τmk0\tau_{m}^{k}\geq 0 for the staleness or age of information from worker mm used by the server at iteration kk, e.g., θ^mk=θkτmk\hat{\theta}_{m}^{k}=\theta^{k-\tau_{m}^{k}}. An age of 0 means “fresh.”

2.1 The ineffectiveness of LAG with stochastic gradients

The LAG method [Chen et al.(2018)Chen, Giannakis, Sun, and Yin] modifies the distributed gradient descent update. Instead of communicating with all workers per iteration, LAG selects the subset of workers k{\cal M}^{k} to obtain fresh full gradients and reuses stale full gradients from others, that is

θk+1=θkηkMm\km(θkτmk)ηkMmkm(θk)\theta^{k+1}=\theta^{k}-\frac{\eta_{k}}{M}\!\!\sum_{m\in{\cal M}\backslash{\cal M}^{k}}\!\!\!\!\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})-\frac{\eta_{k}}{M}\!\sum_{m\in{\cal M}^{k}}\!\!\nabla{\cal L}_{m}(\theta^{k}) (4)

where k{\cal M}^{k} is adaptively decided by comparing the gradient difference m(θk)m(θkτmk)\|\nabla{\cal L}_{m}(\theta^{k})-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\|. Following this principle, the direct (or “naive”) stochastic version of LAG selects the subset of workers k{\cal M}^{k} to obtain fresh stochastic gradients m(θk;ξmk)\nabla{\cal L}_{m}\big{(}\theta^{k};\xi_{m}^{k}\big{)}, mkm\in{\cal M}^{k}. The stochastic LAG also follows the distributed SGD update, but it selects k{\cal M}^{k} by: if worker mm finds the innovation of the fresh stochastic gradient (θk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k}) is small such that it satisfies

(θk;ξmk)\displaystyle\Big{\|}\nabla\ell(\theta^{k};\xi_{m}^{k}) (θkτmk;ξmkτmk)2cdmaxd=1dmaxθk+1dθkd2\displaystyle-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\Big{\|}^{2}\leq\frac{c}{d_{\rm max}}\sum\limits_{d=1}^{d_{\rm max}}\big{\|}\theta^{k+1-d}-\theta^{k-d}\big{\|}^{2}\! (5)

where c0c\geq 0 and dmaxd_{\rm max} are pre-fixed constants, then worker mm reuses the old gradient, m\km\in{\cal M}\backslash{\cal M}^{k}, and sets the staleness τmk+1=τmk+1\tau_{m}^{k+1}=\tau_{m}^{k}+1; otherwise, worker mm uploads the fresh gradient, and sets τmk+1=1\tau_{m}^{k+1}=1.

If the stochastic gradients were full gradients, LAG condition (5) compares the error induced by using the stale gradients and the progress of the distributed gradient descent algorithm, which has proved to be effective in skipping redundant communication [Chen et al.(2018)Chen, Giannakis, Sun, and Yin]. Nevertheless, the observation here is that the two stochastic gradients (5) are evaluated on not just two different iterates (θk\theta^{k} and θkτmk\theta^{k-\tau_{m}^{k}}) but also two different samples (ξmk\xi_{m}^{k} and ξmkτmk\xi_{m}^{k-\tau_{m}^{k}}) thus two different loss functions.

This subtle difference leads to the ineffectiveness of (5). We can see this by expanding the left-hand-side (LHS) of (5) by (see details in supplemental material)

𝔼[(θk;ξmk)(θkτmk;ξmkτmk)2]\displaystyle\!\mathbb{E}\left[\|\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\|^{2}\right] 12𝔼[(θk;ξmk)m(θk)2]\displaystyle\geq\frac{1}{2}\mathbb{E}\Big{[}\big{\|}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla{\cal L}_{m}(\theta^{k})\big{\|}^{2}\Big{]} (6a)
+12𝔼[[(θkτmk;ξmkτmk)m(θkτmk)2]]\displaystyle+\frac{1}{2}\mathbb{E}\Big{[}\big{[}\big{\|}\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\big{\|}^{2}\big{]}\Big{]} (6b)
𝔼[m(θk)m(θkτmk)2].\displaystyle-\mathbb{E}[\|\nabla{\cal L}_{m}(\theta^{k})-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\|^{2}]. (6c)

Even if θk\theta^{k} converges, e.g., θkθ\theta^{k}\rightarrow\theta^{*}, and thus the right-hand-side (RHS) of (5) θk+1dθkd20\big{\|}\theta^{k+1-d}\!-\!\theta^{k-d}\big{\|}^{2}\!\rightarrow\!0, the LHS of (5) does not, because the variance inherited in (6a) and (6b) does not vanish yet the gradient difference at the same function (6c) diminishes. Therefore, the key insight here is that the non-diminishing variance of stochastic gradients makes the LAG rule (5) ineffective eventually. This will also be verified in our simulations when we compare CADA with stochastic LAG.

2.2 Algorithm development of CADA

In this section, we formally develop our CADA method, and present the intuition behind its design.

The key of the CADA design is to reduce the variance of the innovation measure in the adaptive condition. We introduce two CADA variants, both of which follow the update (2), but they differ in the variance-reduced communication rules.

The first one termed CADA1 will calculate two stochastic gradient innovations with one δ~mk:=(θk;ξmk)(θ~;ξmk)\tilde{\delta}_{m}^{k}:=\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\tilde{\theta};\xi_{m}^{k}) at the sample ξmk\xi_{m}^{k}, and one δ~mkτmk:=(θkτmk;ξmkτmk)(θ~;ξmkτmk)\tilde{\delta}_{m}^{k-\tau_{m}^{k}}:=\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\tilde{\theta};\xi_{m}^{k-\tau_{m}^{k}}) at the sample ξmkτmk\xi_{m}^{k-\tau_{m}^{k}}, where θ~\tilde{\theta} is a snapshot of the previous iterate θ\theta that will be updated every DD iterations. As we will show in (2.2), δ~mkδ~mkτmk\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}} can be viewed as the difference of two variance-reduced gradients calculated at θk\theta^{k} and θkτmk\theta^{k-\tau_{m}^{k}}. Using δ~mkδ~mkτmk\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}} as the error induced by using stale information, CADA1 will exclude worker mm from k{\cal M}^{k} if worker mm finds

δ~mkδ~mkτmk2cdmaxd=1dmaxθk+1dθkd2.\!\!\left\|\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}}\right\|^{2}\leq\frac{c}{d_{\rm max}}\sum\limits_{d=1}^{d_{\rm max}}\left\|\theta^{k+1-d}-\theta^{k-d}\right\|^{2}.\! (7)

In (7), we use the change of parameter θk\theta^{k} averaged over the past dmaxd_{\rm max} consecutive iterations to measure the progress of algorithm. Intuitively, if (7) is satisfied, the error induced by using stale information will not large affect the learning algorithm. In this case, worker mm does not upload, and the staleness of information from worker mm increases by τmk+1=τmk+1\tau_{m}^{k+1}=\tau_{m}^{k}+1; otherwise, worker mm belongs to k{\cal M}^{k}, uploads the stochastic gradient innovation δmk\delta_{m}^{k}, and resets τmk+1=1\tau_{m}^{k+1}=1.

The rationale of CADA1. In contrast to the non-vanishing variance in LAG rule (see (6)), the CADA1 rule (7) reduces its inherent variance. To see this, we can decompose the LHS of (7) as the difference of two variance reduced stochastic gradients at iteration kk and kτmkk-\tau_{m}^{k}. Using the stochastic gradient in SVRG as an example [Johnson and Zhang(2013)], the innovation can be written as

δ~mkδ~mkτmk\displaystyle\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}} =((θk;ξmk)(θ~;ξmk)+m(θ~))\displaystyle=\big{(}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\tilde{\theta};\xi_{m}^{k})+\nabla{\cal L}_{m}(\tilde{\theta})\big{)}
((θkτmk;ξmkτmk)(θ~;ξmkτmk)+m(θ~)).\displaystyle-\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\tilde{\theta};\xi_{m}^{k-\tau_{m}^{k}})+\nabla{\cal L}_{m}(\tilde{\theta})\right). (8)

Define the minimizer of (1) as θ\theta^{\star}. With derivations given in the supplementary document, the expectation of the LHS of (7) can be upper-bounded by

𝔼[δ~mkδ~mkτmk2]=𝒪(𝔼[(θk)](θ)+𝔼[(θkτmk)](θ)+𝔼[(θ~)](θ)).\displaystyle\!\mathbb{E}\left[\big{\|}\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}}\big{\|}^{2}\right]={\cal O}\Big{(}\mathbb{E}[{\cal L}(\theta^{k})]-{\cal L}(\theta^{\star})+\mathbb{E}[{\cal L}(\theta^{k-\tau_{m}^{k}})]-{\cal L}(\theta^{\star})+\mathbb{E}[{\cal L}(\tilde{\theta})]-{\cal L}(\theta^{\star})\Big{)}. (9)

If θk\theta^{k} converges, e.g., θk,θkτmk,θ~θ\theta^{k},\theta^{k-\tau_{m}^{k}},\tilde{\theta}\rightarrow\theta^{*}, the RHS of (9) diminishes, and thus the LHS of (7) diminishes. This is in contrast to the LAG rule (6) lower-bounded by a non-vanishing value. Notice that while enjoying the benefit of variance reduction, our communication rule does not need to repeatedly calculate the full gradient m(θ~)\nabla{\cal L}_{m}(\tilde{\theta}), which is only used for illustration purpose.

Algorithm 1 Pseudo-code of CADA; red lines are run only by CADA1; blue lines are implemented only by CADA2; not both at the same time.
1:Input: delay counter {τm0}\{\tau_{m}^{0}\}, stepsize αk\alpha_{k}, constant threshold cc, max delay DD.
2:for k=0,1,,K1k=0,1,\ldots,K-1 do
3:     Server broadcasts θk\theta^{k} to all workers.
4:      All workers set θ~=θk\tilde{\theta}=\theta^{k} if kmodD=0k\!\!\mod\!D\!=\!0.
5:     for  Worker m=1,2,,Mm=1,2,\ldots,M do in parallel
6:         Compute (θk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k}) and (θ~;ξmk)\nabla\ell(\tilde{\theta};\xi_{m}^{k}).
7:         Check condition (7) with stored δ~mkτmk\tilde{\delta}_{m}^{k-\tau_{m}^{k}}.
8:         Compute (θk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k}) and (θmkτmk;ξmk)\nabla\ell(\theta^{k-\tau_{m}^{k}}_{m};\xi_{m}^{k}).
9:         Check condition (10).
10:         if (7) or (10) is violated or τmkD\tau_{m}^{k}\geq D then
11:              Upload δmk\delta_{m}^{k}.       \triangleright τmk+1=1\tau_{m}^{k+1}=1
12:         else
13:              Upload nothing.  \triangleright τmk+1=τmk+1\tau_{m}^{k+1}=\tau_{m}^{k}+1
14:         end if
15:     end for
16:     Server updates {hk,vk}\{h^{k},v^{k}\} via (2a)-(2b).
17:     Server updates θk\theta^{k} via (2c).
18:end for

In addition to (7), the second rule is termed CADA2. The key difference relative to CADA1 is that CADA2 uses (θk;ξmk)(θmkτmk;ξmk)\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}}_{m};\xi_{m}^{k}) to estimate the error of using stale information. CADA2 will reuse the stale stochastic gradient (θmkτmk;ξmkτmk)\nabla\ell(\theta^{k-\tau_{m}^{k}}_{m};\xi_{m}^{k-\tau_{m}^{k}}) or exclude worker mm from k{\cal M}^{k} if worker mm finds

(θk;ξmk)(θmkτmk;ξmk)2cdmaxd=1dmaxθk+1dθkd2.\displaystyle\big{\|}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}}_{m};\xi_{m}^{k})\big{\|}^{2}\leq\frac{c}{d_{\rm max}}\sum\limits_{d=1}^{d_{\rm max}}\big{\|}\theta^{k+1-d}-\theta^{k-d}\big{\|}^{2}. (10)

If (10) is satisfied, then worker mm does not upload, and the staleness increases by τmk+1=τmk+1\tau_{m}^{k+1}=\tau_{m}^{k}+1; otherwise, worker mm uploads the stochastic gradient innovation δmk\delta_{m}^{k}, and resets the staleness as τmk+1=1\tau_{m}^{k+1}=1. Notice that different from the naive LAG (5), the CADA condition (10) is evaluated at two different iterates but on the same sample ξmk\xi_{m}^{k}.

The rationale of CADA2. Similar to CADA1, the CADA2 rule (10) also reduces its inherent variance, since the LHS of (10) can be written as the difference between a variance reduced stochastic gradient and a deterministic gradient, that is

(θk;ξmk)(θkτmk;ξmk)=((θk;ξmk)(θkτmk;ξmk)+m(θkτmk))m(θkτmk).\displaystyle\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})=\!\left(\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})+\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\right)\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}}). (11)

With derivations deferred to the supplementary document, similar to (9) we also have that 𝔼[(θk;ξmk)(θkτmk;ξmk)2]0\mathbb{E}[\|\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})\|^{2}]\rightarrow 0 as the iterate θkθ\theta^{k}\rightarrow\theta^{\star}.

For either (7) or (10), worker mm can check it locally with small memory cost by recursively updating the RHS of (7) or (10). In addition, worker mm will update the stochastic gradient if the staleness satisfies τmkD\tau_{m}^{k}\geq D. We summarize CADA in Algorithm 1.

Computational and memory cost of CADA. In CADA, checking (7) and (10) will double the computational cost (gradient evaluation) per iteration. Aware of this fact, we have compared the number of iterations and gradient evaluations in simulations (see Figures 2-5), which will demonstrate that CADA requires fewer iterations and also fewer gradient queries to achieve a target accuracy. Thus the extra computation is small. In addition, the extra memory for large dmaxd_{\rm max} is low. To compute the RHS of (7) or (10), each worker only stores the norm of model changes (dmaxd_{\rm max} scalars).

3 Convergence Analysis of CADA

We present the convergence results of CADA. For all the results, we make some basic assumptions.

Assumption 1

The loss function (θ)\mathcal{L}(\theta) is smooth with the constant LL.

Assumption 2

Samples ξm1,ξm2,\xi_{m}^{1},\xi_{m}^{2},\ldots are independent, and the stochastic gradient (θ;ξmk)\nabla\ell(\theta;\xi_{m}^{k}) satisfies 𝔼ξmk[(θ;ξmk)]=m(θ)\mathbb{E}_{\xi_{m}^{k}}[\nabla\ell(\theta;\xi_{m}^{k})]=\nabla\mathcal{L}_{m}(\theta) and (θ;ξmk)σm\|\nabla\ell(\theta;\xi_{m}^{k})\|\leq\sigma_{m}.

3.1 Key steps of Lyapunov analysis

The convergence results of CADA critically builds on the subsequent Lyapunov analysis. We will start with analyzing the expected descent in terms of (θk)\mathcal{L}(\theta^{k}) by applying one step CADA update.

Lemma 3.1.

Under Assumptions 1 and 2, if αk+1αk\alpha_{k+1}\leq\alpha_{k}, then {θk}\{\theta^{k}\} generated by CADA satisfy

𝔼[(θk+1)]𝔼[(θk)]\displaystyle\mathbb{E}[\mathcal{L}(\theta^{k+1})]-\mathbb{E}[\mathcal{L}(\theta^{k})] αk(1β1)𝔼[(θk),(ϵI+V^kD)12k]\displaystyle\leq-\alpha_{k}(1-\beta_{1})\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}\right]
αkβ1𝔼[(θk1),(ϵI+V^k)12hk]\displaystyle-\alpha_{k}\beta_{1}\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}\right]
+(L2+β1L)𝔼[θk+1θk2]\displaystyle+\left(\frac{L}{2}+\beta_{1}L\right)\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]
+αk(2β1)σ2𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)]\displaystyle+\alpha_{k}(2\!-\!\beta_{1})\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon\!+\!\hat{v}_{i}^{k-D})^{\!-\!\frac{1}{2}}-(\epsilon\!+\!\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]} (12)

where pp is the dimension of θ\theta, σ\sigma is defined as σ:=1Mmσm\sigma:=\frac{1}{M}\sum_{m\in{\cal M}}\sigma_{m}, and β1,ϵ\beta_{1},\epsilon are parameters in (2).

Lemma 3.1 contains four terms in the RHS of (3.1): the first two terms quantify the correlations between the gradient direction (θk)\nabla\mathcal{L}(\theta^{k}) and the stale stochastic gradient k\bm{\nabla}^{k} as well as the state momentum stochastic gradient hkh^{k}; the third term captures the drift of two consecutive iterates; and, the last term estimates the maximum drift of the adaptive stepsizes over D+1D+1 iterations.

From Lemma 3.1, analyzing the progress of (θk)\mathcal{L}(\theta^{k}) under CADA is challenging especially when the effects of staleness and the momentum couple with each other. Because the the state momentum gradient hkh^{k} is recursively updated by k\bm{\nabla}^{k}, we will first need the following lemma to characterize the regularity of the stale aggregated stochastic gradients k\bm{\nabla}^{k}, which lays the theoretical foundation for incorporating the properly controlled staleness into the Adam’s momentum update.

Lemma 3.2.

Under Assumptions 1 and 2, if the stepsizes satisfy αk+1αk1/L\alpha_{k+1}\leq\alpha_{k}\leq 1/L, then we have

αk𝔼[(θk),(ϵI+V^kD)12k]\displaystyle-\alpha_{k}\mathbb{E}\bigg{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}\bigg{]} αk2𝔼[(θk)(ϵI+V^kD)122]+6DLαk2ϵ12Mmσm2\displaystyle\leq-\frac{\alpha_{k}}{2}\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\right]\!+\!\frac{6DL\alpha_{k}^{2}\epsilon^{-\frac{1}{2}}}{M}\!\!\sum_{m\in{\cal M}}\!\sigma_{m}^{2}
+ϵ12(L12+c2Ldmax)d=1D𝔼[θk+1dθkd2].\displaystyle+\epsilon^{-\frac{1}{2}}\left(\frac{L}{12}+\frac{c}{2Ld_{\rm max}}\right)\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]. (13)

Lemma 3.2 justifies the relevance of the stale yet properly selected stochastic gradients. Intuitively, the first term in the RHS of (3.2) resembles the descent of using SGD with the unbiased stochastic gradient, and the second and third terms will diminish if the stepsizes are diminishing since 𝔼[θkθk12]=𝒪(αk2)\mathbb{E}\left[\|\theta^{k}-\theta^{k-1}\|^{2}\right]={\cal O}(\alpha_{k}^{2}). This is achieved by our designed communication rules.

In view of Lemmas 3.1 and 3.2, we introduce the following Lyapunov function:

𝒱k:=\displaystyle{\cal V}^{k}:= (θk)(θ)j=kαjβ1jk+1(θk1),(ϵI+V^k)12hk\displaystyle\mathcal{L}(\theta^{k})-\mathcal{L}(\theta^{\star})-\sum\limits_{j=k}^{\infty}\alpha_{j}\beta_{1}^{j-k+1}\left\langle\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\right\rangle
+bkd=0Di=1p(ϵ+v^ikd)12+d=1Dρdθk+1dθkd2\displaystyle+b_{k}\sum\limits_{d=0}^{D}\sum_{i=1}^{p}(\epsilon+\hat{v}_{i}^{k-d})^{-\frac{1}{2}}+\sum\limits_{d=1}^{D}\rho_{d}\|\theta^{k+1-d}-\theta^{k-d}\|^{2} (14)

where θ\theta^{\star} is the solution of (1), {bk}k=1K\{b_{k}\}_{k=1}^{K} and {ρd}d=1D\{\rho_{d}\}_{d=1}^{D} are constants that will be specified in the proof.

The design of Lyapunov function in (3.1) is motivated by the progress of (θk)\mathcal{L}(\theta^{k}) in Lemmas 3.1-3.2, and also coupled with our communication rules (7) and (10) that contain the parameter difference term. We find this new Lyapunov function can lead to a much simple proof of Adam and AMSGrad, which is of independent interest. The following lemma captures the progress of the Lyapunov function.

Lemma 3.3.

Under Assumptions 1-2, if {bk}k=1K\{b_{k}\}_{k=1}^{K} and {ρd}d=1D\{\rho_{d}\}_{d=1}^{D} in (3.1) are chosen properly, we have

𝔼[𝒱k+1]𝔼[𝒱k]αk(1β1)2(ϵ+σ21β2)12𝔼[(θk)2]+αk2C0\displaystyle\mathbb{E}[{\cal V}^{k+1}]-\mathbb{E}[{\cal V}^{k}]\leq-\frac{\alpha_{k}(1-\beta_{1})}{2}\!\left(\epsilon+\frac{\sigma^{2}}{1-\beta_{2}}\right)^{\!-\frac{1}{2}}\!\!\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}\right]+\alpha_{k}^{2}C_{0} (15)

where the constant C0C_{0} depends on the CADA and problem parameters c,β1,β2,ϵ,Dc,\beta_{1},\beta_{2},\epsilon,D, and L,{σm2}L,\{\sigma_{m}^{2}\}.

The first term in the RHS of (15) is strictly negative, and the second term is positive but potentially small since it is 𝒪(αk2){\cal O}(\alpha_{k}^{2}) with αk0\alpha_{k}\rightarrow 0. This implies that the function 𝒱k{\cal V}^{k} will eventually converge if we choose the stepsizes appropriately. Lemma 3.3 is a generalization of SGD’s descent lemma. If we set β1=β2=0\beta_{1}=\beta_{2}=0 in (2) and bk=0,ρd=0,d,kb_{k}=0,\rho_{d}=0,\,\forall d,k in (3.1), then Lemma 3.3 reduces to that of SGD in terms of (θk)\mathcal{L}(\theta^{k}); see e.g., [Bottou et al.(2018)Bottou, Curtis, and Nocedal, Lemma 4.4].

3.2 Main convergence results

Building upon our Lyapunov analysis, we first present the convergence in nonconvex case.

Theorem 3.4 (nonconvex).

Under Assumptions 1, 2, if we choose αk=α=𝒪(1K)\alpha_{k}=\alpha={\cal O}(\frac{1}{\sqrt{K}}) and β1<β2<1\beta_{1}<\sqrt{\beta_{2}}<1, then the iterates {θk}\{\theta^{k}\} generated by CADA satisfy

1Kk=0K1𝔼[(θk)2]=𝒪(1K).\frac{1}{K}\sum\limits_{k=0}^{K-1}\mathbb{E}\left[\|\nabla\mathcal{L}(\theta^{k})\|^{2}\right]={\cal O}\left(\frac{1}{\sqrt{K}}\right). (16)

From Theorem 3.4, the convergence rate of CADA in terms of the average gradient norms is 𝒪(1/K){\cal O}(1/\sqrt{K}), which matches that of the plain-vanilla Adam [Reddi et al.(2018)Reddi, Kale, and Kumar, Chen et al.(2019)Chen, Liu, Sun, and Hong]. Unfortunately, due to the complicated nature of Adam-type analysis, the bound in (16) does not achieve the linear speed-up as analyzed for asynchronous nonadaptive SGD such as [Lian et al.(2016)Lian, Zhang, Hsieh, Huang, and Liu]. However, our analysis is tailored for adaptive SGD and does not make any assumption on the asynchrony, e.g., the set of uploading workers are independent from the past or even independent and identically distributed.

Next we present the convergence results under a slightly stronger assumption on the loss (θ){\cal L}(\theta).

Assumption 3

The loss function (θ)\mathcal{L}(\theta) satisfies the Polyak-Łojasiewicz (PL) condition with the constant μ>0\mu>0, that is (θ)(θ)12μ(θ)2{\cal L}(\theta)-{\cal L}(\theta^{*})\leq\frac{1}{2\mu}\left\|{\cal L}(\theta)\right\|^{2}.

The PL condition is weaker than the strongly convexity, which does not even require convexity [Karimi et al.(2016)Karimi, Nutini, and Schmidt]. And it is satisfied by a wider range of problems such as least squares for underdetermined linear systems, logistic regression, and also certain types of neural networks.

We next establish the convergence of CADA under this condition.

Theorem 3.5 (PL-condition).

Under Assumptions 1-3, if we choose the stepsize as αk=2μ(k+K0)\alpha_{k}=\frac{2}{\mu(k+K_{0})} for a given constant K0K_{0}, then θK\theta^{K} generated by Algorithm 1 satisfies

𝔼[(θK)](θ)=𝒪(1K).\mathbb{E}\left[\mathcal{L}(\theta^{K})\right]-\mathcal{L}(\theta^{\star})={\cal O}\left(\frac{1}{K}\right). (17)

Theorem 3.5 implies that under the PL-condition of the loss function, the CADA algorithm can achieve the global convergence in terms of the loss function, with a fast rate 𝒪(1/K){\cal O}(1/K). Compared with the previous analysis for LAG [Chen et al.(2018)Chen, Giannakis, Sun, and Yin], as we highlighted in Section 3.1, the analysis for CADA is more involved, since it needs to deal with not only the outdated gradients but also the stochastic momentum gradients and the adaptive matrix learning rates. We tackle this issue by i) considering a new set of communication rules (7) and (10) with reduced variance; and, ii) incorporating the effect of momentum gradients and the drift of adaptive learning rates in the new Lyapunov function (3.1).

Refer to caption
Refer to caption
Refer to caption
Figure 2: Logistic regression training loss on covtype dataset averaged over 10 Monte Carlo runs.
Refer to caption
Refer to caption
Refer to caption
Figure 3: Logistic regression training loss on ijcnn1 dataset averaged over 10 Monte Carlo runs.
Refer to caption
Refer to caption
Refer to caption
Figure 4: Training Neural network for classification on mnist dataset.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Training Neural network for classification on cifar10 dataset.

4 Simulations

In order to verify our analysis and show the empirical performance of CADA, we conduct simulations using logistic regression and training neural networks. Data are distributed across M=10M=10 workers during all tests. We benchmark CADA with some popular methods such as Adam [Kingma and Ba(2014)], stochastic version of LAG [Chen et al.(2018)Chen, Giannakis, Sun, and Yin], local momentum [Yu et al.(2019)Yu, Jin, and Yang] and the state-of-the-art FedAdam [Reddi et al.(2020)Reddi, Charles, Zaheer, Garrett, Rush, Konečnỳ, Kumar, and McMahan]. For local momentum and FedAdam, workers perform model update independently, which are averaged over all workers every HH iterations. In simulations, critical parameters are optimized for each algorithm by a grid-search. Due to space limitation, please see the detailed choice of parameters, and additional experiments in the supplementary document.

Logistic regression. For CADA, the maximal delay is D=100D=100 and dmax=10d_{\rm max}=10. For local momentum and FedAdam, we manually optimize the averaging period as H=10H=10 for ijcnn1 and H=20H=20 for covtype. Results are averaged over 10 Monte Carlo runs.

Tests on logistic regression are reported in Figures 2-3. In our tests, two CADA variants achieve the similar iteration complexity as the original Adam and outperform all other baselines in most cases. Since our CADA requires two gradient evaluations per iteration, the gradient complexity (e.g., computational complexity) of CADA is higher than Adam, but still smaller than that of other baselines. For logistic regression task, CADA1 and CADA2 save the number of communication uploads by at least one order of magnitude.

Training neural networks. We train a neural network with two convolution-ELU-maxpooling layers followed by two fully-connected layers for 10 classes classification on mnist. We use the popular ResNet20 model on CIFAR10 dataset, which has 20 and roughly 0.27 million parameters. We searched the best values of HH from the grid {1,4,6,8,16}\{1,4,6,8,16\} to optimize the testing accuracy vs communication rounds for each algorithm. In CADA, the maximum delay is D=50D=50 and the average interval dmax=10d_{\rm max}=10. See tests under different HH in the supplementary material.

Tests on training neural networks are reported in Figures 4-5. In mnist, CADA1 and CADA2 save the number of communication uploads by roughly 60% than local momentum and slightly more than FedAdam. In cifar10, CADA1 and CADA2 achieve competitive performance relative to the state-of-the-art algorithms FedAdam and local momentum. We found that if we further enlarge HH, FedAdam and local momentum converge fast at the beginning, but reached worse test accuracy (e.g., 5%-15%). It is also evident that the CADA1 and CADA2 rules achieve more communication reduction than the direct stochastic version of LAG, which verifies the intuition in Section 2.1.

5 Conclusions

While Adaptive SGD methods have been widely applied in the single-node setting, their performance in the distributed learning setting is less understood. In this paper, we have developed a communication-adaptive distributed Adam method that we term CADA, which endows an additional dimension of adaptivity to Adam tailored for its distributed implementation. CADA method leverages a set of adaptive communication rules to detect and then skip less informative communication rounds between the server and workers during distributed learning. All CADA variants are simple to implement, and have convergence rate comparable to the original Adam.

References

  • [Aji and Heafield(2017)] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proc. Conf. Empirical Methods Natural Language Process., pages 440–445, Copenhagen, Denmark, Sep 2017.
  • [Alistarh et al.(2017)Alistarh, Grubic, Li, Tomioka, and Vojnovic] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proc. Conf. Neural Info. Process. Syst., pages 1709–1720, Long Beach, CA, Dec 2017.
  • [Alistarh et al.(2018)Alistarh, Hoefler, Johansson, Konstantinov, Khirirat, and Renggli] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Proc. Conf. Neural Info. Process. Syst., pages 5973–5983, Montreal, Canada, Dec 2018.
  • [Bernstein et al.(2018)Bernstein, Wang, Azizzadenesheli, and Anandkumar] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SignSGD: Compressed optimisation for non-convex problems. In Proc. Intl. Conf. Machine Learn., pages 559–568, Stockholm, Sweden, Jul 2018.
  • [Bottou et al.(2018)Bottou, Curtis, and Nocedal] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  • [Chen et al.(2018)Chen, Giannakis, Sun, and Yin] Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. LAG: Lazily aggregated gradient for communication-efficient distributed learning. In Proc. Conf. Neural Info. Process. Syst., pages 5050–5060, Montreal, Canada, Dec 2018.
  • [Chen et al.(2020)Chen, Sun, and Yin] Tianyi Chen, Yuejiao Sun, and Wotao Yin. LASG: Lazily aggregated stochastic gradients for communication-efficient distributed learning. arXiv preprint:2002.11360, 2020.
  • [Chen et al.(2019)Chen, Liu, Sun, and Hong] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of Adam-type algorithms for non-convex optimization. In Proc. Intl. Conf. Learn. Representations, New Orleans, LA, May 2019.
  • [Défossez et al.(2020)Défossez, Bottou, Bach, and Usunier] Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. On the convergence of Adam and Adagrad. arXiv preprint:2003.02395, March 2020.
  • [Duchi et al.(2011)Duchi, Hazan, and Singer] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Machine Learning Res., 12(Jul):2121–2159, 2011.
  • [Ghadimi and Lan(2013)] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • [Ghadimi and Lan(2016)] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  • [Haddadpour et al.(2019)Haddadpour, Kamani, Mahdavi, and Cadambe] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Proc. Conf. Neural Info. Process. Syst., pages 11080–11092, Vancouver, Canada, December 2019.
  • [Horváth et al.(2019)Horváth, Kovalev, Mishchenko, Stich, and Richtárik] Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint:1904.05115, April 2019.
  • [Ivkin et al.(2019)Ivkin, Rothchild, Ullah, Stoica, Arora, et al.] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, et al. Communication-efficient distributed SGD with sketching. In Proc. Conf. Neural Info. Process. Syst., pages 13144–13154, Vancouver, Canada, December 2019.
  • [Jaggi et al.(2014)Jaggi, Smith, Takác, Terhorst, Krishnan, Hofmann, and Jordan] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Proc. Advances in Neural Info. Process. Syst., pages 3068–3076, Montreal, Canada, December 2014.
  • [Jiang and Agrawal(2018)] Peng Jiang and Gagan Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Proc. Conf. Neural Info. Process. Syst., pages 2525–2536, Montreal, Canada, Dec 2018.
  • [Johnson and Zhang(2013)] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Conf. Neural Info. Process. Syst., pages 315–323, 2013.
  • [Kairouz et al.(2019)Kairouz, McMahan, Avent, Bellet, Bennis, Bhagoji, Bonawitz, Charles, Cormode, Cummings, et al.] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint:1912.04977, December 2019.
  • [Kamp et al.(2018)Kamp, Adilova, Sicking, Hüger, Schlicht, Wirtz, and Wrobel] Michael Kamp, Linara Adilova, Joachim Sicking, Fabian Hüger, Peter Schlicht, Tim Wirtz, and Stefan Wrobel. Efficient decentralized deep learning by dynamic model averaging. In Euro. Conf. Machine Learn. Knowledge Disc. Data.,, pages 393–409, Dublin, Ireland, 2018.
  • [Karimi et al.(2016)Karimi, Nutini, and Schmidt] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Proc. Euro. Conf. Machine Learn., pages 795–811, Riva del Garda, Italy, September 2016.
  • [Karimireddy et al.(2019)Karimireddy, Rebjock, Stich, and Jaggi] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In Proc. Intl. Conf. Machine Learn., pages 3252–3261, Long Beach, CA, June 2019.
  • [Karimireddy et al.(2020)Karimireddy, Kale, Mohri, Reddi, Stich, and Suresh] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. In Proc. Intl. Conf. Machine Learn., July 2020.
  • [Kingma and Ba(2014)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint:1412.6980, December 2014.
  • [Li et al.(2018)Li, Sahu, Zaheer, Sanjabi, Talwalkar, and Smith] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
  • [Li and Orabona(2019)] Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In Proc. Intl. Conf. on Artif. Intell. and Stat., pages 983–992, Okinawa, Japan, April 2019.
  • [Lian et al.(2016)Lian, Zhang, Hsieh, Huang, and Liu] Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In Proc. Conf. Neural Info. Process. Syst., pages 3054–3062, Barcelona, Spain, December 2016.
  • [Lin et al.(2020)Lin, Stich, Patel, and Jaggi] Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local SGD. In Proc. Intl. Conf. Learn. Representations, Addis Ababa, Ethiopia, April 2020.
  • [Lin et al.(2018)Lin, Han, Mao, Wang, and Dally] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proc. Intl. Conf. Learn. Representations, Vancouver, Canada, Apr 2018.
  • [Ma et al.(2017)Ma, Konečnỳ, Jaggi, Smith, Jordan, Richtárik, and Takáč] Chenxin Ma, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I Jordan, Peter Richtárik, and Martin Takáč. Distributed optimization with arbitrary local solvers. Optimization Methods and Software, 32(4):813–848, July 2017.
  • [McMahan et al.(2017a)McMahan, Moore, Ramage, Hampson, and y Arcas] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. Intl. Conf. Artificial Intell. and Stat., pages 1273–1282, Fort Lauderdale, FL, April 2017a.
  • [McMahan et al.(2017b)McMahan, Moore, Ramage, Hampson, and y Arcas] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. Intl. Conf. on Artif. Intell. and Stat., pages 1273–1282, Fort Lauderdale, Florida, Apr 2017b.
  • [Nesterov(1983)] Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2)o(1/k^{2}). In Doklady AN USSR, volume 269, pages 543–547, 1983.
  • [Park et al.(2019)Park, Samarakoon, Bennis, and Debbah] Jihong Park, Sumudu Samarakoon, Mehdi Bennis, and Mérouane Debbah. Wireless network intelligence at the edge. Proc. of the IEEE, 107(11):2204–2239, November 2019.
  • [Polyak(1964)] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • [Reddi et al.(2018)Reddi, Kale, and Kumar] Sashank Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In Proc. Intl. Conf. Learn. Representations, Vancouver, Canada, April 2018.
  • [Reddi et al.(2020)Reddi, Charles, Zaheer, Garrett, Rush, Konečnỳ, Kumar, and McMahan] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint:2003.00295, March 2020.
  • [Reisizadeh et al.(2019)Reisizadeh, Taheri, Mokhtari, Hassani, and Pedarsani] Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. Robust and communication-efficient collaborative learning. In Proc. Conf. Neural Info. Process. Syst., pages 8386–8397, Vancouver, Canada, December 2019.
  • [Robbins and Monro(1951)] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, September 1951.
  • [Seide et al.(2014)Seide, Fu, Droppo, Li, and Yu] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proc. Conf. Intl. Speech Comm. Assoc., Singapore, Sept 2014.
  • [Shamir et al.(2014)Shamir, Srebro, and Zhang] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proc. Intl. Conf. Machine Learn., pages 1000–1008, Beijing, China, Jun 2014.
  • [Stich et al.(2018)Stich, Cordonnier, and Jaggi] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proc. Conf. Neural Info. Process. Syst., pages 4447–4458, Montreal, Canada, Dec 2018.
  • [Stich(2019)] Sebastian Urban Stich. Local SGD converges fast and communicates little. In Proc. Intl. Conf. Learn. Representations, New Orleans, LA, May 2019.
  • [Strom(2015)] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In Proc. Conf. Intl. Speech Comm. Assoc., Dresden, Germany, September 2015.
  • [Sun et al.(2019)Sun, Chen, Giannakis, and Yang] Jun Sun, Tianyi Chen, Georgios Giannakis, and Zaiyue Yang. Communication-efficient distributed learning via lazily aggregated quantized gradients. In Proc. Conf. Neural Info. Process. Syst., page to appear, Vancouver, Canada, Dec 2019.
  • [Tang et al.(2018)Tang, Gan, Zhang, Zhang, and Liu] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Proc. Conf. Neural Info. Process. Syst., pages 7652–7662, Montreal, Canada, December 2018.
  • [Tang et al.(2020)Tang, Gan, Rajbhandari, Lian, Zhang, Liu, and He] Hanlin Tang, Shaoduo Gan, Samyam Rajbhandari, Xiangru Lian, Ce Zhang, Ji Liu, and Yuxiong He. Apmsqueeze: A communication efficient adam-preconditioned momentum sgd algorithm. arXiv preprint:2008.11343, August 2020.
  • [Vogels et al.(2019)Vogels, Karimireddy, and Jaggi] Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD: Practical low-rank gradient compression for distributed optimization. In Proc. Conf. Neural Info. Process. Syst., pages 14236–14245, Vancouver, Canada, December 2019.
  • [Wang et al.(2020a)Wang, Lu, Tu, and Zhang] Guanghui Wang, Shiyin Lu, Weiwei Tu, and Lijun Zhang. SAdam: A variant of adam for strongly convex functions. In Proc. Intl. Conf. Learn. Representations, 2020a.
  • [Wang and Joshi(2019)] Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. In ICML Workshop on Coding Theory for Large-Scale ML, Long Beach, CA, June 2019.
  • [Wang et al.(2020b)Wang, Tantia, Ballas, and Rabbat] Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. SlowMo: Improving communication-efficient distributed SGD with slow momentum. In Proc. Intl. Conf. Learn. Representations, 2020b.
  • [Wangni et al.(2018)Wangni, Wang, Liu, and Zhang] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Proc. Conf. Neural Info. Process. Syst., pages 1299–1309, Montreal, Canada, Dec 2018.
  • [Ward et al.(2019)Ward, Wu, and Bottou] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In Proc. Intl. Conf. Machine Learn., pages 6677–6686, Long Beach, CA, June 2019.
  • [Wen et al.(2017)Wen, Xu, Yan, Wu, Wang, Chen, and Li] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Conf. Neural Info. Process. Syst., pages 1509–1519, Long Beach, CA, Dec 2017.
  • [Wu et al.(2018)Wu, Huang, Huang, and Zhang] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. In Proc. Intl. Conf. Machine Learn., pages 5325–5333, Stockholm, Sweden, Jul 2018.
  • [Xu et al.(2020)Xu, Sutcher-Shepard, Xu, and Chen] Yangyang Xu, Colin Sutcher-Shepard, Yibo Xu, and Jie Chen. Asynchronous parallel adaptive stochastic gradient methods. arXiv preprint:2002.09095, February 2020.
  • [Yu et al.(2019)Yu, Jin, and Yang] Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proc. Intl. Conf. Machine Learn., pages 7184–7193, Long Beach, CA, June 2019.
  • [Zeiler(2012)] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint:1212.5701, December 2012.
  • [Zhang et al.(2017)Zhang, Li, Kara, Alistarh, Liu, and Zhang] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proc. Intl. Conf. Machine Learn., pages 4035–4043, Sydney, Australia, Aug 2017.
  • [Zhang et al.(2015)Zhang, Choromanska, and LeCun] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In Proc. Conf. Neural Info. Process. Syst., pages 685–693, Montreal, Canada, Dec 2015.
  • [Zhang and Lin(2015)] Yuchen Zhang and Xiao Lin. DiSCO: Distributed optimization for self-concordant empirical loss. In Proc. Intl. Conf. Machine Learn., pages 362–370, Lille, France, June 2015.

Supplementary materials for “CADA: Communication-Adaptive Distributed Adam”

In this supplementary document, we first present some basic inequalities that will be used frequently in this document, and then present the missing derivations of some claims, as well as the proofs of all the lemmas and theorems in the paper, which is followed by details on our experiments. The content of this supplementary document is summarized as follows.

6 Supporting Lemmas

Define the σ\sigma-algebra Θk={θl,1lk}\Theta^{k}=\{\theta^{l},1\leq l\leq k\}. For convenience, we also initialize parameters as θD,θD+1,,θ1=θ0\theta^{-D},\theta^{-D+1},\ldots,\theta^{-1}=\theta^{0}. Some basic facts used in the proof are reviewed as follows.

Fact 1. Assume that X1,X2,,XnpX_{1},X_{2},\ldots,X_{n}\in\mathbb{R}^{p} are independent random variables, and EX1==EXn=0EX_{1}=\cdots=EX_{n}=0. Then

𝔼[i=1nXi2]=i=1n𝔼[Xi2].\displaystyle\mathbb{E}\Bigg{[}\Big{\|}\sum\limits_{i=1}^{n}X_{i}\Big{\|}^{2}\Bigg{]}=\sum\limits_{i=1}^{n}\mathbb{E}\left[\|X_{i}\|^{2}\right]. (18)

Fact 2. (Young’s inequality) For any θ1,θ2p,ε>0\theta_{1},\theta_{2}\in\mathbb{R}^{p},\varepsilon>0,

θ1,θ2θ122ε+εθ222.\displaystyle\Big{\langle}\theta_{1},\theta_{2}\Big{\rangle}\leq\frac{\|\theta_{1}\|^{2}}{2\varepsilon}+\frac{\varepsilon\|\theta_{2}\|^{2}}{2}. (19)

As a consequence, we have

θ1+θ22(1+1ε)θ12+(1+ε)θ22.\displaystyle\|\theta_{1}+\theta_{2}\|^{2}\leq\Big{(}1+\frac{1}{\varepsilon}\Big{)}\|\theta_{1}\|^{2}+(1+\varepsilon)\|\theta_{2}\|^{2}. (20)

Fact 3. (Cauchy-Schwarz inequality) For any θ1,θ2,,θnp\theta_{1},\theta_{2},\ldots,\theta_{n}\in\mathbb{R}^{p}, we have

i=1nθi2ni=1nθi2.\displaystyle\Big{\|}\sum\limits_{i=1}^{n}\theta_{i}\Big{\|}^{2}\leq n\sum\limits_{i=1}^{n}\|\theta_{i}\|^{2}. (21)
Lemma 6.1.

For kτmaxlkDk-\tau_{\rm max}\leq l\leq k-D, if {θk}\{\theta^{k}\} are the iterates generated by CADA, we have

𝔼[(θk),(ϵI+V^kD)12((θl;ξmk)(θl;ξmkτmk))]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\right)\Big{\rangle}\right]
\displaystyle\leq Lϵ1212αkd=1D𝔼[θk+1dθkd2]+6DLαkϵ12σm2\displaystyle\frac{L\epsilon^{-\frac{1}{2}}}{12\alpha_{k}}\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]\!+\!6DL\alpha_{k}\epsilon^{-\frac{1}{2}}\sigma_{m}^{2} (22)

and similarly, we have

𝔼[(θk),(ϵI+V^kD)12(m(θl)(θl;θkτmk)]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\mathcal{L}_{m}(\theta^{l})-\nabla\ell(\theta^{l};\theta^{k-\tau_{m}^{k}}\right)\Big{\rangle}\right]
\displaystyle\leq Lϵ1212αkd=1D𝔼[θk+1dθkd2]+3DLαkϵ12σm2.\displaystyle\frac{L\epsilon^{-\frac{1}{2}}}{12\alpha_{k}}\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]+{3DL\alpha_{k}\epsilon^{-\frac{1}{2}}}\sigma_{m}^{2}. (23)
Proof 6.2.

We first show the following holds.

𝔼[(θl),(ϵI+V^kD)12((θl;ξmk)(θl;ξmkτmk))]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{l}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\right)\Big{\rangle}\right]
=(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{=}} 𝔼[𝔼[(θl),(ϵI+V^kD)12((θl;ξmk)(θl;ξmkτmk))|Θl]]\displaystyle\mathbb{E}\left[\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{l}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\right)\Big{\rangle}\Big{|}\Theta^{l}\Big{]}\right]
=(b)\displaystyle\stackrel{{\scriptstyle(b)}}{{=}} 𝔼[(θl),(ϵI+V^kD)12𝔼[(θl;ξmk)(θl;ξmkτmk)|Θl]]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{l}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\mathbb{E}\Big{[}\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\big{|}\Theta^{l}\Big{]}\Big{\rangle}\right]
=\displaystyle= 𝔼[(θl),(ϵI+V^kD)12(m(θl)m(θl))]=0\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{l}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\mathcal{L}_{m}(\theta^{l})-\nabla\mathcal{L}_{m}(\theta^{l})\right)\Big{\rangle}\right]=0 (24)

where (a) follows from the law of total probability, and (b) holds because V^kD\hat{V}^{k-D} is deterministic conditioned on Θl\Theta^{l} when kDlk-D\leq l.

We first prove (6.1) by decomposing it as

𝔼[(θk),(ϵI+V^kD)12((θl;ξmk)(θl;ξmkτmk))]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\right)\Big{\rangle}\right]
=(c)\displaystyle\stackrel{{\scriptstyle(c)}}{{=}} 𝔼[(θk)(θl),(ϵI+V^kD)12((θl;ξmk)(θl;ξmkτmk))]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k})-\nabla\mathcal{L}(\theta^{l}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\right)\Big{\rangle}\right]
(d)\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}} L𝔼[(ϵI+V^kD)14θkθl(ϵI+V^kD)14((θl;ξmk)(θl;ξmkτmk))]\displaystyle L\mathbb{E}\left[\Big{\|}(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{4}}\Big{\|}\Big{\|}\theta^{k}-\theta^{l}\Big{\|}\Big{\|}(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{4}}\left(\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\right)\Big{\|}\right]
(e)\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}} Lϵ1212Dαk𝔼[θkθl2]I1+6DLαkϵ122𝔼[(θl;ξmk)(θl;ξmkτmk)2]I2\displaystyle\frac{L\epsilon^{-\frac{1}{2}}}{12D\alpha_{k}}\underbracket{\mathbb{E}\left[\|\theta^{k}-\theta^{l}\|^{2}\right]}_{I_{1}}+\frac{6DL\alpha_{k}\epsilon^{-\frac{1}{2}}}{2}\underbracket{\mathbb{E}\left[\|\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\|^{2}\right]}_{I_{2}} (25)

where (c) holds due to (6.2), (d) uses Assumption 1, and (e) applies the Young’s inequality.

Applying the Cauchy-Schwarz inequality to I1I_{1}, we have

I1=\displaystyle I_{1}= 𝔼[d=1kl(θk+1dθkd)2]\displaystyle\mathbb{E}\Big{[}\Big{\|}\sum\limits_{d=1}^{k-l}(\theta^{k+1-d}-\theta^{k-d})\Big{\|}^{2}\Big{]}
\displaystyle\leq (kl)d=1kl𝔼[θk+1dθkd2]Dd=1D𝔼[θk+1dθkd2].\displaystyle(k-l)\sum\limits_{d=1}^{k-l}\mathbb{E}\Big{[}\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\Big{]}\leq D\sum\limits_{d=1}^{D}\mathbb{E}\Big{[}\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\Big{]}. (26)

Applying Assumption 2 to I2I_{2}, we have

I2=\displaystyle I_{2}= 𝔼[(θl;ξmk)(θl;ξmkτmk)2]\displaystyle\mathbb{E}\Big{[}\big{\|}\nabla\ell(\theta^{l};\xi_{m}^{k})-\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\big{\|}^{2}\Big{]}
=\displaystyle= 𝔼[(θl;ξmk)2]+𝔼[(θl;ξmkτmk)2]2σm2\displaystyle\mathbb{E}\Big{[}\big{\|}\nabla\ell(\theta^{l};\xi_{m}^{k})\big{\|}^{2}\Big{]}+\mathbb{E}\Big{[}\big{\|}\nabla\ell(\theta^{l};\xi_{m}^{k-\tau_{m}^{k}})\big{\|}^{2}\Big{]}\leq 2\sigma_{m}^{2} (27)

where the last inequality uses Assumption 2. Plugging (6.2) and (6.2) into (6.2), it leads to (6.1).

Likewise, following the steps to (6.2), it can be verified that (6.1) also holds true.

Lemma 6.3.

Under Assumption 2, the parameters {hk,v^k}\{h^{k},\hat{v}^{k}\} of CADA in Algorithm 1 satisfy

hkσ,k;v^ikσ2,k,i\|h^{k}\|\leq\sigma,~{}~{}~{}\forall k;~{}~{}~{}\hat{v}_{i}^{k}\leq\sigma^{2},~{}~{}~{}\forall k,i (28)

where σ:=1Mmσm\sigma:=\frac{1}{M}\sum_{m\in{\cal M}}\sigma_{m}.

Proof 6.4.

Using Assumption 2, it follows that

k=1Mm(θkτmk;ξmkτmk)1Mm(θkτmk;ξmkτmk)1Mmσm=σ.\|\bm{\nabla}^{k}\|=\left\|\frac{1}{M}\sum_{m\in{\cal M}}\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\right\|\leq\frac{1}{M}\sum_{m\in{\cal M}}\left\|\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\right\|\leq\frac{1}{M}\sum_{m\in{\cal M}}\sigma_{m}=\sigma. (29)

Therefore, from the update (2a), we have

hk+1β1hk+(1β1)kβ1hk+(1β1)σ.\displaystyle\|h^{k+1}\|\leq\beta_{1}\|h^{k}\|+(1-\beta_{1})\|\bm{\nabla}^{k}\|\leq\beta_{1}\|h^{k}\|+(1-\beta_{1})\sigma.

Since h1σ\|h^{1}\|\leq\sigma, if follows by induction that hk+1σ,k\|h^{k+1}\|\leq\sigma,~{}\forall k.

Using Assumption 2, it follows that

(ik)2\displaystyle(\nabla_{i}^{k})^{2} =(1Mmi(θkτmk;ξmkτmk))2\displaystyle=\left(\frac{1}{M}\sum_{m\in{\cal M}}\nabla_{i}\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\right)^{2}
1Mm(i(θkτmk;ξmkτmk))2\displaystyle\leq\frac{1}{M}\sum_{m\in{\cal M}}\left(\nabla_{i}\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\right)^{2}
1Mm(θkτmk;ξmkτmk)2=1Mmσm2σ2.\displaystyle\leq\frac{1}{M}\sum_{m\in{\cal M}}\left\|\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\right\|^{2}=\frac{1}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}\leq\sigma^{2}. (30)

Similarly, from the update (2b), we have

v^ik+1max{v^ik,β2v^ik+(1β2)(ik)2}max{v^ik,β2v^ik+(1β2)σ2}.\displaystyle\hat{v}_{i}^{k+1}\leq\max\{\hat{v}_{i}^{k},\beta_{2}\hat{v}_{i}^{k}+(1-\beta_{2})(\nabla_{i}^{k})^{2}\}\leq\max\{\hat{v}_{i}^{k},\beta_{2}\hat{v}_{i}^{k}+(1-\beta_{2})\sigma^{2}\}.

Since vi1=v^i1σ2v_{i}^{1}=\hat{v}_{i}^{1}\leq\sigma^{2}, if follows by induction that v^ik+1σ2\hat{v}_{i}^{k+1}\leq\sigma^{2}.

Lemma 6.5.

Under Assumption 2, the iterates {θk}\{\theta^{k}\} of CADA in Algorithm 1 satisfy

θk+1θk2αk2p(1β2)1(1β3)1\left\|\theta^{k+1}-\theta^{k}\right\|^{2}\leq\alpha_{k}^{2}p(1-\beta_{2})^{-1}(1-\beta_{3})^{-1} (31)

where pp is the dimension of θ\theta, β1<β2<1\beta_{1}<\sqrt{\beta_{2}}<1, and β3:=β12/β2\beta_{3}:=\beta_{1}^{2}/\beta_{2}.

Proof 6.6.

Choosing β1<1\beta_{1}<1 and defining β3:=β12/β2\beta_{3}:=\beta_{1}^{2}/\beta_{2}, it can be verified that

|hik+1|\displaystyle|h_{i}^{k+1}| =|β1hik+(1β1)ik|β1|hik|+|ik|\displaystyle=\left|\beta_{1}h^{k}_{i}+(1-\beta_{1})\nabla^{k}_{i}\right|\beta_{1}|h_{i}^{k}|+|\nabla_{i}^{k}|
β1(β1|hik1|+|ik1|)+|ik|\displaystyle\leq\beta_{1}\left(\beta_{1}|h_{i}^{k-1}|+|\nabla_{i}^{k-1}|\right)+|\nabla_{i}^{k}|
l=0kβ1kl|il|=l=0kβ3klβ2kl|il|\displaystyle\leq\sum\limits_{l=0}^{k}\beta_{1}^{k-l}|\nabla_{i}^{l}|=\sum\limits_{l=0}^{k}\sqrt{\beta_{3}}^{k-l}\sqrt{\beta_{2}}^{k-l}|\nabla_{i}^{l}|
(a)(l=0kβ3kl)12(l=0kβ2kl(il)2)12\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\sum\limits_{l=0}^{k}\beta_{3}^{k-l}\right)^{\frac{1}{2}}\left(\sum\limits_{l=0}^{k}\beta_{2}^{k-l}(\nabla_{i}^{l})^{2}\right)^{\frac{1}{2}}
(1β3)12(l=0kβ2kl(il)2)12\displaystyle\leq(1-\beta_{3})^{-\frac{1}{2}}\left(\sum\limits_{l=0}^{k}\beta_{2}^{k-l}(\nabla_{i}^{l})^{2}\right)^{\frac{1}{2}} (32)

where (a) follows from the Cauchy-Schwartz inequality.

For v^ik\hat{v}_{i}^{k}, first we have that v^i1(1β2)(i1)2\hat{v}_{i}^{1}\geq(1-\beta_{2})(\nabla_{i}^{1})^{2}. Then since

v^ik+1β2v^ik+(1β2)(ik)2\displaystyle\hat{v}_{i}^{k+1}\geq\beta_{2}\hat{v}_{i}^{k}+(1-\beta_{2})(\nabla_{i}^{k})^{2}

by induction we have

v^ik+1(1β2)l=0kβ2kl(il)2.\hat{v}_{i}^{k+1}\geq(1-\beta_{2})\sum\limits_{l=0}^{k}\beta_{2}^{k-l}(\nabla_{i}^{l})^{2}. (33)

Using (6.6) and (33), we have

|hik+1|2\displaystyle|h_{i}^{k+1}|^{2}\leq (1β3)1(l=0kβ2kl(il)2)\displaystyle(1-\beta_{3})^{-1}\left(\sum\limits_{l=0}^{k}\beta_{2}^{k-l}(\nabla_{i}^{l})^{2}\right)
\displaystyle\leq (1β2)1(1β3)1v^ik+1.\displaystyle(1-\beta_{2})^{-1}(1-\beta_{3})^{-1}\hat{v}_{i}^{k+1}.

From the update (2c), we have

θk+1θk2\displaystyle\|\theta^{k+1}-\theta^{k}\|^{2} =αk2i=1p(ϵ+v^ik+1)1|hik+1|2\displaystyle=\alpha_{k}^{2}\sum_{i=1}^{p}\left(\epsilon+\hat{v}^{k+1}_{i}\right)^{-1}|h_{i}^{k+1}|^{2}
αk2p(1β2)1(1β3)1\displaystyle\leq\alpha_{k}^{2}p(1-\beta_{2})^{-1}(1-\beta_{3})^{-1} (34)

which completes the proof.

7 Missing Derivations in Section 2.2

The analysis in this part is analogous to that in [Ghadimi and Lan(2013)]. We define an auxiliary function as

ψm(θ)=m(θ)m(θ)m(θ),θθ\displaystyle\psi_{m}(\theta)={\cal L}_{m}(\theta)-{\cal L}_{m}(\theta^{\star})-\Big{\langle}\nabla{\cal L}_{m}(\theta^{\star}),\theta-\theta^{\star}\Big{\rangle}

where θ\theta^{\star} is a minimizer of {\cal L}. Assume that (θ;ξm)\nabla\ell(\theta;\xi_{m}) is L¯\bar{L}-Lipschitz continuous for all ξm\xi_{m}, we have

(θ;ξm)(θ;ξm)22L¯((θ;ξm)(θ;ξm)(θ;ξm),θθ).\displaystyle\|\nabla\ell(\theta;\xi_{m})-\nabla\ell(\theta^{\star};\xi_{m})\|^{2}\leq 2\bar{L}\left(\ell(\theta;\xi_{m})-\ell(\theta^{\star};\xi_{m})-\Big{\langle}\nabla\ell(\theta^{\star};\xi_{m}),\theta-\theta^{\star}\Big{\rangle}\right).

Taking expectation with respect to ξm\xi_{m}, we can obtain

𝔼ξm[(θ;ξm)(θ;ξm)2]2L¯(m(θ)m(θ)m(θ),θθ)=2L¯ψm(θ).\displaystyle\mathbb{E}_{\xi_{m}}[\|\nabla\ell(\theta;\xi_{m})-\nabla\ell(\theta^{\star};\xi_{m})\|^{2}]\leq 2\bar{L}\left({\cal L}_{m}(\theta)-{\cal L}_{m}(\theta^{\star})-\Big{\langle}\nabla{\cal L}_{m}(\theta^{\star}),\theta-\theta^{\star}\Big{\rangle}\right)=2\bar{L}\psi_{m}(\theta).

Note that m\nabla{\cal L}_{m} is also L¯\bar{L}-Lipschitz continuous and thus

m(θ)m(θ)22L¯(m(θ)m(θ)m(θ),θθ)=2L¯ψm(θ).\displaystyle\|\nabla{\cal L}_{m}(\theta)-\nabla{\cal L}_{m}(\theta^{\star})\|^{2}\leq 2\bar{L}({\cal L}_{m}(\theta)-{\cal L}_{m}(\theta^{\star})-\Big{\langle}\nabla{\cal L}_{m}(\theta^{\star}),\theta-\theta^{\star}\Big{\rangle})=2\bar{L}\psi_{m}(\theta).

7.1 Derivations of (6)

By (21), we can derive that

θ1+θ22θ12+2θ22\|\theta_{1}+\theta_{2}\|\leq 2\|\theta_{1}\|^{2}+2\|\theta_{2}\|^{2}

which also implies θ1212θ1+θ22θ22\|\theta_{1}\|^{2}\geq\frac{1}{2}\|\theta_{1}+\theta_{2}\|^{2}-\|\theta_{2}\|^{2}.

As a consequence, we can obtain

𝔼[(θk;ξmk)(θkτmk;ξmkτmk)2]\displaystyle\mathbb{E}\Big{[}\big{\|}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\big{\|}^{2}\Big{]}
\displaystyle\geq 12𝔼[((θk;ξmk)m(θk))+(m(θkτmk)(θkτmk;ξmkτmk))2]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\big{\|}\big{(}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla{\cal L}_{m}(\theta^{k})\big{)}+\big{(}\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\big{)}\big{\|}^{2}\Big{]}
𝔼[m(θk)m(θkτmk)2]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad~{}~{}-\mathbb{E}\Big{[}\big{\|}\nabla{\cal L}_{m}(\theta^{k})-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\big{\|}^{2}\Big{]}
=\displaystyle= 12𝔼[(θk;ξmk)m(θk)2]+12𝔼[[(θkτmk;ξmkτmk)m(θkτmk)2]]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\big{\|}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla{\cal L}_{m}(\theta^{k})\big{\|}^{2}\Big{]}+\frac{1}{2}\mathbb{E}\Big{[}\big{[}\big{\|}\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\big{\|}^{2}\big{]}\Big{]}
+𝔼[(θk;ξmk)m(θk),m(θkτmk)(θkτmk;ξmkτmk)]I3𝔼[m(θk)m(θkτmk)2]\displaystyle+\underbracket{\mathbb{E}\Big{[}\Big{\langle}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla{\cal L}_{m}(\theta^{k}),\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\Big{\rangle}\Big{]}}_{I_{3}}-\mathbb{E}\Big{[}\big{\|}\nabla{\cal L}_{m}(\theta^{k})-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\big{\|}^{2}\Big{]}

where we used the fact that I3=0I_{3}=0 to obtain (6), that is

I3=𝔼[𝔼[(θk;ξmk)|Θk]m(θk),m(θkτmk)(θkτmk;ξmkτmk)]=0.\displaystyle I_{3}=\mathbb{E}\Big{[}\Big{\langle}\mathbb{E}\big{[}\nabla\ell(\theta^{k};\xi_{m}^{k})\big{|}\Theta^{k}\big{]}-\nabla{\cal L}_{m}(\theta^{k}),\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})\Big{\rangle}\Big{]}=0.

7.2 Derivations of (9)

Recall that

δ~mkδ~mkτmk=\displaystyle\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}}= ((θk;ξmk)(θ~;ξmk)+m(θ~))((θkτmk;ξmkτmk)(θ~;ξmkτmk)+m(θ~))\displaystyle\big{(}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\tilde{\theta};\xi_{m}^{k})+\nabla{\cal L}_{m}(\tilde{\theta})\big{)}-\big{(}\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\tilde{\theta};\xi_{m}^{k-\tau_{m}^{k}})+\nabla{\cal L}_{m}(\tilde{\theta})\big{)}
=\displaystyle= ((θk;ξmk)(θ~;ξmk)+ψm(θ~))gmk((θkτmk;ξmkτmk)(θ~;ξmkτmk)+ψm(θ~))gmkτmk.\displaystyle\underbracket{\big{(}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\tilde{\theta};\xi_{m}^{k})+\nabla\psi_{m}(\tilde{\theta})\big{)}}_{g_{m}^{k}}-\underbracket{\big{(}\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\tilde{\theta};\xi_{m}^{k-\tau_{m}^{k}})+\nabla\psi_{m}(\tilde{\theta})\big{)}}_{g_{m}^{k-\tau_{m}^{k}}}.

And by (21), we have δ~mkδ~mkτmk22gmk2+2gmkτmk2\|\tilde{\delta}_{m}^{k}-\tilde{\delta}_{m}^{k-\tau_{m}^{k}}\|^{2}\leq 2\|g_{m}^{k}\|^{2}+2\|g_{m}^{k-\tau_{m}^{k}}\|^{2}. We decompose the first term as

𝔼[gmk2]\displaystyle\mathbb{E}[\|g_{m}^{k}\|^{2}]\leq 2𝔼[(θk;ξmk)(θ;ξmk)2]+2𝔼[(θ~;ξmk)(θ;ξmk)ψm(θ~)2]\displaystyle 2\mathbb{E}[\|\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})\|^{2}]+2\mathbb{E}[\|\nabla\ell(\tilde{\theta};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})-\nabla\psi_{m}(\tilde{\theta})\|^{2}]
=\displaystyle= 2𝔼[𝔼[(θk;ξmk)(θ;ξmk)2|Θk]]\displaystyle 2\mathbb{E}[\mathbb{E}[\|\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})\|^{2}|\Theta^{k}]]
+2𝔼[(θ~;ξmk)(θ;ξmk)𝔼[(θ~;ξmk)(θ;ξmk)|Θk]2]\displaystyle+2\mathbb{E}[\|\nabla\ell(\tilde{\theta};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})-\mathbb{E}[\nabla\ell(\tilde{\theta};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})|\Theta^{k}]\|^{2}]
\displaystyle\leq 4L¯𝔼ψm(θk)+2𝔼[(θ~;ξmk)(θ;ξmk)2]\displaystyle 4\bar{L}\mathbb{E}\psi_{m}(\theta^{k})+2\mathbb{E}[\|\nabla\ell(\tilde{\theta};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})\|^{2}]
=\displaystyle= 4L¯𝔼ψm(θk)+2𝔼[𝔼[(θ~;ξmk)(θ;ξmk)2|Θk]]\displaystyle 4\bar{L}\mathbb{E}\psi_{m}(\theta^{k})+2\mathbb{E}[\mathbb{E}[\|\nabla\ell(\tilde{\theta};\xi_{m}^{k})-\nabla\ell(\theta^{\star};\xi_{m}^{k})\|^{2}|\Theta^{k}]]
\displaystyle\leq 4L¯𝔼ψm(θk)+4L¯𝔼ψm(θ~).\displaystyle 4\bar{L}\mathbb{E}\psi_{m}(\theta^{k})+4\bar{L}\mathbb{E}\psi_{m}(\tilde{\theta}).

By nonnegativity of ψm\psi_{m}, we have

𝔼[gmk2]\displaystyle\mathbb{E}[\|g_{m}^{k}\|^{2}] 4L¯m𝔼ψm(θk)+4L¯m𝔼ψm(θ~)\displaystyle\leq 4\bar{L}\sum\limits_{m\in{\cal M}}\mathbb{E}\psi_{m}(\theta^{k})+4\bar{L}\sum\limits_{m\in{\cal M}}\mathbb{E}\psi_{m}(\tilde{\theta})
=4ML¯(𝔼(θk)(θ))+4ML¯(𝔼(θ~)(θ)).\displaystyle=4M\bar{L}(\mathbb{E}{\cal L}(\theta^{k})-{\cal L}(\theta^{\star}))+4M\bar{L}(\mathbb{E}{\cal L}(\tilde{\theta})-{\cal L}(\theta^{\star})). (35)

Similarly, we can prove

𝔼[gmkτmk2]4ML¯(𝔼(θkτmk)(θ))+4ML¯(𝔼(θ~)(θ)).\mathbb{E}[\|g_{m}^{k-\tau_{m}^{k}}\|^{2}]\leq 4M\bar{L}(\mathbb{E}{\cal L}(\theta^{k-\tau_{m}^{k}})-{\cal L}(\theta^{\star}))+4M\bar{L}(\mathbb{E}{\cal L}(\tilde{\theta})-{\cal L}(\theta^{\star})). (36)

Therefore, it follows that

𝔼[δ~mk\displaystyle\mathbb{E}[\|\tilde{\delta}_{m}^{k} δ~mkτmk2]\displaystyle-\tilde{\delta}_{m}^{k-\tau_{m}^{k}}\|^{2}]
8ML¯(𝔼(θk)(θ))+8ML¯(𝔼(θkτmk)(θ))+16ML¯(𝔼(θ~)(θ)).\displaystyle\leq 8M\bar{L}(\mathbb{E}{\cal L}(\theta^{k})-{\cal L}(\theta^{\star}))+8M\bar{L}(\mathbb{E}{\cal L}(\theta^{k-\tau_{m}^{k}})-{\cal L}(\theta^{\star}))+16M\bar{L}(\mathbb{E}{\cal L}(\tilde{\theta})-{\cal L}(\theta^{\star})).

7.3 Derivations of (11)

The LHS of (10) can be written as

(θk;ξmk)(θkτmk;ξmk)=\displaystyle\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})= ((θk;ξmk)(θkτmk;ξmk)+m(θkτmk))m(θkτmk)\displaystyle\big{(}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})+\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})\big{)}-\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})
=\displaystyle= ((θk;ξmk)(θkτmk;ξmk)+ψm(θkτmk))ψm(θkτmk).\displaystyle\big{(}\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})+\nabla\psi_{m}(\theta^{k-\tau_{m}^{k}})\big{)}-\nabla\psi_{m}(\theta^{k-\tau_{m}^{k}}).

Similar to (7.2), we can obtain

𝔼[(θk;ξmk)(θkτmk;ξmk)\displaystyle\mathbb{E}[\|\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k}) +ψm(θkτmk)2]\displaystyle+\nabla\psi_{m}(\theta^{k-\tau_{m}^{k}})\|^{2}]
4ML¯(𝔼(θk)(θ))+4ML¯(𝔼(θkτmk)(θ)).\displaystyle\leq 4M\bar{L}(\mathbb{E}{\cal L}(\theta^{k})-{\cal L}(\theta^{\star}))+4M\bar{L}(\mathbb{E}{\cal L}(\theta^{k-\tau_{m}^{k}})-{\cal L}(\theta^{\star})).

Combined with the fact

𝔼[ψm(θkτmk)2]\displaystyle\mathbb{E}[\|\nabla\psi_{m}(\theta^{k-\tau_{m}^{k}})\|^{2}] =𝔼[m(θkτmk)m(θ)2]\displaystyle=\mathbb{E}[\|\nabla{\cal L}_{m}(\theta^{k-\tau_{m}^{k}})-\nabla{\cal L}_{m}(\theta^{\star})\|^{2}]
2L¯𝔼ψm(θkτmk)2ML¯(𝔼(θkτmk)(θ))\displaystyle\leq 2\bar{L}\mathbb{E}\psi_{m}(\theta^{k-\tau_{m}^{k}})\leq 2M\bar{L}(\mathbb{E}{\cal L}(\theta^{k-\tau_{m}^{k}})-{\cal L}(\theta^{\star}))

we have

𝔼[(θk;ξmk)(θkτmk;ξmk)2]8ML¯(𝔼(θk)(θ))+12ML¯(𝔼(θkτmk)(θ)).\displaystyle\mathbb{E}[\|\nabla\ell(\theta^{k};\xi_{m}^{k})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})\|^{2}]\leq 8M\bar{L}(\mathbb{E}{\cal L}(\theta^{k})-{\cal L}(\theta^{\star}))+12M\bar{L}(\mathbb{E}{\cal L}(\theta^{k-\tau_{m}^{k}})-{\cal L}(\theta^{\star})).

8 Proof of Lemma 3.1

Using the smoothness of (θ)\mathcal{L}(\theta) in Assumption 1, we have

(θk+1)\displaystyle\mathcal{L}(\theta^{k+1})\leq (θk)+(θk),θk+1θk+L2θk+1θk2\displaystyle\,\mathcal{L}(\theta^{k})+\Big{\langle}\nabla\mathcal{L}(\theta^{k}),\theta^{k+1}-\theta^{k}\Big{\rangle}+\frac{L}{2}\big{\|}\theta^{k+1}-\theta^{k}\big{\|}^{2}
=\displaystyle= (θk)αk(θk),(ϵI+V^k+1)12hk+1+L2θk+1θk2.\displaystyle\,\mathcal{L}(\theta^{k})-\alpha_{k}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}h^{k+1}\Big{\rangle}+\frac{L}{2}\big{\|}\theta^{k+1}-\theta^{k}\big{\|}^{2}. (37)

We can further decompose the inner product as

(θk),(ϵI+V^k+1)12hk+1\displaystyle-\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}h^{k+1}\Big{\rangle}
=\displaystyle= (1β1)(θk),(ϵI+V^k)12kβ1(θk),(ϵI+V^k)12hkI1k\displaystyle-(1-\beta_{1})\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}\underbracket{-\beta_{1}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}}_{I_{1}^{k}}
(θk),((ϵI+V^k+1)12(ϵI+V^k)12)hk+1I2k\displaystyle\underbracket{-\Big{\langle}\nabla\mathcal{L}(\theta^{k}),\left((\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}-(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}\right)h^{k+1}\Big{\rangle}}_{I_{2}^{k}} (38)

where we again decompose the first inner product as

(1β1)(θk),(ϵI\displaystyle-(1-\beta_{1})\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I +V^k)12k=(1β1)(θk),(ϵI+V^kD)12kI3k\displaystyle+\hat{V}^{k})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}=\underbracket{-(1-\beta_{1})\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}}_{I_{3}^{k}}
(1β1)(θk),((ϵI+V^k)12(ϵI+V^kD)12)kI4k.\displaystyle\underbracket{-(1-\beta_{1})\Big{\langle}\nabla\mathcal{L}(\theta^{k}),\left((\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}-(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\right)\bm{\nabla}^{k}\Big{\rangle}}_{I_{4}^{k}}. (39)

Next, we bound the terms I1k,I2k,I3k,I4kI_{1}^{k},I_{2}^{k},I_{3}^{k},I_{4}^{k} separately.

Taking expectation on I1kI_{1}^{k} conditioned on Θk\Theta^{k}, we have

𝔼[I1kΘk]\displaystyle\mathbb{E}[I_{1}^{k}\mid\Theta^{k}] =𝔼[β1(θk),(ϵI+V^k)12hkΘk]\displaystyle=-\mathbb{E}\left[\beta_{1}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}\mid\Theta^{k}\right]
=β1(θk1),(ϵI+V^k)12hkβ1(θk)(θk1),(ϵI+V^k)12hk\displaystyle=-\beta_{1}\Big{\langle}\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}-\beta_{1}\Big{\langle}\nabla\mathcal{L}(\theta^{k})-\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}
(a)β1(θk1),(ϵI+V^k)12hk+αk11β1Lθkθk12\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}-\beta_{1}\Big{\langle}\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}+\alpha_{k-1}^{-1}\beta_{1}L\big{\|}\theta^{k}-\theta^{k-1}\big{\|}^{2}
(b)β1(I1k1+I2k1+I3k1+I4k1)+αk11β1Lθkθk12\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\beta_{1}\left(I_{1}^{k-1}+I_{2}^{k-1}+I_{3}^{k-1}+I_{4}^{k-1}\right)+\alpha_{k-1}^{-1}\beta_{1}L\big{\|}\theta^{k}-\theta^{k-1}\big{\|}^{2} (40)

where follows from the LL-smoothness of (𝜽)\mathcal{L}({\mbox{\boldmath$\theta$}}) implied by Assumption 1; and (b) uses again the decomposition (8) and (8).

Taking expectation on I2kI_{2}^{k} over all the randomness, we have

𝔼[I2k]=\displaystyle\mathbb{E}[I_{2}^{k}]= 𝔼[(θk),((ϵI+V^k+1)12(ϵI+V^k)12)hk+1]\displaystyle\mathbb{E}\Big{[}-\Big{\langle}\nabla\mathcal{L}(\theta^{k}),\left((\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}-(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}\right)h^{k+1}\Big{\rangle}\Big{]}
=\displaystyle= 𝔼[i=1pi(θk)hik+1((ϵ+v^ik)12(ϵ+v^ik+1)12)]\displaystyle\mathbb{E}\Big{[}\sum_{i=1}^{p}\nabla_{i}\mathcal{L}(\theta^{k})h_{i}^{k+1}\Big{(}(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
(d)\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}} 𝔼[(θk)hk+1i=1p((ϵ+v^ik)12(ϵ+v^ik+1)12)]\displaystyle\mathbb{E}\Big{[}\|\nabla\mathcal{L}(\theta^{k})\|\|h^{k+1}\|\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
(e)\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}} σ2𝔼[i=1p((ϵ+v^ik)12(ϵ+v^ik+1)12)]\displaystyle\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]} (41)

where (d) follows from the Cauchy-Schwarz inequality and (e) is due to Assumption 2.

Regarding I3kI_{3}^{k}, we will bound separately in Lemma 3.2.

Taking expectation on I4kI_{4}^{k} over all the randomness, we have

𝔼[I4k]=\displaystyle\mathbb{E}[I_{4}^{k}]= 𝔼[(1β1)(θk),((ϵI+V^k)12(ϵI+V^kD)12)k]\displaystyle\mathbb{E}\Big{[}-(1-\beta_{1})\Big{\langle}\nabla\mathcal{L}(\theta^{k}),\left((\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}-(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\right)\bm{\nabla}^{k}\Big{\rangle}\Big{]}
=\displaystyle= (1β1)𝔼[i=1pi(θk)ik((ϵ+v^ik)12(ϵ+v^ikD)12)]\displaystyle-(1-\beta_{1})\mathbb{E}\Big{[}\sum_{i=1}^{p}\nabla_{i}\mathcal{L}(\theta^{k})\bm{\nabla}^{k}_{i}\Big{(}(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}\Big{)}\Big{]}
\displaystyle\leq (1β1)𝔼[(θk)ki=1p((ϵ+v^ikD)12(ϵ+v^ik)12)]\displaystyle(1-\beta_{1})\mathbb{E}\Big{[}\|\nabla\mathcal{L}(\theta^{k})\|\|\bm{\nabla}^{k}\|\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}\Big{)}\Big{]}
\displaystyle\leq (1β1)σ2𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik)12)].\displaystyle(1-\beta_{1})\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}\Big{)}\Big{]}. (42)

Taking expectation on (8) over all the randomness, and plugging (8), (8), and (8), we have

𝔼[(θk+1)]𝔼[(θk)]\displaystyle\mathbb{E}[\mathcal{L}(\theta^{k+1})]-\mathbb{E}[\mathcal{L}(\theta^{k})]\leq αk𝔼[(θk),(ϵI+V^k+1)12hk+1]+L2𝔼[θk+1θk2]\displaystyle\,-\alpha_{k}\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}h^{k+1}\Big{\rangle}\right]+\frac{L}{2}\mathbb{E}\left[\big{\|}\theta^{k+1}-\theta^{k}\big{\|}^{2}\right]
=\displaystyle= αk𝔼[I1k+I2k+I3k+I4k]+L2𝔼[θk+1θk2]\displaystyle\,\alpha_{k}\mathbb{E}\left[I_{1}^{k}+I_{2}^{k}+I_{3}^{k}+I_{4}^{k}\right]+\frac{L}{2}\mathbb{E}\left[\big{\|}\theta^{k+1}-\theta^{k}\big{\|}^{2}\right]
\displaystyle\leq αk(1β1)𝔼[(θk),(ϵI+V^kD)12k]\displaystyle-\alpha_{k}(1-\beta_{1})\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}\right]
αkβ1𝔼[(θk1),(ϵI+V^k)12hk]\displaystyle-\alpha_{k}\beta_{1}\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\Big{\rangle}\right]
+αkσ2𝔼[i=1p((ϵ+v^ik)12(ϵ+v^ik+1)12)]\displaystyle+\alpha_{k}\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
+αk(1β1)σ2𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik)12)]\displaystyle+\alpha_{k}(1-\beta_{1})\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}\Big{)}\Big{]}
+(L2+αkαk11β1L)𝔼[θk+1θk2].\displaystyle+\left(\frac{L}{2}+\alpha_{k}\alpha_{k-1}^{-1}\beta_{1}L\right)\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]. (43)

Since (ϵ+v^ik)12(ϵ+v^ik1)12(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}\leq(\epsilon+\hat{v}_{i}^{k-1})^{-\frac{1}{2}}, we have

σ2𝔼[i=1p((ϵ+v^ik)12(ϵ+v^ik+1)12)+(1β1)i=1p((ϵ+v^ikD)12(ϵ+v^ik)12)]\displaystyle\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\!\Big{(}\!(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}\!-\!(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\!+\!(1-\beta_{1})\sum_{i=1}^{p}\!\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}\!-\!(\epsilon+\hat{v}_{i}^{k})^{-\frac{1}{2}}\Big{)}\Big{]}
\displaystyle\leq (2β1)σ2𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)].\displaystyle(2-\beta_{1})\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}. (44)

Plugging (8) into (8) leads to the statement of Lemma 3.1.

9 Proof of Lemma 3.2

We first analyze the inner produce under CADA2 and then CADA1.

First recall that ¯k=1Mm(θk;ξmk)\bar{\bm{\nabla}}^{k}=\frac{1}{M}\!\sum_{m\in{\cal M}}\nabla\ell(\theta^{k};\xi_{m}^{k}). Using the law of total probability implies that

𝔼[(θk),(ϵI+V^kD)12¯k]\displaystyle\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bar{\bm{\nabla}}^{k}\Big{\rangle}\right] =𝔼[𝔼[(θk),(ϵI+V^kD)12¯kΘk]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bar{\bm{\nabla}}^{k}\Big{\rangle}\mid\Theta^{k}\right]\right]
=𝔼[(θk),(ϵI+V^kD)12𝔼[¯kΘk]]\displaystyle=\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\mathbb{E}\left[\bar{\bm{\nabla}}^{k}\mid\Theta^{k}\right]\Big{\rangle}\right]
=𝔼[(θk)(ϵI+V^kD)122].\displaystyle=\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\right]. (45)

Taking expectation on (θk),(ϵI+V^kD)12k\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle} over all randomness, we have

𝔼[(θk),(ϵI+V^kD)12k]\displaystyle-\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bm{\nabla}^{k}\Big{\rangle}\right]
=\displaystyle= 𝔼[(θk),(ϵI+V^kD)12¯k]\displaystyle-\mathbb{E}\left[\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\bar{\bm{\nabla}}^{k}\Big{\rangle}\right]
𝔼[(θk),(ϵI+V^kD)121Mm((θkτmk;ξmkτmk)(θk;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\frac{1}{M}\sum_{m\in{\cal M}}\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\Big{\rangle}\!\Big{]}
=(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{=}} 𝔼[(θk)(ϵI+V^kD)122]\displaystyle-\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\right]
1Mm𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmkτmk)(θk;ξmk))]\displaystyle-\frac{1}{M}\sum_{m\in{\cal M}}\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\Big{\rangle}\!\Big{]} (46)

where (a) uses (9).

Decomposing the inner product, for the CADA2 rule (10), we have

𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmkτmk)(θk;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\Big{\rangle}\!\Big{]}
=\displaystyle= 𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmkτmk)(θkτmk;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})\right)\!\Big{\rangle}\!\Big{]}
𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmk)(θk;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\Big{\rangle}\Big{]}
(b)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}} Lϵ1212αkd=1D𝔼[θk+1dθkd2]+6DLαkϵ12σm2\displaystyle\frac{L\epsilon^{-\frac{1}{2}}}{12\alpha_{k}}\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]+{6DL\alpha_{k}\epsilon^{-\frac{1}{2}}}\sigma_{m}^{2}
𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmk)(θk;ξmk))]\displaystyle\!-\mathbb{E}\Big{[}\Big{\langle}\!\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\!\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\!\!\Big{\rangle}\Big{]}\!\! (47)

where (b) follows from Lemma 6.1.

Using the Young’s inequality, we can bound the last inner product in (9) as

𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmk)(θk;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\!\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\!\Big{\rangle}\Big{]}
\displaystyle\leq 12𝔼[(θk)(ϵI+V^kD)122]+12𝔼[(ϵI+V^kD)12((θkτmk;ξmk)(θk;ξmk))2]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}\nabla\mathcal{L}(\theta^{k})\Big{\|}^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\Big{]}\!+\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\Big{\|}\Big{\|}\!\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\!\Big{\|}^{2}\Big{]}
(g)\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}} 12𝔼[(θk)(ϵI+V^kD)122]+12𝔼[(ϵI+V^kD)12(θkτmk;ξmk)(θk;ξmk)2]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}\nabla\mathcal{L}(\theta^{k})\Big{\|}^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\Big{]}+\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\Big{\|}\Big{\|}\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k})-\nabla\ell(\theta^{k};\xi_{m}^{k})\Big{\|}^{2}\Big{]}
(h)\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}} 12𝔼[(θk)(ϵI+V^kD)122]+c2dmax𝔼[(ϵI+V^kD)12d=1dmaxθk+1dθkd2]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}\nabla\mathcal{L}(\theta^{k})\Big{\|}^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\Big{]}+\frac{c}{2d_{\rm max}}\mathbb{E}\Big{[}\Big{\|}(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\Big{\|}\sum\limits_{d=1}^{d_{\rm max}}\left\|\theta^{k+1-d}\!\!-\theta^{k-d}\right\|^{2}\Big{]}
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 12𝔼[(θk)(ϵI+V^kD)122]+cϵ122dmaxd=1D𝔼[θk+1dθkd2]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}\nabla\mathcal{L}(\theta^{k})\Big{\|}^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\Big{]}+\frac{c\epsilon^{-\frac{1}{2}}}{2d_{\rm max}}\sum\limits_{d=1}^{D}\mathbb{E}\Big{[}\left\|\theta^{k+1-d}\!\!-\theta^{k-d}\right\|^{2}\Big{]} (48)

where (g) follows from the Cauchy-Schwarz inequality, and (h) uses the adaptive communication condition (10) in CADA2, and (i) follows since V^kD\hat{V}^{k-D} is entry-wise nonnegative and θk+1dθkd2\left\|\theta^{k+1-d}\!\!-\theta^{k-d}\right\|^{2} is nonnegative.

Similarly for CADA1’s condition (7), we have

𝔼[(θk),(ϵI+V^kD)12((θkτmk;ξmkτmk)(θk;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\theta^{k-\tau_{m}^{k}};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\theta^{k};\xi_{m}^{k})\right)\Big{\rangle}\!\Big{]}
=\displaystyle= 𝔼[(θk),(ϵI+V^kD)12((θ~;ξmkτmk)(θ~;ξmk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\nabla\ell(\tilde{\theta};\xi_{m}^{k-\tau_{m}^{k}})-\nabla\ell(\tilde{\theta};\xi_{m}^{k})\right)\!\Big{\rangle}\!\Big{]}
𝔼[(θk),(ϵI+V^kD)12(δ~mkτmkδ~mk))]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\left(\tilde{\delta}^{k-\tau_{m}^{k}}_{m}-\tilde{\delta}^{k}_{m})\right)\Big{\rangle}\Big{]}
(j)\displaystyle\stackrel{{\scriptstyle(j)}}{{\leq}} Lϵ1212αkd=1D𝔼[θk+1dθkd2]+6DLαkϵ12σm2𝔼[(θk),(ϵI+V^kD)12(δ~mkτmkδ~mk)]\displaystyle\frac{L\epsilon^{-\frac{1}{2}}}{12\alpha_{k}}\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]+{6DL\alpha_{k}\epsilon^{-\frac{1}{2}}}\sigma_{m}^{2}\!-\mathbb{E}\Big{[}\Big{\langle}\!\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\!\left(\tilde{\delta}_{m}^{k-\tau_{m}^{k}}-\tilde{\delta}_{m}^{k}\right)\!\!\Big{\rangle}\Big{]} (49)

where (j) follows from Lemma 6.1 since θ~\tilde{\theta} is a snapshot among {θk,,θkD}\{\theta^{k},\cdots,\theta^{k-D}\}.

And the last product in (9) is bounded by

𝔼[(θk),(ϵI+V^kD)12(δ~mkτmkδ~mk)]\displaystyle-\mathbb{E}\Big{[}\Big{\langle}\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}\!\left(\tilde{\delta}_{m}^{k-\tau_{m}^{k}}-\tilde{\delta}_{m}^{k}\right)\!\Big{\rangle}\Big{]}
\displaystyle\leq 12𝔼[(θk)(ϵI+V^kD)122]+c2𝔼[(ϵI+V^kD)12dmaxd=1dmaxθk+1dθkd2]\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}\nabla\mathcal{L}(\theta^{k})\Big{\|}^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\Big{]}+\frac{c}{2}\mathbb{E}\Big{[}\Big{\|}(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2d_{\rm max}}}\Big{\|}\sum\limits_{d=1}^{d_{\rm max}}\left\|\theta^{k+1-d}\!\!-\theta^{k-d}\right\|^{2}\Big{]}
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 12𝔼[(θk)(ϵI+V^kD)122]+cϵ122dmaxd=1D𝔼[θk+1dθkd2].\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\Big{\|}\nabla\mathcal{L}(\theta^{k})\Big{\|}^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\Big{]}+\frac{c\epsilon^{-\frac{1}{2}}}{2d_{\rm max}}\sum\limits_{d=1}^{D}\mathbb{E}\Big{[}\left\|\theta^{k+1-d}\!\!-\theta^{k-d}\right\|^{2}\Big{]}. (50)

Combining (9)-(9) leads to the desired statement for CADA1 and CADA2.

10 Proof of Lemma 3.3

For notational brevity, we re-write the Lyapunov function (3.1) as

𝒱k:=(θk)(θ)\displaystyle{\cal V}^{k}:=\mathcal{L}(\theta^{k})-\mathcal{L}(\theta^{\star}) ck(θk1),(ϵI+V^k)12hk\displaystyle-c_{k}\left\langle\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\right\rangle
+bkd=0Di=1p(ϵ+v^ikd)12+d=1Dρdθk+1dθkd2\displaystyle+b_{k}\sum\limits_{d=0}^{D}\sum_{i=1}^{p}(\epsilon+\hat{v}_{i}^{k-d})^{-\frac{1}{2}}+\sum\limits_{d=1}^{D}\rho_{d}\|\theta^{k+1-d}-\theta^{k-d}\|^{2} (51)

where {ck}\{c_{k}\} are some positive constants.

Therefore, taking expectation on the difference of 𝒱k{\cal V}^{k} and 𝒱k+1{\cal V}^{k+1} in (10), we have (with ρD+1=0\rho_{D+1}=0)

𝔼[𝒱k+1]𝔼[𝒱k]=\displaystyle\mathbb{E}[{\cal V}^{k+1}]-\mathbb{E}[{\cal V}^{k}]= 𝔼[(θk+1)]𝔼[(θk)]ck+1𝔼[(θk),(ϵI+V^k+1)12hk+1]\displaystyle\mathbb{E}[\mathcal{L}(\theta^{k+1})]-\mathbb{E}[\mathcal{L}(\theta^{k})]-c_{k+1}\mathbb{E}\left[\left\langle\nabla\mathcal{L}(\theta^{k}),(\epsilon I+\hat{V}^{k+1})^{-\frac{1}{2}}h^{k+1}\right\rangle\right]
+ck𝔼[(θk1),(ϵI+V^k)12hk]\displaystyle+c_{k}\mathbb{E}\left[\left\langle\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\right\rangle\right]
+bk+1d=0Di=1p(ϵ+v^ik+1d)12bkd=0Di=1p(ϵ+v^ikd)12\displaystyle+b_{k+1}\sum\limits_{d=0}^{D}\sum_{i=1}^{p}(\epsilon+\hat{v}_{i}^{k+1-d})^{-\frac{1}{2}}-b_{k}\sum\limits_{d=0}^{D}\sum_{i=1}^{p}(\epsilon+\hat{v}_{i}^{k-d})^{-\frac{1}{2}}
+ρ1𝔼[θk+1θk2]+d=1D(ρd+1ρd)𝔼[θk+1dθkd2]\displaystyle+\rho_{1}\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]+\sum\limits_{d=1}^{D}(\rho_{d+1}-\rho_{d})\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}} (αk+ck+1)𝔼[I1k+I2k+I3k+I4k]ck𝔼[I1k1+I2k1+I3k1+I4k1]\displaystyle(\alpha_{k}+c_{k+1})\mathbb{E}\left[I_{1}^{k}+I_{2}^{k}+I_{3}^{k}+I_{4}^{k}\right]-c_{k}\mathbb{E}\left[I_{1}^{k-1}+I_{2}^{k-1}+I_{3}^{k-1}+I_{4}^{k-1}\right]
+bk+1i=1p𝔼[(ϵ+v^ik+1)12]bki=1p𝔼[(ϵ+v^ikD)12]\displaystyle+b_{k+1}\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{]}-b_{k}\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}\Big{]}
+d=1D(bk+1bk)i=1p𝔼[(ϵ+v^ik+1d)12]+(L2+ρ1)𝔼[θk+1θk2]\displaystyle+\sum\limits_{d=1}^{D}(b_{k+1}-b_{k})\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k+1-d})^{-\frac{1}{2}}\Big{]}+\left(\frac{L}{2}+\rho_{1}\right)\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]
+d=1D(ρd+1ρd)𝔼[θk+1dθkd2]\displaystyle+\sum\limits_{d=1}^{D}(\rho_{d+1}-\rho_{d})\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right] (52)

where (a) uses the smoothness in Assumption 1 and the definition of I1k,I2k,I3k,I4kI_{1}^{k},I_{2}^{k},I_{3}^{k},I_{4}^{k} in (8) and (8).

Note that we can bound (αk+ck+1)𝔼[I1k+I2k+I3k+I4k](\alpha_{k}+c_{k+1})\mathbb{E}\left[I_{1}^{k}+I_{2}^{k}+I_{3}^{k}+I_{4}^{k}\right] the same as (8) in the proof of Lemma 3.1. In addition, Lemma 3.2 implies that

𝔼[I3k]\displaystyle\mathbb{E}[I_{3}^{k}]\leq 1β12𝔼[(θk)(ϵI+V^kD)122]\displaystyle-\frac{1-\beta_{1}}{2}\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\right]
+(1β1)ϵ12(L12αk+c2dmax)d=1D𝔼[θk+1dθkd2]+(1β1)6DLαkϵ12Mmσm2.\displaystyle+(1-\beta_{1})\epsilon^{-\frac{1}{2}}\left(\frac{L}{12\alpha_{k}}+\frac{c}{2d_{\rm max}}\right)\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]\!+\!(1-\beta_{1})\frac{6DL\alpha_{k}\epsilon^{-\frac{1}{2}}}{M}\!\!\sum_{m\in{\cal M}}\sigma_{m}^{2}. (53)

Hence, plugging Lemma 3.1 with αk\alpha_{k} replaced by αk+ck+1\alpha_{k}+c_{k+1} into (10), together with (10), leads to

𝔼[𝒱k+1]𝔼[𝒱k]\displaystyle\mathbb{E}[{\cal V}^{k+1}]-\mathbb{E}[{\cal V}^{k}]\leq (αk+ck+1)(1β12)𝔼[(θk)(ϵI+V^kD)122]\displaystyle-(\alpha_{k}+c_{k+1})\left(\frac{1-\beta_{1}}{2}\right)\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\right]
+(αk+ck+1)(1β1)ϵ12(L12αk+c2dmax)d=1D𝔼[θk+1dθkd2]\displaystyle+(\alpha_{k}+c_{k+1})(1-\beta_{1})\epsilon^{-\frac{1}{2}}\left(\frac{L}{12\alpha_{k}}+\frac{c}{2d_{\rm max}}\right)\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]\!
+(αk+ck+1)(1β1)6DLαkϵ12Mmσm2\displaystyle+(\alpha_{k}+c_{k+1})(1-\beta_{1})\frac{6DL\alpha_{k}\epsilon^{-\frac{1}{2}}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}
+((αk+ck+1)β1ck)𝔼[I1k1+I2k1+I3k1+I4k1]\displaystyle+((\alpha_{k}+c_{k+1})\beta_{1}-c_{k})\mathbb{E}\left[I_{1}^{k-1}+I_{2}^{k-1}+I_{3}^{k-1}+I_{4}^{k-1}\right]
+(αk+ck+1)(2β1)σ2𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)]\displaystyle+(\alpha_{k}+c_{k+1})(2-\beta_{1})\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
+bk+1i=1p𝔼[(ϵ+v^ik+1)12]bki=1p𝔼[(ϵ+v^ikD)12]\displaystyle+b_{k+1}\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{]}-b_{k}\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}\Big{]}
+d=1D(bk+1bk)i=1p𝔼[(ϵ+v^ik+1d)12]+d=1D(ρd+1ρd)𝔼[θk+1dθkd2]\displaystyle+\sum\limits_{d=1}^{D}(b_{k+1}-b_{k})\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k+1-d})^{-\frac{1}{2}}\Big{]}+\sum\limits_{d=1}^{D}(\rho_{d+1}-\rho_{d})\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]
+(L2+ρ1+(αk+ck+1)αk11β1L)𝔼[θk+1θk2].\displaystyle+\left(\frac{L}{2}+\rho_{1}+(\alpha_{k}+c_{k+1})\alpha_{k-1}^{-1}\beta_{1}L\right)\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]. (54)

Select αkαk1\alpha_{k}\leq\alpha_{k-1} and ck:=j=kαjβ1jk+1(1β1)1αkc_{k}:=\sum\limits_{j=k}^{\infty}\alpha_{j}\beta_{1}^{j-k+1}\leq(1-\beta_{1})^{-1}\alpha_{k} so that (αk+ck+1)β1=ck(\alpha_{k}+c_{k+1})\beta_{1}=c_{k} and

(αk+ck+1)(1β1)\displaystyle(\alpha_{k}+c_{k+1})(1-\beta_{1}) (αk+(1β1)1αk+1)(1β1)\displaystyle\leq(\alpha_{k}+(1-\beta_{1})^{-1}\alpha_{k+1})(1-\beta_{1})
αk(1+(1β1)1)(1β1)=αk(2β1).\displaystyle\leq\alpha_{k}(1+(1-\beta_{1})^{-1})(1-\beta_{1})=\alpha_{k}(2-\beta_{1}).

In addition, select bkb_{k} to ensure that bk+1bkb_{k+1}\leq b_{k}. Then it follows from (10) that

𝔼[𝒱k+1]𝔼[𝒱k]\displaystyle\mathbb{E}[{\cal V}^{k+1}]-\mathbb{E}[{\cal V}^{k}]\leq αk(1β1)2𝔼[(θk)(ϵI+V^kD)122]+(2β1)αk26DLϵ12Mmσm2\displaystyle-\frac{\alpha_{k}(1-\beta_{1})}{2}\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}_{(\epsilon I+\hat{V}^{k-D})^{-\frac{1}{2}}}\right]+(2-\beta_{1})\alpha_{k}^{2}\frac{6DL\epsilon^{-\frac{1}{2}}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}
+(2β1)αkϵ12(L12αk+c2dmax)d=1D𝔼[θk+1dθkd2]\displaystyle+(2-\beta_{1})\alpha_{k}\epsilon^{-\frac{1}{2}}\left(\frac{L}{12\alpha_{k}}+\frac{c}{2d_{\rm max}}\right)\sum\limits_{d=1}^{D}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]
+((2β1)2(1β1)αkσ2bk)𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)]\displaystyle+\left(\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\alpha_{k}\sigma^{2}-b_{k}\right)\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
+(L2+ρ1+(1β1)1L)𝔼[θk+1θk2]\displaystyle+\left(\frac{L}{2}+\rho_{1}+(1-\beta_{1})^{-1}L\right)\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]
+d=1D(ρd+1ρd)𝔼[θk+1dθkd2]\displaystyle+\sum\limits_{d=1}^{D}(\rho_{d+1}-\rho_{d})\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right] (55)

where we have also used the fact that (αk+ck+1)(1β12)αk(1β1)2-(\alpha_{k}+c_{k+1})\left(\frac{1-\beta_{1}}{2}\right)\leq-\frac{\alpha_{k}(1-\beta_{1})}{2} since ck+10c_{k+1}\geq 0.

If we choose αk1L\alpha_{k}\leq\frac{1}{L} for k=1,2,Kk=1,2\ldots,K, then it follows from (10) that

𝔼[𝒱k+1]𝔼[𝒱k]\displaystyle\mathbb{E}[{\cal V}^{k+1}]-\mathbb{E}[{\cal V}^{k}]
\displaystyle\leq αk(1β1)2(ϵ+σ21β2)12𝔼[(θk)2]+(2β1)6αk2DLϵ12Mmσm2\displaystyle-\frac{\alpha_{k}(1-\beta_{1})}{2}\!\left(\epsilon+\frac{\sigma^{2}}{1-\beta_{2}}\right)^{-\frac{1}{2}}\!\!\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}\right]+(2-\beta_{1})\frac{6\alpha_{k}^{2}DL\epsilon^{-\frac{1}{2}}}{M}\!\!\sum_{m\in{\cal M}}\!\!\sigma_{m}^{2}
+((2β1)2(1β1)αkσ2bk)Ak𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)]\displaystyle+\underbracket{\left(\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\alpha_{k}\sigma^{2}-b_{k}\right)}_{A^{k}}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
+(L2+ρ1+(1β1)1L)𝔼[θk+1θk2]\displaystyle+\left(\frac{L}{2}+\rho_{1}+(1-\beta_{1})^{-1}L\right)\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]
+d=1D((2β1)ϵ12(L12+cαk2dmax)+ρd+1ρd)Bdk𝔼[θk+1dθkd2].\displaystyle+\sum\limits_{d=1}^{D}\underbracket{\left((2-\beta_{1})\epsilon^{-\frac{1}{2}}\left(\frac{L}{12}+\frac{c\alpha_{k}}{2d_{\rm max}}\right)+\rho_{d+1}-\rho_{d}\right)}_{B_{d}^{k}}\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right]. (56)

To ensure Ak0A^{k}\leq 0 and Bdk0B_{d}^{k}\leq 0, it is sufficient to choose {bk}\{b_{k}\} and {ρd}\{\rho_{d}\} satisfying (with ρD+1=0\rho_{D+1}=0)

(2β1)2(1β1)αkσ2bk0,k=1,,K\displaystyle\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\alpha_{k}\sigma^{2}-b_{k}\leq 0,\quad k=1,\cdots,K (57)
(2β1)ϵ12(L12+cαk2dmax)+ρd+1ρd0,d=1,,D.\displaystyle(2-\beta_{1})\epsilon^{-\frac{1}{2}}\left(\frac{L}{12}+\frac{c\alpha_{k}}{2d_{\rm max}}\right)+\rho_{d+1}-\rho_{d}\leq 0,\quad d=1,\cdots,D. (58)

Solve this system of linear equations and get

bk=(2β1)2(1β1)Lσ2,k=1,,K\displaystyle b_{k}=\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})L}\sigma^{2},\quad k=1,\cdots,K (59)
ρd=(2β1)ϵ12(L12+c2Ldmax)(Dd+1),d=1,,D\displaystyle\rho_{d}=(2-\beta_{1})\epsilon^{-\frac{1}{2}}\left(\frac{L}{12}+\frac{c}{2Ld_{\rm max}}\right)(D-d+1),~{}~{}~{}d=1,\cdots,D (60)

plugging which into (10) leads to the conclusion of Lemma 3.3.

11 Proof of Theorem 3.4

From the definition of 𝒱k{\cal V}^{k}, we have for any kk, that

𝔼[𝒱k]\displaystyle\mathbb{E}[{\cal V}^{k}] (θk)(θ)ck(θk1),(ϵI+V^k)12hk+d=1Dρdθk+1dθkd2\displaystyle\geq\mathcal{L}(\theta^{k})-\mathcal{L}(\theta^{*})-c_{k}\left\langle\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\right\rangle+\sum\limits_{d=1}^{D}\rho_{d}\|\theta^{k+1-d}-\theta^{k-d}\|^{2}
|ck|(θk1)(ϵI+V^k)12hk\displaystyle\geq-|c_{k}|\left\|\nabla\mathcal{L}(\theta^{k-1})\right\|\left\|(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\right\|
(1β1)1αkσ2ϵ12\displaystyle\geq-(1-\beta_{1})^{-1}\alpha_{k}\sigma^{2}\epsilon^{-\frac{1}{2}} (61)

where we use Assumption 2 and Lemma 6.3.

By taking summation on (10) over k=0,,K1k=0,\cdots,K-1, it follows from that

α(1β1)2(ϵ+σ21β2)121Kk=1K𝔼[(θk)2]\displaystyle\frac{\alpha(1-\beta_{1})}{2}\left(\epsilon+\frac{\sigma^{2}}{1-\beta_{2}}\right)^{-\frac{1}{2}}\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}\right]
\displaystyle\leq 𝔼[𝒱1]𝔼[𝒱K+1]K+(2β1)6α2DLϵ12Mmσm2+(2β1)2(1β1)σ2pDϵ12αK\displaystyle\frac{\mathbb{E}[{\cal V}^{1}]-\mathbb{E}[{\cal V}^{K+1}]}{K}+(2-\beta_{1})\frac{6\alpha^{2}DL\epsilon^{-\frac{1}{2}}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}+\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\sigma^{2}pD\epsilon^{-\frac{1}{2}}\frac{\alpha}{K}
+(L2+ρ1+(1β1)1L)1Kk=1K𝔼[θk+1θk2]\displaystyle+\left(\frac{L}{2}+\rho_{1}+(1-\beta_{1})^{-1}L\right)\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\|\theta^{k+1}-\theta^{k}\|^{2}\right]
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}} 𝔼[𝒱1]K+(2β1)6α2DLϵ12Mmσm2+(1β1)1σ2ϵ12αK+(2β1)2(1β1)σ2pDϵ12αK\displaystyle\frac{\mathbb{E}[{\cal V}^{1}]}{K}+(2-\beta_{1})\frac{6\alpha^{2}DL\epsilon^{-\frac{1}{2}}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}+(1-\beta_{1})^{-1}\sigma^{2}\epsilon^{-\frac{1}{2}}\frac{\alpha}{K}+\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\sigma^{2}pD\epsilon^{-\frac{1}{2}}\frac{\alpha}{K}
+(L2+ρ1+(1β1)1L)p(1β2)1(1β3)1α2\displaystyle+\left(\frac{L}{2}+\rho_{1}+(1-\beta_{1})^{-1}L\right)p(1-\beta_{2})^{-1}(1-\beta_{3})^{-1}\alpha^{2} (62)

where (a) follows from (11) and Lemma 6.5.

Specifically, if we choose a constant stepsize α:=ηK\alpha:=\frac{\eta}{\sqrt{K}}, where η>0\eta>0 is a constant, and define

C~1:=(2β1)6DLϵ12\tilde{C}_{1}:=(2-\beta_{1})6DL\epsilon^{-\frac{1}{2}} (63)

and

C~2:=(1β1)1ϵ12+(2β1)2(1β1)Dϵ12\tilde{C}_{2}:=(1-\beta_{1})^{-1}\epsilon^{-\frac{1}{2}}+\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}D\epsilon^{-\frac{1}{2}} (64)

and

C~3:=(L2+ρ1+(1β1)1L)(1β2)1(1β3)1\tilde{C}_{3}:=\left(\frac{L}{2}+\rho_{1}+(1-\beta_{1})^{-1}L\right)(1-\beta_{2})^{-1}(1-\beta_{3})^{-1} (65)

and

C~4:=12(1β1)(ϵ+σ21β2)12\tilde{C}_{4}:=\frac{1}{2}(1-\beta_{1})\left(\epsilon+\frac{\sigma^{2}}{1-\beta_{2}}\right)^{-\frac{1}{2}} (66)

we can obtain from (11) that

1Kk=0K1𝔼[(θk)2]\displaystyle\frac{1}{K}\sum\limits_{k=0}^{K-1}\mathbb{E}\left[\|\nabla\mathcal{L}(\theta^{k})\|^{2}\right]\leq (θ0)(θ)K+C~1Mmσm2α2+C~2pσ2αK+C~3pα2αC~4\displaystyle\frac{\frac{\mathcal{L}(\theta^{0})-\mathcal{L}(\theta^{*})}{K}+\frac{\tilde{C}_{1}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}\alpha^{2}+\tilde{C}_{2}p\sigma^{2}\frac{\alpha}{K}+\tilde{C}_{3}p\alpha^{2}}{\alpha\tilde{C}_{4}}
\displaystyle\leq (θ0)(θ)KαC~4+C~1αC~4Mmσm2+C~2pσ2KC~4+C~3pαC~4\displaystyle\frac{\mathcal{L}(\theta^{0})-\mathcal{L}(\theta^{*})}{K\alpha\tilde{C}_{4}}+\frac{\tilde{C}_{1}\alpha}{\tilde{C}_{4}M}\sum_{m\in{\cal M}}\sigma_{m}^{2}+\tilde{C}_{2}p\frac{\sigma^{2}}{K\tilde{C}_{4}}+\frac{\tilde{C}_{3}p\alpha}{\tilde{C}_{4}}
=\displaystyle= ((θ0)(θ))C4Kη+C1ηKMmσm2+C2pσ2K+C3pηK\displaystyle\frac{(\mathcal{L}(\theta^{0})-\mathcal{L}(\theta^{*}))C_{4}}{\sqrt{K}\eta}+\frac{C_{1}\eta}{\sqrt{K}M}\sum_{m\in{\cal M}}\sigma_{m}^{2}+\frac{C_{2}p\sigma^{2}}{K}+\frac{C_{3}p\eta}{\sqrt{K}}

where we define C1:=C~1/C~4C_{1}:=\tilde{C}_{1}/\tilde{C}_{4}, C2:=C~2/C~4C_{2}:=\tilde{C}_{2}/\tilde{C}_{4}, C3:=C~3/C~4C_{3}:=\tilde{C}_{3}/\tilde{C}_{4}, and C4:=1/C~4C_{4}:=1/\tilde{C}_{4}.

12 Proof of Theorem 3.5

By the PL-condition of (θ)\mathcal{L}(\theta), we have

αk(1β1)2(ϵ+σ21β2)12𝔼[(θk)2]\displaystyle-\frac{\alpha_{k}(1-\beta_{1})}{2}\!\left(\epsilon+\frac{\sigma^{2}}{1-\beta_{2}}\right)^{-\frac{1}{2}}\!\!\mathbb{E}\left[\left\|\nabla\mathcal{L}(\theta^{k})\right\|^{2}\right]
\displaystyle\leq αkμ(1β1)(ϵ+σ21β2)12𝔼[(θk)(θ)]\displaystyle-\alpha_{k}\mu(1-\beta_{1})\!\left(\epsilon+\frac{\sigma^{2}}{1-\beta_{2}}\right)^{-\frac{1}{2}}\!\!\mathbb{E}\left[\mathcal{L}(\theta^{k})-\mathcal{L}(\theta^{\star})\right]
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}} 2αkμC~4(𝔼[𝒱k]+ck(θk1),(ϵI+V^k)12hkbkd=0Di=1p(ϵ+v^ikd)12d=1Dρdθk+1dθkd2)\displaystyle\!-2\alpha_{k}\mu\tilde{C}_{4}\Big{(}\mathbb{E}[{\cal V}^{k}]\!+\!c_{k}\left\langle\nabla\mathcal{L}(\theta^{k-1}),(\epsilon I+\hat{V}^{k})^{-\frac{1}{2}}h^{k}\right\rangle\!-\!b_{k}\sum\limits_{d=0}^{D}\sum_{i=1}^{p}\!(\epsilon+\hat{v}_{i}^{k-d})^{-\frac{1}{2}}-\sum\limits_{d=1}^{D}\rho_{d}\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\Big{)}
(b)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}} 2αkμC~4𝔼[𝒱k]+2αk2μC~4(1β1)1σ2ϵ12+2αkμC~4bkd=0Di=1p𝔼[(ϵ+v^ikd)12]\displaystyle-2\alpha_{k}\mu\tilde{C}_{4}\mathbb{E}[{\cal V}^{k}]+2\alpha_{k}^{2}\mu\tilde{C}_{4}(1-\beta_{1})^{-1}\sigma^{2}\epsilon^{-\frac{1}{2}}+2\alpha_{k}\mu\tilde{C}_{4}b_{k}\sum\limits_{d=0}^{D}\sum_{i=1}^{p}\mathbb{E}\left[(\epsilon+\hat{v}_{i}^{k-d})^{-\frac{1}{2}}\right]
+2αkμC~4d=1Dρd𝔼[θk+1dθkd2]\displaystyle+2\alpha_{k}\mu\tilde{C}_{4}\sum\limits_{d=1}^{D}\rho_{d}\mathbb{E}[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}] (67)

where (a) uses the definition of C~4\tilde{C}_{4} in (66), and (b) uses Assumption 2 and Lemma 6.3.

Plugging (12) into (10), we have

𝔼[𝒱k+1]\displaystyle\mathbb{E}[{\cal V}^{k+1}] 𝔼[𝒱k]2αkμC~4𝔼[𝒱k]+(2β1)6αk2DLϵ12Mmσm2\displaystyle-\mathbb{E}[{\cal V}^{k}]\leq-2\alpha_{k}\mu\tilde{C}_{4}\mathbb{E}[{\cal V}^{k}]+(2-\beta_{1})\frac{6\alpha_{k}^{2}DL\epsilon^{-\frac{1}{2}}}{M}\!\!\sum_{m\in{\cal M}}\!\!\sigma_{m}^{2} (68)
+(2β1)2(1β1)αkσ2𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)]\displaystyle+\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\alpha_{k}\sigma^{2}\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
+bk+1i=1p𝔼[(ϵ+v^ik+1)12](bk2αkμC~4bk)i=1p𝔼[(ϵ+v^ikD)12]\displaystyle+b_{k+1}\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{]}-(b_{k}-2\alpha_{k}\mu\tilde{C}_{4}b_{k})\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}\Big{]}
+d=1D(bk+1bk+2αkμC~4bk)i=1p𝔼[(ϵ+v^ik+1d)12]\displaystyle+\sum\limits_{d=1}^{D}(b_{k+1}-b_{k}+2\alpha_{k}\mu\tilde{C}_{4}b_{k})\sum_{i=1}^{p}\mathbb{E}\Big{[}(\epsilon+\hat{v}_{i}^{k+1-d})^{-\frac{1}{2}}\Big{]}
+(L2+ρ1+(1β1)1L)p(1β2)1(1β3)1αk2+2αk2μC~4(1β1)1σ2ϵ12\displaystyle+\left(\frac{L}{2}+\rho_{1}+(1-\beta_{1})^{-1}L\right)p(1-\beta_{2})^{-1}(1-\beta_{3})^{-1}\alpha_{k}^{2}+2\alpha_{k}^{2}\mu\tilde{C}_{4}(1-\beta_{1})^{-1}\sigma^{2}\epsilon^{-\frac{1}{2}}
+d=1D((2β1)ϵ12(L12+cαk2dmax)+ρd+1ρd+2αkμC~4ρd)𝔼[θk+1dθkd2].\displaystyle+\sum\limits_{d=1}^{D}\!\left((2-\beta_{1})\epsilon^{-\frac{1}{2}}\!\left(\frac{L}{12}\!+\!\frac{c\alpha_{k}}{2d_{\rm max}}\right)+\rho_{d+1}-\rho_{d}+2\alpha_{k}\mu\tilde{C}_{4}\rho_{d}\right)\!\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right].

If we choose bkb_{k} to ensure that bk+1(12αkμC~4)bkb_{k+1}\leq(1-2\alpha_{k}\mu\tilde{C}_{4})b_{k}, then we can obtain from (68) that

𝔼[𝒱k+1]𝔼[𝒱k]\displaystyle\mathbb{E}[{\cal V}^{k+1}]-\mathbb{E}[{\cal V}^{k}] (69)
\displaystyle\leq 2αkμC~4𝔼[𝒱k]+C~1Mmσm2αk2+C~3pαk2+2μC~4(1β1)1σ2ϵ12αk2\displaystyle-2\alpha_{k}\mu\tilde{C}_{4}\mathbb{E}[{\cal V}^{k}]+\frac{\tilde{C}_{1}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}\alpha_{k}^{2}+\tilde{C}_{3}p\alpha_{k}^{2}+2\mu\tilde{C}_{4}(1-\beta_{1})^{-1}\sigma^{2}\epsilon^{-\frac{1}{2}}\alpha_{k}^{2}
+((2β1)2(1β1)αkσ2(12αkμC~4)bk)𝔼[i=1p((ϵ+v^ikD)12(ϵ+v^ik+1)12)]\displaystyle+\left(\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})}\alpha_{k}\sigma^{2}-(1-2\alpha_{k}\mu\tilde{C}_{4})b_{k}\right)\mathbb{E}\Big{[}\sum_{i=1}^{p}\Big{(}(\epsilon+\hat{v}_{i}^{k-D})^{-\frac{1}{2}}-(\epsilon+\hat{v}_{i}^{k+1})^{-\frac{1}{2}}\Big{)}\Big{]}
+d=1D((2β1)ϵ12(L12+cαk2dmax)+ρd+1ρd+2αkμC~4ρd)𝔼[θk+1dθkd2].\displaystyle+\sum\limits_{d=1}^{D}\!\left((2-\beta_{1})\epsilon^{-\frac{1}{2}}\!\left(\frac{L}{12}\!+\!\frac{c\alpha_{k}}{2d_{\rm max}}\right)+\rho_{d+1}-\rho_{d}+2\alpha_{k}\mu\tilde{C}_{4}\rho_{d}\right)\!\mathbb{E}\left[\|\theta^{k+1-d}-\theta^{k-d}\|^{2}\right].

If αk1L\alpha_{k}\leq\frac{1}{L}, we choose parameters {bk,ρd}\{b_{k},\rho_{d}\} to guarantee that

(2β1)2(1β1)Lσ2(12μC~4L)bk0,k\displaystyle\frac{(2-\beta_{1})^{2}}{(1-\beta_{1})L}\sigma^{2}-\Big{(}1-\frac{2\mu\tilde{C}_{4}}{L}\Big{)}b_{k}\leq 0,~{}~{}~{}\forall k (70)
(2β1)(L12+c2Ldmax)ϵ12+ρd+1(12μC~4L)ρd0,d=1,,D\displaystyle(2-\beta_{1})\!\left(\frac{L}{12}\!+\!\frac{c}{2Ld_{\rm max}}\right)\epsilon^{-\frac{1}{2}}+\rho_{d+1}-\Big{(}1-\frac{2\mu\tilde{C}_{4}}{L}\Big{)}\rho_{d}\leq 0,~{}~{}~{}d=1,\cdots,D (71)

and choose β1,β2,ϵ\beta_{1},\beta_{2},\epsilon to ensure that 12μC~4L01-\frac{2\mu\tilde{C}_{4}}{L}\geq 0.

Then we have

𝔼[𝒱k+1]\displaystyle\mathbb{E}[{\cal V}^{k+1}]\leq (12αkμC~4)𝔼[𝒱k]+(C~1Mmσm2+C~3p+2μC~4(1β1)1σ2ϵ12C~5)αk2\displaystyle\left(1-2\alpha_{k}\mu\tilde{C}_{4}\right)\mathbb{E}[{\cal V}^{k}]+\Bigg{(}\underbracket{\frac{\tilde{C}_{1}}{M}\sum_{m\in{\cal M}}\sigma_{m}^{2}+\tilde{C}_{3}p+2\mu\tilde{C}_{4}(1-\beta_{1})^{-1}\sigma^{2}\epsilon^{-\frac{1}{2}}}_{\tilde{C}_{5}}\Bigg{)}\alpha_{k}^{2}
\displaystyle\leq j=0k(12αjμC~4)𝔼[𝒱0]+j=0kαj2i=j+1k(12αiμC~4)C~5.\displaystyle\prod\limits_{j=0}^{k}(1-2\alpha_{j}\mu\tilde{C}_{4})\mathbb{E}[{\cal V}^{0}]+\sum\limits_{j=0}^{k}\alpha_{j}^{2}\prod\limits_{i=j+1}^{k}(1-2\alpha_{i}\mu\tilde{C}_{4})\tilde{C}_{5}. (72)

If we choose αk=1μ(k+K0)C~41L\alpha_{k}=\frac{1}{\mu(k+K_{0})\tilde{C}_{4}}\leq\frac{1}{L}, where K0K_{0} is a sufficiently large constant to ensure that αk\alpha_{k} satisfies the aforementioned conditions, then we have

𝔼[𝒱K]\displaystyle\mathbb{E}[{\cal V}^{K}]\leq 𝔼[𝒱0]k=0K1(12αkμC~4)+C~5k=0K1αk2j=k+1K1(12αjμC~4)\displaystyle\mathbb{E}[{\cal V}^{0}]\prod\limits_{k=0}^{K-1}(1-2\alpha_{k}\mu\tilde{C}_{4})+\tilde{C}_{5}\sum\limits_{k=0}^{K-1}\alpha_{k}^{2}\prod\limits_{j=k+1}^{K-1}(1-2\alpha_{j}\mu\tilde{C}_{4})
\displaystyle\leq 𝔼[𝒱0]k=0K1k+K02k+K0+C~5μ2C~42k=0K11(k+K0)2j=k+1K1j+K02j+K0\displaystyle\mathbb{E}[{\cal V}^{0}]\prod\limits_{k=0}^{K-1}\frac{k+K_{0}-2}{k+K_{0}}+\frac{\tilde{C}_{5}}{\mu^{2}\tilde{C}_{4}^{2}}\sum\limits_{k=0}^{K-1}\frac{1}{(k+K_{0})^{2}}\prod\limits_{j=k+1}^{K-1}\frac{j+K_{0}-2}{j+K_{0}}
\displaystyle\leq (K02)(K01)(K+K02)(K+K01)𝔼[𝒱0]+C~5μ2C~42k=0K1(k+K01)(k+K0)(K+K02)(K+K02)\displaystyle\frac{(K_{0}-2)(K_{0}-1)}{(K+K_{0}-2)(K+K_{0}-1)}\mathbb{E}[{\cal V}^{0}]+\frac{\tilde{C}_{5}}{\mu^{2}\tilde{C}_{4}^{2}}\sum\limits_{k=0}^{K-1}\frac{(k+K_{0}-1)}{(k+K_{0})(K+K_{0}-2)(K+K_{0}-2)}
\displaystyle\leq (K01)2(K+K01)2𝔼[𝒱0]+C~5Kμ2C~42(K+K01)2\displaystyle\frac{(K_{0}-1)^{2}}{(K+K_{0}-1)^{2}}\mathbb{E}[{\cal V}^{0}]+\frac{\tilde{C}_{5}K}{\mu^{2}\tilde{C}_{4}^{2}(K+K_{0}-1)^{2}}
=\displaystyle= (K01)2(K+K01)2((θ0)(θ))+C~5Kμ2C~42(K+K02)2\displaystyle\frac{(K_{0}-1)^{2}}{(K+K_{0}-1)^{2}}(\mathcal{L}(\theta^{0})-\mathcal{L}(\theta^{\star}))+\frac{\tilde{C}_{5}K}{\mu^{2}\tilde{C}_{4}^{2}(K+K_{0}-2)^{2}}

from which the proof is complete.

13 Additional Numerical Results

13.1 Simulation setup

In order to verify our analysis and show the empirical performance of CADA, we conduct experiments in the logistic regression and training neural network tasks, respectively.

In logistic regression, we tested the covtype and ijcnn1 in the main paper, and MNIST in the supplementary document. In training neural networks, we tested MNIST dataset in the main paper, and CIFAR10 in the supplementary document. To benchmark CADA, we compared it with some state-of-the-art algorithms, namely ADAM [Kingma and Ba(2014)], stochastic LAG, local momentum [Yu et al.(2019)Yu, Jin, and Yang, Wang et al.(2020b)Wang, Tantia, Ballas, and Rabbat] and FedAdam [Reddi et al.(2020)Reddi, Charles, Zaheer, Garrett, Rush, Konečnỳ, Kumar, and McMahan].

All experiments are run on a workstation with an Intel i9-9960x CPU with 128GB memory and four NVIDIA RTX 2080Ti GPUs each with 11GB memory using Python 3.6.

13.2 Simulation details

13.2.1 Logistic regression.

Objective function. For the logistic regression task, we use either the logistic loss for the binary case, or the cross-entropy loss for the multi-class class, both of which are augmented with an 2\ell_{2} norm regularizer with the coefficient λ=105\lambda=10^{-5}.

Data pre-processing. For ijcnn1 and covtype datasets, they are imported from the popular library LIBSVM (https://www.csie.ntu.edu.tw/ cjlin/libsvm/) without further preprocessing. For MNIST, we normalize the data and subtract the mean. We uniformly partition ijcnn1 dataset with 91,701 samples and MNIST dataset with 60,000 samples into M=10M=10 workers. To simulate the heterogeneous setting, we partition covtype dataset with 581,012 samples randomly into M=20M=20 workers with different number of samples per worker.

For covtype, we fix the batch ratio to be 0.001 uniformly across all workers; and for ijcnn1 and MNIST, we fix the batch ratio to be 0.01 uniformly across all workers.

Choice of hyperparameters. For the logistic regression task, the hyperparameters in each algorithm are chosen by hand to roughly optimize the training loss performance of each algorithm. We list the values of parameters used in each test in Tables 1-2.

Algorithm    stepsize α\alpha    momentum weight β\beta    averaging interval H/DH/D
FedAdam αl=100\alpha_{l}=100 αs=0.02\alpha_{s}=0.02 0.90.9 H=10H=10
Local momentum 0.10.1 0.90.9 H=10H=10
ADAM 0.0050.005 β1=0.9\beta_{1}=0.9 β2=0.999\beta_{2}=0.999 /
CADA1&2 0.0050.005 β1=0.9\beta_{1}=0.9 β2=0.999\beta_{2}=0.999 D=100D=100, dmax=10d_{\rm max}=10
Stochastic LAG 0.10.1 / dmax=10d_{\rm max}=10
Table 1: Choice of parameters in covtype.
Algorithm    stepsize α\alpha    momentum weight β\beta    averaging interval H/DH/D
FedAdam αl=100\alpha_{l}=100 αs=0.03\alpha_{s}=0.03 0.90.9 H=10H=10
Local momentum 0.10.1 0.90.9 H=20H=20
ADAM 0.010.01 β1=0.9\beta_{1}=0.9 β2=0.999\beta_{2}=0.999 /
CADA 0.010.01 β1=0.9\beta_{1}=0.9 β2=0.999\beta_{2}=0.999 D=100D=100, dmax=10d_{\rm max}=10
Stochastic LAG 0.10.1 / dmax=10d_{\rm max}=10
Table 2: Choice of parameters in ijcnn1.

13.2.2 Training neural networks.

For training neural networks, we use the cross-entropy loss but with different network models.

Neural network models. For MNIST dataset, we use a convolutional neural network with two convolution-ELUmaxpooling layers (ELU is a smoothed ReLU) followed by two fully-connected layers. The first convolution layer is 5×5×205\times 5\times 20 with padding, and the second layer is 5×5×505\times 5\times 50 with padding. The output of second layer is followed by two fully connected layers with one being 800×500800\times 500 and the other being 500×10500\times 10. The output goes through a softmax function. For CIFAR10 dataset, we use the popular neural network architecture ResNet20 111https://github.com/akamaster/pytorch_resnet_cifar10 which has 20 and roughly 0.27 million parameters. We do not use a pre-trained model.

Data pre-processing. We uniformly partition MNIST and CIFAR10 datasets into M=10M=10 workers. For MNIST, we use the raw data without preprocessing. The minibatch size per worker is 12. For CIFAR10, in addition to normalizing the data and subtracting the mean, we randomly flip and crop part of the original image every time it is used for training. This is a standard technique of data augmentation to avoid over-fitting. The minibatch size for CIFAR10 is 50 per worker.

Choice of hyperparameters. For MNIST dataset which is relatively easy, the hyperparameters in each algorithm are chosen by hand to optimize the performance of each algorithm. We list the values of parameters used in each test in Table 3.

Algorithm    stepsize α\alpha    momentum weight β\beta    averaging interval H/DH/D
FedAdam αl=0.1\alpha_{l}=0.1 αs=0.001\alpha_{s}=0.001 0.90.9 H=8H=8
Local momentum 0.0010.001 0.90.9 H=8H=8
ADAM 0.0005{0.0005} β1=0.9\beta_{1}=0.9 β2=0.999\beta_{2}=0.999 /
CADA1&2 0.0005{0.0005} β1=0.9\beta_{1}=0.9 β2=0.999\beta_{2}=0.999 D=50,dmax=10D=50,d_{\rm max}=10
Stochastic LAG 0.10.1 / dmax=10d_{\rm max}=10
Table 3: Choice of parameters in multi-class MNIST.

For CIFAR10 dataset, we search the best values of hyperparameters from the following search grid on a per-algorithm basis to optimize the testing accuracy versus the number of communication rounds. The chosen values of parameter are listed in Table 4.

FedAdam: αs{0.1,0.01,0.001}\alpha_{s}\in\left\{0.1,0.01,0.001\right\}; αl{1,0.5,0.1}\alpha_{l}\in\left\{1,0.5,{0.1}\right\}; H{1,4,6,8,16}H\in\left\{1,4,6,8,16\right\}.

Local momentum: α{0.1,0.01,0.001}\alpha\in\left\{{0.1},0.01,0.001\right\}; H{1,4,6,8,16}H\in\left\{1,4,6,8,16\right\}.

CADA1: α{0.1,0.01,0.001}\alpha\in\left\{0.1,{0.01},0.001\right\}; c{0.05,0.1,0.3,0.6,0.9,1.2,1.5,1.8}c\in\left\{0.05,0.1,0.3,0.6,0.9,1.2,{1.5},1.8\right\}.

CADA2: α{0.1,0.01,0.001}\alpha\in\left\{0.1,{0.01},0.001\right\}; c{0.05,0.1,0.3,0.6,0.9,1.2,1.5,1.8}c\in\left\{0.05,0.1,{0.3},0.6,0.9,1.2,1.5,1.8\right\}.

LAG: α{0.1,0.01,0.001}\alpha\in\left\{{0.1},0.01,0.001\right\}; c{0.05,0.1,0.3,0.6,0.9,1.2,1.5,1.8}c\in\left\{{0.05},0.1,0.3,0.6,0.9,1.2,1.5,1.8\right\}.

Refer to caption
Refer to caption
Figure 6: Performance of FedAdam and local momentum on MNIST under different HH.
Refer to caption
Refer to caption
Figure 7: Performance of FedAdam and local momentum on CIFAR10 under different HH.
Algorithm    stepsize α\alpha    momentum weight β\beta    averaging interval H/DH/D
FedAdam αl=0.1\alpha_{l}=0.1 αs=0.1\alpha_{s}=0.1 0.90.9 H=8H=8
Local momentum 0.10.1 0.90.9 H=8H=8
CADA1 0.10.1 β1=0.9\beta_{1}=0.9 β2=0.99\beta_{2}=0.99 D=50D=50,  dmax=2d_{\rm max}=2
CADA2 0.10.1 β1=0.9\beta_{1}=0.9 β2=0.99\beta_{2}=0.99 D=50D=50,  dmax=2d_{\rm max}=2
Stochastic LAG 0.10.1 / dmax=2d_{\rm max}=2
Table 4: Choice of parameters in CIFAR10.

Additional results. In addition to the results presented in the main paper, we report a new set of simulations on the performance of local update based algorithms under different averaging interval HH. Since algorithms under H=4,6H=4,6 do not perform as good as H=8H=8, we only plot H=1,8,16H=1,8,16 in Figures 6 and 7 to ease the comparison. Figure 6 compares the performance of FedAdam and local momentum on MNIST dataset under different averaging interval HH. Figure 7 compares the performance of FedAdam and local momentum on CIFAR10 dataset under different HH.

Figure 7 compares the performance of FedAdam and local momentum on CIFAR10 dataset under different averaging interval HH. FedAdam and local momentum under a larger averaging interval HH have faster convergence speed at the initial stage, but they reach slightly lower testing accuracy. This reduced test accuracy is common among local SGD-type methods, which has also been studied theoretically; see e.g., [Haddadpour et al.(2019)Haddadpour, Kamani, Mahdavi, and Cadambe].