This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FeDXL: Provable Federated Learning for Deep X-Risk Optimization

Zhishuai Guo    Rong Jin    Jiebo Luo    Tianbao Yang
Abstract

In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of 𝔼𝐳𝒮1f(𝔼𝐳𝒮2(𝐰;𝐳,𝐳))\mathbb{E}_{\mathbf{z}\sim\mathcal{S}_{1}}f(\mathbb{E}_{\mathbf{z}^{\prime}\sim\mathcal{S}_{2}}\ell(\mathbf{w};\mathbf{z},\mathbf{z}^{\prime})), where two sets of data 𝒮1,𝒮2\mathcal{S}_{1},\mathcal{S}_{2} are distributed over multiple machines, (;,)\ell(\cdot;\cdot,\cdot) is a pairwise loss that only depends on the prediction outputs of the input data pairs (𝐳,𝐳)(\mathbf{z},\mathbf{z}^{\prime}). This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm for X-risks lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. To this end, we propose an active-passive decomposition framework that decouples the gradient’s components with two types, namely active parts and passive parts, where the active parts depend on local data that are computed with the local model and the passive parts depend on other machines that are communicated/computed based on historical models and samples. Under this framework, we design two FL algorithms (FeDXL) for handling linear and nonlinear ff, respectively, based on federated averaging and merging and develop a novel theoretical analysis to combat the latency of the passive parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the passive parts do not degrade the complexities. We conduct empirical studies of FeDXL for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.


1 Introduction

This work is motivated by solving the following optimization problem arising in many ML applications in a federated learning (FL) setting:

min𝐰d1|𝒮1|𝐳𝒮1f(1|𝒮2|𝐳𝒮2(𝐰,𝐳,𝐳)g(𝐰,𝐳,𝒮2)),\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d}}\frac{1}{|\mathcal{S}_{1}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}}f\bigg{(}\underbrace{\frac{1}{|\mathcal{S}_{2}|}\sum_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}}\ell(\mathbf{w},\mathbf{z},\mathbf{z}^{\prime})}\limits_{g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})}\bigg{)}, (1)

where 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} denote two sets of data points that are distributed over many machines, 𝐰\mathbf{w} denotes the model of a prediction function h(𝐰,)doh(\mathbf{w},\cdot)\in\mathbb{R}^{d_{o}}, f()f(\cdot) is a deterministic function that could be linear or non-linear (possibly non-convex), and (𝐰,𝐳,𝐳)=(h(𝐰,𝐳),h(𝐰,𝐳))\ell(\mathbf{w},\mathbf{z},\mathbf{z}^{\prime})=\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime})) denotes a pairwise loss that only depends on the prediction outputs of the input data 𝐳,𝐳\mathbf{z},\mathbf{z}^{\prime}. The above problem belongs to a broader family of machine learning problems called deep X-risk optimization (DXO) (Yang, 2022). We provide details of some X-risk minimization applications in Appendix B.

When ff is a linear function, the above problem is the classic pairwise loss minimization problem, which has applications in AUROC (AUC) maximization (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b; Yang & Ying, 2022), bipartite ranking (Cohen et al., 1997; Clémençon et al., 2008; Kotlowski et al., 2011; Dembczynski et al., 2012), and distance metric learning (Radenović et al., 2016; Wu et al., 2017). When ff is a non-linear function, the above problem is a special case of finite-sum coupled compositional optimization problem (Wang & Yang, 2022), which has found applications in various performance measure optimization such as partial AUC maximization (Zhu et al., 2022), average precision maximization (Qi et al., 2021; Wang et al., 2022), NDCG maximization (Qiu et al., 2022), p-norm push optimization (Rudin, 2009; Wang & Yang, 2022) and contrastive loss optimization (Goldberger et al., 2004; Yuan et al., 2022).

This is in sharp contrast with most existing studies on FL algorithms (Yang, 2013; Konečnỳ et al., 2016; McMahan et al., 2017; Kairouz et al., 2021; Smith et al., 2018; Stich, 2018; Yu et al., 2019a, b; Khaled et al., 2020; Woodworth et al., 2020b, a; Karimireddy et al., 2020b; Haddadpour et al., 2019), which focus on the following empirical risk minimization (ERM) problem with the data set 𝒮\mathcal{S} distributed over different machines:

min𝐰d1|𝒮|𝐳𝒮(𝐰,𝐳).\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d}}\frac{1}{|\mathcal{S}|}\sum_{\mathbf{z}\in\mathcal{S}}\ell(\mathbf{w},\mathbf{z}). (2)

The major differences between DXO and ERM are (i) the ERM’s objective is decomposable over training data, while the DXO is not; and (ii) the data-dependent losses in ERM are decoupled between different data points; in contrast the data-dependent loss in DXO couples different training data points. These differences pose a big challenge for DXO in the FL setting where the training data are distributed on different machines and are prohibited to be moved to a central server. In particular, the gradient of X-risk cannot be written as the sum of local gradients at individual machines that only depend on the local data in those machines. Instead, the gradient of DXO at each machine not only depends on local data but also on data in other machines. As a result, the design of communication-efficient FL algorithms for DXO is much more complicated than that for ERM. In addition, the presence of non-linear function ff makes the algorithm design and analysis even more challenging than that with linear ff. There are two levels of coupling in DXO with nonlinear ff with one level at the pairwise loss (h(𝐰,𝐳),h(𝐰,𝐳))\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime})) and another level at the non-linear risk of f(g(𝐰,𝐳,𝒮2))f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})), which makes estimation of stochastic gradient more tricky.

Although DXO can be solved by existing algorithms in a centralized learning setting (Hu et al., 2020; Wang & Yang, 2022), extension of the existing algorithms to the FL setting is non-trivial. This is different from the extension of centralized algorithms for ERM problems to the FL setting. In the design and analysis of FL algorithms for ERM, the individual machines compute local gradients and update local models and communicate periodically to average models. The rationale of local FL algorithms for ERM is that as long as the gap error between local models and the averaged model is on par with the noise in the stochastic gradients by controlling the communication frequency, the convergence of local FL algorithms will not be sacrificed and is able to enjoy the parallel speed-up of using multiple machines. However, this rationale is not sufficient for developing FL algorithms for DXO optimization due to the challenges mentioned above.

To address these challenges, we propose two novel FL algorithms named FeDXL1 and FeDXL2 for DXO with linear and non-linear ff, respectively. The main innovation in the algorithm design lies at an active-passive decomposition framework that decouples the gradient of the objective into two types, active parts and passive parts. The active parts depend on data in local machines and the passive parts depend on data in other machines. We estimate the active parts using the local data and the local model and estimate the passive parts using the information with delayed communications from other machines that are computed at historical models in the previous round. In terms of analysis, the challenge is that the model used in the computation of stochastic gradient estimator depends on the (historical) samples for computing the passive parts at the current iteration, which is only exacerbated in the presence of non-linear function ff. We develop a novel analysis that allows us to transfer the error of the gradient estimator into the latency error of the passive parts and the gap error between local models and the global model. Hence, the rationale is that as long as the latency error of the passive parts and the gap error between local models and the global model is on par with the noise in the stochastic gradient estimator we are able to achieve convergence and linear speed-up.

Refer to caption
Figure 1: Illustration of the proposed Active-Passive Decomposition Framework of FeDXL, which is enabled by Federated Averaging and Merging, where the merged prediction outputs from previous rounds are used for computing the passive parts in stochastic gradient estimator, and its active parts are computed by using local model and local data.

The main contributions of this work are as follows:

  • We propose two novel communication-efficient algorithms, FeDXL1 and FeDXL2, for DXO with linear and nonlinear ff, respectively, based on federated averaging and merging. Besides communicating local models for federated averaging, the proposed algorithms need to communicate local prediction outputs only periodically for federated merging to enable the computing of passive parts. The diagram of the proposed FeDXL algorithms is shown in Figure 1.

  • We perform novel technical analysis to prove the convergence of both algorithms. We show that both algorithms enjoy parallel speed-up in terms of the iteration complexity, and a lower-order communication complexity.

  • We conduct empirical studies on two tasks for federated deep partial AUC optimization with a compositional loss and federated deep AUC optimization with a pairwise loss, and demonstrate the advantages of the proposed algorithms over several baselines.

2 Related Work

FL for ERM. The challenge of FL is how to utilize the distributed data to learn a ML model with light communication cost without harming the data privacy (Konečnỳ et al., 2016; McMahan et al., 2017). To reduce the communication cost, many algorithms have been proposed to skip communications (Stich, 2018; Yu et al., 2019a, b; Yang, 2013; Karimireddy et al., 2020b) or compress the communicated statistics (Stich et al., 2018; Basu et al., 2019; Jiang & Agrawal, 2018; Wangni et al., 2018; Bernstein et al., 2018). Tight analysis has been performed in various studies (Kairouz et al., 2021; Yu et al., 2019a, b; Khaled et al., 2020; Woodworth et al., 2020b, a; Karimireddy et al., 2020b; Haddadpour et al., 2019). However, most of these works target at ERM.

FL for Non-ERM Problems. In (Guo et al., 2020; Yuan et al., 2021a; Deng & Mahdavi, 2021; Deng et al., 2020; Liu et al., 2020; Sharma et al., 2022), federated minimax optimization algorithms have been studied, which are not applicable to our problem when ff is non-convex. Gao et al. (2022) considered a much simpler federated compositional optimization in the form of k𝔼ζ𝒟fkfk(𝔼ξ𝒟gkgk(𝐰;ξ);ζ)\sum_{k}\mathbb{E}_{\zeta\sim\mathcal{D}_{f}^{k}}f_{k}(\mathbb{E}_{\xi\sim\mathcal{D}_{g}^{k}}g_{k}(\mathbf{w};\xi);\zeta), where kk denotes the machine index. Compared with the X-risk, their objective does not involve interdependence between different machines. Li et al. (2022); Huang et al. (2022) analyzed FL algorithms for bi-level problems where only the low-level objective involves distribution over many machines. Tarzanagh et al. (2022) considered another federated bilevel problem, where both upper and lower level objective are distributed over many machines, but the lower level objective is not coupled with the data in the upper objective. Xing et al. (2022) studied a federated bilevel optimization in a server-clients setting, where the central server solves an objective that depends on optimal solutions of local clients. Our problem cannot be mapped into these federated bilevel optimization problems. There are works that optimize non-ERM problems using local data or data from other machines, which are mostly adhoc and lack of theoretical guarantees (Han et al., 2022; Zhang et al., 2020; Wu et al., 2022; Li & Huang, 2022).

Table 1: Comparison for sample complexity on each machine for solving the DXO problem to find an ϵ\epsilon-stationary point, i.e., 𝔼[F(𝐰)2]ϵ2\mathbb{E}[\|F(\mathbf{w})\|^{2}]\leq\epsilon^{2}. nn is the number of finite-sum components in outer finite-sum setting, which is the number of data on the outer function. ninn_{\text{in}} denotes the number of finite-sum components for the inner function gg when it is of finite-sum structure. In federated learning setting, nin_{i} denotes the number components in the outer function of machine ii.
Method Sample Complexity Setting
BSGD (Hu et al., 2020) O(1/ϵ6)O(1/\epsilon^{6}) Inner Expectation + Outer Expectation
BSpiderBoost (Hu et al., 2020) O(1/ϵ5)O(1/\epsilon^{5}) Inner Expectation + Outer Expectation
Centralized SOX (Wang & Yang, 2022) O(n/ϵ4)O(n/\epsilon^{4}) Inner Expectation + Outer Finite-sum
MSVR (Jiang et al., 2022) O(max(1/ϵ4,n/ϵ3))O(\max(1/\epsilon^{4},n/\epsilon^{3})) Inner Expectation + Outer Finite-sum
MSVR (Jiang et al., 2022) O(nnin/ϵ2)O(n\sqrt{n_{\text{in}}}/\epsilon^{2}) Inner Finite-sum + Outer Finite-sum
Federated This Work O(maxini/ϵ4)O(\max_{i}n_{i}/\epsilon^{4}) Inner Expectation + Outer Finite-sum

Centralized Algorithms for DXO. In the centralized setting DXO has been considered in recent works (Qi et al., 2021; Wang et al., 2022; Wang & Yang, 2022; Qiu et al., 2022). In particular, Wang & Yang (2022) have proposed a stochastic algorithm named SOX for solving (1) and achieved state-of-the-art sample complexity of O(|𝒮1|/ϵ4)O(|\mathcal{S}_{1}|/\epsilon^{4}) to ensure the expected convergence to an ϵ\epsilon-stationary point. Nevertheless, it is non-trivial to extend the centralized algorithms to the FL setting due to the challenges mentioned earlier. Recently, (Jiang et al., 2022) further proposed an advanced variance-reduction technique named MSVR to improve the sample complexity of solving finite-sum coupled compositional optimization problems. We provide a summary of state-of-the-art sample complexities for solving DXO in both centralized and FL setting in Table 1.

3 FeDXL for DXO

We assume 𝒮1,𝒮2\mathcal{S}_{1},\mathcal{S}_{2} are split into NN non-overlapping subsets that are distributed over NN clients 111We use clients and machines interchangeably., i.e., 𝒮1=𝒮11𝒮12𝒮1N\mathcal{S}_{1}=\mathcal{S}_{1}^{1}\cup\mathcal{S}_{1}^{2}\ldots\cup\mathcal{S}_{1}^{N} and 𝒮2=𝒮21𝒮22𝒮2N\mathcal{S}_{2}=\mathcal{S}_{2}^{1}\cup\mathcal{S}_{2}^{2}\ldots\cup\mathcal{S}_{2}^{N}. We denote by 𝔼𝐳𝒮=1|𝒮|𝐳𝒮\mathbb{E}_{\mathbf{z}\sim\mathcal{S}}=\frac{1}{|\mathcal{S}|}\sum_{\mathbf{z}\in\mathcal{S}}. Denote by 1(,)\nabla_{1}\ell(\cdot,\cdot) and 2(,)\nabla_{2}\ell(\cdot,\cdot) the partial gradients in terms of the first argument and the second argument, respectively. Without loss of generality, we assume the dimensionality of h(𝐰,𝐳)h(\mathbf{w},\mathbf{z}) is 1 (i.e., do=1d_{o}=1) in the following presentation. Notations used in the algorithms are summarized in Appendix A.

3.1 FeDXL1 for DXO with linear ff

We consider the following FL objective for DXO:

min𝐰dF(𝐰)=1Ni=1N𝔼𝐳𝒮1i1Nj=1N𝔼𝐳𝒮2j(h(𝐰,𝐳),h(𝐰,𝐳)).\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d}}F(\mathbf{w})=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\frac{1}{N}\sum_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime})). (3)

To highlight the challenge and motivate FeDXL, we decompose the gradient of the objective function into:

F(𝐰)=\displaystyle\nabla F(\mathbf{w})=
1Ni=1N𝔼𝐳𝒮1i1Nj=1N𝔼𝐳𝒮2j1(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)Δi1\displaystyle\frac{1}{N}\sum_{i=1}^{N}\underbrace{\mathbb{E}_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\frac{1}{N}\sum_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}}\nabla_{1}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z})}\limits_{\Delta_{i1}}
+1Ni=1N𝔼𝐳𝒮2i1Nj=1N𝔼𝐳𝒮1j2(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)Δi2.\displaystyle\!+\!\frac{1}{N}\!\sum_{i=1}^{N}\underbrace{\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{i}}\frac{1}{N}\sum_{j=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}_{1}^{j}}\nabla_{2}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z}^{\prime})}\limits_{\Delta_{i2}}.

Let Fi(𝐰):=Δi,1+Δi,2\nabla F_{i}(\mathbf{w}):=\!\Delta_{i,1}\!+\!\Delta_{i,2}. Then F(𝐰)=1Ni=1NFi(𝐰)\nabla F(\mathbf{w})=\frac{1}{N}\sum\limits_{i=1}^{N}\nabla F_{i}(\mathbf{w}).

With the above decomposition, we can see that the main task at the local client ii is to estimate the gradient terms Δi1\Delta_{i1} and Δi2\Delta_{i2}. Due to the symmetry between Δi1\Delta_{i1} and Δi2\Delta_{i2}, below, we only use Δi1\Delta_{i1} as an illustration for explaining the proposed algorithm. The difficulty in computing Δi1\Delta_{i1} lies at it relies on data in other machines due to the presence of 𝔼𝐳𝒮2j\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}} for all jj. To overcome this difficulty, we decouple the data-dependent factors in Δi1\Delta_{i1} into two types marked by green and blue shown below:

𝔼𝐳𝒮1ilocal11Nj=1N𝔼𝐳𝒮2jglobal11(h(𝐰,𝐳)local2,h(𝐰,𝐳)global2)h(𝐰,𝐳)local3.\displaystyle\underbrace{\hbox{\pagecolor{green!30}$\mathbb{E}_{\mathbf{z}\in\mathcal{S}_{1}^{i}}$}}\limits_{\text{local1}}\underbrace{\hbox{\pagecolor{blue!30}$\frac{1}{N}\sum\limits_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}}$}}\limits_{\text{global1}}\nabla_{1}\ell(\underbrace{\hbox{\pagecolor{green!30}$h(\mathbf{w},\mathbf{z})$}}\limits_{\text{local2}},\underbrace{\hbox{\pagecolor{blue!30}$h(\mathbf{w},\mathbf{z}^{\prime})$}}\limits_{\text{global2}})\underbrace{\hbox{\pagecolor{green!30}$\nabla h(\mathbf{w},\mathbf{z})$}}\limits_{\text{local3}}. (4)

It is notable that the three green terms can be estimated or computed based the local data. In particular, local1 can be estimated by sampling data from 𝒮1i\mathcal{S}_{1}^{i} and local2 and local3 can be computed based on the sampled data 𝐳\mathbf{z} and the local model parameter. The difficulty springs from estimating and computing the two blue terms that depend on data on all machines. We would like to avoid communicating h(𝐰;𝐳)h(\mathbf{w};\mathbf{z}^{\prime}) at every iteration for estimating the blue terms as each communication would incur additional communication overhead. To tackle this, we propose to leverage the historical information computed in the previous round 222A round is defined as a sequence of local updates between two consecutive communications.. To put this into context of optimization, we consider the update at the kk-th iteration during the rr-th round, where k=0,,K1k=0,\ldots,K-1. Let 𝐰i,kr\mathbf{w}^{r}_{i,k} denote the local model in ii-th client at the kk-th iteration within rr-th round. Let 𝐳i,k,1r𝒮1i,𝐳i,k,2r𝒮2i\mathbf{z}^{r}_{i,k,1}\in\mathcal{S}_{1}^{i},\mathbf{z}^{r}_{i,k,2}\in\mathcal{S}_{2}^{i} denote the data sampled at the kk-th iteration from 𝒮1i\mathcal{S}_{1}^{i} and 𝒮2i\mathcal{S}_{2}^{i}, respectively. Each local machine will compute h(𝐰i,kr,𝐳i,k,1r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}) and h(𝐰i,kr,𝐳i,k,2r)h({\mathbf{w}^{r}_{i,k}},\mathbf{z}^{r}_{i,k,2}), which will be used for computing the active parts. Across all iterations k=0,,K1k=0,\ldots,K-1, we will accumulate the computed prediction outputs over sampled data and stored in two sets i,1r={h(𝐰i,kr,𝐳i,k,1r),k=0,,K1}\mathcal{H}^{r}_{i,1}=\{h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}),k=0,\ldots,K-1\} and i,2r={h(𝐰i,kr,𝐳i,k,2r),k=0,,K1}\mathcal{H}^{r}_{i,2}=\{h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}),k=0,\ldots,K-1\}. At the end of round rr, we will communicate 𝐰i,Kr\mathbf{w}^{r}_{i,K} and i,1r\mathcal{H}^{r}_{i,1} and i,2r\mathcal{H}^{r}_{i,2} to the central server, which will average the local models to get a global model 𝐰r\mathbf{w}_{r} and also merge 1r=1,1r2,1rN,1r\mathcal{H}^{r}_{1}=\mathcal{H}^{r}_{1,1}\cup\mathcal{H}^{r}_{2,1}\ldots\cup\mathcal{H}^{r}_{N,1} and 2r=1,2r2,2rN,2r\mathcal{H}^{r}_{2}=\mathcal{H}^{r}_{1,2}\cup\mathcal{H}^{r}_{2,2}\ldots\cup\mathcal{H}^{r}_{N,2}. These merged information will be broadcast to each individual client. Then, at the kk-th iteration in the rr-th round, we estimate the blue term by sampling h2,ξr12r1h^{r-1}_{2,\xi}\in\mathcal{H}^{r-1}_{2} without replacement and compute an estimator of Δi1\Delta_{i1} by

Gi,k,1r=1(h(𝐰i,kr,𝐳i,k,1r)active,h2,ξr1passive)h(𝐰i,kr,𝐳i,k,1r)active,\displaystyle G^{r}_{i,k,1}=\nabla_{1}\ell(\underbrace{\hbox{\pagecolor{green!30}$h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1})$}}\limits_{\text{active}},\underbrace{\hbox{\pagecolor{blue!30}$h^{r-1}_{2,\xi}$}}\limits_{\text{passive}})\underbrace{\hbox{\pagecolor{green!30}$\nabla h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1})$}}\limits_{\text{active}}, (5)

where ξ=(j,t,𝐳j,t,2r1)\xi=(j,t,\mathbf{z}^{r-1}_{j,t,2}) represents a random variable that captures the randomness in the sampled client j{1,,N}j\in\{1,\ldots,N\}, iteration index k{0,,K1}k\in\{0,\ldots,K-1\} and data sample 𝐳j,t,2r1𝒮2j\mathbf{z}^{r-1}_{j,t,2}\in\mathcal{S}_{2}^{j}, which is used for estimating the global1 in (4). We refer to the green factors in Gi,k,1G_{i,k,1} as the active parts and the blue factor in Gi,k,1G_{i,k,1} as the passive part. Similarly, we can estimate Δi2\Delta_{i2} by Gi,k,2G_{i,k,2}

Gi,k,2r=2(h1,ζr1passive,h(𝐰i,kr,𝐳i,k,2r)active)h(𝐰i,kr,𝐳i,k,2r)active,\displaystyle G^{r}_{i,k,2}=\nabla_{2}\ell(\underbrace{\hbox{\pagecolor{blue!30}$h^{r-1}_{1,\zeta}$}}\limits_{\text{passive}},\underbrace{\hbox{\pagecolor{green!30}$h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})$}}\limits_{\text{active}})\underbrace{\hbox{\pagecolor{green!30}$\nabla h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})$}}\limits_{\text{active}}, (6)

where h1,ζr11r1h^{r-1}_{1,\zeta}\in\mathcal{H}^{r-1}_{1} is a randomly sampled prediction output in the previous round with ζ=(j,t,𝐳j,t,1r1)\zeta=(j^{\prime},t^{\prime},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1}) representing a random variable including a client sample jj^{\prime} and iteration sample tt^{\prime} and the data sample 𝐳j,t,1r1\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1}. Then we will update the local model parameter 𝐰i,kr\mathbf{w}^{r}_{i,k} by using a gradient estimator Gi,k,1r+Gi,k,2rG^{r}_{i,k,1}+G^{r}_{i,k,2}.

We present the detailed steps of the proposed algorithm FeDXL1 in Algorithm 1. Several remarks are following: (i) at every round, the algorithm needs to communicate both the model parameters 𝐰i,Kr\mathbf{w}^{r}_{i,K} and the historical prediction outputs i,1r1\mathcal{H}^{r-1}_{i,1} and i,2r1\mathcal{H}^{r-1}_{i,2}, where i,r1\mathcal{H}^{r-1}_{i,*} is constructed by collecting all or sub-sampled computed predictions in the (r1)(r-1)-th round. The bottom line for constructing i,r1\mathcal{H}^{r-1}_{i,*} is to ensure that r1\mathcal{H}^{r-1}_{*} contains at least KK independently sampled predictions that are from the previous round on all machines such that the corresponding data samples involved in r1\mathcal{H}^{r-1}_{*} can be used to approximate 1Ni=1N𝔼𝐳𝒮i\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}^{i}_{*}} KK times. Hence, to keep the communication costs minimal, each client at least needs to sample O(K/N)O(\lceil K/N\rceil) sampled predictions from all iterations k=0,1,,K1k=0,1,\ldots,K-1 and send them to the server for constructing r1\mathcal{H}^{r-1}_{*}, which is then broadcast to all clients for computing the passive parts in the round rr. As a result, the minimal communication costs per-round per-client is O(d+Kdo/N)O(d+Kd_{o}/N). Nevertheless, for simplicity in Algorithm 1 we simply put all historical predictions into i,r1\mathcal{H}^{r-1}_{i,*}.

Similar to all other FL algorithms, FeDXL1 does not require communicating the raw input data, hence protects the privacy of the data. However, compared with most FL algorithms for ERM, FeDXL1 for DXO has an additional communication overhead at least O(doK/N)O(d_{o}K/N) which depends on the dimensionality of prediction output dod_{o}. For learning a high-dimensional model (e.g. deep neural network with d1d\gg 1) with score-based pairwise losses (do=1d_{o}=1), the additional communication cost O(K/N)O(K/N) could be marginal. For updating the buffer i,1\mathcal{B}_{i,1} and i,2\mathcal{B}_{i,2}, we can simply flush the history and add the newly received i,1r1\mathcal{R}^{r-1}_{i,1} and i,2r1\mathcal{R}^{r-1}_{i,2} with random shuffling to i,1\mathcal{B}_{i,1} and i,2\mathcal{B}_{i,2}, respectively.

For analysis, we make the following assumptions regarding the DXO with linear ff problem, i.e., problem (3).

Assumption 3.1.
  • ()\ell(\cdot) is differentiable, LL_{\ell}-smooth and CC_{\ell}-Lipschitz.

  • h(,𝐳)h(\cdot,\mathbf{z}) is differentiable, LhL_{h}-smooth and ChC_{h}-Lipschitz on 𝐰\mathbf{w} for any 𝐳𝒮1𝒮2\mathbf{z}\in\mathcal{S}_{1}\cup\mathcal{S}_{2}.

  • 𝔼𝐳𝒮1i𝔼j[1:N]𝔼𝐳𝒮2j1(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)+2(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)Fi(𝐰)2σ2\mathbb{E}_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}_{j\in[1:N]}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}}\|\nabla_{1}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z})\\ \!+\!\nabla_{2}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z}^{\prime})\!-\!\nabla F_{i}(\mathbf{w})\|^{2}\!\leq\!\sigma^{2}.

  • D\exists D such that Fi(𝐰)F(𝐰)2D2,i\|\nabla F_{i}(\mathbf{w})-\nabla F(\mathbf{w})\|^{2}\leq D^{2},\forall i.

Algorithm 1 FeDXL1: FL for DXO with linear ff
1:  On Client ii: Require parameters η,K\eta,K
2:  Initialize model 𝐰i,K0\mathbf{w}_{i,K}^{0} and initialize Buffer i,1,i,2=\mathcal{B}_{i,1},\mathcal{B}_{i,2}=\emptyset
3:  Sample KK points from 𝒮1i\mathcal{S}_{1}^{i}, compute their predictions using model 𝐰i,K0\mathbf{w}_{i,K}^{0} denoted by i,10\mathcal{H}^{0}_{i,1}
4:  Sample KK points from 𝒮2i\mathcal{S}_{2}^{i}, compute their predictions using model 𝐰i,K0\mathbf{w}_{i,K}^{0} denoted by i,20\mathcal{H}^{0}_{i,2}
5:  for r=1,,Rr=1,...,R do
6:     Sends 𝐰i,Kr1\mathbf{w}^{r-1}_{i,K} to the server
7:     Receives 𝐰¯r\bar{\mathbf{w}}^{r} from the server and set 𝐰i,0r=𝐰¯r\mathbf{w}^{r}_{i,0}=\bar{\mathbf{w}}^{r}
8:     Send i,1r1,i,2r1\mathcal{H}^{r-1}_{i,1},\mathcal{H}^{r-1}_{i,2} to the server
9:     Receive i,1r1,i,2r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2} from the server
10:     Update buffer i,1,i,2\mathcal{B}_{i,1},\mathcal{B}_{i,2} using i,1r1,i,2r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2} with shuffling \diamond see text for updating the buffer
11:     Set i,1r=\mathcal{H}^{r}_{i,1}=\emptyset, i,2r=\mathcal{H}^{r}_{i,2}=\emptyset
12:     for k=0,..,K1k=0,..,K-1 do
13:        Sample 𝐳i,k,1r\mathbf{z}^{r}_{i,k,1} from 𝒮1i\mathcal{S}^{i}_{1}, sample 𝐳i,k,2r\mathbf{z}^{r}_{i,k,2} from 𝒮2i\mathcal{S}^{i}_{2} \diamond or sample two mini-batches of data
14:        Take next hξr1h^{r-1}_{\xi} and hζr1h^{r-1}_{\zeta} from i,1\mathcal{B}_{i,1} and i,2\mathcal{B}_{i,2}, resp.
15:        Compute h(𝐰i,kr,𝐳i,k,1r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}) and h(𝐰i,kr,𝐳i,k,2r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})
16:        Add h(𝐰i,kr,𝐳i,k,1r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}) into i,1r\mathcal{H}^{r}_{i,1} and add h(𝐰i,kr,𝐳i,k,2r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}) into i,2r\mathcal{H}^{r}_{i,2}
17:        Compute Gi,k,1rG^{r}_{i,k,1} and Gi,k,2rG^{r}_{i,k,2} according to (5) and (6)
18:        𝐰i,k+1r=𝐰i,krη(Gi,k,1r+Gi,k,2r)\mathbf{w}^{r}_{i,k+1}=\mathbf{w}^{r}_{i,k}-\eta(G^{r}_{i,k,1}+G^{r}_{i,k,2})
19:     end for
20:  end for  
21:  On Server
22:  for r=1,,Rr=1,...,R do
23:     Receive 𝐰i,Kr1\mathbf{w}^{r-1}_{i,K}, from clients i[N]i\in[N], compute 𝐰¯r=1Ni=1N𝐰i,Kr\bar{\mathbf{w}}^{r}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{w}^{r}_{i,K} and broadcast it to all clients.
24:     Collects 1r1=1,1r12,1r1N,1r1\mathcal{H}^{r-1}_{1}=\mathcal{H}^{r-1}_{1,1}\cup\mathcal{H}^{r-1}_{2,1}\ldots\cup\mathcal{H}^{r-1}_{N,1} and 2r1=1,2r12,2r1N,2r1\mathcal{H}^{r-1}_{2}=\mathcal{H}^{r-1}_{1,2}\cup\mathcal{H}^{r-1}_{2,2}\ldots\cup\mathcal{H}^{r-1}_{N,2}
25:     Set i,1r1=1r1,i,2r1=2r1\mathcal{R}^{r-1}_{i,1}=\mathcal{H}^{r-1}_{1},\mathcal{R}^{r-1}_{i,2}=\mathcal{H}^{r-1}_{2}
26:     Send i,1r1,i,2r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2} to client ii for all i[N]i\in[N]
27:  end for

The first three assumptions are standard in the optimization of DXO problems (Wang & Yang, 2022). The last assumption embodies the data heterogeneity that is also common in federated learning (Yu et al., 2019a; Karimireddy et al., 2020b). Next, we present the theoretical results on the convergence of FeDXL1.

Theorem 3.2.

Under Assumption 3.1, by setting η=O(NR2/3)\eta=O(\frac{N}{R^{2/3}}) and K=O(R1/3N)K=O(\frac{R^{1/3}}{N}), Algorithm 1 ensures that

𝔼[1Rr=1RF(𝐰¯r1)2](1R2/3).\mathbb{E}\bigg{[}\frac{1}{R}\sum_{r=1}^{R}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\bigg{]}\leq\bigg{(}\frac{1}{R^{2/3}}\bigg{)}. (7)

Remark. To get 𝔼[1Rr=1RF(𝐰¯r1)2]ϵ2\mathbb{E}[\frac{1}{R}\sum_{r=1}^{R}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}]\leq\epsilon^{2}, we just need to set R=O(1ϵ3)R=O(\frac{1}{\epsilon^{3}}), η=Nϵ2\eta=N\epsilon^{2} and K=1NϵK=\frac{1}{N\epsilon}. The number of communications is much less than the total number of iterations i.e., O(1Nϵ4)O(\frac{1}{N\epsilon^{4}}) as long as NO(1ϵ)N\leq O(\frac{1}{\epsilon}). And the sample complexity on each machine is 1Nϵ4\frac{1}{N\epsilon^{4}}, which is linearly reduced by the number of machines NN.

Novelty of Analysis. As the passive parts are computed in different machines in a previous round, the gradient estimators Gi,k,1rG^{r}_{i,k,1} and Gi,k,2rG^{r}_{i,k,2} will involve the dependency between the local model parameter 𝐰i,kr\mathbf{w}^{r}_{i,k} and the historical data contained in ξ,ζ\xi,\zeta used for computing Gi,k,1rG^{r}_{i,k,1} and Gi,k,2rG^{r}_{i,k,2}, which makes the analysis more involved. We need to make sure that using the gradient estimator based on them can still result in “good” results. To this end, we borrow an analysis technique in (Yang et al., 2021b) to decouple the dependence between the current model parameter and the data used for computing the current gradient estimator, in which they used data in previous iteration to couple the data in the current iteration in order to compute a gradient of the pairwise loss (h(𝐰t;𝐳t),h(𝐰t;𝐳t1))\ell(h(\mathbf{w}_{t};\mathbf{z}_{t}),h(\mathbf{w}_{t};\mathbf{z}_{t-1})). Nevertheless, in federated DXO controlling the error brought by the passive parts is more challenging since the delay is much longer and they were computed on different machines. In our analysis, we replace 𝐰i,kr\mathbf{w}^{r}_{i,k} with 𝐰¯r1\bar{\mathbf{w}}^{r-1} to decouple the dependence between the model parameter 𝐰¯r1\bar{\mathbf{w}}^{r-1} and the historical data ξ,ζ\xi,\zeta, then we need to control the latency error 𝐰¯r1𝐰¯r2\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r}\|^{2} and the gap error between different machines ik𝔼𝐰¯r𝐰i,kr2\sum_{i}\sum_{k}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2} such that the complexities are not compromised.

3.2 FeDXL2 for optimizing DXO with nonlinear ff

With nonlinear ff, we consider the following FL problem of DXO minimization,

F(𝐰)=1Ni=1N𝔼𝐳𝒮1if(1Nj=1N𝔼𝐳𝒮2j(h(𝐰,𝐳),h(𝐰,𝐳))g(𝐰,𝐳,𝒮2)).\begin{split}F(\mathbf{w})=\frac{1}{N}\sum\limits_{i=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}^{i}_{1}}f\bigg{(}\underbrace{{\frac{1}{N}\sum\limits_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}^{j}_{2}}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))}}\limits_{g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})}\bigg{)}.\end{split} (8)

We compute the gradient and decompose it into:

F(𝐰)=1Ni=1N(Δi,1+Δi,2),\begin{split}&\nabla F(\mathbf{w})=\frac{1}{N}\sum\limits_{i=1}^{N}(\Delta_{i,1}+\Delta_{i,2}),\end{split} (9)

where

Δi,1=𝔼𝐳𝒮1i1Nj=1N𝔼𝐳𝒮2j[f(g(𝐰,𝐳,𝒮2))1(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)]Δi,2=𝔼𝐳𝒮2i1Nj=1N𝔼𝐳𝒮1j[f(g(𝐰,𝐳,𝒮2))2(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)].\begin{split}&\Delta_{i,1}=\mathbb{E}_{\mathbf{z}\in\mathcal{S}^{i}_{1}}\frac{1}{N}\sum\limits_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}^{j}_{2}}\bigg{[}\hbox{\pagecolor{green!15}$\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}))$}\cdot\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\nabla_{1}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z})\bigg{]}\\ &\Delta_{i,2}=\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}^{i}_{2}}\frac{1}{N}\sum\limits_{j=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}^{j}_{1}}\bigg{[}\hbox{\pagecolor{blue!15}$\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}))$}\cdot\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\nabla_{2}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z}^{\prime})\bigg{]}.\end{split} (10)
Algorithm 2 FeDXL2: Federated Learning for DXO with non-linear ff
  On Client ii: Require parameters η,K\eta,K
  Initialize model 𝐰i,K0\mathbf{w}_{i,K}^{0}, 𝒰i0={u0(𝐳)=0,𝐳𝒮1i}\mathcal{U}_{i}^{0}=\{u^{0}(\mathbf{z})=0,\mathbf{z}\in\mathcal{S}^{i}_{1}\}, Gi,K0=𝟎G^{0}_{i,K}=\mathbf{0}, and buffer i,1,i,2,𝒞i=\mathcal{B}_{i,1},\mathcal{B}_{i,2},\mathcal{C}_{i}=\emptyset
  Sample KK points from S1iS_{1}^{i}, compute their predictions using model 𝐰i,K0\mathbf{w}_{i,K}^{0} denoted by i,10\mathcal{H}^{0}_{i,1}
  Sample KK points from S2iS_{2}^{i}, compute their predictions using model 𝐰i,K0\mathbf{w}_{i,K}^{0} denoted by i,20\mathcal{H}^{0}_{i,2}
  for r=1,,Rr=1,...,R do
     Sends 𝐰i,Kr1,Gi,Kr1\mathbf{w}^{r-1}_{i,K},G^{r-1}_{i,K} to the server
     Receives 𝐰¯r,G¯r\bar{\mathbf{w}}^{r},\bar{G}^{r} from the server and set 𝐰i,0r=𝐰¯r,Gi,0r=G¯r\mathbf{w}^{r}_{i,0}=\bar{\mathbf{w}}^{r},G^{r}_{i,0}=\bar{G}^{r}
     Send i,1r1,i,2r1,𝒰ir1\mathcal{H}^{r-1}_{i,1},\mathcal{H}^{r-1}_{i,2},\mathcal{U}_{i}^{r-1} to the server
     Receive i,1r1,i,2r1,𝒫r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2},\mathcal{P}^{r-1} from the server
     Update the buffer i,1,i,2,𝒞i\mathcal{B}_{i,1},\mathcal{B}_{i,2},\mathcal{C}_{i} using i,1r1,i,2r1,𝒫r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2},\mathcal{P}^{r-1} with shuffling, respectively
     Set i,1r=\mathcal{H}^{r}_{i,1}=\emptyset, i,2r=,𝒰ir=\mathcal{H}^{r}_{i,2}=\emptyset,\mathcal{U}_{i}^{r}=\emptyset
     for k=0,..,K1k=0,..,K-1 do
        Sample 𝐳i,k,1r\mathbf{z}^{r}_{i,k,1} from 𝒮1i\mathcal{S}^{i}_{1}, sample 𝐳i,k,2r\mathbf{z}^{r}_{i,k,2} from 𝒮2i\mathcal{S}^{i}_{2} \diamond or sample two mini-batches of data
        Take next hξr1h^{r-1}_{\xi}, hζr1h^{r-1}_{\zeta} and uζr1u^{r-1}_{\zeta} from i,1\mathcal{B}_{i,1} and i,2\mathcal{B}_{i,2} and 𝒞i\mathcal{C}_{i}, respectively
        Compute h(𝐰i,kr,𝐳i,k,1r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}) and h(𝐰i,kr,𝐳i,k,2r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})
        Compute h(𝐰i,kr,𝐳^i,k,1r)h(\mathbf{w}^{r}_{i,k},\hat{\mathbf{z}}^{r}_{i,k,1}) and h(𝐰i,kr,𝐳^i,k,2r)h(\mathbf{w}^{r}_{i,k},\hat{\mathbf{z}}^{r}_{i,k,2}) and add them to i,1r,i,2r\mathcal{H}^{r}_{i,1},\mathcal{H}^{r}_{i,2}, respectively
        Compute 𝐮i,kr(𝐳i,k,1r)\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}) according to (11) and add 𝐮i,kr(𝐳^i,k,1r)\mathbf{u}^{r}_{i,k}(\hat{\mathbf{z}}^{r}_{i,k,1}) to 𝒰ir\mathcal{U}_{i}^{r}
        Compute Gi,k,1rG^{r}_{i,k,1} and Gi,k,2rG^{r}_{i,k,2} according to (12,13)
        Gi,kr=(1β)Gi,k1r+β(Gi,k,1r+Gi,k,2r)G^{r}_{i,k}=(1-\beta)G^{r}_{i,k-1}+\beta(G^{r}_{i,k,1}+G^{r}_{i,k,2})
        𝐰i,k+1r=𝐰i,krηGi,kr\mathbf{w}^{r}_{i,k+1}=\mathbf{w}^{r}_{i,k}-\eta G^{r}_{i,k}
     end for
  end for  
  On Server
  for r=1,,Rr=1,...,R do
     Receive 𝐰i,Kr1\mathbf{w}^{r-1}_{i,K},Gi,Kr1G^{r-1}_{i,K} from client i[N]i\in[N], compute 𝐰¯r=1Ni=1N𝐰i,Kr\bar{\mathbf{w}}^{r}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{w}^{r}_{i,K}, Gr=1Ni=1NGi,KrG^{r}=\frac{1}{N}\sum_{i=1}^{N}G^{r}_{i,K} and broadcast them to all clients.
     Collects r1=1,r12,r1N,r1\mathcal{H}^{r-1}_{*}=\mathcal{H}^{r-1}_{1,*}\cup\mathcal{H}^{r-1}_{2,*}\ldots\cup\mathcal{H}^{r-1}_{N,*} and 𝒰r1=𝒰1r1𝒰1r1𝒰Nr1\mathcal{U}^{r-1}=\mathcal{U}^{r-1}_{1}\cup\mathcal{U}^{r-1}_{1}\ldots\cup\mathcal{U}^{r-1}_{N}, where =1,2*=1,2
     Set i,1r1=1r1,i,2r1=2r1,𝒫ir1=𝒰r1\mathcal{R}^{r-1}_{i,1}=\mathcal{H}^{r-1}_{1},\mathcal{R}^{r-1}_{i,2}=\mathcal{H}^{r-1}_{2},\mathcal{P}^{r-1}_{i}=\mathcal{U}^{r-1} and send them to Client ii for all i[N]i\in[N]
  end for

Let Fi(𝐰)=Δi,1+Δi,2\nabla F_{i}(\mathbf{w})=\Delta_{i,1}+\Delta_{i,2}. Then we have F(𝐰)=1Ni=1NFi(𝐰)\nabla F(\mathbf{w})=\frac{1}{N}\sum\limits_{i=1}^{N}\nabla F_{i}(\mathbf{w}).

Compared to that in (4) for DXO with linear ff, the Δi1\Delta_{i1} term above involves another factor f(g(𝐰,𝐳,𝒮2))\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})), which cannot be computed locally as it depends on 𝒮2\mathcal{S}_{2} distributed over all machines. Similarly, the Δi2\Delta_{i2} term above involves another non-locally computable factor f(g(𝐰,𝐳,𝒮2))\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})). To address the challenge of estimating g(𝐰,𝐳,𝒮2)g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}), we leverage the similar technique in the centralized setting (Wang & Yang, 2022) by tracking it using a moving average estimator based on random samples. In a centralized setting, one can maintain and update 𝐮(𝐳)\mathbf{u}(\mathbf{z}) for estimating g(𝐰,𝐳,𝒮2)g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}) by

𝐮(𝐳)(1γ)𝐮(𝐳)+γ(h(𝐰,𝐳),h(𝐰,𝐳)),\displaystyle\mathbf{u}(\mathbf{z})\leftarrow(1-\gamma)\mathbf{u}(\mathbf{z})+\gamma\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime})),

where 𝐳\mathbf{z}^{\prime} is a random sample from 𝒮2\mathcal{S}_{2}. However, this is not possible in an FL setting as 𝒮2\mathcal{S}_{2} is distributed over many machines. To tackle this, we leverage the delay communication technique used in the last subsection. At the kk-th iteration in the rr-th round, we update 𝐮(𝐳i,k,1r)\mathbf{u}(\mathbf{z}^{r}_{i,k,1}) for a sampled 𝐳i,k,1r\mathbf{z}^{r}_{i,k,1} by

𝐮i,kr(𝐳i,k,1r)=(1γ)𝐮i,kr(𝐳i,k,1r)+γ(h(𝐰i,kr,𝐳i,k,1r),hξ,2r1),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})=(1-\gamma)\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})+\gamma\ell(h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}),h^{r-1}_{\xi,2}), (11)

where hξ,2r1h^{r-1}_{\xi,2} is a random sample from 2r1\mathcal{H}^{r-1}_{2} where ξ=(j,t,𝐳^j,t,2r1)\xi=(j^{\prime},t^{\prime},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},2}) captures the randomness in client, iteration index and data sample in the last round. Then, we can use f(𝐮i,kr(𝐳i,k,1r))\nabla f(\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})) in place of f(g(𝐰i,kr,𝐳i,k,1r,𝒮2))\nabla f(g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathcal{S}_{2})) for estimating Δi1\Delta_{i1}. However, it is more nuanced for estimating f(g(𝐰,𝐳,𝒮2))\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})) in Δ2i\Delta_{2i} since 𝐳𝒮j2\mathbf{z}\in\mathcal{S}^{2}_{j} is not local random data. To address this, we propose to communicate 𝒰r1={𝐮i,kr1(𝐳^i,k,1r1),i[N],k[K]1}\mathcal{U}^{r-1}=\{\mathbf{u}^{r-1}_{i,k}(\hat{\mathbf{z}}^{r-1}_{i,k,1}),i\in[N],k\in[K]-1\}. Then at the kk-iteration in the rr-th round of the ii-th client, we can estimate f(g(𝐰,𝐳,𝒮2))\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})) with a random sample from 𝒰r1\mathcal{U}^{r-1} denoted by uζr1u^{r-1}_{\zeta}, where ζ=(j,t,𝐳^j,t,1r1)\zeta=(j^{\prime},t^{\prime},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}), i.e., by using f(𝐮ζr1)\nabla f(\mathbf{u}_{\zeta}^{r-1}). Then we estimate Δ1i\Delta_{1i} and Δ2i\Delta_{2i} by

Gi,k,1r:=\displaystyle G^{r}_{i,k,1}:= (12)
f(𝐮i,kr(𝐳i,k,1r))active1(h(𝐰i,kr,𝐳i,k,1r)active,h2,ξr1passive)h(𝐰i,kr,𝐳i,k,1r)active\displaystyle\!\underbrace{\hbox{\pagecolor{green!30}$\nabla f(\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}))$}}\limits_{\text{active}}\nabla_{1}\ell(\underbrace{\hbox{\pagecolor{green!30}$h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1})$}}\limits_{\text{active}}\!,\!\underbrace{\hbox{\pagecolor{blue!30}$h^{r-1}_{2,\xi}$}}\limits_{\text{passive}}\!)\!\underbrace{\!\hbox{\pagecolor{green!30}$\nabla h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1})\!$}}\limits_{\text{active}}
Gi,k,2r=\displaystyle G^{r}_{i,k,2}= (13)
f(𝐮ζr1)passive2(h1,ζr1passive,h(𝐰i,kr,𝐳i,k,2r)active)h(𝐰i,kr,𝐳i,k,2r)active\displaystyle\underbrace{\hbox{\pagecolor{blue!30}$\nabla f(\mathbf{u}^{r-1}_{\zeta})$}}\limits_{\text{passive}}\nabla_{2}\ell(\underbrace{\hbox{\pagecolor{blue!30}$h^{r-1}_{1,\zeta}$}}\limits_{\text{passive}},\underbrace{\hbox{\pagecolor{green!30}$h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})$}}\limits_{\text{active}})\underbrace{\hbox{\pagecolor{green!30}$\nabla h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})$}}\limits_{\text{active}}

where j,ξ,j,ζj,\xi,j^{\prime},\zeta are random variables. Another difference from DXO with linear ff is that even in the centralized setting directly using Gi,k,1r+Gi,k,2rG^{r}_{i,k,1}+G^{r}_{i,k,2} will lead to a worse complexity due to that non-linear ff make the stochastic gradient estimator biased (Wang et al., 2017). Hence, in order to improve the convergence, we follow existing state-of-the-art algorithms for stochastic compositional optimization (Ghadimi et al., 2020; Wang & Yang, 2022) to compute a moving average estimator for the gradient at local machines, i.e., Step 17 in Algorithm 2. With these changes, we present the detailed steps of FeDXL2 for solving DXO with non-linear ff in Algorithm 2. The buffers i,\mathcal{B}_{i,*} and 𝒞i\mathcal{C}_{i} are updated similar to that for FeDXL1. Different from FeDXL1, there is an additional communication cost for communicating 𝒰ir1\mathcal{U}_{i}^{r-1} and an additional buffer 𝒞i\mathcal{C}_{i} at each local machine to store the received 𝒫ir1\mathcal{P}^{r-1}_{i} from aggregated 𝒰r1\mathcal{U}^{r-1}. Nevertheless, these additional costs are marginal compared with communicating r1\mathcal{H}^{r-1}_{*} and maintaining the buffer i,\mathcal{B}_{i,*}.

We make the following assumptions regarding problem (8).

Assumption 3.3.
  • ()\ell(\cdot) is differentiable, LL_{\ell}-smooth and CC_{\ell}-Lipschitz. |()|C0|\ell(\cdot)|\leq C_{0}.

  • f()f(\cdot) is differentiable, LfL_{f}-smooth and CfC_{f}-Lipschitz.

  • h(,𝐳)h(\cdot,\mathbf{z}) is differentiable, LhL_{h}-smooth and ChC_{h}-Lipschitz on 𝐰\mathbf{w} for any 𝐳𝒮1𝒮2\mathbf{z}\in\mathcal{S}_{1}\cup\mathcal{S}_{2}.

  • 𝔼𝐳𝒮1i𝔼j[1:N]𝔼𝐳𝒮2jf(g(𝐰,𝐳,𝒮2))1(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)+f(g(𝐰,𝐳,𝒮2))2(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)Fi(𝐰)2σ2\mathbb{E}_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}_{j\in[1:N]}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}}\\ \|\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}))\nabla_{1}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z})\\ ~{}+\nabla f(g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}))\nabla_{2}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))\nabla h(\mathbf{w},\mathbf{z})-\nabla F_{i}(\mathbf{w})\|^{2}\leq\sigma^{2}.

  • D\exists D such that Fi(𝐰)F(𝐰)2D2,i\|\nabla F_{i}(\mathbf{w})-\nabla F(\mathbf{w})\|^{2}\leq D^{2},\forall i.

We present the convergence result of FeDXL2 below.

Theorem 3.4.

Under Assumption 3.3, denoting M=maxi|𝒮i1|M=\max_{i}|\mathcal{S}^{1}_{i}| as the largest number of data on a single machine, by setting γ=O(M1/3R2/3)\gamma=O(\frac{M^{1/3}}{R^{2/3}}), β=O(1M1/6R2/3)\beta=O(\frac{1}{M^{1/6}R^{2/3}}), η=O(1M2/3R2/3)\eta=O(\frac{1}{M^{2/3}R^{2/3}}) and K=O(M1/3R1/3)K=O(M^{1/3}R^{1/3}), Algorithm 2 ensures that

1Rr=1R𝔼F(𝐰¯r)2O(1R2/3).\frac{1}{R}\sum_{r=1}^{R}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\leq O(\frac{1}{R^{2/3}}).

Remark. To get 𝔼[1Rr=1RF(𝐰¯r)2]ϵ2\mathbb{E}[\frac{1}{R}\sum_{r=1}^{R}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}]\leq\epsilon^{2}, we just set R=O(M1/2ϵ3)R=O(\frac{M^{1/2}}{\epsilon^{3}}), η=O(ϵ2M)\eta=O(\frac{\epsilon^{2}}{M}), γ=O(ϵ2)\gamma=O(\epsilon^{2}), β=ϵ2M\beta=\frac{\epsilon^{2}}{\sqrt{M}} and K=M1/2ϵK=\frac{M^{1/2}}{\epsilon}. The number of communications R=O(M1/2ϵ3)R=O(\frac{M^{1/2}}{\epsilon^{3}}) is less than the total number of iterations i.e., O(Mϵ4)O(\frac{M}{\epsilon^{4}}) by a factor of O(M1/2/ϵ)O(M^{1/2}/\epsilon). And the sample complexity on each machine is Mϵ4\frac{M}{\epsilon^{4}}, which is less than that in (Wang & Yang, 2022) which has a sample complexity of O(i=1N|𝒮i1|/ϵ4)O(\sum\nolimits_{i=1}^{N}|\mathcal{S}^{1}_{i}|/\epsilon^{4}). When the data are evenly distributed on different machines, we have achieved a linear speedup property. And in an extreme case where all data are on one machine, the sample complexity of FeDXL2 matches that in (Wang & Yang, 2022), which is expected. Compared with FeDXL1, the analysis of FeDXL2 has to deal with extra difficulties. First, with non-linear ff, the coupling between the inner function and outer function adds to the complexity of interdependence between different rounds and machines. Second, we have to deal with the error for the passive part related to 𝐮\mathbf{u}.

Our analysis for FeDXL2 with moving average gradient estimator is different from previous studies for local momentum methods for ERM problems(Yu et al., 2019a; Karimireddy et al., 2020a), which used a fixed momentum parameter. In contrast, in FeDXL2 the momentum parameter β\beta is decreasing as RR increases, which is similar to centralized algorithms compositional problems (Ghadimi et al., 2020; Wang & Yang, 2022).

Table 2: Comparison for Federated Deep Partial AUC Maximization. All reported results are partial AUC scores on testing data.
K=32K=32, N=16N=16 Centralized (OPAUC Loss) Local SGD (CE Loss) CODASCA (Min-Max AUC) Local Pair (OPAUC Loss) FeDXL2 (OPAUC Loss)
Cifar10 FPR \leq 0.3 0.7655±\pm0.0039 0.6825±\pm0.0047 0.7288±\pm0.0035 0.7487±\pm0.0059 0.7580±\pm0.0034
FPR \leq 0.5 0.8032±\pm0.0039 0.7279±\pm0.0050 0.7702±\pm0.0029 0.7888±\pm0.0052 0.7978±\pm0.0026
Cifar100 FPR \leq 0.3 0.6287±\pm0.0037 0.5875±\pm0.0016 0.6131±\pm0.0054 0.6281±\pm0.0032 0.6332±\pm0.0024
FPR \leq 0.5 0.6487±\pm0.0026 0.6124±\pm0.0021 0.6406±\pm0.0041 0.6569±\pm0.0017 0.6623±\pm0.0022
CheXpert FPR \leq 0.3 0.7220±\pm0.0035 0.6495±\pm0.0039 0.6903±\pm0.0059 0.6902±\pm0.0053 0.7344±\pm0.0042
FPR \leq 0.5 0.7861±\pm0.0040 0.7017±\pm0.0042 0.7770±\pm0.0071 0.7483±\pm0.0033 0.7918±\pm0.0037
ChestMNIST FPR \leq 0.3 0.6344±\pm0.0053 0.5904±\pm0.0012 0.6071±\pm0.0040 0.5802±\pm0.0039 0.6228±\pm0.0048
FPR \leq 0.5 0.6622±\pm0.0029 0.6072±\pm0.0034 0.6272±\pm0.0038 0.6026±\pm0.0025 0.6490±\pm0.0039
Table 3: Comparison for Federated Deep AUC maximization under corrupted labels. All reported results are AUC scores on testing data.
K=32K=32, N=16N=16  Centralized (PSM Loss) Local SGD (CE Loss) CODASCA ( Min-Max AUC) Local Pair (PSM Loss) FeDXL1 (PSM Loss)
Cifar10 0.7352±\pm0.0043 0.6501±\pm0.0024 0.6407±\pm0.0044 0.7287±\pm0.0027 0.7344±\pm0.0038
Cifar100 0.6114±\pm0.0038 0.5700±\pm0.0031 0.5950±\pm0.0039 0.6175±\pm0.0045 0.6208±\pm0.0041
CheXpert 0.8149±\pm0.0031 0.6782±\pm0.0032 0.7062±\pm0.0085 0.7924±\pm0.0043 0.8431±\pm0.0027
ChestMNIST 0.7227±\pm0.0026 0.5642±\pm0.0041 0.6509±\pm0.0033 0.6766±\pm0.0019 0.6925±\pm0.0030

4 Experiments

To verify our theories, we experiment on two tasks: federated deep partial AUC maximization and federated deep AUC maximization with a pairwise surrogate loss, which corresponds to (1) with non-linear and linear ff, respectively. Code is released at https://github.com/Optimization-AI/ICML2023_FeDXL.

Datasets and Neural Networks. We use four datasets: Cifar10, Cifar100 (Krizhevsky et al., 2009), CheXpert (Irvin et al., 2019), and ChestMNIST (Yang et al., 2021a), where the latter two datasets are large-scale medical image data. For Cifar10 and Cifar100, we sample 20% of the training data as validation set, and construct imbalanced binary versions with positive:negative = 1:5 in the training set similar to (Yuan et al., 2021b). For CheXpert, we consider the task of predicting Consolidation and use the last 1000 images in the training set as the validation set and use the original validation set as the testing set. For ChestMNIST, we consider the task of Mass prediction and use the provided train/valid/test split. We distribute training data to N=16N=16 machines unless specified otherwise. To increase the heterogeneity of data on different machines, we add random Gaussian noise of 𝒩(μ,0.04)\mathcal{N}(\mu,0.04) to all training images, where μ{0.08:0.01:0.08}\mu\in\{-0.08:0.01:0.08\} that varies on different machines, i.e., for the ii-th machine out of the N=16N=16 machines, its μ=0.08+i0.01\mu=-0.08+i*0.01. We train ResNet18 from scratch for CIFAR-10 and CIFAR-100 data, and initialize DenseNet121 by an ImageNet pretrained model for CheXpert and ChestMNIST. We use the PyTorch framework (Paszke et al., 2019).

Baselines. We compare our algorithms with three local baselines: 1) Local SGD which optimizes a Cross-Entropy loss using classical local SGD algorithm; 2) CODASCA - a state-of-the-art FL algorithm for optimizing a min-max formulated AUC loss (Yuan et al., 2021a); and 3) Local Pair which optimizes the X-risk using only local pairs. As a reference, we also compare with the Centralized methods, i.e., mini-batch SGD for DXO with linear ff and SOX for DXO with non-linear ff. We tune the initial step size in [1e3,1][1e^{-3},1] using grid search and decay it by a factor of 0.1 every 5K iterations. All algorithms are run for 20k iterations. The mini-batch sizes B1,B2B_{1},B_{2} (as in Step 11 of FeDXL1 and FeDXL2) are set to 32. The β\beta parameter of FeDXL2 (and corresponding Local Pair and Centralized method) is set to 0.10.1. In the Centralized method, we tune the batch size B1B_{1} and B2B_{2} from {32,64,128,256,512}\{32,64,128,256,512\} in an effort to benchmark the best performance.For CODASCA and Local SGD which are not using pairwise losses, we set the batch size to 64 for fair comparison with FeDXL. For all the non-centralized algorithms, we set the communication interval K=32K=32 unless specified otherwise. In every run, we use the validation set to select the best performing model and finally use the selected model to evaluate on the testing set. For each algorithm, we repeat 3 times with different random seeds and report the averaged performance.

FeDXL2 for Federated Deep Partial AUC Maximization.
We consider the task of one way partial AUC maximization, which refers to the area under the ROC curve with false positive rate (FPR) restricted to be less than a threshold. We consider the KL-OPAUC loss function proposed in (Zhu et al., 2022),

min𝐰d1Ni=1N𝔼𝐳𝒮1iλlog(1Nj=1N𝔼𝐳𝒮2j(𝐰,𝐳,𝐳)),\min_{\mathbf{w}\in\mathbb{R}^{d}}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}^{i}_{1}}\lambda\log\bigg{(}\frac{1}{N}\sum_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}^{j}_{2}}\ell(\mathbf{w},\mathbf{z},\mathbf{z}^{\prime})\bigg{)}, (14)

where 𝒮1i\mathcal{S}_{1}^{i} denotes the set of positive data, 𝒮2j\mathcal{S}_{2}^{j} denotes the set of negative data and (𝐰,𝐳,𝐳)=exp((h(𝐰,𝐳)+1h(𝐰,𝐳))+2/λ)\ell(\mathbf{w},\mathbf{z},\mathbf{z}^{\prime})=\exp((h(\mathbf{w},\mathbf{z})+1-h(\mathbf{w},\mathbf{z}^{\prime}))_{+}^{2}/\lambda) where λ\lambda is a parameter tuned in [1:5][1:5]. The experimental results are reported in Table 3. We can see: (i) FeDXL2 is better than all local methods (i.e., Local SGD, Local Pair and CODASCA), and achieves competitive performance as the Centralized method, which indicates the our algorithm can effectively utilize data on all machines. The better performance of FeDXL2 on CIFAR100 and CheXpert than the Centralized method is probably due to that the Centralized method may overfit the training data; (ii) FeDXL2 is better than the Local Pair method, which implies that using data pairs from all machines are helpful for improving the performance in terms of partial AUC maximization; and (iii) FeDXL2 is better than CODASCA, which is not surprising since CODASCA is designed to optimize AUC loss, while FeDXL2 is used to optimize partial AUC loss.

FeDXL1 for Federated Deep AUC maximization with Corrupted Labels. Second, we consider the task of federated deep AUC maximization. Since deep AUC maximization for solving a min-max loss (an equivalent form for the pairwise square loss) has been developed in previous works (Yuan et al., 2021a), we aim to justify the benefit of using the general pairwise loss formulation. According to (Charoenphakdee et al., 2019), a symmetric loss can be more robust to data with corrupted labels for AUC maximization, where a symmetric loss is one such that (z)+(z)\ell(z)+\ell(-z) is a constant. Since the square loss is not symmetric, we conjecture that that min-max federated deep AUC maximization algorithm CODASCA is not robust to the noise in labels. In contrast, our algorithm FeDXL1 can optimize a symmetric pairwise loss; hence we expect FeDXL1 is better than CODASCA in the presence of corrupted labels. To verify this hypothesis, we generate corrupted data by flipping the labels of 20% of both the positive and negative training data. We use FeDXL1/Local Pair to optimize the symmetric pairwise sigmoid (PSM) loss (Calders & Jaroszewicz, 2007), which corresponds to (1) with linear f(s)=sf(s)=s and (a,b)=(1+exp(ab))1\ell(a,b)=(1+\exp(a-b))^{-1}, where aa is a positive data score and bb is a negative data score. Specifically,

min𝐰d1Ni=1N𝔼𝐳𝒮1i1Nj=1N𝔼𝐳𝒮2j(h(𝐰,𝐳),h(𝐰,𝐳)),\min_{\mathbf{w}\in\mathbb{R}^{d}}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\mathbf{z}\in\mathcal{S}^{i}_{1}}\frac{1}{N}\sum_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}^{j}_{2}}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime})),

where 𝒮1i\mathcal{S}_{1}^{i} denotes the set of positive data, 𝒮2j\mathcal{S}_{2}^{j} denotes the set of negative data and (h(𝐰,𝐳),h(𝐰,𝐳))=(1+exp(h(𝐰,𝐳)h(𝐰,𝐳)))1\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))=(1+\exp(h(\mathbf{w},\mathbf{z})-h(\mathbf{w},\mathbf{z}^{\prime})))^{-1}. The results are reported in Table 3. We observe that FeDXL1 is more robust to label noises compared to other local methods, including Local SGD, Local Pair, and CODASCA that optimizes a min-max AUC loss. As before, FeDXL1 has competitive performance compared with the Centralized method.

The running time comparison, statistics of data, and ablation studies are in Appendix C.

5 Conclusion

We have considered federated learning (FL) for deep X-risk optimization. We have developed communication-efficient FL algorithms to alleviate the interdependence between different machines. Novel convergence analysis is performed to address the technical challenges and to improve both iteration and communication complexities of proposed algorithms. We have conducted empirical studies of the proposed FL algorithms for solving deep partial AUC maximization and deep AUC maximization and achieved promising results compared with several baselines.

6 Limitations and Potential Negative Societal Impacts

While the current communication complexity is O(1/ϵ3)O(1/\epsilon^{3}), there may still be room for improvement to further reduce the communication cost because the state-of-the-art communication complexity for federated ERM problems is O(1/ϵ2)O(1/\epsilon^{2}). Our experimental results indicate that FeDXL may offer better generalization performance than centralized algorithms. However, a more rigorous analysis is necessary to better understand this phenomenon and leverage it effectively. While this work has verified the performance of FeDXL on partial AUC maximization and AUC maximization problems, more studies are needed to test FeDXL on other federated DXO problems and beyond. We do not see any potential negative societal impact.

Acknowledgements

We appreciate the feedback provided by the anonymous reviewers. This work has been partially supported by NSF Career Award 2246753, NSF Grant 2246757 and NSF Grant 2246756.

References

  • Basu et al. (2019) Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019.
  • Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560–569. PMLR, 2018.
  • Boyd et al. (2013) Boyd, K., Eng, K. H., and Page, C. D. Area under the precision-recall curve: point estimates and confidence intervals. In Joint European conference on machine learning and knowledge discovery in databases, pp.  451–466. Springer, 2013.
  • Calders & Jaroszewicz (2007) Calders, T. and Jaroszewicz, S. Efficient AUC optimization for classification. In Knowledge Discovery in Databases: PKDD 2007, 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 17-21, 2007, Proceedings, volume 4702 of Lecture Notes in Computer Science, pp.  42–53. Springer, 2007.
  • Charoenphakdee et al. (2019) Charoenphakdee, N., Lee, J., and Sugiyama, M. On symmetric losses for learning from corrupted labels. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  961–970. PMLR, 2019.
  • Clémençon et al. (2008) Clémençon, S., Lugosi, G., and Vayatis, N. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844–874, 2008.
  • Cohen et al. (1997) Cohen, W. W., Schapire, R. E., and Singer, Y. Learning to order things. Advances in neural information processing systems, 10, 1997.
  • Dembczynski et al. (2012) Dembczynski, K., Kotlowski, W., and Hüllermeier, E. Consistent multilabel ranking through univariate losses. arXiv preprint arXiv:1206.6401, 2012.
  • Deng & Mahdavi (2021) Deng, Y. and Mahdavi, M. Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pp.  1387–1395. PMLR, 2021.
  • Deng et al. (2020) Deng, Y., Kamani, M. M., and Mahdavi, M. Distributionally robust federated averaging. Advances in Neural Information Processing Systems, 33:15111–15122, 2020.
  • Gao et al. (2022) Gao, H., Li, J., and Huang, H. On the convergence of local stochastic compositional gradient descent with momentum. In International Conference on Machine Learning, pp. 7017–7035. PMLR, 2022.
  • Gao & Zhou (2015) Gao, W. and Zhou, Z. On the consistency of AUC pairwise optimization. In Yang, Q. and Wooldridge, M. J. (eds.), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp.  939–945. AAAI Press, 2015.
  • Gao et al. (2013) Gao, W., Jin, R., Zhu, S., and Zhou, Z. One-pass AUC optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp.  906–914. JMLR.org, 2013.
  • Ghadimi et al. (2020) Ghadimi, S., Ruszczynski, A., and Wang, M. A single timescale stochastic approximation method for nested stochastic optimization. SIAM J. Optim., 30(1):960–979, 2020.
  • Goldberger et al. (2004) Goldberger, J., Hinton, G. E., Roweis, S., and Salakhutdinov, R. R. Neighbourhood components analysis. Advances in neural information processing systems, 17, 2004.
  • Guo et al. (2020) Guo, Z., Liu, M., Yuan, Z., Shen, L., Liu, W., and Yang, T. Communication-efficient distributed stochastic auc maximization with deep neural networks. In International Conference on Machine Learning, pp. 3864–3874. PMLR, 2020.
  • Haddadpour et al. (2019) Haddadpour, F., Kamani, M. M., Mahdavi, M., and Cadambe, V. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. Advances in Neural Information Processing Systems, 32, 2019.
  • Han et al. (2022) Han, S., Park, S., Wu, F., Kim, S., Wu, C., Xie, X., and Cha, M. Fedx: Unsupervised federated learning with cross knowledge distillation, 2022. URL https://arxiv.org/abs/2207.09158.
  • Hanley & McNeil (1982) Hanley, J. A. and McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
  • Hu et al. (2020) Hu, Y., Zhang, S., Chen, X., and He, N. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Huang et al. (2022) Huang, Y., Lin, Q., Street, N., and Baek, S. Federated learning on adaptively weighted nodes by bilevel optimization. arXiv preprint arXiv:2207.10751, 2022.
  • Irvin et al. (2019) Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R. L., Shpanskaya, K. S., Seekins, J., Mong, D. A., Halabi, S. S., Sandberg, J. K., Jones, R., Larson, D. B., Langlotz, C. P., Patel, B. N., Lungren, M. P., and Ng, A. Y. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp.  590–597. AAAI Press, 2019.
  • Jiang & Agrawal (2018) Jiang, P. and Agrawal, G. A linear speedup analysis of distributed deep learning with sparse and quantized communication. Advances in Neural Information Processing Systems, 31, 2018.
  • Jiang et al. (2022) Jiang, W., Li, G., Wang, Y., Zhang, L., and Yang, T. Multi-block-single-probe variance reduced estimator for coupled compositional optimization. In NeurIPS, 2022.
  • Kairouz et al. (2021) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Karimireddy et al. (2020a) Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
  • Karimireddy et al. (2020b) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020b.
  • Khaled et al. (2020) Khaled, A., Mishchenko, K., and Richtárik, P. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp.  4519–4529. PMLR, 2020.
  • Konečnỳ et al. (2016) Konečnỳ, J., McMahan, H. B., Ramage, D., and Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
  • Kotlowski et al. (2011) Kotlowski, W., Dembczynski, K., and Hüllermeier, E. Bipartite ranking through minimization of univariate loss. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp.  1113–1120. Omnipress, 2011.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images.(2009), 2009.
  • Li & Huang (2022) Li, J. and Huang, H. Fedgrec: Federated graph recommender system with lazy update of latent embeddings. arXiv preprint arXiv:2210.13686, 2022.
  • Li et al. (2022) Li, J., Pei, J., and Huang, H. Communication-efficient robust federated learning with noisy labels. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  914–924, 2022.
  • Liu et al. (2020) Liu, M., Zhang, W., Mroueh, Y., Cui, X., Ross, J., Yang, T., and Das, P. A decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information Processing Systems, 33:11056–11070, 2020.
  • McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Qi et al. (2021) Qi, Q., Luo, Y., Xu, Z., Ji, S., and Yang, T. Stochastic optimization of areas under precision-recall curves with provable convergence. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  1752–1765, 2021.
  • Qiu et al. (2022) Qiu, Z., Hu, Q., Zhong, Y., Zhang, L., and Yang, T. Large-scale stochastic optimization of NDCG surrogates for deep learning with provable convergence. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  18122–18152. PMLR, 2022.
  • Radenović et al. (2016) Radenović, F., Tolias, G., and Chum, O. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In European conference on computer vision, pp.  3–20. Springer, 2016.
  • Rudin (2009) Rudin, C. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. J. Mach. Learn. Res., 10:2233–2271, 2009.
  • Sharma et al. (2022) Sharma, P., Panda, R., Joshi, G., and Varshney, P. Federated minimax optimization: Improved convergence analyses and algorithms. In International Conference on Machine Learning, pp. 19683–19730. PMLR, 2022.
  • Smith et al. (2018) Smith, V., Forte, S., Chenxin, M., Takáč, M., Jordan, M. I., and Jaggi, M. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
  • Stich (2018) Stich, S. U. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2018.
  • Stich et al. (2018) Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31, 2018.
  • Tarzanagh et al. (2022) Tarzanagh, D. A., Li, M., Thrampoulidis, C., and Oymak, S. FedNest: Federated bilevel, minimax, and compositional optimization. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  21146–21179. PMLR, 17–23 Jul 2022.
  • Wang & Yang (2022) Wang, B. and Yang, T. Finite-sum coupled compositional stochastic optimization: Theory and applications. In International Conference on Machine Learning, pp. 23292–23317. PMLR, 2022.
  • Wang et al. (2022) Wang, G., Yang, M., Zhang, L., and Yang, T. Momentum accelerates the convergence of stochastic AUPRC maximization. In International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pp.  3753–3771. PMLR, 2022.
  • Wang et al. (2017) Wang, M., Fang, E. X., and Liu, H. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161(1-2):419–449, 2017.
  • Wangni et al. (2018) Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.
  • Woodworth et al. (2020a) Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., and Srebro, N. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pp. 10334–10343. PMLR, 2020a.
  • Woodworth et al. (2020b) Woodworth, B. E., Patel, K. K., and Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281–6292, 2020b.
  • Wu et al. (2017) Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision, pp.  2840–2848, 2017.
  • Wu et al. (2022) Wu, Y., Wang, Z., Zeng, D., Li, M., Shi, Y., and Hu, J. Federated contrastive representation learning with feature fusion and neighborhood matching, 2022. URL https://openreview.net/forum?id=6LNPEcJAGWe.
  • Xing et al. (2022) Xing, P., Lu, S., Wu, L., and Yu, H. Big-fed: Bilevel optimization enhanced graph-aided federated learning. IEEE Transactions on Big Data, pp.  1–12, 2022. doi: 10.1109/TBDATA.2022.3191439.
  • Yang et al. (2021a) Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister, H., and Ni, B. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. arXiv preprint arXiv:2110.14795, 2021a.
  • Yang (2013) Yang, T. Trading computation for communication: Distributed stochastic dual coordinate ascent. Advances in Neural Information Processing Systems, 26, 2013.
  • Yang (2022) Yang, T. Algorithmic foundation of deep x-risk optimization. CoRR, abs/2206.00439, 2022. doi: 10.48550/arXiv.2206.00439. URL https://doi.org/10.48550/arXiv.2206.00439.
  • Yang & Ying (2022) Yang, T. and Ying, Y. Auc maximization in the era of big data and ai: A survey. ACM Comput. Surv., aug 2022. ISSN 0360-0300. doi: 10.1145/3554729. URL https://doi.org/10.1145/3554729. Just Accepted.
  • Yang et al. (2021b) Yang, Z., Lei, Y., Wang, P., Yang, T., and Ying, Y. Simple stochastic and online gradient descent algorithms for pairwise learning. Advances in Neural Information Processing Systems, 34:20160–20171, 2021b.
  • Yu et al. (2019a) Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pp. 7184–7193. PMLR, 2019a.
  • Yu et al. (2019b) Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  5693–5700, 2019b.
  • Yuan et al. (2021a) Yuan, Z., Guo, Z., Xu, Y., Ying, Y., and Yang, T. Federated deep auc maximization for hetergeneous data with a constant communication complexity. In International Conference on Machine Learning, pp. 12219–12229. PMLR, 2021a.
  • Yuan et al. (2021b) Yuan, Z., Yan, Y., Sonka, M., and Yang, T. Large-scale robust deep AUC maximization: A new surrogate loss and empirical studies on medical image classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp.  3020–3029. IEEE, 2021b.
  • Yuan et al. (2022) Yuan, Z., Wu, Y., Qiu, Z., Du, X., Zhang, L., Zhou, D., and Yang, T. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  25760–25782. PMLR, 2022. URL https://proceedings.mlr.press/v162/yuan22b.html.
  • Zhang et al. (2020) Zhang, F., Kuang, K., You, Z., Shen, T., Xiao, J., Zhang, Y., Wu, C., Zhuang, Y., and Li, X. Federated unsupervised representation learning. CoRR, abs/2010.08982, 2020. URL https://arxiv.org/abs/2010.08982.
  • Zhao et al. (2011) Zhao, P., Hoi, S. C. H., Jin, R., and Yang, T. Online AUC maximization. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp.  233–240, 2011.
  • Zhu et al. (2022) Zhu, D., Li, G., Wang, B., Wu, X., and Yang, T. When AUC meets DRO: optimizing partial AUC for deep learning with non-convex convergence guarantee. CoRR, abs/2203.00176, 2022.

Appendix A Notations

Table 4: Notations
𝐰\mathbf{w} Model parameters of the neural network, variables to be trained
𝐰i,kr\mathbf{w}^{r}_{i,k} Model parameters of machine ii at round rr, iteration kk
𝐳\mathbf{z} A data point
𝐳i\mathbf{z}_{i} A data point from machine ii
𝐳i,kr\mathbf{z}^{r}_{i,k} A data point sampled on machine ii, at round rr iteration kk
𝐳i,k,1r,𝐳i,k,2r\mathbf{z}^{r}_{i,k,1},\mathbf{z}^{r}_{i,k,2} Two independent data points sampled on machine ii, at round rr iteration kk
h(𝐰,𝐳)h(\mathbf{w},\mathbf{z}) The prediction score of data 𝐳\mathbf{z} by network 𝐰\mathbf{w}
Gi,k,1r,Gi,k,2rG^{r}_{i,k,1},G^{r}_{i,k,2} Stochastic estimators of components of gradient
i,1r,i,2r\mathcal{H}^{r}_{i,1},\mathcal{H}^{r}_{i,2} Collected historical prediction scores on machine ii at round rr
𝐮(𝐳)\mathbf{u}(\mathbf{z}) Moving average estimator of the inner function g(𝐰,𝐳,𝒮2)g(\mathbf{w},\mathbf{z},\mathcal{S}_{2})
𝐮i,kr(𝐳)\mathbf{u}^{r}_{i,k}(\mathbf{z}) Moving average estimator of the inner function g(𝐰,𝐳,𝒮2)g(\mathbf{w},\mathbf{z},\mathcal{S}_{2}) on machine ii at round rr, iteration kk
𝒰ir\mathcal{U}^{r}_{i} Collected historical 𝐮\mathbf{u} on machine ii at round rr
hϵr1,hζr1h^{r-1}_{\epsilon},h^{r-1}_{\zeta} Predictions scores sampled from the collected scores of round r1r-1
uζr1u^{r-1}_{\zeta} Moving average estimator sampled from the collected moving average estimator of round r1r-1

Appendix B Applications of DXO Problems

We now present some concrete applications of the DXO problems, including AUROC maximization, partial AUROC maximization and AUPRC maximization. A more comprehensive list of DXO problems is discussed in the Intrduction section and can also be found in a recent survey (Yang, 2022).

AUROC Maximization The area under ROC curve (AUROC) is defined (Hanley & McNeil, 1982) as

AUROC(𝐰)=𝔼[𝕀(h(𝐰,𝐳)h(𝐰,𝐳))|y=+1,y=1],\text{AUROC}(\mathbf{w})=\mathbb{E}[\mathbb{I}(h(\mathbf{w},\mathbf{z})\geq h(\mathbf{w},\mathbf{z}^{\prime}))|y=+1,y^{\prime}=-1], (15)

where 𝐳,𝐳\mathbf{z},\mathbf{z}^{\prime} are a pair of data features and y,yy,y^{\prime} are the corresponding labels. To maximize the AUROC, there are a number of surrogate losses ()\ell(\cdot), e.g. (𝐰;𝐳,𝐳)=(1h(𝐰,𝐳)+h(𝐰,𝐳))2\ell(\mathbf{w};\mathbf{z},\mathbf{z}^{\prime})=(1-h(\mathbf{w},\mathbf{z})+h(\mathbf{w},\mathbf{z}^{\prime}))^{2}, that have proposed in the literature (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b), which formulates the problem into

min𝐰1|𝒮1|𝐳iS11|𝒮2|𝐳jS2(𝐰,𝐳i,𝐳j),\begin{split}\min\limits_{\mathbf{w}}\frac{1}{|\mathcal{S}_{1}|}\sum\limits_{\mathbf{z}_{i}\in S_{1}}\frac{1}{|\mathcal{S}_{2}|}\sum\limits_{\mathbf{z}_{j}\in S_{2}}\ell(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j}),\end{split} (16)

where 𝒮1\mathcal{S}_{1} is the set of data with positive labels and 𝒮2\mathcal{S}_{2} is the set of data with negative labels. This is a DXO problem of (1) with f(x)=xf(x)=x.

Partial AUROC Maximization In medical diagnosis, high false positive rates (FPR) and low true positive rates (TPR) may cause a large cost. To alleviate this, we will also consider optimizing partial AUC (pAUC). This task considers to maximize the area under ROC curve with the restriction that the false positive rate to be less than a certain level. In (Zhu et al., 2022), it has been shown that the partial AUROC maximization problem can be solved by the

min𝐰1|𝒮1|𝐱i𝒮1λlog(1|𝒮2|𝐳j𝒮2exp(~(𝐰,𝐳i,𝐳j)λ)),\begin{split}\min_{\mathbf{w}}\frac{1}{|\mathcal{S}_{1}|}\sum\limits_{\mathbf{x}_{i}\in\mathcal{S}_{1}}\lambda\log\left(\frac{1}{|\mathcal{S}_{2}|}\sum\limits_{\mathbf{z}_{j}\in\mathcal{S}_{2}}\exp(\frac{\tilde{\ell}(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j})}{\lambda})\right),\end{split} (17)

where 𝒮1\mathcal{S}_{1} is the set of positive data, 𝒮2\mathcal{S}_{2} is the set of negative data, ~()\tilde{\ell}(\cdot) is surrogate loss, and λ\lambda is associated with the tolerance level of false positive rate. This is a DXO problem of (1) with f(x)=λlog(x)f(x)=\lambda\log(x), and (𝐰,𝐳i,𝐳j)=exp(~(𝐰,𝐳i,𝐳j)λ)\ell(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j})=\exp(\frac{\tilde{\ell}(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j})}{\lambda}).

AUPRC Maximization According to (Boyd et al., 2013), the area under the precision-recall curve (AUPRC) can be approximated by

1|𝒮|(𝐳i,yi)𝒮𝕀(yi=1)(𝐳j,yj)𝒮𝕀(yj=1)𝕀(h(𝐰,𝐳i)h(𝐰,𝐳j))(𝐳j,yj)𝒮𝕀(h(𝐰,𝐳i)h(𝐰,𝐳j)).\begin{split}\frac{1}{|\mathcal{S}|}\sum\limits_{(\mathbf{z}_{i},y_{i})\in\mathcal{S}}\mathbb{I}(y_{i}=1)\frac{\sum\limits_{(\mathbf{z}_{j},y_{j})\in\mathcal{S}}\mathbb{I}(y_{j}=1)\mathbb{I}(h(\mathbf{w},\mathbf{z}_{i})\geq h(\mathbf{w},\mathbf{z}_{j}))}{\sum\limits_{(\mathbf{z}_{j},y_{j})\in\mathcal{S}}\mathbb{I}(h(\mathbf{w},\mathbf{z}_{i})\geq h(\mathbf{w},\mathbf{z}_{j}))}.\end{split} (18)

Then using a surrogate loss, the AUPRC maximization problem becomes

min𝐰1|𝒮|(𝐳i,yi)𝒮𝕀(yi=1)(𝐳j,yj)𝒮𝕀(yj=1)~(𝐰,𝐳i,𝐳j))(𝐳j,yj)𝒮~(𝐰,𝐳i,𝐳j),\begin{split}\min\limits_{\mathbf{w}}-\frac{1}{|\mathcal{S}|}\sum\limits_{(\mathbf{z}_{i},y_{i})\in\mathcal{S}}\mathbb{I}(y_{i}=1)\frac{\sum\limits_{(\mathbf{z}_{j},y_{j})\in\mathcal{S}}\mathbb{I}(y_{j}=1)\tilde{\ell}(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j}))}{\sum\limits_{(\mathbf{z}_{j},y_{j})\in\mathcal{S}}\tilde{\ell}(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j})},\end{split} (19)

which is a DXO problem of (1) with (𝐰,𝐳i,𝐳j)=[(𝕀yj=1)~(𝐰,𝐳i,𝐳j),~(𝐰,𝐳i,𝐳j)]\ell(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j})=[(\mathbb{I}_{y_{j}=1})\tilde{\ell}(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j}),\tilde{\ell}(\mathbf{w},\mathbf{z}_{i},\mathbf{z}_{j})] and f(x1,x2)=x1x2f(x_{1},x_{2})=\frac{x_{1}}{x_{2}} (Qi et al., 2021).

Appendix C Experiments

C.1 Statistics of Data

Statistics of used data sets are summarized in Table 5.

Table 5: Statistics of the Datasets
# of Training Data # of Validation Data # of Testing Data
Cifar10 24000 10000 10000
Cifar100 24000 10000 10000
CheXpert 190027 1000 202
ChestMNIST 78468 11219 22433

C.2 Running Time Comparison

Running time is reported in Tabel 6. Each algorithm was run on 16 client machines connected by InfiniBand where each machine uses a NVIDIA A100 GPU.

Table 6: Running time comparison of federated algorithm on partial AUC maximization task in 4. We report the average number of communication rounds and runtime (in seconds) for each algorithm to converge to a region that for FR0.5\text{FR}\leq 0.5, the training pAUC \geq its best training pAUC-0.01.
Local SGD (CE Loss) CODASCA (Min-Max AUC) Local Pair (OPAUC Loss) FeDXL2 (OPAUC Loss)
Cifar10 157 (664s) 147 (955s) 168 (740s) 160 (819s)
Cifar100 160 (644s) 163 (974s) 162 (688s) 159 (758s)
CheXpert 162 (2465s) 151 (3501s) 175 (2838s) 182 (3246s)
ChestMNIST 172 (1537s) 165 (3176s) 164 (1484s) 171 (1763s)

C.3 Ablation Study.

We show an ablation study to further verify our theory. In particular, we show the benefit of using multiple machines and the lower communication complexity by using K>1K>1 local updates between two communications. To verify the first effect, we fix KK and vary NN, and for the latter we fix NN and vary KK. We conduct experiments on the CIFAR-10 data for optimizing the X risk corresponding to partial AUC loss and the results are plotted in Figure 2. The left two figures demonstrate that our algorithm can tolerate a certain value of KK for skipping communications without harming the performance; and the right two figures demonstrate the advantage of FL by using FeDXL2, i.e., using data from more sources can dramatically improve the performance.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Ablation study: Left two: Fix NN and Vary KK; Right two: Fix KK and Vary NN

Appendix D Analysis of FeDXL1 for solving DXO with Linear ff

In this section, we present the analysis of the FeDXL1 algorithm. For 𝐳𝒮1i\mathbf{z}\in\mathcal{S}_{1}^{i} and 𝐳𝒮2j\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}, we define

G1(𝐰,𝐳,𝐰,𝐳)=1(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳)G2(𝐰,𝐳,𝐰,𝐳)=2(h(𝐰,𝐳),h(𝐰,𝐳))h(𝐰,𝐳).\begin{split}&G_{1}(\mathbf{w},\mathbf{z},\mathbf{w}^{\prime},\mathbf{z}^{\prime})=\nabla_{1}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))^{\top}\nabla h(\mathbf{w},\mathbf{z})\\ &G_{2}(\mathbf{w},\mathbf{z},\mathbf{w}^{\prime},\mathbf{z}^{\prime})=\nabla_{2}\ell(h(\mathbf{w},\mathbf{z}),h(\mathbf{w},\mathbf{z}^{\prime}))^{\top}\nabla h(\mathbf{w},\mathbf{z}^{\prime}).\end{split} (20)

Therefore, the

Gi,k,1r=1(h(𝐰i,kr,𝐳i,k,1r),h2,ξr1)h(𝐰i,kr,𝐳i,k,1r),\displaystyle G^{r}_{i,k,1}=\nabla_{1}\ell(h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}),h^{r-1}_{2,\xi})\nabla h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}),

defined in (4) is equivalent to G1(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳j,t,2r1)G_{1}(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\mathbf{z}^{r-1}_{j,t,2}), where h2,ξr1=h(𝐰j,tr1,𝐳j,t,2r1)h^{r-1}_{2,\xi}=h(\mathbf{w}^{r-1}_{j,t},\mathbf{z}^{r-1}_{j,t,2}) is a scored of a randomly sampled data that in computed in the round r1r-1 at machine jj and iteration tt. Technically, notations jj and tt are associated with ii and kk, but we omit this dependence when the context is clear to simplify notations.

Similarly, the

Gi,k,2r=2(h1,ζr1,h(𝐰i,kr,𝐳i,k,2r),h(𝐰i,kr,𝐳i,k,2r)),\displaystyle G^{r}_{i,k,2}=\nabla_{2}\ell(h^{r-1}_{1,\zeta},h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}),\nabla h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})),

defined in (6) is equivalent to G2(𝐰j,tr1,𝐳j,t,1r1,𝐰i,kr,𝐳i,k,2r)G_{2}(\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}).

Proof.

Under Assumption 3.1, it follows that F()F(\cdot) is LFL_{F}-smooth, with LF:=2(LCh+CLh)L_{F}:=2(L_{\ell}C_{h}+C_{\ell}L_{h}). Simiarly, G1,G2G_{1},G_{2} also Lipschtz in 𝐰\mathbf{w} and 𝐰\mathbf{w}^{\prime} with some constant L1L_{1} that depend on Ch,C,L,LhC_{h},C_{\ell},L_{\ell},L_{h}. Let L~:=max{LF,L1}\tilde{L}:=\max\{L_{F},L_{1}\}.

Denote η~=ηK\tilde{\eta}=\eta K and suppose η~L~O(1)\tilde{\eta}\tilde{L}\leq O(1)by proper setting of η\eta and KK. Using the L~\tilde{L}-smoothness of F(𝐰)F(\mathbf{w}), we have

F(𝐰¯r+1)F(𝐰¯r)F(𝐰¯r)(𝐰¯r+1𝐰¯r)+L~2𝐰¯r+1𝐰¯r2=η~F(𝐰¯r)(1NKik(Gi,k,1r+Gi,k,2r))+L~2𝐰¯r+1𝐰¯r2=η~(F(𝐰¯r)F(𝐰¯r1)+F(𝐰¯r1))(1NKik(Gi,k,1r+Gi,k,2r))+L~2𝐰¯r+1𝐰¯r212L~F(𝐰¯r)F(𝐰¯r1)2+2η~2L~1NKik(Gi,k,1r+Gi,k,2r)2η~F(𝐰¯r1)(1NKik(Gi,k,1r+Gi,k,2r))+L~2𝐰¯r+1𝐰¯r2L~2𝐰¯r𝐰¯r12+2η~2L~1NKik(Gi,k,1r+Gi,k,2r)2η~F(𝐰¯r1)(1NKik(Gi,k,1r+Gi,k,2r))+L~2𝐰¯r+1𝐰¯r2,\begin{split}&F(\bar{\mathbf{w}}^{r+1})-F(\bar{\mathbf{w}}^{r})\leq\nabla F(\bar{\mathbf{w}}^{r})^{\top}(\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r})+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &=-\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r})^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\right)+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &=-\tilde{\eta}(\nabla F(\bar{\mathbf{w}}^{r})-\nabla F(\bar{\mathbf{w}}^{r-1})+\nabla F(\bar{\mathbf{w}}^{r-1}))^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\right)+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq\frac{1}{2\tilde{L}}\|\nabla F(\bar{\mathbf{w}}^{r})-\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+2\tilde{\eta}^{2}\tilde{L}\|\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\|^{2}\\ &~{}~{}~{}-\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r-1})^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\right)+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+2\tilde{\eta}^{2}\tilde{L}\|\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\|^{2}-\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r-1})^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\right)\\ &~{}~{}~{}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2},\end{split} (21)

where

𝔼[η~F(𝐰¯r1)(1NKik(Gi,k,1r+Gi,k,2r))]=𝔼[η~F(𝐰¯r1)(1NKik(G1(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳j,t,2r1)+G2(𝐰j,tr1,𝐳j,t,1r1,𝐰i,kr,𝐳i,k,2r)G1(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳j,t,2r1)G2(𝐰¯r1,𝐳j,t,1r1,𝐰¯r1,𝐳i,k,2r)+G1(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳j,t,2r1)+G2(𝐰¯r1,𝐳j,t,1r1,𝐰¯r1,𝐳i,k,2r)))]4η~L~21NKi=1Nk=1K𝔼(𝐰i,kr𝐰¯r12+𝐰j,tr1𝐰¯r12+𝐰j,tr1𝐰¯r12+𝐰i,kr𝐰¯r12)+η~4𝔼F(𝐰¯r1)2𝔼[η~F(𝐰¯r1)(1NKikFi(𝐰¯r1))]16η~L~2𝔼𝐰¯r𝐰¯r12+8η~L~21NKik𝔼𝐰¯r𝐰i,kr2+8η~L~21NKik𝔼𝐰¯r1𝐰i,kr12η~2𝔼F(𝐰¯r1)2,\begin{split}&-\mathbb{E}\left[\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r-1})^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\right)\right]\\ &=-\mathbb{E}\bigg{[}\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r-1})^{\top}\bigg{(}\frac{1}{NK}\sum_{i}\sum_{k}(G_{1}(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j,t,2})-G_{2}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))\bigg{)}\bigg{]}\\ &\leq 4\tilde{\eta}\tilde{L}^{2}\frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}\mathbb{E}(\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r-1}\|^{2}+\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}+\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}+\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r-1}\|^{2})\\ &~{}~{}+\frac{\tilde{\eta}}{4}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}-\mathbb{E}\bigg{[}\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r-1})^{\top}\bigg{(}\frac{1}{NK}\sum_{i}\sum_{k}\nabla F_{i}(\bar{\mathbf{w}}^{r-1})\bigg{)}\bigg{]}\\ &\leq 16\tilde{\eta}\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{\eta}\tilde{L}^{2}\frac{1}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+8\tilde{\eta}\tilde{L}^{2}\frac{1}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{i,k}\|^{2}-\frac{\tilde{\eta}}{2}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2},\end{split} (22)

where first inequality uses Young’s inequality, Lipschitz of G1,G2G_{1},G_{2}, and the fact that data samples 𝐳i,k,1r,𝐳j,tr1,𝐳j,t,1r1,𝐳i,k,2r\mathbf{z}^{r}_{i,k,1},\mathbf{z}^{r-1}_{j,t},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{z}^{r}_{i,k,2} are independent samples after 𝐰¯r1\bar{\mathbf{w}}^{r-1}, therefore

𝔼[(G1(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳j,t,2r1)+G2(𝐰¯r1,𝐳j,t,1r1,𝐰¯r1,𝐳i,k,2r)Fi(𝐰¯r1)]=𝟎.\begin{split}\mathbb{E}[(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2})-\nabla F_{i}(\bar{\mathbf{w}}^{r-1})]=\mathbf{0}.\end{split} (23)

To bound the updates of 𝐰¯r\bar{\mathbf{w}}^{r} after one round, we have

𝔼𝐰¯r+1𝐰¯r2=η~2𝔼1NKik(Gi,k,1r+Gi,k,2r)2=η~2𝔼1NKik(G1(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳j,t,2r1)+G2(𝐰j,tr1,𝐳j,t,1r1,𝐰i,kr,𝐳i,k,2r))23η~2𝔼1NKik[G1(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳j,t,2r1)+G2(𝐰j,tr1,𝐳j,t,1r1,𝐰i,kr,𝐳i,k,2r)]1NKik[G1(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳j,t,2r1)+G2(𝐰¯r1,𝐳j,t,1r1,𝐰¯r1,𝐳i,k,2r)]2+3η~2𝔼1NKik[G1(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳j,t,2r1)+G2(𝐰¯r1,𝐳j,t,1r1,𝐰¯r1,𝐳i,k,2r)Fi(𝐰¯r1)]2+3η~2𝔼F(𝐰¯r1)2\begin{split}&\mathbb{E}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}=\tilde{\eta}^{2}\mathbb{E}\|\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k,1}+G^{r}_{i,k,2})\|^{2}\\ &=\tilde{\eta}^{2}\mathbb{E}\|\frac{1}{NK}\sum_{i}\sum_{k}(G_{1}(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}))\|^{2}\\ &\leq 3\tilde{\eta}^{2}\mathbb{E}\bigg{\|}\frac{1}{NK}\sum_{i}\sum_{k}[G_{1}(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{1}{NK}\sum_{i}\sum_{k}[G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2})]\bigg{\|}^{2}\\ &~{}~{}+3\tilde{\eta}^{2}\mathbb{E}\bigg{\|}\frac{1}{NK}\sum_{i}\sum_{k}[G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2})\!-\!\nabla F_{i}(\bar{\mathbf{w}}^{r-1})]\bigg{\|}^{2}\\ &~{}~{}+3\tilde{\eta}^{2}\mathbb{E}\left\|\nabla F(\bar{\mathbf{w}}^{r-1})\right\|^{2}\\ \end{split} (24)

Using the Lipschtz property of G1,G2G_{1},G_{2}, we continue this inequality as

𝔼𝐰¯r+1𝐰¯r26η~2L~2NKik𝔼𝐰i,kr𝐰¯r2+6η~2L~2NKik𝔼𝐰i,kr1𝐰¯r12+6η~2L~2𝔼𝐰¯r𝐰¯r12+3η~21NK𝔼[G1(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳j,t,2r1)+G2(𝐰¯r1,𝐳j,t,1r1,𝐰¯r1,𝐳i,k,2r)Fi(𝐰¯r1)]2+3η~2𝔼F(𝐰¯r1)26η~2L~2NKik𝔼𝐰i,kr𝐰¯r2+6η~2L~2NKik𝔼𝐰i,kr1𝐰¯r12+6η~2L~2𝔼𝐰¯r𝐰¯r12+3η~2σ2NK+3η~2𝔼F(𝐰¯r1)2.\begin{split}&\mathbb{E}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq 6\tilde{\eta}^{2}\frac{\tilde{L}^{2}}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+6\tilde{\eta}^{2}\frac{\tilde{L}^{2}}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\mathbf{w}^{r-1}_{i,k}-\bar{\mathbf{w}}^{r-1}\|^{2}+6\tilde{\eta}^{2}\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &~{}~{}+3\tilde{\eta}^{2}\frac{1}{NK}\mathbb{E}\bigg{\|}[G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j,t,2})+G_{2}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2})-\nabla F_{i}(\bar{\mathbf{w}}^{r-1})]\bigg{\|}^{2}+3\tilde{\eta}^{2}\mathbb{E}\|F(\bar{\mathbf{w}}^{r-1})\|^{2}\\ &\leq 6\tilde{\eta}^{2}\frac{\tilde{L}^{2}}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+6\tilde{\eta}^{2}\frac{\tilde{L}^{2}}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\mathbf{w}^{r-1}_{i,k}-\bar{\mathbf{w}}^{r-1}\|^{2}+6\tilde{\eta}^{2}\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &~{}~{}+3\tilde{\eta}^{2}\frac{\sigma^{2}}{NK}+3\tilde{\eta}^{2}\mathbb{E}\|F(\bar{\mathbf{w}}^{r-1})\|^{2}.\end{split}

Thus,

1Rr=1R𝔼𝐰¯r+1𝐰¯r21Rr=1R[10η~2L~21NKik𝔼𝐰i,kr𝐰¯r2+6η~2σ2NK+6η~2𝔼F(𝐰¯r1)2].\begin{split}&\frac{1}{R}\sum_{r=1}^{R}\mathbb{E}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq\frac{1}{R}\sum_{r=1}^{R}\bigg{[}10\tilde{\eta}^{2}\tilde{L}^{2}\frac{1}{NK}\sum_{i}\sum_{k}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+6\tilde{\eta}^{2}\frac{\sigma^{2}}{NK}+6\tilde{\eta}^{2}\mathbb{E}\|F(\bar{\mathbf{w}}^{r-1})\|^{2}\bigg{]}.\end{split} (25)

Using Assumption 3.1, we know that G12,G22\|G_{1}\|^{2},\|G_{2}\|^{2} are both less than C2Ch2C_{\ell}^{2}C_{h}^{2}. Then, to bound the updates in one round of one machine as

𝔼𝐰¯r𝐰i,kr22η~2C2Ch2.\begin{split}&\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}\leq 2\tilde{\eta}^{2}C_{\ell}^{2}C_{h}^{2}.\end{split} (26)

Recalling (21) and (22), we obtain

1Rr=1R𝔼F(𝐰¯r1)2O(2(F(𝐰¯1)F)η~R+η~2L~2C2Ch2+η~σ2NK).\begin{split}\frac{1}{R}\sum\limits_{r=1}^{R}\mathbb{E}\|F(\bar{\mathbf{w}}^{r-1})\|^{2}\leq O\left(\frac{2(F(\bar{\mathbf{w}}^{1})-F_{*})}{\tilde{\eta}R}+\tilde{\eta}^{2}\tilde{L}^{2}C_{\ell}^{2}C_{h}^{2}+\tilde{\eta}\frac{\sigma^{2}}{NK}\right).\end{split} (27)

By setting parameters as in the theorem, we conclude the proof. Besides, if we set η=O(Nϵ2)\eta=O(N\epsilon^{2}), K=O(1/Nϵ)K=O(1/N\epsilon), thus η~=O(ϵ)\tilde{\eta}=O(\epsilon), to ensure 1Rr=1R𝔼F(𝐰¯r1)2ϵ2\frac{1}{R}\sum\limits_{r=1}^{R}\mathbb{E}\|F(\bar{\mathbf{w}}^{r-1})\|^{2}\leq\epsilon^{2}, it takes communication rounds of R=O(1ϵ3)R=O(\frac{1}{\epsilon^{3}}), and sample complexity on each machine O(1Nϵ4)O(\frac{1}{N\epsilon^{4}}). ∎

Appendix E FeDXL2 for Solving DXO with Non-Linear ff

In this section, we define the following notations:

Gi,1(𝐰1,𝐳1,𝐮,𝐰2,𝐳2)=f(𝐮)1(h(𝐰1,𝐳1),h(𝐰2,𝐳2))h(𝐰1,𝐳1),Gi,2(𝐰1,𝐳1,𝐮,𝐰2,𝐳2)=f(𝐮)2(h(𝐰1,𝐳1),h(𝐰2,𝐳2))h(𝐰2,𝐳2).\begin{split}&G_{i,1}(\mathbf{w}_{1},\mathbf{z}_{1},\mathbf{u},\mathbf{w}_{2},\mathbf{z}_{2})=\nabla f(\mathbf{u})\nabla_{1}\ell(h(\mathbf{w}_{1},\mathbf{z}_{1}),h(\mathbf{w}_{2},\mathbf{z}_{2}))\nabla h(\mathbf{w}_{1},\mathbf{z}_{1}),\\ &G_{i,2}(\mathbf{w}_{1},\mathbf{z}_{1},\mathbf{u},\mathbf{w}_{2},\mathbf{z}_{2})=\nabla f(\mathbf{u})\nabla_{2}\ell(h(\mathbf{w}_{1},\mathbf{z}_{1}),h(\mathbf{w}_{2},\mathbf{z}_{2}))\nabla h(\mathbf{w}_{2},\mathbf{z}_{2}).\end{split} (28)

Based on Assumption 3.3, it follows that Gi,1,Gi,2G_{i,1},G_{i,2} are Lipschitz with some constant modulus L1L_{1} and Gi,12,Gi,22\|G_{i,1}\|^{2},\|G_{i,2}\|^{2} are bounded by Cf2C2Ch2C_{f}^{2}C_{\ell}^{2}C_{h}^{2}, FF is LFL_{F}-smooth, where L1,LFL_{1},L_{F} are some proper constants depend on Assumption 3.3. We denote L~=max{L1,LF}\tilde{L}=\max\{L_{1},L_{F}\} to simplify notations.

For 𝐳1𝒮1i,𝐳2𝒮2j\mathbf{z}_{1}\in\mathcal{S}_{1}^{i},\mathbf{z}_{2}\in\mathcal{S}_{2}^{j}, define g(𝐰1,𝐳1,𝐰2,𝐳2)=(h(𝐰1;𝐳1),h(𝐰2,𝐳2))g(\mathbf{w}_{1},\mathbf{z}_{1},\mathbf{w}_{2},\mathbf{z}_{2})=\ell(h(\mathbf{w}_{1};\mathbf{z}_{1}),h(\mathbf{w}_{2},\mathbf{z}_{2})) and for 𝐳1𝒮1i\mathbf{z}_{1}\in\mathcal{S}_{1}^{i}, we define

g(𝐰1,𝐳1,𝐰2,𝒮2)=1Nj=1N𝔼𝐳𝒮2j(h(𝐰1;𝐳1),h(𝐰2,𝐳))\begin{split}g(\mathbf{w}_{1},\mathbf{z}_{1},\mathbf{w}_{2},\mathcal{S}_{2})=\frac{1}{N}\sum\limits_{j=1}^{N}\mathbb{E}_{\mathbf{z}^{\prime}\in\mathcal{S}_{2}^{j}}\ell(h(\mathbf{w}_{1};\mathbf{z}_{1}),h(\mathbf{w}_{2},\mathbf{z}^{\prime}))\end{split} (29)

It follows that gg is also L~\tilde{L}-Lipschitz in 𝐰1\mathbf{w}_{1} and 𝐰2\mathbf{w}_{2}.

E.1 Analysis of the moving average estimator 𝐮\mathbf{u}

Lemma E.1.

Under Assumption 3.3, the moving average estimator 𝐮\mathbf{u} satisfies

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2(1γ16|𝒮1i|)1Ni=1N1|𝒮1i|𝐳|𝒮1i|[𝔼𝐮i,k1r(𝐳)g(𝐰¯k1r,𝐳,𝐰¯k1r,𝒮2)2+20|𝒮1i|γL~2𝐰¯k1r𝐰¯kr2]+8γ2|𝒮1i|(σ2+C02)+16γβ2K2C02|𝒮1i|+8L~2𝐰¯r𝐰¯r12+8L~2𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21Ni𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq(1-\frac{\gamma}{16|\mathcal{S}_{1}^{i}|})\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}[\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k-1},\mathbf{z},\bar{\mathbf{w}}^{r}_{k-1},\mathcal{S}_{2})\|^{2}\\ &+\frac{20|\mathcal{S}_{1}^{i}|}{\gamma}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}]+8\frac{\gamma^{2}}{|\mathcal{S}_{1}^{i}|}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}}{|\mathcal{S}_{1}^{i}|}\\ &+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split}
Proof.

By update rules of 𝐮\mathbf{u}, we have

𝐮i,kr(𝐳)={𝐮i,k1r(𝐳)γ(𝐮i,k1r(𝐳)(h(𝐰i,kr,𝐳i,k,1r),h(𝐰j,tr1,𝐳^j,t,2r1)))𝐳=𝐳i,k,1r𝐮i,k1r(𝐳)𝐳𝐳i,k,1r.\begin{split}\mathbf{u}_{i,k}^{r}(\mathbf{z})=\left\{\begin{array}[]{cc}\mathbf{u}_{i,k-1}^{r}(\mathbf{z})-\gamma(\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-\ell(h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}),h(\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})))&\mathbf{z}=\mathbf{z}^{r}_{i,k,1}\\ \mathbf{u}_{i,k-1}^{r}(\mathbf{z})&\mathbf{z}\neq\mathbf{z}^{r}_{i,k,1}.\end{array}\right.\end{split} (30)

Or equivalently,

𝐮i,kr(𝐳)={𝐮i,k1r(𝐳)γ(𝐮i,k1r(𝐳)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1))𝐳=𝐳i,k,1r𝐮i,k1r(𝐳)𝐳𝐳i,k,1r\begin{split}\mathbf{u}_{i,k}^{r}(\mathbf{z})=\left\{\begin{array}[]{cc}\mathbf{u}_{i,k-1}^{r}(\mathbf{z})-\gamma(\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}))&\mathbf{z}=\mathbf{z}^{r}_{i,k,1}\\ \mathbf{u}_{i,k-1}^{r}(\mathbf{z})&\mathbf{z}\neq\mathbf{z}^{r}_{i,k,1}\end{array}\right.\end{split} (31)

Define 𝐮¯kr=(𝐮1,kr,𝐮2,kr,,𝐮N,kr)\bar{\mathbf{u}}^{r}_{k}=(\mathbf{u}^{r}_{1,k},\mathbf{u}^{r}_{2,k},...,\mathbf{u}^{r}_{N,k}), 𝐰¯kr=1Ni=1N𝐰i,kr\bar{\mathbf{w}}^{r}_{k}=\frac{1}{N}\sum\limits_{i=1}^{N}\mathbf{w}^{r}_{i,k}. Then it follows that

12Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2=1Ni1|𝒮1i|𝐳|𝒮1i|𝔼[12𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2),𝐮i,kr(𝐳)𝐮i,k1r(𝐳)+12𝐮i,kr(𝐳)𝐮i,k1r(𝐳)2],\begin{split}&\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &=\frac{1}{N}\sum_{i}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\bigg{[}\frac{1}{2}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}~{}~{}~{}+\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z})\rangle+\frac{1}{2}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z})\|^{2}\bigg{]},\end{split} (32)

which is

12Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2=12Ni1|𝒮i|𝐳𝒮1i𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+1Ni1|𝒮1i|𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)+1Ni12|𝒮1i|𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2=12Ni1|𝒮i|𝐳𝒮1i𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+1Ni1|𝒮1i|𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)+1Ni1|𝒮1i|𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)+1Ni12|𝒮i|𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2,\begin{split}&\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &=\frac{1}{2N}\sum_{i}\frac{1}{|\mathcal{S}_{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+\frac{1}{N}\sum_{i}\frac{1}{|\mathcal{S}_{1}^{i}|}\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{N}\sum_{i}\frac{1}{2|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &=\frac{1}{2N}\sum_{i}\frac{1}{|\mathcal{S}_{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+\frac{1}{N}\sum_{i}\frac{1}{|\mathcal{S}_{1}^{i}|}\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{N}\sum_{i}\frac{1}{|\mathcal{S}_{1}^{i}|}\mathbb{E}\langle g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{N}\sum_{i}\frac{1}{2|\mathcal{S}_{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2},\end{split} (33)

where

𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)=𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)=𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+1γ𝐮i,k1r(𝐳i,k,1r)𝐮i,kr(𝐳i,k,1r),𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)=𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+12γ(𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2)\begin{split}&\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &=\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\rangle\\ &=\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{\gamma}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\rangle\\ &=\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{2\gamma}(\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2})\end{split} (34)

If γ15\gamma\leq\frac{1}{5}, we have

12(1γ1γ+14γ)𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2+𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)14γ𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2+γ𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+14γ𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2γ𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)24γ𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)2+4γL~2𝔼𝐰¯r𝐰¯r12+4γL~2𝔼𝐰i,kr𝐰¯r2+4γL~2𝔼𝐰j,tr1𝐰¯r124γσ2+4γL~2𝔼𝐰¯r𝐰¯r12+4γL~2𝔼𝐰i,kr𝐰¯r2+4γL~2𝔼𝐰j,tr1𝐰¯r12.\begin{split}&-\frac{1}{2}\left(\frac{1}{\gamma}-1-\frac{\gamma+1}{4\gamma}\right)\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}+\mathbb{E}\langle g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq-\frac{1}{4\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+\gamma\mathbb{E}\|g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+\frac{1}{4\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq\gamma\mathbb{E}\|g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq 4\gamma\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &~{}~{}~{}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &\leq 4\gamma\sigma^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}.\end{split} (35)

Then, we have

12Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)212Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+1Ni1|𝒮1i|[12γ𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)212γ𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2γ+18γ𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2+4γσ2+4γL~2𝔼𝐰¯r𝐰¯r12+4γL~2𝔼𝐰i,kr𝐰¯r2+4γL~2𝔼𝐰j,tr1𝐰¯r12+𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)].\begin{split}&\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &+\frac{1}{N}\sum_{i}\frac{1}{|\mathcal{S}_{1}^{i}|}\Bigg{[}\frac{1}{2\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &-\frac{1}{2\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\!-\!\frac{\gamma+1}{8\gamma}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\!+\!4\gamma\sigma^{2}\\ &+4\gamma\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\Bigg{]}.\end{split} (36)

Note that 𝐳𝐳i,k,1r𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2=𝐳𝐳i,k,1r𝐮i,kr(𝐳)g(𝐰kr,𝐳,𝐰¯kr,𝒮2)2\sum_{\mathbf{z}\neq\mathbf{z}^{r}_{i,k,1}}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}=\sum_{\mathbf{z}\neq\mathbf{z}^{r}_{i,k,1}}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\mathbf{w}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}, which implies

12γ(𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2)=12γ𝐳𝒮1i(𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2).\begin{split}&\frac{1}{2\gamma}\left(\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\right)\\ &=\frac{1}{2\gamma}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\left(\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\right).\end{split} (37)

Since ()C0\ell(\cdot)\leq C_{0}, we have that g()2C02\|g(\cdot)\|^{2}\leq C_{0}^{2}, 𝐮i,kr(𝐳)2C02\|\mathbf{u}^{r}_{i,k}(\mathbf{z})\|^{2}\leq C_{0}^{2} and

𝐮i,kr(𝐳)𝐮i,0r(𝐳)2β2K2C02\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-\mathbf{u}^{r}_{i,0}(\mathbf{z})\|^{2}\leq\beta^{2}K^{2}C_{0}^{2}

. Besides, we have

𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)=𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)+𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)+2L~2𝔼𝐰¯r𝐰¯r12+2L~2𝔼𝐰¯r𝐰i,kr2+L~2𝔼𝐰¯r1𝐰j,tr12+14𝔼g(𝐰¯r,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)22γC02+1γ𝐰¯kr𝐰¯r12+𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)+2L~2𝔼𝐰¯r𝐰¯r12+2L~2𝔼𝐰¯r𝐰i,kr2+L~2𝔼𝐰¯r1𝐰j,tr12+14𝔼g(𝐰¯r,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)2,\begin{split}&\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &=\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{j,t}\|^{2}\\ &+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq 2\gamma C_{0}^{2}+\frac{1}{\gamma}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{j,t}\|^{2}\\ &+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2},\end{split} (38)

where

𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)=𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)+𝐮i,0r1(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)+𝐮i,0r1(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)+𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r),𝐮i,0r1(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)+𝔼𝐮i,0r1(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)+𝔼𝐮i,0r1(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),𝐮i,0r1(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)4𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)2+14𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2+14𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2+4𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)24𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)212𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2+8β2K2C02.\begin{split}&\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &=\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})+\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})+\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1}),\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq 4\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}+4\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq 4\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}-\frac{1}{2}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}+8\beta^{2}K^{2}C_{0}^{2}.\end{split} (39)

Noting

𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2=𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)+𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)2=𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)2𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)2+2𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r),𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)12𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)2+8𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)212𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)2+8β2K2C0214𝔼g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)2+12L~2𝐰¯r1𝐰¯kr2+8β2K2C02\begin{split}&-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &=-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})+\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &=-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}-\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}+2\mathbb{E}\langle g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1}),\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq-\frac{1}{2}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+8\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq-\frac{1}{2}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+8\beta^{2}K^{2}C_{0}^{2}\\ &\leq-\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+\frac{1}{2}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+8\beta^{2}K^{2}C_{0}^{2}\end{split} (40)

Then by multiplying γ\gamma to every term and rearranging terms using the setting of γO(1)\gamma\leq O(1), we can obtain

γ+121Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2γ(118|𝒮1i|)+121Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+4γ2|𝒮1i|(σ2+C02)+8γβ2K2C02|𝒮1i|+4L~2𝔼𝐰¯r𝐰¯r12+4L~2𝔼𝐰¯r𝐰¯kr2+4(γ2+γ|𝒮1i|)L~21Ni𝔼𝐰¯r𝐰i,kr2+(γ2+γ|𝒮1i|)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰i,kr12.\begin{split}&\frac{\gamma+1}{2}\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{\gamma(1-\frac{1}{8|\mathcal{S}_{1}^{i}|})+1}{2}\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &+\frac{4\gamma^{2}}{|\mathcal{S}_{1}^{i}|}(\sigma^{2}+C_{0}^{2})+\frac{8\gamma\beta^{2}K^{2}C_{0}^{2}}{|\mathcal{S}_{1}^{i}|}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+4(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{i,k}\|^{2}.\end{split} (41)

Dividing γ+12\frac{\gamma+1}{2} on both sides gives

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2γ(118|𝒮1i|)+1γ+11Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+8γ2|𝒮1i|(σ2+C02)+16γβ2K2C02|𝒮1i|+8L~2𝐰¯r𝐰¯r12+8L~2𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21Ni𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{\gamma(1-\frac{1}{8|\mathcal{S}_{1}^{i}|})+1}{\gamma+1}\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &+8\frac{\gamma^{2}}{|\mathcal{S}_{1}^{i}|}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}}{|\mathcal{S}_{1}^{i}|}+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split} (42)

Using Young’s inequality,

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2(1γ8|𝒮1i|)1Ni=1N1|𝒮1i|𝐳|𝒮1i|[(1+γ16|𝒮1i|)𝔼𝐮i,k1r(𝐳)g(𝐰¯k1r,𝐳,𝐰¯k1r,𝒮2)2+(1+16|𝒮1i|γ)L~2𝐰¯k1r𝐰¯kr2]+8γ2|𝒮1i|(σ2+C02)+16γβ2K2C02|𝒮1i|+8L~2𝐰¯r𝐰¯r12+8L~2𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21Ni𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰¯i,kr12(1γ16|𝒮1i|)1Ni=1N1|𝒮1i|𝐳|𝒮1i|[𝔼𝐮i,k1r(𝐳)g(𝐰¯k1r,𝐳,𝐰¯k1r,𝒮2)2+20|𝒮1i|γL~2𝐰¯k1r𝐰¯kr2]+8γ2|𝒮1i|(σ2+C02)+16γβ2K2C02|𝒮1i|+8L~2𝐰¯r𝐰¯r12+8L~2𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21Ni𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq(1-\frac{\gamma}{8|\mathcal{S}_{1}^{i}|})\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\bigg{[}(1+\frac{\gamma}{16|\mathcal{S}_{1}^{i}|})\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k-1},\mathbf{z},\bar{\mathbf{w}}^{r}_{k-1},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+(1+\frac{16|\mathcal{S}_{1}^{i}|}{\gamma})\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\bigg{]}\\ &+8\frac{\gamma^{2}}{|\mathcal{S}_{1}^{i}|}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}}{|\mathcal{S}_{1}^{i}|}+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}\\ &\leq(1-\frac{\gamma}{16|\mathcal{S}_{1}^{i}|})\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}[\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k-1},\mathbf{z},\bar{\mathbf{w}}^{r}_{k-1},\mathcal{S}_{2})\|^{2}\\ &+\frac{20|\mathcal{S}_{1}^{i}|}{\gamma}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}]+8\frac{\gamma^{2}}{|\mathcal{S}_{1}^{i}|}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}}{|\mathcal{S}_{1}^{i}|}\\ &+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split}

E.2 Analysis of the estimator of gradient

With update Gi,kr=(1β)Gi,k1r+β(Gi,k,1r+Gi,k,2r)G^{r}_{i,k}=(1-\beta)G^{r}_{i,k-1}+\beta(G^{r}_{i,k,1}+G^{r}_{i,k,2}), we define G¯kr:=1Ni=1NGi,kr\bar{G}^{r}_{k}:=\frac{1}{N}\sum\limits_{i=1}^{N}G^{r}_{i,k}, and Δkr:=G¯krF(𝐰¯kr)2\Delta^{r}_{k}:=\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}. Then it follows that G¯kr=(1β)G¯k1r+β1Ni(Gi,k,1r+Gi,k,2r)\bar{G}^{r}_{k}=(1-\beta)\bar{G}^{r}_{k-1}+\beta\frac{1}{N}\sum_{i}(G^{r}_{i,k,1}+G^{r}_{i,k,2}).

Lemma E.2.

Under Assumption 3.3, with setting of η=O(β)\eta=O(\beta), Algorithm 2 ensures that

Δkr(13β4)G¯k1rF(𝐰¯k1r)2+η16F(𝐰¯r1)2+2β2σ2N+36η2βL~2𝐰¯kr𝐰¯r12+12β(1Ni4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1Ni4L~2𝔼𝐰j,tr1𝐰¯r12)+12β1Ni(L~2𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+L~2𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯tr1,𝐳^j,t,1r1,𝐰¯tr1,𝒮2)2).\begin{split}&\Delta^{r}_{k}\leq(1-\frac{3\beta}{4})\|\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1})\|^{2}+\frac{\eta}{16}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+\frac{2\beta^{2}\sigma^{2}}{N}+36\frac{\eta^{2}}{\beta}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+12\beta\left(\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+12\beta\frac{1}{N}\sum_{i}\bigg{(}\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}+\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2})\|^{2}\bigg{)}.\end{split}
Proof.
Δkr=G¯krF(𝐰¯kr)2=(1β)G¯k1r+β1Ni(Gi,k,1r+Gi,k,2r)F(𝐰¯kr)2=(1β)(G¯k1rF(𝐰¯k1r))+(1β)(F(𝐰¯k1r)F(𝐰¯kr))+β(1Ni(G1(𝐰i,kr,𝐳i,k,1r,𝐮i,kr(𝐳i,k,1r),𝐰j,tr1,𝐳^j,t,2r1)+G2(𝐰j,tr1,𝐳^j,t,1r1,𝐮j,tr1(𝐳^j,t,1r1),𝐰i,kr,𝐳i,k,2r))1Ni(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r)))+β(1Ni(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯kr))2.\begin{split}&\Delta^{r}_{k}=\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\\ &=\|(1-\beta)\bar{G}^{r}_{k-1}+\beta\frac{1}{N}\sum_{i}(G^{r}_{i,k,1}+G^{r}_{i,k,2})-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\\ &=\bigg{\|}(1-\beta)(\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1}))+(1-\beta)(\nabla F(\bar{\mathbf{w}}^{r}_{k-1})-\nabla F(\bar{\mathbf{w}}^{r}_{k}))\\ &~{}~{}~{}+\beta\bigg{(}\frac{1}{N}\sum_{i}(G_{1}(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}),\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})+G_{2}(\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}),\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}))\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{1}{N}\sum_{i}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))\bigg{)}\\ &+\beta\bigg{(}\frac{1}{N}\sum_{i}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r}_{k})\bigg{)}\bigg{\|}^{2}.\end{split} (43)

Using Young’s inequality and L~\tilde{L}-Lipschtzness of G1,G2G_{1},G_{2}, we can then derive

Δkr(1+β)(1β)(G¯k1rF(𝐰¯k1r))+β(1Ni(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯r1))2+(1+10β)β2(1Ni4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1Ni4L~2𝔼𝐰j,tr1𝐰¯r12)+(1+10β)𝐰¯k1r𝐰¯kr2+(1+10β)β21Ni(L~2𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+L~2𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯tr1,𝐳^j,t,1r1,𝐰¯tr1,𝒮2)2).\begin{split}&\Delta^{r}_{k}\leq(1+\beta)\Bigg{\|}(1-\beta)(\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1}))\\ &+\beta\bigg{(}\frac{1}{N}\sum_{i}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},{\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r-1})\bigg{)}\Bigg{\|}^{2}\\ &+(1+\frac{10}{\beta})\beta^{2}\left(\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+(1+\frac{10}{\beta})\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+(1+\frac{10}{\beta})\beta^{2}\frac{1}{N}\sum_{i}\bigg{(}\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2})\|^{2}\bigg{)}.\end{split} (44)

By the fact that

𝔼[1Ni(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯r1)]=0,\begin{split}&\mathbb{E}[\frac{1}{N}\sum_{i}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},{\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r-1})]=0,\end{split} (45)
𝔼1Ni(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯r1)2σ2N,\begin{split}&\mathbb{E}\|\frac{1}{N}\sum_{i}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},{\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\leq\frac{\sigma^{2}}{N},\end{split} (46)

and

𝐰¯k1r𝐰¯kr2=η2G¯kr23η2G¯krF(𝐰¯kr)2+3η2F(𝐰¯kr)F(𝐰¯r1)2+3η2F(𝐰¯r1)2,\begin{split}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}=\eta^{2}\|\bar{G}^{r}_{k}\|^{2}\leq 3\eta^{2}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}+3\eta^{2}\|\nabla F(\bar{\mathbf{w}}^{r}_{k})-\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+3\eta^{2}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2},\end{split} (47)

we obtain

Δkr(13β4)G¯k1F(𝐰¯k1r)2+η16F(𝐰¯r1)2+36η2βL~2𝐰¯kr𝐰¯r12+2β2σ2N+12β(1Ni4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1Ni4L~2𝔼𝐰j,tr1𝐰¯r12)+12β1Ni(L~2𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+L~2𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯tr1,𝐳^j,t,1r1,𝐰¯tr1,𝒮2)2).\begin{split}&\Delta^{r}_{k}\leq(1-\frac{3\beta}{4})\|\bar{G}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1})\|^{2}+\frac{\eta}{16}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+36\frac{\eta^{2}}{\beta}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+\frac{2\beta^{2}\sigma^{2}}{N}+12\beta\left(\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+12\beta\frac{1}{N}\sum_{i}\bigg{(}\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}+\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2})\|^{2}\bigg{)}.\end{split}

E.3 Analysis of Theorem 3.4

Proof.

By updating rules,

𝐰¯r𝐰i,kr2η2K2Cf2C2Cg2,\begin{split}&\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}\leq\eta^{2}K^{2}C_{f}^{2}C_{\ell}^{2}C_{g}^{2},\end{split} (48)

and

𝐰¯kr𝐰¯r2=η~21NKi=1Nm=1kG¯mr2η~21Km=1KG¯mrF(𝐰¯mr)+F(𝐰¯mr)2.\begin{split}&\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r}\|^{2}=\tilde{\eta}^{2}\|\frac{1}{NK}\sum_{i=1}^{N}\sum\limits_{m=1}^{k}\bar{G}^{r}_{m}\|^{2}\leq\tilde{\eta}^{2}\frac{1}{K}\sum_{m=1}^{K}\|\bar{G}^{r}_{m}-\nabla F(\bar{\mathbf{w}}^{r}_{m})+\nabla F(\bar{\mathbf{w}}^{r}_{m})\|^{2}.\end{split} (49)

Similarly, we also have

𝐰¯r1𝐰¯r2=η~21NKi=1Nk=1KG¯kr12η~21Kk=1KG¯kr1F(𝐰¯kr1)+F(𝐰¯kr1)2.\begin{split}&\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r}\|^{2}=\tilde{\eta}^{2}\|\frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}\bar{G}^{r-1}_{k}\|^{2}\leq\tilde{\eta}^{2}\frac{1}{K}\sum_{k=1}^{K}\|\bar{G}^{r-1}_{k}-\nabla F(\bar{\mathbf{w}}^{r-1}_{k})+\nabla F(\bar{\mathbf{w}}^{r-1}_{k})\|^{2}.\end{split} (50)

Lemma E.2 gives that

1RKr,k𝔼G¯krF(𝐰¯kr)2Δ00βRK+18RrF(𝐰¯r1)2+1RKr,k50η2βL~2𝐰¯kr𝐰¯r12+2βσ2N+18(1Ni4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1Ni4L~2𝔼𝐰j,tr1𝐰¯r12)+181Rr1NKi,k1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯r,𝐳,𝐰¯r,𝒮2)2+181Rr1NKj,t1|𝒮1i|𝐳𝒮1i𝐮j,tr1(𝐳)g(𝐰¯tr1,𝐳,𝐰¯tr1,𝒮2))2,\begin{split}&\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\leq\frac{\Delta^{0}_{0}}{\beta RK}+\frac{1}{8R}\sum_{r}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+\frac{1}{RK}\sum_{r,k}50\frac{\eta^{2}}{\beta}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{2\beta\sigma^{2}}{N}\\ &+18\left(\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+18\frac{1}{R}\sum_{r}\frac{1}{NK}\sum_{i,k}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r},\mathbf{z},\bar{\mathbf{w}}^{r},\mathcal{S}_{2})\|^{2}\\ &~{}~{}+18\frac{1}{R}\sum_{r}\frac{1}{NK}\sum_{j^{\prime},t^{\prime}}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\mathbf{z})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathbf{z},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2}))\|^{2},\end{split} (51)

which by setting of η\eta and β\beta leads to

1RKr,k𝔼G¯krF(𝐰¯kr)22Δ00βRK+8βσ2N+14RrF(𝐰¯r1)2+10η~2C2Cg2+161Rr1NKi,k1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯r;𝐳,𝒮2)2+321Rr1NKj,t1|𝒮1i|𝐳𝒮1i𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯r1;𝐳^j,t,1r1,𝒮2))2+32Cg21Rr1Kt𝐰¯r1𝐰¯tr12.\begin{split}&\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\leq\frac{2\Delta^{0}_{0}}{\beta RK}+\frac{8\beta\sigma^{2}}{N}+\frac{1}{4R}\sum_{r}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+10\tilde{\eta}^{2}C_{\ell}^{2}C_{g}^{2}\\ &+16\frac{1}{R}\sum_{r}\frac{1}{NK}\sum_{i,k}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r};\mathbf{z},\mathcal{S}_{2})\|^{2}\\ &+32\frac{1}{R}\sum_{r}\frac{1}{NK}\sum_{j^{\prime},t^{\prime}}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1};\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\mathcal{S}_{2}))\|^{2}\\ &+32C_{g}^{2}\frac{1}{R}\sum_{r}\frac{1}{K}\sum_{t^{\prime}}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{t^{\prime}}\|^{2}.\end{split}

Using Lemma E.1 yields

1Rr1NKi=1Nk=1K1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)216Mγ1R1NKi=1N1|𝒮1i|𝐳𝒮1i𝔼𝐮i,00(𝐳)g(𝐰¯00,𝐳,𝐰¯00,𝒮2)2+400M2γ21RKr,kL~2𝐰¯k1r𝐰¯kr2+150γ(σ2+C02)+256β2K2C02+128L~2|𝒮1i|γ(𝐰¯r𝐰¯r12+𝐰¯r𝐰¯r12)+150(γ|𝒮1i|+1)L~21Ni𝐰¯r𝐰i,kr2+32(γ|𝒮1i|+1)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{R}\sum_{r}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{16M}{\gamma}\frac{1}{R}\frac{1}{NK}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{0}_{i,0}(\mathbf{z})-g(\bar{\mathbf{w}}^{0}_{0},\mathbf{z},\bar{\mathbf{w}}^{0}_{0},\mathcal{S}_{2})\|^{2}\\ &+\frac{400M^{2}}{\gamma^{2}}\frac{1}{RK}\sum_{r,k}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+150\gamma(\sigma^{2}+C_{0}^{2})+256\beta^{2}K^{2}C_{0}^{2}\\ &+128\tilde{L}^{2}\frac{|\mathcal{S}_{1}^{i}|}{\gamma}(\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2})\\ &+150(\gamma|\mathcal{S}_{1}^{i}|+1)\tilde{L}^{2}\frac{1}{N}\sum_{i}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+32(\gamma|\mathcal{S}_{1}^{i}|+1)\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split}

Combining this with previous five inequalities and noting the parameters settings, we obtain

1Rr1NKi=1Nk=1K1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2O(MγRK+η2M2γ21RKr,k𝔼G¯krF(𝐰¯kr)2+γ+β2K2+Mγη~2(1βRK+βN)+γMη2K2+14RrF(𝐰¯r1)2)\begin{split}&\frac{1}{R}\sum_{r}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq O\bigg{(}\frac{M}{\gamma RK}+\eta^{2}\frac{M^{2}}{\gamma^{2}}\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}+\gamma+\beta^{2}K^{2}+\frac{M}{\gamma}\tilde{\eta}^{2}(\frac{1}{\beta RK}+\frac{\beta}{N})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}+\gamma M\eta^{2}K^{2}+\frac{1}{4R}\sum_{r}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\bigg{)}\end{split}

and

1RKr,k𝔼G¯krF(𝐰¯kr)2O(MγRK+γ+β2K2+Mγη~2(1βRK+βN)+γMη2K2+14RrF(𝐰¯r1)2).\begin{split}&\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\\ &\leq O\left(\frac{M}{\gamma RK}+\gamma+\beta^{2}K^{2}+\frac{M}{\gamma}\tilde{\eta}^{2}(\frac{1}{\beta RK}+\frac{\beta}{N})+\gamma M\eta^{2}K^{2}+\frac{1}{4R}\sum_{r}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\right).\end{split} (52)

Then using the standard analysis of smooth function, we derive

F(𝐰¯r+1)F(𝐰¯r)F(𝐰¯r)(𝐰¯r+1𝐰¯r)+L~2𝐰¯r+1𝐰¯r2=η~F(𝐰¯r)(1NKikGi,krF(𝐰¯r)+F(𝐰¯r))+L~2𝐰¯r+1𝐰¯r2=η~F(𝐰¯r)2+η~2F(𝐰¯r)2+η~21NKikGi,krF(𝐰¯r)2+L~2𝐰¯r+1𝐰¯r2η~2F(𝐰¯r)2+η~1NKik(Gi,krF(𝐰¯kr))2+η~1Kk(F(𝐰¯kr)F(𝐰¯r))2+L~2𝐰¯r+1𝐰¯r2η~2F(𝐰¯r)2+η~1Kk1Ni(Gi,krF(𝐰¯kr))2+η~L~2Kk𝐰¯kr𝐰¯r2+L~2𝐰¯r+1𝐰¯r2.\begin{split}&F(\bar{\mathbf{w}}^{r+1})-F(\bar{\mathbf{w}}^{r})\leq\nabla F(\bar{\mathbf{w}}^{r})^{\top}(\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r})+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &=-\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r})^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r})+\nabla F(\bar{\mathbf{w}}^{r})\right)+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &=-\tilde{\eta}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\frac{\tilde{\eta}}{2}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\frac{\tilde{\eta}}{2}\|\frac{1}{NK}\sum_{i}\sum_{k}G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\\ &~{}~{}~{}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq-\frac{\tilde{\eta}}{2}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\tilde{\eta}\|\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r}_{k}))\|^{2}\\ &~{}~{}~{}+\tilde{\eta}\|\frac{1}{K}\sum_{k}(\nabla F(\bar{\mathbf{w}}^{r}_{k})-\nabla F(\bar{\mathbf{w}}^{r}))\|^{2}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq-\frac{\tilde{\eta}}{2}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\tilde{\eta}\frac{1}{K}\sum_{k}\|\frac{1}{N}\sum_{i}(G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r}_{k}))\|^{2}\\ &~{}~{}~{}+\tilde{\eta}\frac{\tilde{L}^{2}}{K}\sum_{k}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r}\|^{2}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}.\end{split} (53)

Combining with (52), (48), (49), and (50), we derive

1Rr𝔼F(𝐰¯r)2O(MγRK+γ+β2K2+Mγη~2(1βRK+βN)+γMη2K2).\begin{split}&\frac{1}{R}\sum_{r}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\leq O\left(\frac{M}{\gamma RK}+\gamma+\beta^{2}K^{2}+\frac{M}{\gamma}\tilde{\eta}^{2}(\frac{1}{\beta RK}+\frac{\beta}{N})+\gamma M\eta^{2}K^{2}\right).\end{split}

By setting parameters as in the theorem, we can conclude the proof. Further, to get 1Rr𝔼F(𝐰¯r)2ϵ2\frac{1}{R}\sum_{r}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\leq\epsilon^{2}, we just need to set γ=O(ϵ2)\gamma=O(\epsilon^{2}), β=O(ϵ2M)\beta=O(\frac{\epsilon^{2}}{\sqrt{M}}), K=O(Mϵ)K=O(\frac{\sqrt{M}}{\epsilon}), η=O(ϵ2M)\eta=O(\frac{\epsilon^{2}}{M}), R=O(Mϵ3)R=O(\frac{\sqrt{M}}{\epsilon^{3}}). ∎

Appendix F FeDXL with Partial Client Participation

Considering that not all client machines are available to work at each round, in this section, we provide an algorithm that allows partial client participation in every round. The algorithm is given in Algorithm 3. We use the Assumption 3.3. The convergence results will be presented in Theorem F.3.

Algorithm 3 FeDXL2: Federated Learning for DXO with non-linear ff
1:  On Client ii: Require parameters η,K\eta,K
2:  Initialize model 𝐰i,K0\mathbf{w}_{i,K}^{0}, 𝒰i0={u0(𝐳)=0,𝐳𝒮1i}\mathcal{U}_{i}^{0}=\{u^{0}(\mathbf{z})=0,\mathbf{z}\in\mathcal{S}^{i}_{1}\}, Gi,K0=0G^{0}_{i,K}=0, and buffer i,1,i,2,𝒞i=\mathcal{B}_{i,1},\mathcal{B}_{i,2},\mathcal{C}_{i}=\emptyset
3:  Send i,10,i,20,𝒰i0\mathcal{H}^{0}_{i,1},\mathcal{H}^{0}_{i,2},\mathcal{U}^{0}_{i} to the server
4:  Sample KK points from S1iS_{1}^{i}, compute their predictions using model 𝐰i,00\mathbf{w}_{i,0}^{0} denoted by i,10\mathcal{H}^{0}_{i,1}
5:  Sample KK points from S2iS_{2}^{i}, compute their predictions using model 𝐰i,00\mathbf{w}_{i,0}^{0} denoted by i,20\mathcal{H}^{0}_{i,2}
6:  for r=1,,Rr=1,...,R do
7:     if iPri\not\in P^{r} then skip this round, otherwise do the following
8:     Receives 𝐰¯r,G¯r\bar{\mathbf{w}}^{r},\bar{G}^{r} from the server and set 𝐰i,0r+1=𝐰¯r,Gi,0r+1=G¯r\mathbf{w}^{r+1}_{i,0}=\bar{\mathbf{w}}_{r},G^{r+1}_{i,0}=\bar{G}^{r}
9:     Receive i,1r1,i,2r1,𝒫r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2},\mathcal{P}^{r-1} from the server
10:     Update the buffer i,1,i,2,𝒞i\mathcal{B}_{i,1},\mathcal{B}_{i,2},\mathcal{C}_{i} using i,1r1,i,2r1,𝒫r1\mathcal{R}^{r-1}_{i,1},\mathcal{R}^{r-1}_{i,2},\mathcal{P}^{r-1} with shuffling, respectively
11:     Set i,1r=\mathcal{H}^{r}_{i,1}=\emptyset, i,2r=,𝒰ir=\mathcal{H}^{r}_{i,2}=\emptyset,\mathcal{U}_{i}^{r}=\emptyset
12:     for k=0,..,K1k=0,..,K-1 do
13:        Sample 𝐳i,k,1r\mathbf{z}^{r}_{i,k,1} from 𝒮1i\mathcal{S}^{i}_{1}, sample 𝐳i,k,2r\mathbf{z}^{r}_{i,k,2} from 𝒮2i\mathcal{S}^{i}_{2} \diamond or sample two mini-batches of data
14:        Take next hξr1h^{r-1}_{\xi}, hζr1h^{r-1}_{\zeta} and uζr1u^{r-1}_{\zeta} from i,1\mathcal{B}_{i,1} and i,2\mathcal{B}_{i,2} and 𝒞i\mathcal{C}_{i}, respectively
15:        Compute h(𝐰i,kr,𝐳i,k,1r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}) and h(𝐰i,kr,𝐳i,k,2r)h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2})
16:        Compute h(𝐰i,kr,𝐳^i,k,1r)h(\mathbf{w}^{r}_{i,k},\hat{\mathbf{z}}^{r}_{i,k,1}) and h(𝐰i,kr,𝐳^i,k,2r)h(\mathbf{w}^{r}_{i,k},\hat{\mathbf{z}}^{r}_{i,k,2}) and add them to i,1r,i,2r\mathcal{H}^{r}_{i,1},\mathcal{H}^{r}_{i,2}, respectively
17:        Compute 𝐮i,kr(𝐳i,k,1r)\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}) according to (11 and add it to 𝒰ir\mathcal{U}_{i}^{r}
18:        Compute Gi,k,1rG^{r}_{i,k,1} and Gi,k,2rG^{r}_{i,k,2} according to (12), (13)
19:        Gi,kr=(1β)Gi,k1r+β(Gi,k,1r+Gi,k,2r)G^{r}_{i,k}=(1-\beta)G^{r}_{i,k-1}+\beta(G^{r}_{i,k,1}+G^{r}_{i,k,2})
20:        𝐰i,k+1r=𝐰i,krηGi,kr\mathbf{w}^{r}_{i,k+1}=\mathbf{w}^{r}_{i,k}-\eta G^{r}_{i,k}
21:     end for
22:     Sends 𝐰i,Kr,Gi,kr\mathbf{w}^{r}_{i,K},G^{r}_{i,k} to the server
23:     Send i,1r,i,2r,𝒰ir\mathcal{H}^{r}_{i,1},\mathcal{H}^{r}_{i,2},\mathcal{U}_{i}^{r} to the server
24:  end for  
25:  On Server
26:  Collects 0=1,02,0N,0\mathcal{H}^{0}_{*}=\mathcal{H}^{0}_{1,*}\cup\mathcal{H}^{0}_{2,*}\ldots\cup\mathcal{H}^{0}_{N,*} and 𝒰0=𝒰10𝒰10𝒰N0\mathcal{U}^{0}=\mathcal{U}^{0}_{1}\cup\mathcal{U}^{0}_{1}\ldots\cup\mathcal{U}^{0}_{N}, where =1,2*=1,2
27:  for r=1,,Rr=1,...,R do
28:     Sample a set PrP^{r} of clients to participant this round
29:     Receive 𝐰i,Kr1\mathbf{w}^{r-1}_{i,K},Gi,Kr1G^{r-1}_{i,K} from client iPr1i\in P^{r-1}, compute 𝐰¯r=1|Pr1|iPr1𝐰i,Kr1\bar{\mathbf{w}}^{r}=\frac{1}{|P^{r-1}|}\sum_{i\in P^{r-1}}\mathbf{w}^{r-1}_{i,K}, Gr=1|Pr1|iPr1Gi,Kr1G^{r}=\frac{1}{|P^{r-1}|}\sum_{i\in P^{r-1}}G^{r-1}_{i,K}.
30:     Broadcast 𝐰¯r\bar{\mathbf{w}}^{r} and GrG^{r} to clients in PrP^{r}
31:     Set i,1r1=1r1,i,2r1=2r1,𝒫ir1=𝒰r1\mathcal{R}^{r-1}_{i,1}=\mathcal{H}^{r-1}_{1},\mathcal{R}^{r-1}_{i,2}=\mathcal{H}^{r-1}_{2},\mathcal{P}^{r-1}_{i}=\mathcal{U}^{r-1} and send them to Client ii for all iPri\in P^{r}
32:     Collects r=i,r,iPr\mathcal{H}^{r}_{*}=\cup\mathcal{H}^{r}_{i,*},\forall i\in P^{r} and 𝒰r=𝒰ir,iPr\mathcal{U}^{r}=\cup\mathcal{U}^{r}_{i},\forall i\in P^{r}, where =1,2*=1,2
33:  end for

F.1 Analysis of the moving average estimator 𝐮\mathbf{u}

Lemma F.1.

Under Assumption 3.3, the moving average estimator 𝐮\mathbf{u} satisfies

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2(1γ|Pr|16|𝒮1i|N)1Ni=1N1|𝒮1i|𝐳|𝒮1i|[𝔼𝐮i,k1r(𝐳)g(𝐰¯k1r,𝐳,𝐰¯k1r,𝒮2)2+20|𝒮1i|Nγ|Pr|L~2𝐰¯k1r𝐰¯kr2]+8γ2|𝒮1i||Pr|N(σ2+C02)+16γβ2K2C02|Pr||𝒮1i|N+8|Pr|NL~2𝐰¯r𝐰¯r12+8L~2|Pr|N𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21NiPr𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKiPrk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq(1-\frac{\gamma|P^{r}|}{16|\mathcal{S}_{1}^{i}|N})\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}[\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k-1},\mathbf{z},\bar{\mathbf{w}}^{r}_{k-1},\mathcal{S}_{2})\|^{2}\\ &+\frac{20|\mathcal{S}_{1}^{i}|N}{\gamma|P^{r}|}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}]+8\frac{\gamma^{2}}{|\mathcal{S}_{1}^{i}|}\frac{|P^{r}|}{N}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}\\ &+8\frac{|P^{r}|}{N}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\frac{|P^{r}|}{N}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i\in P^{r}}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i\in P^{r}}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split}
Proof.

Denote PrP^{r} as the clients that are sampled to take participation in the rr-th round. By update rules of 𝐮\mathbf{u}, we have

𝐮i,kr(𝐳)={𝐮i,k1r(𝐳)γ(𝐮i,k1r(𝐳)(h(𝐰i,kr,𝐳i,k,1r),h(𝐰j,tr1,𝐳^j,t,2r1))),iPrand𝐳=𝐳i,k,1r𝐮i,k1r(𝐳),otherwise.\begin{split}\mathbf{u}_{i,k}^{r}(\mathbf{z})=\left\{\begin{array}[]{cc}\mathbf{u}_{i,k-1}^{r}(\mathbf{z})-\gamma(\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-\ell(h(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1}),h(\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}))),&i\in P^{r}~{}and~{}\mathbf{z}=\mathbf{z}^{r}_{i,k,1}\\ \mathbf{u}_{i,k-1}^{r}(\mathbf{z}),&otherwise.\end{array}\right.\end{split} (54)

Or equivalently,

𝐮i,kr(𝐳)={𝐮i,k1r(𝐳)γ(𝐮i,k1r(𝐳)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)),iPrand𝐳=𝐳i,k,1r𝐮i,k1r(𝐳),otherwise.\begin{split}\mathbf{u}_{i,k}^{r}(\mathbf{z})=\left\{\begin{array}[]{cc}\mathbf{u}_{i,k-1}^{r}(\mathbf{z})-\gamma(\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})),&i\in P^{r}~{}and~{}\mathbf{z}=\mathbf{z}^{r}_{i,k,1}\\ \mathbf{u}_{i,k-1}^{r}(\mathbf{z}),&otherwise.\end{array}\right.\end{split} (55)

Define 𝐮¯kr=(𝐮1,kr,𝐮2,kr,,𝐮N,kr)\bar{\mathbf{u}}^{r}_{k}=(\mathbf{u}^{r}_{1,k},\mathbf{u}^{r}_{2,k},...,\mathbf{u}^{r}_{N,k}), 𝐰¯kr=1|Pr|iPr𝐰i,kr\bar{\mathbf{w}}^{r}_{k}=\frac{1}{|P^{r}|}\sum\limits_{i\in P^{r}}\mathbf{w}^{r}_{i,k}. Then it follows that

12Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2=1Ni1|𝒮1i|𝐳|𝒮1i|𝔼[12𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2),𝐮i,kr(𝐳)𝐮i,k1r(𝐳)+12𝐮i,kr(𝐳)𝐮i,k1r(𝐳)2]=12Ni1|𝒮i|𝐳𝒮1i𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+𝔼1NiPr1|𝒮1i|𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)+1Ni12|𝒮1i|𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2=12Ni1|𝒮i|𝐳𝒮1i𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+𝔼[1NiPr1|𝒮1i|𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)]+𝔼[1NiPr1|𝒮1i|g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)]+𝔼[1NiPr12|𝒮i|𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2],\begin{split}&\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &=\frac{1}{N}\sum_{i}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\bigg{[}\frac{1}{2}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}~{}~{}~{}+\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z})\rangle+\frac{1}{2}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z})\|^{2}\bigg{]}\\ &=\frac{1}{2N}\sum_{i}\frac{1}{|\mathcal{S}_{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+\mathbb{E}\frac{1}{N}\sum_{i\in P^{r}}\frac{1}{|\mathcal{S}_{1}^{i}|}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{N}\sum_{i}\frac{1}{2|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &=\frac{1}{2N}\sum_{i}\frac{1}{|\mathcal{S}_{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+\mathbb{E}[\frac{1}{N}\sum_{i\in P^{r}}\frac{1}{|\mathcal{S}_{1}^{i}|}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle]\\ &~{}~{}~{}+\mathbb{E}[\frac{1}{N}\sum_{i\in P^{r}}\frac{1}{|\mathcal{S}_{1}^{i}|}\langle g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle]\\ &~{}~{}~{}+\mathbb{E}[\frac{1}{N}\sum_{i\in P^{r}}\frac{1}{2|\mathcal{S}_{i}|}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}],\end{split} (56)

where for iPri\in P^{r} it has

𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)=𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)=𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+1γ𝐮i,k1r(𝐳i,k,1r)𝐮i,kr(𝐳i,k,1r),𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)=𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+12γ(𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2)\begin{split}&\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &=\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\rangle\\ &=\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{\gamma}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\rangle\\ &=\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &~{}~{}~{}+\frac{1}{2\gamma}(\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2})\end{split} (57)

If γ15\gamma\leq\frac{1}{5}, we have for iPri\in P^{r}

12(1γ1γ+14γ)𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2+𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2),𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)14γ𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2+γ𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+14γ𝔼𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2γ𝔼g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)24γ𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)2+4γL~2𝔼𝐰¯r𝐰¯r12+4γL~2𝔼𝐰i,kr𝐰¯r2+4γL~2𝔼𝐰j,tr1𝐰¯r124γσ2+4γL~2𝔼𝐰¯r𝐰¯r12+4γL~2𝔼𝐰i,kr𝐰¯r2+4γL~2𝔼𝐰j,tr1𝐰¯r12.\begin{split}&-\frac{1}{2}\left(\frac{1}{\gamma}-1-\frac{\gamma+1}{4\gamma}\right)\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}+\mathbb{E}\langle g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2}),\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq-\frac{1}{4\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}+\gamma\mathbb{E}\|g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+\frac{1}{4\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq\gamma\mathbb{E}\|g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq 4\gamma\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &~{}~{}~{}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &\leq 4\gamma\sigma^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}.\end{split} (58)

Then, we have

12Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)212Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+1NiPr1|𝒮1i|[12γ𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)212γ𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2γ+18γ𝐮i,kr(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)2+4γσ2+4γL~2𝔼𝐰¯r𝐰¯r12+4γL~2𝔼𝐰i,kr𝐰¯r2+4γL~2𝔼𝐰j,tr1𝐰¯r12+𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)].\begin{split}&\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{1}{2N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &+\frac{1}{N}\sum_{i\in P^{r}}\frac{1}{|\mathcal{S}_{1}^{i}|}\Bigg{[}\frac{1}{2\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &-\frac{1}{2\gamma}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\!-\!\frac{\gamma+1}{8\gamma}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\!+\!4\gamma\sigma^{2}\\ &+4\gamma\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\gamma\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j,t}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\Bigg{]}.\end{split} (59)

Note that for iPri\in P^{r}, 𝐳𝐳i,k,1r𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2=𝐳𝐳i,k,1r𝐮i,kr(𝐳)g(𝐰kr,𝐳,𝐰¯kr,𝒮2)2\sum_{\mathbf{z}\neq\mathbf{z}^{r}_{i,k,1}}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}=\sum_{\mathbf{z}\neq\mathbf{z}^{r}_{i,k,1}}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\mathbf{w}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}, which implies for iPri\in P^{r}

12γ(𝐮i,k1r(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2)=12γ𝐳𝒮1i(𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2).\begin{split}&\frac{1}{2\gamma}\left(\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\right)\\ &=\frac{1}{2\gamma}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\left(\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}-\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\right).\end{split} (60)

Since ()C0\ell(\cdot)\leq C_{0}, we have that g()2C02\|g(\cdot)\|^{2}\leq C_{0}^{2}, 𝐮i,kr(𝐳)2C02\|\mathbf{u}^{r}_{i,k}(\mathbf{z})\|^{2}\leq C_{0}^{2} and

𝐮i,kr(𝐳)𝐮i,0r(𝐳)2β2K2C02.\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-\mathbf{u}^{r}_{i,0}(\mathbf{z})\|^{2}\leq\beta^{2}K^{2}C_{0}^{2}.

Besides, we have for iPri\in P^{r} that

𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)=𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)+𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1)g(𝐰i,kr,𝐳i,k,1r,𝐰j,tr1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)+𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)+2L~2𝔼𝐰¯r𝐰¯r12+2L~2𝔼𝐰¯r𝐰i,kr2+L~2𝔼𝐰¯r1𝐰j,tr12+14𝔼g(𝐰¯r,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)22γC02+1γ𝐰¯kr𝐰¯r12+𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)+2L~2𝔼𝐰¯r𝐰¯r12+2L~2𝔼𝐰¯r𝐰i,kr2+L~2𝔼𝐰¯r1𝐰j,tr12+14𝔼g(𝐰¯r,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)2,\begin{split}&\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &=\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})-g(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{j,t}\|^{2}\\ &+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq 2\gamma C_{0}^{2}+\frac{1}{\gamma}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+2\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{j,t}\|^{2}\\ &+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2},\end{split} (61)

where

𝔼𝐮i,k1r(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)=𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)+𝐮i,0r1(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)+𝐮i,0r1(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)+𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r),𝐮i,0r1(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)+𝔼𝐮i,0r1(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)+𝔼𝐮i,0r1(𝐳i,k,1r)g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝐳^j,t,2r1),𝐮i,0r1(𝐳i,k,1r)𝐮i,k1r(𝐳i,k,1r)4𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)2+14𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2+14𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2+4𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)24𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)212𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2+8β2K2C02.\begin{split}&\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &=\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})+\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})+\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1}),\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &+\mathbb{E}\langle\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2}),\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq 4\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &+\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}+4\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq 4\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}-\frac{1}{2}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}+8\beta^{2}K^{2}C_{0}^{2}.\end{split} (62)

Noting for iPri\in P^{r},

𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,0r1(𝐳i,k,1r)2=𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)+𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)2=𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)2𝔼𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)2+2𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r),𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)12𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)2+8𝐮i,k1r(𝐳i,k,1r)𝐮i,0r1(𝐳i,k,1r)212𝔼g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2)𝐮i,k1r(𝐳i,k,1r)2+8β2K2C0214𝔼g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)2+12L~2𝐰¯r1𝐰¯kr2+8β2K2C02.\begin{split}&-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &=-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})+\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &=-\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}-\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &~{}~{}~{}+2\mathbb{E}\langle g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1}),\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\rangle\\ &\leq-\frac{1}{2}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+8\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})-\mathbf{u}^{r-1}_{i,0}(\mathbf{z}^{r}_{i,k,1})\|^{2}\\ &\leq-\frac{1}{2}\mathbb{E}\|g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+8\beta^{2}K^{2}C_{0}^{2}\\ &\leq-\frac{1}{4}\mathbb{E}\|g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}+\frac{1}{2}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+8\beta^{2}K^{2}C_{0}^{2}.\end{split} (63)

With the client sampling and data sampling, we observe that

𝔼[1NiPr1|𝒮1i|g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)2]=1N|Pr|Ni=1N𝔼𝐳i,k,1r𝒮1i[1|𝒮1i|g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)𝐮i,k1r(𝐳i,k,1r)2].\begin{split}&-\mathbb{E}\left[\frac{1}{N}\sum_{i\in P^{r}}\frac{1}{|\mathcal{S}^{i}_{1}|}\|g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\right]\\ &=-\frac{1}{N}\frac{|P^{r}|}{N}\sum_{i=1}^{N}\mathbb{E}_{\mathbf{z}^{r}_{i,k,1}\in\mathcal{S}^{i}_{1}}\left[\frac{1}{|\mathcal{S}^{i}_{1}|}\|g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})-\mathbf{u}^{r}_{i,k-1}(\mathbf{z}^{r}_{i,k,1})\|^{2}\right].\end{split} (64)

Then by multiplying γ\gamma to every term and rearranging terms using the setting of γO(1)\gamma\leq O(1), we can obtain

γ+121Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2γ(1|Pr|8|𝒮1i|N)+121Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+4γ2|Pr||𝒮1i|N(σ2+C02)+8γβ2K2C02|Pr||𝒮1i|N+4L~2|Pr|N𝔼𝐰¯r𝐰¯r12+4L~2|Pr|N𝔼𝐰¯r𝐰¯kr2+4(γ2+γ|𝒮1i|)L~21NiPr𝔼𝐰¯r𝐰i,kr2+(γ2+γ|𝒮1i|)L~21NKiPrk=1K𝔼𝐰¯r1𝐰i,kr12.\begin{split}&\frac{\gamma+1}{2}\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{\gamma(1-\frac{|P^{r}|}{8|\mathcal{S}_{1}^{i}|N})+1}{2}\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &+\frac{4\gamma^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}(\sigma^{2}+C_{0}^{2})+\frac{8\gamma\beta^{2}K^{2}C_{0}^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}+4\tilde{L}^{2}\frac{|P^{r}|}{N}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+4\tilde{L}^{2}\frac{|P^{r}|}{N}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+4(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i\in P^{r}}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i\in P^{r}}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\mathbf{w}^{r-1}_{i,k}\|^{2}.\end{split} (65)

Dividing γ+12\frac{\gamma+1}{2} on both sides gives

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2γ(1|Pr|8|𝒮1i|N)+1γ+11Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,k1r(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2+8γ2|Pr||𝒮1i|N(σ2+C02)+16γβ2K2C02|Pr||𝒮1i|N+8L~2|Pr|N𝐰¯r𝐰¯r12+8L~2|Pr|N𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21NiPr𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKiPrk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{\gamma(1-\frac{|P^{r}|}{8|\mathcal{S}_{1}^{i}|N})+1}{\gamma+1}\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &+8\frac{\gamma^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}+8\tilde{L}^{2}\frac{|P^{r}|}{N}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+8\tilde{L}^{2}\frac{|P^{r}|}{N}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i\in P^{r}}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}\\ &+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i\in P^{r}}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split} (66)

Using Young’s inequality,

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2(1γ|Pr|8|𝒮1i|N)1Ni=1N1|𝒮1i|𝐳|𝒮1i|[(1+γ|Pr|16|𝒮1i|N)𝔼𝐮i,k1r(𝐳)g(𝐰¯k1r,𝐳,𝐰¯k1r,𝒮2)2+(1+16|𝒮1i|Nγ|Pr|)L~2𝐰¯k1r𝐰¯kr2]+8γ2|Pr||𝒮1i|N(σ2+C02)+16γβ2K2C02|Pr||𝒮1i|N+8L~2|Pr|N𝐰¯r𝐰¯r12+8L~2|Pr|N𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21NiPr𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKiPrk=1K𝔼𝐰¯r1𝐰¯i,kr12,\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq(1-\frac{\gamma|P^{r}|}{8|\mathcal{S}_{1}^{i}|N})\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\bigg{[}(1+\frac{\gamma|P^{r}|}{16|\mathcal{S}_{1}^{i}|N})\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k-1},\mathbf{z},\bar{\mathbf{w}}^{r}_{k-1},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}+(1+\frac{16|\mathcal{S}_{1}^{i}|N}{\gamma|P^{r}|})\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\bigg{]}\\ &+8\frac{\gamma^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}\\ &+8\tilde{L}^{2}\frac{|P^{r}|}{N}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\frac{|P^{r}|}{N}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i\in P^{r}}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}\\ &+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i\in P^{r}}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2},\end{split}

which yields

1Ni=1N1|𝒮1i|𝐳|𝒮1i|𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2(1γ|Pr|16|𝒮1i|N)1Ni=1N1|𝒮1i|𝐳|𝒮1i|[𝔼𝐮i,k1r(𝐳)g(𝐰¯k1r,𝐳,𝐰¯k1r,𝒮2)2+20|𝒮1i|Nγ|Pr|L~2𝐰¯k1r𝐰¯kr2]+8γ2|𝒮1i||Pr|N(σ2+C02)+16γβ2K2C02|Pr||𝒮1i|N+8|Pr|NL~2𝐰¯r𝐰¯r12+8L~2|Pr|N𝐰¯r𝐰¯kr2+8(γ2+γ|𝒮1i|)L~21NiPr𝐰¯r𝐰i,kr2+2(γ2+γ|𝒮1i|)L~21NKiPrk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq(1-\frac{\gamma|P^{r}|}{16|\mathcal{S}_{1}^{i}|N})\frac{1}{N}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in|\mathcal{S}_{1}^{i}|}[\mathbb{E}\|\mathbf{u}^{r}_{i,k-1}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k-1},\mathbf{z},\bar{\mathbf{w}}^{r}_{k-1},\mathcal{S}_{2})\|^{2}\\ &+\frac{20|\mathcal{S}_{1}^{i}|N}{\gamma|P^{r}|}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}]+8\frac{\gamma^{2}}{|\mathcal{S}_{1}^{i}|}\frac{|P^{r}|}{N}(\sigma^{2}+C_{0}^{2})+\frac{16\gamma\beta^{2}K^{2}C_{0}^{2}|P^{r}|}{|\mathcal{S}_{1}^{i}|N}\\ &+8\frac{|P^{r}|}{N}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+8\tilde{L}^{2}\frac{|P^{r}|}{N}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r}_{k}\|^{2}\\ &+8(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{N}\sum_{i\in P^{r}}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}\\ &+2(\gamma^{2}+\frac{\gamma}{|\mathcal{S}_{1}^{i}|})\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i\in P^{r}}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split}

F.2 Analysis of the estimator of gradient

With update Gi,kr=(1β)Gi,k1r+β(Gi,k,1r+Gi,k,2r)G^{r}_{i,k}=(1-\beta)G^{r}_{i,k-1}+\beta(G^{r}_{i,k,1}+G^{r}_{i,k,2}), we define G¯kr:=1|Pr|iPrGi,kr\bar{G}^{r}_{k}:=\frac{1}{|P^{r}|}\sum\limits_{i\in P^{r}}G^{r}_{i,k}, and Δkr:=G¯krF(𝐰¯kr)2\Delta^{r}_{k}:=\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}. Then it follows that G¯kr=(1β)G¯k1r+β1|Pr|iPr(Gi,k,1r+Gi,k,2r)\bar{G}^{r}_{k}=(1-\beta)\bar{G}^{r}_{k-1}+\beta\frac{1}{|P^{r}|}\sum\limits_{i\in P^{r}}(G^{r}_{i,k,1}+G^{r}_{i,k,2}).

Lemma F.2.

Under Assumption 3.3, Algorithm 3 ensures that

Δkr(1β)G¯k1rF(𝐰¯k1r)2+η16F(𝐰¯r1)2+2β2σ2|Pr|+36η2βL~2𝐰¯kr𝐰¯r12+12β(1Ni4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1Ni4L~2𝔼𝐰j,tr1𝐰¯r12)+12β1Ni(L~2𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+L~2𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯tr1,𝐳^j,t,1r1,𝐰¯tr1,𝒮2)2).\begin{split}&\Delta^{r}_{k}\leq(1-\beta)\|\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1})\|^{2}+\frac{\eta}{16}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\\ &+\frac{2\beta^{2}\sigma^{2}}{|P^{r}|}+36\frac{\eta^{2}}{\beta}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}\\ &+12\beta\left(\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+12\beta\frac{1}{N}\sum_{i}\bigg{(}\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}+\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2})\|^{2}\bigg{)}.\end{split}
Proof.
Δkr=G¯krF(𝐰¯kr)2=(1β)G¯k1r+β1|Pr|iPr(Gi,k,1r+Gi,k,2r)F(𝐰¯kr)2=(1β)(G¯k1rF(𝐰¯k1r))+(1β)(F(𝐰¯k1r)F(𝐰¯kr))+β(1|Pr|iPr(G1(𝐰i,kr,𝐳i,k,1r,𝐮i,kr(𝐳i,k,1r),𝐰j,tr1,𝐳^j,t,2r1)+G2(𝐰j,tr1,𝐳^j,t,1r1,𝐮j,tr1(𝐳^j,t,1r1),𝐰i,kr,𝐳i,k,2r))1PriPr(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r)))+β(1|Pr|iPr(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯kr))2.\begin{split}&\Delta^{r}_{k}=\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\\ &=\|(1-\beta)\bar{G}^{r}_{k-1}+\beta\frac{1}{|P^{r}|}\sum_{i\in P^{r}}(G^{r}_{i,k,1}+G^{r}_{i,k,2})-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\\ &=\bigg{\|}(1-\beta)(\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1}))+(1-\beta)(\nabla F(\bar{\mathbf{w}}^{r}_{k-1})-\nabla F(\bar{\mathbf{w}}^{r}_{k}))\\ &~{}~{}~{}+\beta\bigg{(}\frac{1}{|P^{r}|}\sum_{i\in P^{r}}(G_{1}(\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,1},\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1}),\mathbf{w}^{r-1}_{j,t},\hat{\mathbf{z}}^{r-1}_{j,t,2})+G_{2}(\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}),\mathbf{w}^{r}_{i,k},\mathbf{z}^{r}_{i,k,2}))\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{1}{P^{r}}\sum_{i\in P^{r}}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))\bigg{)}\\ &+\beta\bigg{(}\frac{1}{|P^{r}|}\sum_{i\in P^{r}}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r}_{k})\bigg{)}\bigg{\|}^{2}.\end{split} (67)

Using Young’s inequality and L~\tilde{L}-Lipschtzness of G1,G2G_{1},G_{2}, we can then derive

Δkr(1+β)(1β)(G¯k1rF(𝐰¯k1r))+β(1|Pr|iPr(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯r1))2+(1+10β)β2(1|Pr|iPr4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1Ni4L~2𝔼𝐰j,tr1𝐰¯r12)+(1+10β)𝐰¯k1r𝐰¯kr2+(1+10β)β21|Pr|iPr(L~2𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+L~2𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯tr1,𝐳^j,t,1r1,𝐰¯tr1,𝒮2)2).\begin{split}&\Delta^{r}_{k}\leq(1+\beta)\Bigg{\|}(1-\beta)(\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1}))\\ &+\beta\bigg{(}\frac{1}{|P^{r}|}\sum_{i\in P^{r}}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},{\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r-1})\bigg{)}\Bigg{\|}^{2}\\ &+(1+\frac{10}{\beta})\beta^{2}\left(\frac{1}{|P^{r}|}\sum_{i\in P^{r}}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{N}\sum_{i}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+(1+\frac{10}{\beta})\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+(1+\frac{10}{\beta})\beta^{2}\frac{1}{|P^{r}|}\sum_{i\in P^{r}}\bigg{(}\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2})\|^{2}\bigg{)}.\end{split} (68)

By the fact that

𝔼[1|Pr|iPr(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯r1)]=0,\begin{split}&\mathbb{E}[\frac{1}{|P^{r}|}\sum_{i\in P^{r}}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},{\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r-1})]=0,\end{split} (69)
𝔼1|Pr|iPr(G1(𝐰¯r1,𝐳i,k,1r,g(𝐰¯r1,𝐳i,k,1r,𝐰¯r1,𝒮2),𝐰¯r1,𝐳^j,t,2r1)+G2(𝐰¯r1,𝐳^j,t,1r1,g(𝐰¯r1,𝐳^j,t,1r1,𝐰¯r1,𝒮2),𝐰¯r1,𝐳i,k,2r))F(𝐰¯r1)2σ2|Pr|\begin{split}&\mathbb{E}\|\frac{1}{|P^{r}|}\sum_{i\in P^{r}}(G_{1}(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},g(\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j,t,2})\\ &~{}~{}~{}~{}~{}~{}~{}+G_{2}(\bar{\mathbf{w}}^{r-1},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},g(\bar{\mathbf{w}}^{r-1},{\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1}},\bar{\mathbf{w}}^{r-1},\mathcal{S}_{2}),\bar{\mathbf{w}}^{r-1},\mathbf{z}^{r}_{i,k,2}))-\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\leq\frac{\sigma^{2}}{|P^{r}|}\end{split} (70)

and

𝐰¯k1r𝐰¯kr2=η2G¯kr23η2G¯krF(𝐰¯kr)2+3η2F(𝐰¯kr)F(𝐰¯r1)2+3η2F(𝐰¯r1)2,\begin{split}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}=\eta^{2}\|\bar{G}^{r}_{k}\|^{2}\leq 3\eta^{2}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}+3\eta^{2}\|\nabla F(\bar{\mathbf{w}}^{r}_{k})-\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}+3\eta^{2}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2},\end{split} (71)

we obtain

Δkr(13β4)G¯k1rF(𝐰¯k1r)2+2β2σ2|Pr|+36η2βL~2𝐰¯kr𝐰¯r12+η16F(𝐰¯r1)2+12β(1|Pr|iPr4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1|Pr|iPr4L~2𝔼𝐰j,tr1𝐰¯r12)+12β1|Pr|iPr(L~2𝔼𝐮i,kr(𝐳i,k,1r)g(𝐰¯kr,𝐳i,k,1r,𝐰¯kr,𝒮2)2+L~2𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯tr1,𝐳^j,t,1r1,𝐰¯tr1,𝒮2)2).\begin{split}&\Delta^{r}_{k}\leq(1-\frac{3\beta}{4})\|\bar{G}^{r}_{k-1}-\nabla F(\bar{\mathbf{w}}^{r}_{k-1})\|^{2}+\frac{2\beta^{2}\sigma^{2}}{|P^{r}|}+36\frac{\eta^{2}}{\beta}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{\eta}{16}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\\ &+12\beta\left(\frac{1}{|P^{r}|}\sum_{i\in P^{r}}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{|P^{r}|}\sum_{i\in P^{r}}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+12\beta\frac{1}{|P^{r}|}\sum_{i\in P^{r}}\bigg{(}\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z}^{r}_{i,k,1})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z}^{r}_{i,k,1},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}+\tilde{L}^{2}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2})\|^{2}\bigg{)}.\end{split}

F.3 Convergence Result

Theorem F.3.

Suppose Assumption 3.3 holds, and assume there are at least |P||P| machines take participation in each round. Denoting M=maxi|𝒮i1|M=\max_{i}|\mathcal{S}^{1}_{i}| as the largest number of data on a single machine, by setting γ=O(M1/3R2/3)\gamma=O(\frac{M^{1/3}}{R^{2/3}}), β=O(1M1/6R2/3)\beta=O(\frac{1}{M^{1/6}R^{2/3}}), η=O(|P|NM2/3R2/3)\eta=O(\frac{|P|}{NM^{2/3}R^{2/3}}) and K=O(NM1/3R1/3|P|)K=O(\frac{NM^{1/3}R^{1/3}}{|P|}), Algorithm 1 ensures that 𝔼[1Rr=1RF(𝐰¯r)2]O(1R2/3)\mathbb{E}\left[\frac{1}{R}\sum_{r=1}^{R}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\right]\leq O(\frac{1}{R^{2/3}}).

Proof.

By updating rules, we have that for iPri\in P^{r},

𝐰¯r𝐰i,kr2η2K2Cf2C2Cg2,\begin{split}&\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}\leq\eta^{2}K^{2}C_{f}^{2}C_{\ell}^{2}C_{g}^{2},\end{split} (72)

and

𝐰¯kr𝐰¯r2=η~21|Pr|KiPrm=1kG¯mr2η~21Km=1KG¯mrF(𝐰¯mr)+F(𝐰¯mr)2.\begin{split}&\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r}\|^{2}=\tilde{\eta}^{2}\|\frac{1}{|P^{r}|K}\sum_{i\in P^{r}}\sum\limits_{m=1}^{k}\bar{G}^{r}_{m}\|^{2}\leq\tilde{\eta}^{2}\frac{1}{K}\sum_{m=1}^{K}\|\bar{G}^{r}_{m}-\nabla F(\bar{\mathbf{w}}^{r}_{m})+\nabla F(\bar{\mathbf{w}}^{r}_{m})\|^{2}.\end{split} (73)

Similarly, we also have

𝐰¯r1𝐰¯r2=η~21|Pr|KiPrk=1KG¯kr12η~21Kk=1KG¯kr1F(𝐰¯kr1)+F(𝐰¯kr1)2.\begin{split}&\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r}\|^{2}=\tilde{\eta}^{2}\|\frac{1}{|P^{r}|K}\sum_{i\in P^{r}}\sum_{k=1}^{K}\bar{G}^{r-1}_{k}\|^{2}\leq\tilde{\eta}^{2}\frac{1}{K}\sum_{k=1}^{K}\|\bar{G}^{r-1}_{k}-\nabla F(\bar{\mathbf{w}}^{r-1}_{k})+\nabla F(\bar{\mathbf{w}}^{r-1}_{k})\|^{2}.\end{split} (74)

Lemma F.2 yields that

1RKr,k𝔼G¯krF(𝐰¯kr)2Δ00βRK+βσ2|Pr|+2(1|Pr|iPi4L~2𝔼𝐰i,kr𝐰¯r2+4L~2𝔼𝐰¯r𝐰¯r12+1|Pr|iPr4L~2𝔼𝐰j,tr1𝐰¯r12)+2𝔼[1Rr1|Pr|KiPr,k1|𝒮1i|𝐳𝒮1i𝐮i,kr(𝐳)g(𝐰¯r,𝐳,𝐰¯r,𝒮2)2]+2𝔼[1Rr1|Pr|Kj,t1|𝒮1i|𝐳𝒮1i𝐮j,tr1(𝐳)g(𝐰¯tr1,𝐳,𝐰¯tr1,𝒮2))2],\begin{split}&\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\leq\frac{\Delta^{0}_{0}}{\beta RK}+\frac{\beta\sigma^{2}}{|P^{r}|}\\ &+2\left(\frac{1}{|P^{r}|}\sum_{i\in P^{i}}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r}_{i,k}-\bar{\mathbf{w}}^{r}\|^{2}+4\tilde{L}^{2}\mathbb{E}\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\frac{1}{|P^{r}|}\sum_{i\in P^{r}}4\tilde{L}^{2}\mathbb{E}\|\mathbf{w}^{r-1}_{j^{\prime},t^{\prime}}-\bar{\mathbf{w}}^{r-1}\|^{2}\right)\\ &+2\mathbb{E}\left[\frac{1}{R}\sum_{r}\frac{1}{|P^{r}|K}\sum_{i\in P^{r},k}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r},\mathbf{z},\bar{\mathbf{w}}^{r},\mathcal{S}_{2})\|^{2}\right]\\ &~{}~{}+2\mathbb{E}\left[\frac{1}{R}\sum_{r}\frac{1}{|P^{r}|K}\sum_{j^{\prime},t^{\prime}}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\mathbf{z})-g(\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathbf{z},\bar{\mathbf{w}}^{r-1}_{t^{\prime}},\mathcal{S}_{2}))\|^{2}\right],\end{split} (75)

which by setting of η\eta and β\beta leads to

1RKr,k𝔼G¯krF(𝐰¯kr)22Δ00βRK+4βσ2|P|+10βη~2C2Cg2+14RrF(𝐰¯r1)2+161Rr1NKi,k1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯r;𝐳,𝒮2)2+321Rr1NKj,t1|𝒮1i|𝐳𝒮1i𝔼𝐮j,tr1(𝐳^j,t,1r1)g(𝐰¯r1;𝐳^j,t,1r1,𝒮2))2+32Cg21Rr1Kt𝔼𝐰¯r1𝐰¯tr12.\begin{split}&\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\leq\frac{2\Delta^{0}_{0}}{\beta RK}+\frac{4\beta\sigma^{2}}{|P|}+10\beta\tilde{\eta}^{2}C_{\ell}^{2}C_{g}^{2}+\frac{1}{4R}\sum_{r}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\\ &+16\frac{1}{R}\sum_{r}\frac{1}{NK}\sum_{i,k}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r};\mathbf{z},\mathcal{S}_{2})\|^{2}\\ &+32\frac{1}{R}\sum_{r}\frac{1}{NK}\sum_{j^{\prime},t^{\prime}}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r-1}_{j^{\prime},t^{\prime}}(\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1})-g(\bar{\mathbf{w}}^{r-1};\hat{\mathbf{z}}^{r-1}_{j^{\prime},t^{\prime},1},\mathcal{S}_{2}))\|^{2}\\ &+32C_{g}^{2}\frac{1}{R}\sum_{r}\frac{1}{K}\sum_{t^{\prime}}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{t^{\prime}}\|^{2}.\end{split}

Using Lemma F.1 yields

1Rr1NKi=1Nk=1K1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)216MNγ|Pr|1R1NKi=1N1|𝒮1i|𝐳𝒮1i𝔼𝐮i,00(𝐳)g(𝐰¯00,𝐳,𝐰¯00,𝒮2)2+400M2N2γ2|Pr|21RKr,kL~2𝐰¯k1r𝐰¯kr2+150γ(σ2+C02)+256β2K2C02+128L~2|𝒮1i|γ(𝐰¯r𝐰¯r12+𝐰¯r𝐰¯r12)+150(γ|𝒮1i|+1)L~21Ni𝐰¯r𝐰i,kr2+32(γ|𝒮1i|+1)L~21NKi=1Nk=1K𝔼𝐰¯r1𝐰¯i,kr12.\begin{split}&\frac{1}{R}\sum_{r}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq\frac{16MN}{\gamma|P^{r}|}\frac{1}{R}\frac{1}{NK}\sum\limits_{i=1}^{N}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{0}_{i,0}(\mathbf{z})-g(\bar{\mathbf{w}}^{0}_{0},\mathbf{z},\bar{\mathbf{w}}^{0}_{0},\mathcal{S}_{2})\|^{2}\\ &+\frac{400M^{2}N^{2}}{\gamma^{2}|P^{r}|^{2}}\frac{1}{RK}\sum_{r,k}\tilde{L}^{2}\|\bar{\mathbf{w}}^{r}_{k-1}-\bar{\mathbf{w}}^{r}_{k}\|^{2}+150\gamma(\sigma^{2}+C_{0}^{2})+256\beta^{2}K^{2}C_{0}^{2}\\ &+128\tilde{L}^{2}\frac{|\mathcal{S}_{1}^{i}|}{\gamma}(\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2}+\|\bar{\mathbf{w}}^{r}-\bar{\mathbf{w}}^{r-1}\|^{2})\\ &+150(\gamma|\mathcal{S}_{1}^{i}|+1)\tilde{L}^{2}\frac{1}{N}\sum_{i}\|\bar{\mathbf{w}}^{r}-\mathbf{w}^{r}_{i,k}\|^{2}+32(\gamma|\mathcal{S}_{1}^{i}|+1)\tilde{L}^{2}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\mathbb{E}\|\bar{\mathbf{w}}^{r-1}-\bar{\mathbf{w}}^{r-1}_{i,k}\|^{2}.\end{split}

Combining this with previous five inequalities and noting the parameters settings, we obtain

1Rr1NKi=1Nk=1K1|𝒮1i|𝐳𝒮1i𝔼𝐮i,kr(𝐳)g(𝐰¯kr,𝐳,𝐰¯kr,𝒮2)2O(MNγRK|P|+η2M2N2γ2|P|21RKr,k𝔼G¯krF(𝐰¯kr)2+γ+β2K2+Mγη~2(1βRK+β|P|)+γMη2K2+1Rrη~2F(𝐰¯r1)2)\begin{split}&\frac{1}{R}\sum_{r}\frac{1}{NK}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}\frac{1}{|\mathcal{S}_{1}^{i}|}\sum\limits_{\mathbf{z}\in\mathcal{S}_{1}^{i}}\mathbb{E}\|\mathbf{u}^{r}_{i,k}(\mathbf{z})-g(\bar{\mathbf{w}}^{r}_{k},\mathbf{z},\bar{\mathbf{w}}^{r}_{k},\mathcal{S}_{2})\|^{2}\\ &\leq O\bigg{(}\frac{MN}{\gamma RK|P|}+\eta^{2}\frac{M^{2}N^{2}}{\gamma^{2}|P|^{2}}\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}+\gamma+\beta^{2}K^{2}+\frac{M}{\gamma}\tilde{\eta}^{2}(\frac{1}{\beta RK}+\frac{\beta}{|P|})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\gamma M\eta^{2}K^{2}+\frac{1}{R}\sum_{r}\tilde{\eta}^{2}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\bigg{)}\end{split}

and

1RKr,k𝔼G¯krF(𝐰¯kr)2O(MNγRK|P|+γ+β2K2+Mγη~2(1βRK+β|P|)+γMη2K2+1Rrη~2F(𝐰¯r1)2).\begin{split}&\frac{1}{RK}\sum_{r,k}\mathbb{E}\|\bar{G}^{r}_{k}-\nabla F(\bar{\mathbf{w}}^{r}_{k})\|^{2}\\ &\leq O\left(\frac{MN}{\gamma RK|P|}+\gamma+\beta^{2}K^{2}+\frac{M}{\gamma}\tilde{\eta}^{2}(\frac{1}{\beta RK}+\frac{\beta}{|P|})+\gamma M\eta^{2}K^{2}+\frac{1}{R}\sum_{r}\tilde{\eta}^{2}\|\nabla F(\bar{\mathbf{w}}^{r-1})\|^{2}\right).\end{split} (76)

Then using the standard analysis of smooth function, we derive

F(𝐰¯r+1)F(𝐰¯r)F(𝐰¯r)(𝐰¯r+1𝐰¯r)+L~2𝐰¯r+1𝐰¯r2=η~F(𝐰¯r)(1NKikGi,krF(𝐰¯r)+F(𝐰¯r))+L~2𝐰¯r+1𝐰¯r2=η~F(𝐰¯r)2+η~2F(𝐰¯r)2+η~21NKikGi,krF(𝐰¯r)2+L~2𝐰¯r+1𝐰¯r2η~2F(𝐰¯r)2+η~1NKik(Gi,krF(𝐰¯kr))2+η~1Kk(F(𝐰¯kr)F(𝐰¯r))2+L~2𝐰¯r+1𝐰¯r2η~2F(𝐰¯r)2+η~1Kk1Ni(Gi,krF(𝐰¯kr))2+η~L~2Kk𝐰¯kr𝐰¯r2+L~2𝐰¯r+1𝐰¯r2.\begin{split}&F(\bar{\mathbf{w}}^{r+1})-F(\bar{\mathbf{w}}^{r})\leq\nabla F(\bar{\mathbf{w}}^{r})^{\top}(\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r})+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &=-\tilde{\eta}\nabla F(\bar{\mathbf{w}}^{r})^{\top}\left(\frac{1}{NK}\sum_{i}\sum_{k}G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r})+\nabla F(\bar{\mathbf{w}}^{r})\right)+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &=-\tilde{\eta}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\frac{\tilde{\eta}}{2}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\frac{\tilde{\eta}}{2}\|\frac{1}{NK}\sum_{i}\sum_{k}G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\\ &~{}~{}~{}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq-\frac{\tilde{\eta}}{2}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\tilde{\eta}\|\frac{1}{NK}\sum_{i}\sum_{k}(G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r}_{k}))\|^{2}\\ &~{}~{}~{}+\tilde{\eta}\|\frac{1}{K}\sum_{k}(\nabla F(\bar{\mathbf{w}}^{r}_{k})-\nabla F(\bar{\mathbf{w}}^{r}))\|^{2}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}\\ &\leq-\frac{\tilde{\eta}}{2}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}+\tilde{\eta}\frac{1}{K}\sum_{k}\|\frac{1}{N}\sum_{i}(G^{r}_{i,k}-\nabla F(\bar{\mathbf{w}}^{r}_{k}))\|^{2}\\ &~{}~{}~{}+\tilde{\eta}\frac{\tilde{L}^{2}}{K}\sum_{k}\|\bar{\mathbf{w}}^{r}_{k}-\bar{\mathbf{w}}^{r}\|^{2}+\frac{\tilde{L}}{2}\|\bar{\mathbf{w}}^{r+1}-\bar{\mathbf{w}}^{r}\|^{2}.\end{split} (77)

Combining with (76), (72), (73), and (74), we derive

1Rr𝔼F(𝐰¯r)2O(MNγRK|P|+γ+β2K2+Mγη~2(1βRK+β|P|)+γMη2K2).\begin{split}&\frac{1}{R}\sum_{r}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\leq O\left(\frac{MN}{\gamma RK|P|}+\gamma+\beta^{2}K^{2}+\frac{M}{\gamma}\tilde{\eta}^{2}(\frac{1}{\beta RK}+\frac{\beta}{|P|})+\gamma M\eta^{2}K^{2}\right).\end{split}

By setting parameters as in the theorem, we can conclude the proof. Further, to get 1Rr𝔼F(𝐰¯r)2ϵ2\frac{1}{R}\sum_{r}\mathbb{E}\|\nabla F(\bar{\mathbf{w}}^{r})\|^{2}\leq\epsilon^{2}, we just need to set γ=O(ϵ2)\gamma=O(\epsilon^{2}), β=O(ϵ2M)\beta=O(\frac{\epsilon^{2}}{\sqrt{M}}), K=O(NM|P|ϵ)K=O(\frac{N\sqrt{M}}{|P|\epsilon}), η=O(|P|ϵ2NM)\eta=O(\frac{|P|\epsilon^{2}}{NM}), R=O(Mϵ3)R=O(\frac{\sqrt{M}}{\epsilon^{3}}). ∎