This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Networked Real-time Learning

Alfredo Garcia, Luochao Wang, Jeff Huang, and Lingzhou Hong This work was supported by NSF ECCS-1933878 Award and Grant AFOSR-15RT0767. {alfredo.garcia, wangluochao, jeffhuang,hlz}@tamu.edu
(Texas A&M University)
Abstract

Many machine learning algorithms have been developed under the assumption that data sets are already available in batch form. Yet in many application domains data is only available sequentially overtime via compute nodes in different geographic locations. In this paper, we consider the problem of learning a model when streaming data cannot be transferred to a single location in a timely fashion. In such cases, a distributed architecture for learning relying on a network of interconnected “local” nodes is required. We propose a distributed scheme in which every local node implements stochastic gradient updates based upon a local data stream. To ensure robust estimation, a network regularization penalty is used to maintain a measure of cohesion in the ensemble of models. We show the ensemble average approximates a stationary point and characterize the degree to which individual models differ from the ensemble average. We compare the results with federated learning to conclude the proposed approach is more robust to heterogeneity in data streams (data rates and estimation quality). We illustrate the results with an application to image classification with a deep learning model based upon convolutional neural networks.

keywords: asynchronous computing, distributed computing, networks, non-convex optimization, real-time machine learning.

1 Introduction

Streaming data sets are pervasive in certain application domains often involving a network of compute nodes located in different geographic locations. However, most machine learning algorithms have been developed under the assumption that data sets are already available in batch form. When the data is obtained through a network of heterogeneous compute nodes, assembling a diverse batch of data points in a central processing location to update a model may imply significant latency. Recently, an architecture referred to as federated learning (FL, see e.g. [1, 2]) with a central server in proximity to local nodes has been proposed. In FL, each node implements updates to a machine learning model that is kept in the central server. This allows collaborative learning while keeping all the training data on nodes rather than in the cloud. In general, schemes that avoid the need to rely on the cloud for data storage and/or computation are referred to as “edge computing”.

With high data payloads, such architecture for real-time learning is subject to an accuracy vs speed trade-off due to asymmetries in data quality vs. data rates as we explain in what follows.

Consider nodes i{1,,N}i\in\{1,\dots,N\} generating data points (xi,n,yi,n),n+(x_{i,n},y_{i,n}),~{}n\in\mathbb{N}^{+} at different rates μi>0\mu_{i}>0 which are used for the instantaneous computation of model updates θk,k+\theta_{k},~{}k\in\mathbb{N}^{+} (striving to minimize loss \ell). This setting could correspond for example with supervised deep learning in real-time wherein gradient estimates (with noise variance σi2>0\sigma^{2}_{i}>0) are computed via back-propagation in a relatively fast fashion. Without complete information on σi2>0\sigma^{2}_{i}>0, updating the model parameters based upon every incoming data point yields high speed but possibly at the expense of low accuracy. For example, if the nodes producing noisier estimates are also faster at producing data, it is highly unlikely that an accurate model will be identified at all.

To illustrate this scenario, in Figure 1, we depict the performance of FL for deep convolutional neural networks with the MNIST dataset. In these simulations, each one of N=5N=5 nodes sends data according to independent Poisson processes with μ0=8\mu_{0}=8 and μi=1\mu_{i}=1, i{1,,4}i\in\{1,\dots,4\}. The fastest node computes gradient estimates based upon a single image whereas the slower nodes compute gradient estimates based upon a batch of 6464 images.

Refer to caption
Figure 1: Performance comparison for Deep Learning on MNIST with learning rate γ=0.01\gamma=0.01. The 95%95\% percentile is depicted with green lines.

This trade-off between speed and precision is mitigated in a distributed approach to real-time learning subject to a network regularization penalty. In such an approach, each one of N>1N>1 local nodes independently produce parameter updates based upon a single (locally obtained) data point which speeds up computation. Evidently, with increased noise, such a scheme may fail to enable the identification of a reasonably accurate model. However, by adding a network regularization penalty (which is computed locally) a form of coordination between multiple local nodes is induced so that the ensemble average solution is robust to noise.111Similar network regularization methods have been used in multi-task learning to account for inherent network structure in data sets (see e.g.[3, 4, 5, 6, 7, 8]). See section D. Specifically, we show that the ensemble average solution approximates a stationary point and that the approximation quality is 𝒪(i=1Nσi2N2)\mathcal{O}(\frac{\sum_{i=1}^{N}\sigma^{2}_{i}}{N^{2}}), which compares quite favorably with FL, which is highly sensitive to fast and inaccurate data streams. We illustrate the results with an application to deep learning with convolutional neural networks.

The structure of the paper is as follows. In section 2 we introduce the distributed scheme that combines stochastic gradient descent with network regularization (NR). In section 3 we analyze the scheme and show that it converges (in a certain sense) to a stationary point, we also compare its performance with FL. Finally, in section 4 we report the results from a testbed on deep learning application to image processing, and in section 5 we offer conclusions.

2 A Network Regularized Approach to Real-time Learning

2.1 Setup

We consider a setting in which data is made available sequentially overtime via nodes i{1,,N}i\in\{1,\dots,N\} in different geographic locations. We denote the ii-th stream by {(𝒙i,n,yi,n):n+}\{(\bm{x}_{i,n},y_{i,n}):n\in\mathbb{N}^{+}\} and assume these data points are independent samples from a joint distribution 𝒫i\mathcal{P}_{i}.

We also assume the data streams are independent but heterogeneous, i.e. 𝒫i𝒫j,ij\mathcal{P}_{i}\neq\mathcal{P}_{j},i\neq j. Each node strives to find a parameter specification θΘp\theta\in\Theta\subset\mathbb{R}^{p} that minimizes the performance criteria 𝔼𝒫i[(𝒙i,yi;θ)]\mathbb{E}_{\mathcal{P}_{i}}[\mathcal{L}(\bm{x}_{i},y_{i};\theta)], where the loss function ()0\mathcal{L}(\cdot)\geq 0 is continuously differentiable with respect to θ\theta. Though data is distributed and heterogeneous, we consider a setting in which nodes agree on a common learning task. This is formalized in the first standing assumption. Let gi(θ)θ(𝒙i,yi;θ)g_{i}(\theta)\triangleq\nabla_{\theta}\mathcal{L}(\bm{x}_{i},y_{i};\theta) denote the gradient evaluated at (𝒙i,yi)𝒫i(\bm{x}_{i},y_{i})\sim\mathcal{P}_{i}, and assume gi(θ)g_{i}(\theta) is uniformly integrable.

Assumption 0: For all θΘ\theta\in\Theta, and i{1,,N}i\in\{1,\dots,N\}:

𝔼𝒫i[gi(θ)]=𝔼𝒫j[gj(θ)].\mathbb{E}_{\mathcal{P}_{i}}[g_{i}(\theta)]=\mathbb{E}_{\mathcal{P}_{j}}[g_{j}(\theta)].

Let (θ)\ell(\theta) denote the (ensemble) average expected loss:

(θ)1Ni=1N𝔼𝒫i[(𝒙i,yi;θ)].\ell(\theta)\triangleq\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\mathcal{P}_{i}}[\mathcal{L}(\bm{x}_{i},y_{i};\theta)].

By uniform integrability, θ𝔼𝒫i(𝒙i,yi;θ)=𝔼𝒫igi(θ)\nabla_{\theta}\mathbb{E}_{\mathcal{P}_{i}}\mathcal{L}(\bm{x}_{i},y_{i};\theta)=\mathbb{E}_{\mathcal{P}_{i}}g_{i}(\theta). Assumption 0 thus implies that 𝔼𝒫i[gi(θ)]=(θ)\mathbb{E}_{\mathcal{P}_{i}}[g_{i}(\theta)]=\triangledown\ell(\theta) for all ii and θ\theta.

Let εi(θ)gi(θ)(θ)\varepsilon_{i}(\theta)\triangleq g_{i}(\theta)-\triangledown\ell(\theta), then it holds that 𝔼[εi(θ)]=0\mathbb{E}[\varepsilon_{i}(\theta)]=0. We further assume:

Assumption 1: For all θiΘ\theta_{i}\in\Theta, the random variables {εi(θi):i{1,,N}}\{\varepsilon_{i}(\theta_{i}):i\in\{1,\dots,N\}\} are independent and

𝔼[εi(θi)2]σi2.\mathbb{E}[\left\lVert\varepsilon_{i}(\theta_{i})\right\rVert^{2}]\leq\sigma_{i}^{2}.

Define σ2=i=1Nσi2\sigma^{2}=\sum_{i=1}^{N}\sigma_{i}^{2}. By independence of data streams:

𝔼[εi(θ)εj(θ)]=𝔼[εi(θ)]𝔼[εj(θ)]=0,\mathbb{E}[\varepsilon_{i}(\theta)^{\intercal}\varepsilon_{j}(\theta)]=\mathbb{E}[\varepsilon_{i}(\theta)]^{\intercal}\mathbb{E}[\varepsilon_{j}(\theta)]=0,

for all θΘ,j{1,,N}\{i}\theta\in\Theta,j\in\left.{\{1,\dots,N\}}\right\backslash{\{i}\}.

Streams generate data over time according to independent Poisson processes Di(t)D_{i}(t) with rate μi>0\mu_{i}>0 and Di(0)=1D_{i}(0)=1. We assume the time required to compute gradient estimates and/or exchange parameters locally among neighbors or with the central server are negligible compared to the time in between model updates. In what follows we make use of a virtual clock that produces ticks according to an aggregate counting process D(t)=i=1NDi(t)D(t)=\sum_{i=1}^{N}D_{i}(t) with rate μ=i=1Nμi\mu=\sum_{i=1}^{N}\mu_{i}. Let k+k\in\mathbb{N}^{+} denote the index set of ticks associated with the aggregate process. Since we assume the parameter is updated once a data point arrives, the kk-th iteration is completed at the kk-th tick. Index kk denotes the kk-th step in the schemes described below.

2.2 Federated Real-time Learning

In FL, gradient estimates are communicated to a central server where a model is updated as follows:

θk+1=θkγi=1N𝟏i,kgi(θk),\theta_{k+1}=\theta_{k}-\gamma\sum_{i=1}^{N}\mathbf{1}_{i,k}g_{i}(\theta_{k}), (1)

where γ\gamma is the learning rate, 𝟏i,k\mathbf{1}_{i,k} is a indicator of whether node ii performs the an update: 𝟏i,k=1\bm{1}_{i,k}=1 if the next gradient estimate comes from the ii-th stream and 𝟏i,k=0\mathbf{1}_{i,k}=0 otherwise.

The algorithmic scheme described in (1) was first analyzed in [9] for data in batch form and has been used in the recent literature on asynchronous parallel optimization algorithms (see for example [10], [11] and [12]). As Figure 1 suggests, with heterogeneous data streams, the scheme in (1) trades off speed in producing parameter updates at the expense of heterogeneous noise in gradient estimates. In what follows we introduce a distributed approach that relies on a network regularization penalty to ensure the ensemble average approximates a stationary point (i.e. a choice of parameters with null gradient). We will show that in such a networked approach the trade-off between precision and speed is mitigated.

2.3 A Distributed Approach with Network Regularization

In the NR scheme, we consider a network of local compute nodes which we model as a graph 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}), where 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\} stands for the set of nodes and 𝒩×𝒩\mathcal{E}\subseteq\mathcal{N}\times\mathcal{N} is the set of links connecting nodes. Let A=[αij]N×NA=[\alpha_{ij}]\in\mathbb{R}^{N\times N} be the adjacency matrix of 𝒢\mathcal{G}, where αij{0,1}\alpha_{ij}\in\{0,1\} indicates whether node ii communicates with node jj: αij=1{\alpha}_{ij}=1 if two nodes can exchange local information and αij=0{\alpha}_{ij}=0 otherwise.

In this scheme, each local node ii performs model updates according to a linear combination of local gradient estimate and the gradient of a consensus potential:

(𝜽)=14ijiαijθiθj2,\mathcal{F}(\bm{\theta})=\frac{1}{4}\sum_{i}\sum_{j\neq i}\alpha_{ij}\|\theta_{i}-\theta_{j}\|^{2},

where 𝜽t=(θ1,t,,θN,t)p×N\bm{\theta}_{t}^{\intercal}=(\theta_{1,t}^{\intercal},\dots,\theta_{N,t}^{\intercal})\in\mathbb{R}^{p\times N}. The consensus potential is a measure of similarity across local models.222This consensus potential has been used in the literature of opinion dynamics (see e.g. [13]). The update performed by node ii is of the form:

θi,k+1=θi,kγ𝟏i,k[gi(θi,k)+ai,k],\theta_{i,k+1}=\theta_{i,k}-\gamma\mathbf{1}_{i,k}[g_{i}(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}], (2)

where a>0a>0 is a regularization parameter, and

i,kθi(𝜽k)=jiαij(θi,kθj,k).\triangledown\mathcal{F}_{i,k}\triangleq\nabla_{\theta_{i}}\mathcal{F}(\bm{\theta}_{k})=\sum_{j\neq i}{\alpha}_{ij}(\theta_{i,k}-\theta_{j,k}).

Note that the basic iterate (2) can be interpreted as a stochastic gradient approach to solve the local problem:

minθi[𝔼𝒫i(θi)+a(𝜽)],\min_{\theta_{i}}[\mathbb{E}_{\mathcal{P}_{i}}(\theta_{i})+a\mathcal{F}(\bm{\theta})],

in which the objective function is a linear combination of loss and consensus potential.333This interpretation is not novel (see e.g. [14], [15] for its use in swarm (flocking) optimization and in multi-task learning [4, 5, 6, 7, 8]). When a=0a=0, each local node ignores the neighboring models. For large values of a>0a>0, the resulting dynamics reflect the countervailing effects of seeking to minimize consensus potential and improving model fit. With highly dissimilar initial models, each local node largely ignores its own data and opts for updates that lead to a model that is similar to the local average. Once approximate consensus is achieved, local gradient estimates begin to dictate the dynamics of model updates.

In what follows it will be convenient to rewrite (2) as follows:

θi,k+1=θi,kγ𝟏i,k[(θi,k)+ai,k+εi,k].\theta_{i,k+1}=\theta_{i,k}-\gamma\mathbf{1}_{i,k}\left[\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}+\varepsilon_{i,k}\right]. (3)

Given that local nodes independently update and maintain their own parameters, the network regularized scheme is not subject to the possibility of biased gradient estimates stemming from update delays in FL (see [16]).

2.4 Literature Review

The scheme proposed in (2) has already been considered in the machine learning literature. In a series of papers (see [4, 5, 6, 7, 8])), the authors consider an approach to multi-task learning based upon a network regularization penalty as in (2). This paper focuses on distributed single-task learning. In contrast to the papers referred above, we consider a non-convex setting with heterogeneous nodes asynchronously updating their respective models at different rates over time.

The scheme proposed in (2) is also related to the literature on consensus optimization (see e.g. [17], [11], [18]). However, the proposed approach can not be interpreted as being based upon averaging over local models as in consensus-based optimization. In that literature, the basic iteration is of the form:

θi,k+1=jWi,j,kθj,kγg(θi,k),\theta_{i,k+1}=\sum_{j}W_{i,j,k}\theta_{j,k}-\gamma g(\theta_{i,k}),

where 𝐖kN×N{\bf W}_{k}\in\mathbb{R}^{N\times N} is doubly stochastic and g(θi,k)g(\theta_{i,k}) is a noisy gradient estimate. Indeed one can rewrite (2) as:

θi,k+1=jWi,jθj,kγ𝟏i,kgi(θi,k),\theta_{i,k+1}=\sum_{j}W_{i,j}\theta_{j,k}-\gamma\mathbf{1}_{i,k}g_{i}(\theta_{i,k}),

with Wi,i=1γajαi,jW_{i,i}=1-\gamma a\sum_{j}\alpha_{i,j} and Wi,j=γajαi,jW_{i,j}=\gamma a\sum_{j}\alpha_{i,j}. However, the resulting matrix 𝐖{\bf W} is not doubly stochastic in general since we only require a>0a>0. Thus, the approach to consensus in (2) can not be interpreted as being based upon averaging over local models as in consensus-optimization.

The algorithms proposed in [11] and [18] are designed for batch data while our approach deals with streaming data. For example, in [11], each node uses the same mini-batch size for estimating gradients while in our approach gradient estimation noise is heterogeneous. In addition, the algorithms proposed in [11] and [18], every node is equally likely to be selected at each iteration to update its local model. In contrast, in our approach data streams are heterogeneous so that certain nodes are more likely to update their models at any given time. Finally, in [11] the objective function (loss) is defined with respect to a distribution that is biased towards the nodes that update more often. This is in contrast to the objective function defined in this paper (i.e. (θ)\ell(\theta)), where every node contributes to the global distribution with the same weight regardless of their updating frequency.

3 Analysis

In this section, we show the NR scheme converges (in a certain sense) to a stationary point. To that end we study stochastic processes {θi,k:k>0}\{\theta_{i,k}:k>0\} associated with each one of the N>1N>1 nodes in the network regularized approach. The proofs are given in the appendix. We make the following standing assumptions:

Assumption 2: The graph 𝒢\mathcal{G} corresponding to the network of nodes is undirected (A=AA=A^{\intercal}) and connected, i.e., there is a path between every pair of vertices.

Assumption 3 (Lipschitz) (θ)(θ)Lθθ\left\lVert\triangledown\ell(\theta)-\triangledown\ell(\theta^{\prime})\right\rVert\leq L\left\lVert\theta-\theta^{\prime}\right\rVert for some L>0L>0 and for all θ,θ\theta,\theta^{\prime}.

3.1 Preliminaries

The ensemble average θ¯k1Ni=1Nθi,k\bar{\theta}_{k}\triangleq\frac{1}{N}\sum\nolimits_{i=1}^{N}\theta_{i,k} plays an important role in characterizing the performance of the network regularized scheme. To this end, we analyze the process {V¯k:k>0}\{\overline{V}_{k}:k>0\} defined as

V¯k1Ni=1Nθi,kθ¯k2.\overline{V}_{k}\triangleq\frac{1}{N}\sum_{i=1}^{N}\|\theta_{i,k}-\bar{\theta}_{k}\|^{2}.

Let ei,kθi,kθ¯ke_{i,k}\triangleq\theta_{i,k}-\bar{\theta}_{k} and Vi,kei,k2V_{i,k}\triangleq\|e_{i,k}\|^{2}, then V¯k=1Ni=1NVi,k\overline{V}_{k}=\frac{1}{N}\sum_{i=1}^{N}V_{i,k}. We now introduce some additional notations. Let deg(i)\deg(i) denote the degree of vertex ii in graph 𝒢\mathcal{G} and d¯:=maxideg(i)\overline{d}:=\max_{i}\deg(i). Let 𝔼[V¯k+1|𝜽k]\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}] denote the conditional expectation of V¯k+1\overline{V}_{k+1} given θk\mathbf{\theta}_{k}. We define μmax=max{μi:1iN}\mu_{\max}=\max\{\mu_{i}:1\leq i\leq N\} and μmin=min{μi:1iN}\mu_{\min}=\min\{\mu_{i}:1\leq i\leq N\}. We first prove two intermediate results.

Lemma 1 Suppose Assumptions 0, 1 and 2 hold. It holds that

V¯k+1\displaystyle\overline{V}_{k+1} =V¯k2Ni=1Nγei,k𝟏i,k[(θi,k)+ai,k]\displaystyle=\overline{V}_{k}-\frac{2}{N}\sum_{i=1}^{N}\gamma e_{i,k}^{\intercal}\mathbf{1}_{i,k}\left[\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\right]
2Ni=1Nγei,kεi,k𝟏i,k+1Ni=1Nγ2δi,k2,\displaystyle\qquad-\frac{2}{N}\sum_{i=1}^{N}\gamma e_{i,k}^{\intercal}\varepsilon_{i,k}\mathbf{1}_{i,k}+\frac{1}{N}\sum_{i=1}^{N}\gamma^{2}\|\delta_{i,k}\|^{2},

where δi,k=δi,kf+δi,kg+δi,kn\delta_{i,k}=\delta_{i,k}^{f}+\delta_{i,k}^{g}+\delta_{i,k}^{n}, and

δi,kf(θi,k)𝟏i,k¯k,\displaystyle\delta_{i,k}^{f}\triangleq\triangledown\ell(\theta_{i,k})\mathbf{1}_{i,k}-\triangledown\bar{\ell}_{k}, ¯k1Nj=1N(θj,k)𝟏j,k,\displaystyle~{}~{}~{}\triangledown\bar{\ell}_{k}\triangleq\frac{1}{N}\sum_{j=1}^{N}\triangledown\ell(\theta_{j,k})\mathbf{1}_{j,k},
δi,kga(i,k𝟏i,k¯k),\displaystyle\delta_{i,k}^{g}\triangleq a(\triangledown\mathcal{F}_{i,k}\mathbf{1}_{i,k}-\triangledown\bar{\mathcal{F}}_{k}), δi,knεi,k𝟏i,k1Nj=1Nεj,k𝟏j,k,\displaystyle~{}~{}~{}\delta_{i,k}^{n}\triangleq\varepsilon_{i,k}\mathbf{1}_{i,k}-\frac{1}{N}\sum_{j=1}^{N}\varepsilon_{j,k}\mathbf{1}_{j,k},
¯k\displaystyle~{}\triangledown\bar{\mathcal{F}}_{k} 1Nj=1Nj,k𝟏j,k.\displaystyle\triangleq\frac{1}{N}\sum_{j=1}^{N}\triangledown\mathcal{F}_{j,k}\mathbf{1}_{j,k}.

Lemma 2 Suppose Assumptions 0, 1, 2 and 3 hold. Let ξ=μmax/μmin\xi={\mu_{\max}}/{\mu_{\min}}, then:

𝔼[V¯k+1|𝜽k](1+κγN)V¯k+4γ2ξN(θ¯k)2+γ2ξσ2N2,\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}]\leq(1+\frac{\kappa\gamma}{N})\overline{V}_{k}+\frac{4\gamma^{2}\xi}{N}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{\gamma^{2}\xi\sigma^{2}}{N^{2}},

where λ2\lambda_{2} denotes the second-smallest of the Laplacian associated with graph 𝒢\mathcal{G} and

κ=2(Lμminaλ2μmax)+4γξN(L2+2a2d¯2).\kappa=2(L\mu_{\min}-a\lambda_{2}\mu_{\max})+\frac{4\gamma\xi}{N}(L^{2}+2a^{2}\overline{d}^{2}).

3.2 Convergence

We are now ready to state and prove the main theorem.

As in [19], convergence is described in terms of the expected value of the average squared norm of the gradient in the first KK-updates. The ensuing corollary goes into further detail by describing the same result in terms of real-time elapsed and not just on a total number of iterations.

Theorem 1: Suppose Assumptions 0, 1, 2 and 3 hold. Choose γ<\gamma<min{γ¯1,γ¯2}\min\{\bar{\gamma}_{1},\bar{\gamma}_{2}\}, where

γ¯1=N2aλ2μmaxL(2μmin+ξ/2)6ξ(L2+2a2d¯2) and\displaystyle\bar{\gamma}_{1}=N\frac{2a\lambda_{2}\mu_{\max}-L(2\mu_{\min}+{\xi}/{2})}{6\xi(L^{2}+2a^{2}\overline{d}^{2})}\text{ and } γ¯2=14L(2N+1)\displaystyle\bar{\gamma}_{2}=\frac{1}{4L(2N+1)}

are positive by choosing a>4μminL+ξL4λ2μmaxa>\frac{4\mu_{\min}L+\xi L}{4{\lambda}_{2}\mu_{\max}}. With scheme (2), it holds that

𝔼[1Kk=0K1𝔼[(θ¯k)2]]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\triangledown\ell(\bar{\theta}_{k})\|^{2}]\right]
\displaystyle\leq 1ηK[(θ¯0)+LV¯0+KLγ2ξσ2N2(1+12N)],\displaystyle\frac{1}{\eta K}\left[\ell(\bar{\theta}_{0})+L\overline{V}_{0}+\frac{KL\gamma^{2}\xi\sigma^{2}}{N^{2}}(1+\frac{1}{2N})\right],

where η=γξN(122γL(2+1N))\eta=\frac{\gamma\xi}{N}\Big{(}\frac{1}{2}-2{\gamma}L(2+\frac{1}{N})\Big{)}.

The regularization penalty parameter aa must be high enough to ensure cohesion between local models. This condition is weaker with a higher degree of connectivity (i.e. higher values of λ2\lambda_{2}).

Note also that for fixed N>0N>0, when aa\to\infty, then γ1/a\gamma\propto 1/a. So convergence, as characterized by Theorem 1, may be slower. This is not necessarily the case since the conditions in Theorem 1 identify a wide range of choices for aa and γ\gamma. For example, simulations indicate that for fixed γ\gamma higher values of aa may speed up convergence (see Figure 3 (c)).

3.3 Real-time Performance

The analysis in Theorem 1 takes place in the time scale indexed by k>0k>0 and associated with the clicks associated with a Poisson process with rate μ>0\mu>0. To embed the result in Theorem 1 in real-time, recall that {D(t):t0}\{D(t):t\geq 0\} is the counting process governing the aggregation of all data streams. Given our assumption on computation times being negligible, the total number of updates completed in [0,t)[0,t) is also D(t)D(t). Let us define the conditional average squared gradient norm ¯t2\|\bar{\triangledown}\ell_{t}\|^{2} in the interval [0,t)[0,t) as follows:

𝔼[¯t2|D(t)]1D(t)k=1D(t)(θ¯k)2.\mathbb{E}[\left\lVert\bar{\triangledown}\ell_{t}\right\rVert^{2}|D(t)]\triangleq\frac{1}{D(t)}\sum_{k=1}^{D(t)}\|\triangledown\ell(\bar{\theta}_{k})\|^{2}. (4)

Hence, the result in Theorem 1 can be reinterpreted by taking expectation of (4) over D(t)D(t) as:

𝔼[¯t2]\displaystyle\mathbb{E}[\|\bar{\triangledown}\ell_{t}\|^{2}] =𝔼[𝔼[¯t2|D(t)]]\displaystyle=\mathbb{E}\Big{[}\mathbb{E}[\|\bar{\triangledown}\ell_{t}\|^{2}|D(t)]\Big{]}
𝔼[1ηD(t)((θ¯0)+LV¯0)+Lγ2ξσ2ηN2(1+12N)]\displaystyle\leq\mathbb{E}\Big{[}\frac{1}{\eta D(t)}\big{(}\ell(\bar{\theta}_{0})+L\overline{V}_{0}\big{)}+\frac{L\gamma^{2}\xi\sigma^{2}}{\eta N^{2}}\big{(}1+\frac{1}{2N}\big{)}\Big{]}
=((θ¯0)+LV¯0)(1eμt)ημt+Lγ2ξσ2ηN2(1+12N).\displaystyle=\frac{(\ell(\bar{\theta}_{0})+L\overline{V}_{0})(1-e^{-\mu t})}{\eta\mu t}+\frac{L{\gamma}^{2}\xi\sigma^{2}}{\eta N^{2}}(1+\frac{1}{2N}).

According to Theorem 1, and using γ1N\gamma\sim\frac{1}{N}, the coupling of solutions via the network regularization penalty implies the ensemble average approximates a stationary point in the sense that:

limsupt𝔼[¯t2]=𝒪(σ2N2).\lim\sup_{t\rightarrow\infty}\mathbb{E}[\|\bar{\triangledown}\ell_{t}\|^{2}]=\mathcal{O}(\frac{\sigma^{2}}{N^{2}}).

The approximation quality is monotonically increasing in the number of nodes. The convergence properties outlined above are related to the ensemble average. It is, therefore, necessary to examine the degree to which individual models differ from the ensemble average. This is the gist of the next result.

Corollary 1: With the same assumptions and definitions in Theorem 1, it holds that

𝔼[1Kk=0K1V¯k]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\overline{V}_{k}\right]\leq 1K|κ|[(Nγ+4Lγξη)V¯0+4γξηl(θ¯0)]+4Lγ3ξ2σ2η|κ|N2(1+12N)+γξσ2|κ|N.\displaystyle\frac{1}{K|\kappa|}\Big{[}\big{(}\frac{N}{\gamma}+\frac{4L\gamma\xi}{\eta}\big{)}\overline{V}_{0}+\frac{4\gamma\xi}{\eta}l(\bar{\theta}_{0})\Big{]}+\frac{4L\gamma^{3}\xi^{2}\sigma^{2}}{\eta|\kappa|N^{2}}(1+\frac{1}{2N})+\frac{\gamma\xi\sigma^{2}}{|\kappa|N}.

We embed the result in Corollary 1 in real-time. Define the conditional average of V¯k\bar{V}_{k} in the interval [0,t)[0,t) as

𝔼[V¯t|D(t)]1D(t)k=1D(t)V¯k.\mathbb{E}[\bar{V}_{t}|D(t)]\triangleq\frac{1}{D(t)}\sum_{k=1}^{D(t)}\bar{V}_{k}.

The random process {V¯t:t>0}\{\bar{V}_{t}:t>0\} tracks the average distance of individual models to the ensemble average. Similar to the discussion of Theorem 1, the real-time result of Corollary 1 is as follows:

𝔼[V¯t]=\displaystyle\mathbb{E}[\bar{V}_{t}]= 𝔼[𝔼[V¯t|D(t)]]\displaystyle\mathbb{E}\Big{[}\mathbb{E}[\bar{V}_{t}|D(t)]\Big{]}
\displaystyle\leq 1D(t)|κ|[(Nγ+4Lγξη)V¯0+4γξηl(θ¯0)]+4Lγ3ξ2σ2η|κ|N2(1+12N)+γξσ2|κ|N\displaystyle\frac{1}{D(t)|\kappa|}\Big{[}\big{(}\frac{N}{\gamma}+\frac{4L\gamma\xi}{\eta}\big{)}\overline{V}_{0}+\frac{4\gamma\xi}{\eta}l(\bar{\theta}_{0})\Big{]}+\frac{4L\gamma^{3}\xi^{2}\sigma^{2}}{\eta|\kappa|N^{2}}(1+\frac{1}{2N})+\frac{\gamma\xi\sigma^{2}}{|\kappa|N}
=\displaystyle= 1eμtμt|κ|[(Nγ+4Lγξη)V¯0+4γξηl(θ¯0)]++4Lγ3ξ2σ2η|κ|N2(1+12N)+γξσ2|κ|N.\displaystyle\frac{1-e^{-\mu t}}{\mu t|\kappa|}\Big{[}\big{(}\frac{N}{\gamma}+\frac{4L\gamma\xi}{\eta}\big{)}\overline{V}_{0}+\frac{4\gamma\xi}{\eta}l(\bar{\theta}_{0})\Big{]}++\frac{4L\gamma^{3}\xi^{2}\sigma^{2}}{\eta|\kappa|N^{2}}(1+\frac{1}{2N})+\frac{\gamma\xi\sigma^{2}}{|\kappa|N}.

This implies the asymptotic difference between individual models and the ensemble average satisfies:

limsupt𝔼[V¯t]=𝒪(σ2N).\lim\sup_{t\to\infty}\mathbb{E}[\bar{V}_{t}]=\mathcal{O}(\frac{\sigma^{2}}{N}).

The network regularization parameter a>0a>0 plays an important role in controlling the upper bound of in Corollary 1. For fixed N>0N>0, when aa\to\infty, then γ,η1/a\gamma,\eta\propto 1/a and |κ|a2|\kappa|\propto a^{2}, it follows that 𝔼[1Kk=0K1V¯k]1/a\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\overline{V}_{k}\right]\propto 1/a. Hence, the upper bound in Corollary 1 can be made arbitrarily small by choosing large enough aa.

3.4 Comparison to Federated Learning

We now present the counterpart convergence result regarding to FL.

Proposition 1: Suppose Assumptions 0, 1, 2 and 3 hold. For scheme 1, with a choice γ(0,2L)\gamma\in(0,\frac{2}{L}), it holds that:

𝔼[1Kk=0K1(θk)2](θ0)η~K+Lγ22η~i=1Nμiμσi2,\displaystyle\mathbb{E}\Big{[}\frac{1}{K}\sum_{k=0}^{K-1}\|\triangledown\ell(\theta_{k})\|^{2}\Big{]}\leq\frac{\ell(\theta_{0})}{\tilde{\eta}K}+\frac{L\gamma^{2}}{2\tilde{\eta}}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\sigma_{i}^{2},

with η~=\tilde{\eta}=γ(1Lγ2)\gamma(1-\frac{L\gamma}{2}).

To embed the process in Proposition 1 in real-time, let us define the average squared gradient norm ~t2\|\triangledown\tilde{\ell}_{t}\|^{2} in the interval [0,t)[0,t) as follows:

𝔼[~t2|D(t)]1D(t)k=1D(t)(θk)2.\begin{split}\mathbb{E}[\|\triangledown\tilde{\ell}_{t}\|^{2}|D(t)]\triangleq\frac{1}{D(t)}\sum_{k=1}^{D(t)}\|\triangledown\ell(\theta_{k})\|^{2}.\end{split} (5)

Hence, the result in Proposition 1 can be reinterpreted by taking expectation of (5) over D(t)D(t) as:

𝔼[~t2]\displaystyle\mathbb{E}[\|\triangledown\tilde{\ell}_{t}\|^{2}] =𝔼[𝔼[~t2|D(t)]]𝔼[(θ0)η~D(t)+Lγ22η~i=1Nμiμσi2]=(θ0)(1eμt)η~μt+Lγ22η~i=1Nμiμσi2.\displaystyle=\mathbb{E}\Big{[}\mathbb{E}[\|\triangledown\tilde{\ell}_{t}\|^{2}|D(t)]\Big{]}\leq\mathbb{E}\Big{[}\frac{\ell(\theta_{0})}{\tilde{\eta}D(t)}+\frac{L{\gamma}^{2}}{2\tilde{\eta}}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\sigma_{i}^{2}\Big{]}=\frac{\ell(\theta_{0})(1-e^{-{\mu}t})}{\tilde{\eta}{\mu}t}+\frac{L{\gamma}^{2}}{2\tilde{\eta}}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\sigma_{i}^{2}.

To compare FL with NR, we also make γ1N\gamma\sim\frac{1}{N}. The asymptotic approximation quality is given by:

limsupt𝔼[~t2]=O(1Niμiμσi2),\lim\sup_{t\rightarrow\infty}\mathbb{E}[\|\triangledown\tilde{\ell}_{t}\|^{2}]=O(\frac{1}{N}\sum_{i}\frac{\mu_{i}}{\mu}\sigma^{2}_{i}),

which suggests that the approximation quality is determined by the faster data streams. This leads to unsatisfactory performance whenever μiσi2\mu_{i}\propto\sigma^{2}_{i} (i.e. faster data streams are also less accurate). Evidently, the opposite holds true when faster nodes are also more accurate, i.e. μi1/σi2\mu_{i}\propto 1/\sigma^{2}_{i}. However, in many real-time machine learning applications, this is not likely to be the case. Obtaining higher precision gradient estimates requires larger batches and/or increased computation. Thus nodes with higher precision are less likely to be the faster ones.

4 Testbed: Realtime Deep Learning

In this section, we report the results of NR (scheme (2)) to distributed real-time learning from three aspects: the comparison with FL (scheme (1)), the effects of the regularization parameter aa, and the effects of the network connectivity.

The specific learning task is to classify handwritten digits between 0 and 99 digits as given in the MNIST data set [20]. The dataset is composed of 1000010000 testing items and 60,00060,000 training items. Each item in the dataset is a black-and-white (single-channel) image of 28 ×\times 28 pixels of a handwritten digit between 0 and 99.

In the first two experiments, we implement schemes in a heterogeneous setting with 55 nodes, and the third experiment with 2020 nodes in a homogeneous setting. In the test-bed MNIST streams according to independent Poisson processes. Gradient estimates are obtained with different mini-batch sizes. Evidently, a smaller mini-batch size implies noisier gradient estimates. The detailed experimental settings are summarized in Table 1. In the heterogeneous setting, “node 0” is the fastest and noisiest in producing gradient estimates.

Setting Stream ID # Nodes Mini-batch Size μi\mu_{i}
D0D_{0} 1 1 8
Heterogeneous D1D4D_{1}-D_{4} 4 64 1
Homogeneous All streams 20 4 1

Table 1.The experiment hyperparameters of the two settings, including the data stream ID (Stream ID), number of nodes involved (# Nodes), the number of images arrived as a mini-batch (Mini-batch Size), and the Poisson rate of the corresponding stream is (μi\mu_{i}).

We use the Ray platform (see [21]) which is a popular library with shared memory supported, allowing information exchange between local nodes without copying as well as avoiding a central bottleneck. For low-level computation, Google TensorFlow is used. We use a Convolutional Neural Network (CNN) with two 2D Convolutions each with kernel size 5×55\times 5, stride 1 and 32, 64 filters. Each convolution layer is followed by a Max-pooling with a 2×22\times 2 filter and stride of 2. These layers are then followed by a Dense Layer with 256 neurons with 0.5 dropout and sigmoid activation followed by 10 output neurons and Softmax operation. Cross entropy is used as a performance measure (i.e. loss).444 With max-pooling the loss function is not differentiable in a set of measure zero. If in the course of execution a non-differentiable point is encountered, Tensorflow assumes a zero derivative. Details on the implementation are available at: https://github.com/wangluochao902/Network-Regularized-Approach.

We present the experimental results in mean plots with stand error bar. The means are computed across 1010 trials under the same hyper-parameters (namely, γ\gamma and aa).

4.1 Comparison to Federated Learning

In this experiment, we compare NR with FL in the heterogeneous setting. In Figure 2, we plot the means of the ensemble average of NR and FL with different learning rates.

Refer to caption
(a)
Refer to caption
(b)

Figure 2. The mean plot of ensemble average computed under the schemes of NR and FL in heterogeneous setting. The parameter aa is set to 1010 and the network is fully connected. The learning rate γ\gamma is set to 0.0020.002 in (a) and 0.0040.004 in (b).

We can observe from Figure (a) that when the learning rate is moderate, both FL and NR can converge, but the empirical standard deviation of FL is much larger than that of NR. With increased γ\gamma, FL fails to converge while NR still performs relatively well, as shown in Figure (b). We can see that NR is more robust with respect to the learning rate.

4.2 The Effects of Regularization Parameter

In this experiment, we look at the effects of changing the regularization parameter aa. In Figure 3, we present the means of each node as well as the ensemble average.

Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)

Figure 3. The mean plot of each node computed under the scheme of NR in heterogeneous setting. The parameter γ\gamma is set to 0.010.01 and the network is fully connected. The regularization parameter aa is set to 11 in (a) and 1010 in (b). The mean plot of the ensemble average under two choices of aa is presented in (c).

As we increase aa from 11 to 1010, we can observe from Figure 3 (a) and (b) that the consensus among nodes increases and the empirical mean standard deviation of the “node 0” decreases. As presented in Corollary 1, the regularization parameter aa influences the degree of similarity between individual models and the ensemble average. Note that we only identify a range of values for aa (lower bound) and γ\gamma (upper bound) for which convergence is guaranteed so that a higher value of aa does not necessarily imply slower convergence, as shown in Figure 3 (c).

4.3 The Effects of Network Connectivity

In the third experiment, we check the effect of increased connectivity in the homogeneous setting by using a Watts-Strogatz “small world” topology (see [22]), in which each node is connected with 2 (or 8) nearest neighbors.

Refer to caption
(f)

Figure 4. The mean plot of ensemble average computed under the scheme of NR in the homogeneous setting. The parameter aa is set to 1010 and the learning rate γ\gamma is set to 0.0010.001.


We can see from Figure 4 that increasing the connectivity of the topology only improves the performance slightly, meaning that only a limited connectivity is needed for the network regularized approach to enjoy a satisfactory rate of convergence.

5 Conclusions

In many application domains, data streams through a network of heterogeneous nodes in different geographic locations. When there is high data payload (e.g. high-resolution video), assembling a diverse batch of data points in a central processing location in order to update a model entails significant latency. In such cases, a distributed architecture for learning relying on a network of interconnected “local” nodes may prove advantageous. We have analyzed a distributed scheme in which every local node implements stochastic gradient updates every time a data point is obtained. To ensure robust estimation, a local regularization penalty is used to maintain a measure of cohesion in the ensemble of models. We show the ensemble average approximates a stationary point. The approximation quality is superior to that of FL, especially when there is heterogeneity in gradient estimation quality. We also show that our approach is robust against changes in the learning rate and network connectivity. We illustrate the results with an application to deep learning with convolutional neural networks.

In future work we plan to study different localized model averaging schemes. A careful selection of weights for computing local average model ensures a reduction of estimation variance. This is motivated by the literature on the optimal combination of forecasts (see [23]). For example, weights minimizing the sample mean square prediction error are of the form σ^i2j=1Nσ^j2\frac{\hat{\sigma}^{-2}_{i}}{\sum_{j=1}^{N}\hat{\sigma}^{-2}_{j}} where σ^i2\hat{\sigma}^{2}_{i} is the estimated mean squared prediction error of the ii-th model.

6 Appendix

Proof of Lemma 1

Note that

θ¯k+1\displaystyle\bar{\theta}_{k+1} =1Ni=1N[θi,kγ𝟏i,k[(θi,k)+ai,k+εi,k]]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left[\theta_{i,k}-\gamma\mathbf{1}_{i,k}[\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}+\varepsilon_{i,k}]\right]
=θ¯kγNi=1N(θi,k)𝟏i,kaγNi=1Ni,k𝟏i,kγNi=1Nεi,k𝟏i,k.\displaystyle=\bar{\theta}_{k}-\frac{\gamma}{N}\sum_{i=1}^{N}\triangledown\ell(\theta_{i,k})\mathbf{1}_{i,k}-\frac{a\gamma}{N}\sum_{i=1}^{N}\triangledown\mathcal{F}_{i,k}\mathbf{1}_{i,k}-\frac{\gamma}{N}\sum_{i=1}^{N}\varepsilon_{i,k}\mathbf{1}_{i,k}.

Hence, ei,k+1=θi,k+1θ¯k+1=ei,kγδi,ke_{i,k+1}=\theta_{i,k+1}-\bar{\theta}_{k+1}=e_{i,k}-\gamma\delta_{i,k}. Then

Vi,k+1\displaystyle V_{i,k+1} =(ei,kγδi,k)(ei,kγδi,k)\displaystyle=(e_{i,k}-\gamma\delta_{i,k})^{\intercal}(e_{i,k}-\gamma\delta_{i,k})
=ei,kei,k2γei,kδi,k+γ2δi,k2\displaystyle=e_{i,k}^{\intercal}e_{i,k}-2\gamma e_{i,k}^{\intercal}\delta_{i,k}+\gamma^{2}\left\lVert\delta_{i,k}\right\rVert^{2}
=Vi,k2γei,k(δi,kf+δi,kg+δi,kn)+γ2δi,k2,\displaystyle=V_{i,k}-2\gamma e_{i,k}^{\intercal}(\delta_{i,k}^{f}+\delta_{i,k}^{g}+\delta_{i,k}^{n})+\gamma^{2}\left\lVert\delta_{i,k}\right\rVert^{2},

and

V¯k+1=V¯k2γNi=1Nei,k(δi,kf+δi,kg+δi,kn)+γ2Ni=1Nδi,k2.\overline{V}_{k+1}=\overline{V}_{k}-\frac{2\gamma}{N}\sum_{i=1}^{N}e_{i,k}^{\intercal}(\delta_{i,k}^{f}+\delta_{i,k}^{g}+\delta_{i,k}^{n})+\frac{\gamma^{2}}{N}\sum_{i=1}^{N}\|\delta_{i,k}\|^{2}.

Finally, note that

i=1Nei,kδi,kf\displaystyle\sum_{i=1}^{N}e_{i,k}^{\intercal}\delta_{i,k}^{f} =i=1Nei,k[(θi,k)𝟏i,k¯k]=i=1Nei,k(θi,k)𝟏i,k,\displaystyle=\sum_{i=1}^{N}e_{i,k}^{\intercal}\left[\triangledown\ell(\theta_{i,k})\mathbf{1}_{i,k}-\triangledown\bar{\ell}_{k}\right]=\sum_{i=1}^{N}e_{i,k}^{\intercal}\triangledown\ell(\theta_{i,k})\mathbf{1}_{i,k},
i=1Nei,kδi,kg\displaystyle\sum_{i=1}^{N}e_{i,k}^{\intercal}\delta_{i,k}^{g} =ai=1Nei,k(i,k𝟏i,k¯k)=ai=1Nei,ki,k𝟏i,k,\displaystyle=a\sum_{i=1}^{N}e_{i,k}^{\intercal}(\triangledown\mathcal{F}_{i,k}\mathbf{1}_{i,k}-\triangledown\bar{\mathcal{F}}_{k})=a\sum_{i=1}^{N}e_{i,k}^{\intercal}\triangledown\mathcal{F}_{i,k}\mathbf{1}_{i,k},
i=1Nei,kδi,kn\displaystyle\sum_{i=1}^{N}e_{i,k}^{\intercal}\delta_{i,k}^{n} =i=1Nei,k(εi,k𝟏i,k1Nj=1Nεj,k𝟏j,k)=i=1Nei,kεi,k𝟏i,k.\displaystyle=\sum_{i=1}^{N}e_{i,k}^{\intercal}(\varepsilon_{i,k}\mathbf{1}_{i,k}-\frac{1}{N}\sum_{j=1}^{N}\varepsilon_{j,k}\mathbf{1}_{j,k})=\sum_{i=1}^{N}e_{i,k}^{\intercal}\varepsilon_{i,k}\mathbf{1}_{i,k}.

So the result follows. \blacksquare

Proof of Lemma 2

In light of Lemma 1 we have:

𝔼[V¯k+1|𝜽k]\displaystyle\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}] =\displaystyle= V¯k2γNi=1Nμiμei,k[(θi,k)+ai,k]+γ2Ni=1Nδi,k2.\displaystyle\overline{V}_{k}-\frac{2\gamma}{N}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}e_{i,k}^{\intercal}\left[\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\right]+\frac{\gamma^{2}}{N}\sum_{i=1}^{N}\|\delta_{i,k}\|^{2}.

Let 𝐞k=[e1,k,e2,k,,eN,k]\mathbf{e}_{k}=\left[\begin{array}[]{cccc}e_{1,k}^{\intercal},e_{2,k}^{\intercal},\cdots,e_{N,k}^{\intercal}\end{array}\right]^{\intercal} and =[lij]\mathcal{L}=[l_{ij}] be the Laplacian matrix associated with the adjacency matrix AA, where lii=jaijl_{ii}=\sum_{j}a_{ij} and lij=aijl_{ij}=-a_{ij} when iji\neq j. For an undirected graph, the Laplacian matrix is symmetric positive semi-definite. It follows that

i=1Nei,ki,k\displaystyle\sum_{i=1}^{N}e_{i,k}^{\intercal}\triangledown\mathcal{F}_{i,k} =i=1Nj=1,jiNαijei,k(ei,kej,k)=i=1NjiNlijei,k(ei,kej,k)\displaystyle=\sum_{i=1}^{N}\sum_{j=1,j\neq i}^{N}\alpha_{ij}e_{i,k}^{\intercal}(e_{i,k}-e_{j,k})=-\sum_{i=1}^{N}\sum_{j\neq i}^{N}l_{ij}e_{i,k}^{\intercal}(e_{i,k}-e_{j,k})
=i=1NjiNlijei,kej,k=𝐞k(Ip)𝐞kλ2i=1Nei,k2,\displaystyle=\sum_{i=1}^{N}\sum_{j\neq i}^{N}l_{ij}e_{i,k}^{\intercal}e_{j,k}=\mathbf{e}_{k}^{\intercal}(\mathcal{L}\otimes I_{p})\mathbf{e}_{k}\geq\lambda_{2}\sum_{i=1}^{N}\left\|e_{i,k}\right\|^{2},

where λ2:=λ2()\lambda_{2}:=\lambda_{2}(\mathcal{L}) is the second-smallest eigenvalue of \mathcal{L}, also called the algebraic connectivity of 𝒢\mathcal{G} [24]. Thus,

𝔼[V¯k+1|𝜽k]V¯k2γNi=1Nμiμei,k(θi,k)2aλ2γNi=1Nμiμei,k2+γ2Ni=1N𝔼[δi,k2|𝜽k]=V¯k2γNi=1Nμiμ((θi,k)(θ¯k))ei,k2aλ2γNi=1Nμiμei,k2+γ2Ni=1N𝔼[δi,k2|𝜽k].\begin{split}\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}]\leq&\overline{V}_{k}-\frac{2\gamma}{N}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}e_{i,k}^{\intercal}\triangledown\ell(\theta_{i,k})-\frac{2a\lambda_{2}\gamma}{N}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\left\|e_{i,k}\right\|^{2}+\frac{\gamma^{2}}{N}\sum_{i=1}^{N}\mathbb{E}[\|\delta_{i,k}\|^{2}|\bm{\theta}_{k}]\\ =&\overline{V}_{k}-\frac{2\gamma}{N}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}(\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{k}))^{\intercal}e_{i,k}-\frac{2a\lambda_{2}\gamma}{N}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\left\|e_{i,k}\right\|^{2}+\frac{\gamma^{2}}{N}\sum_{i=1}^{N}\mathbb{E}[\|\delta_{i,k}\|^{2}|\bm{\theta}_{k}].\end{split}

By Cauchy–Schwarz inequality and Assumption 33, we can obtain that

((θi,k)(θ¯k))ei,k\displaystyle-(\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{k}))^{\intercal}e_{i,k} (θi,k)(θ¯k)ei,kLei,k2.\displaystyle\leq\left\lVert\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{k})\right\rVert\left\lVert e_{i,k}\right\rVert\leq L\left\|e_{i,k}\right\|^{2}.

Define μ¯=μ/N\bar{\mu}=\mu/N, and by the inequalities μminNμ¯μiμμmaxNμ¯\frac{\mu_{\min}}{N\bar{\mu}}\leq\frac{\mu_{i}}{\mu}\leq\frac{\mu_{\max}}{N\bar{\mu}}, we can obtain

𝔼[V¯k+1|𝜽k](1+2γNμ¯(Lμmaxaλ2μmin))V¯k+γ2Ni=1N𝔼[δi,k2|𝜽k].\begin{split}\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}]\leq&(1+\frac{2\gamma}{N\bar{\mu}}(L\mu_{\max}-a\lambda_{2}\mu_{min}))\overline{V}_{k}+\frac{\gamma^{2}}{N}\sum_{i=1}^{N}\mathbb{E}[\left\lVert\delta_{i,k}\right\rVert^{2}|\bm{\theta}_{k}].\end{split} (6)

We now simplify the last term in the right hand side of (6). First we note that:

𝔼[δi,k2|𝜽k]=𝔼[δi,kf+δi,kg2|𝜽k]+𝔼[δi,kn2|𝜽k].\mathbb{E}[\|\delta_{i,k}\|^{2}|\bm{\theta}_{k}]=\mathbb{E}[\|\delta_{i,k}^{f}+\delta_{i,k}^{g}\|^{2}|\bm{\theta}_{k}]+\mathbb{E}[\|\delta_{i,k}^{n}\|^{2}|\bm{\theta}_{k}]. (7)

The first term in the right hand side of (7) can be further described as follows:

γ2𝔼[δi,kf+δi,kg2|𝜽k]\displaystyle\gamma^{2}\mathbb{E}[\|\delta_{i,k}^{f}+\delta_{i,k}^{g}\|^{2}|\bm{\theta}_{k}]
=\displaystyle= γ2𝔼[(θi,k)𝟏i,k¯k+a(i,k𝟏i,k¯k)2|𝜽k]\displaystyle\gamma^{2}\mathbb{E}\Big{[}\left.\left\|\triangledown\ell(\theta_{i,k})\mathbf{1}_{i,k}-\triangledown\bar{\ell}_{k}+a(\triangledown\mathcal{F}_{i,k}\mathbf{1}_{i,k}-\triangledown\bar{\mathcal{F}}_{k})\right\|^{2}\right|\bm{\theta}_{k}\Big{]}
=\displaystyle= γ2𝔼[(11N)[(θi,k)+ai,k]𝟏i,k+1NjiN[(θj,k)+aj,k]𝟏j,k2|𝜽k]\displaystyle\gamma^{2}\mathbb{E}\Big{[}\|(1-\frac{1}{N})[\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}]\mathbf{1}_{i,k}+\frac{1}{N}\sum_{j\neq i}^{N}[\triangledown\ell(\theta_{j,k})+a\triangledown\mathcal{F}_{j,k}]\mathbf{1}_{j,k}\|^{2}|\bm{\theta}_{k}\Big{]}
=\displaystyle= γ2N[(11N)2μiμ¯(θi,k)+ai,k2+1N2jiNμjμ¯(θj,k)+aj,k2].\displaystyle\frac{\gamma^{2}}{N}\Big{[}(1-\frac{1}{N})^{2}\frac{\mu_{i}}{\bar{\mu}}\left\|\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\right\|^{2}+\frac{1}{N^{2}}\sum_{j\neq i}^{N}\frac{\mu_{j}}{\bar{\mu}}\left\|\triangledown\ell(\theta_{j,k})\mathbf{+}a\triangledown\mathcal{F}_{j,k}\right\|^{2}\Big{]}.

This leads to:

i=1Nγ2𝔼[δi,kf+δi,kg2|𝜽k]\displaystyle\sum_{i=1}^{N}\gamma^{2}\mathbb{E}[\|\delta_{i,k}^{f}+\delta_{i,k}^{g}\|^{2}|\bm{\theta}_{k}] γ2ξN[(11N)2i=1N(θi,k)+ai,k2+1N2i=1NjiN(θj,k)+aj,k2]\displaystyle\leq\frac{\gamma^{2}\xi}{N}\Big{[}(1-\frac{1}{N})^{2}\sum_{i=1}^{N}\left\|\triangledown\ell(\theta_{i,k})\mathbf{+}a\triangledown\mathcal{F}_{i,k}\right\|^{2}+\frac{1}{N^{2}}\sum_{i=1}^{N}\sum_{j\neq i}^{N}\left\|\triangledown\ell(\theta_{j,k})\mathbf{+}a\triangledown\mathcal{F}_{j,k}\right\|^{2}\Big{]}
γ2ξNi=1N(θi,k)+ai,k2.\displaystyle\leq\frac{\gamma^{2}\xi}{N}\sum_{i=1}^{N}\left\|\triangledown\ell(\theta_{i,k})\mathbf{+}a\triangledown\mathcal{F}_{i,k}\right\|^{2}. (8)

Finally,

γ2i=1N𝔼[δi,kn2|𝜽k]\displaystyle\gamma^{2}\sum_{i=1}^{N}\mathbb{E}[\|\delta_{i,k}^{n}\|^{2}|\bm{\theta}_{k}] =γ2i=1N𝔼[(11N)εi,k𝟏i,k1NjiNεj,k𝟏j,k2|𝜽k]\displaystyle=\gamma^{2}\sum_{i=1}^{N}\mathbb{E}\Big{[}\|(1-\frac{1}{N})\varepsilon_{i,k}\mathbf{1}_{i,k}-\frac{1}{N}\sum_{j\neq i}^{N}\varepsilon_{j,k}\mathbf{1}_{j,k}\|^{2}|\bm{\theta}_{k}\Big{]}
=γ2N[(11N)2i=1Nμiμ¯𝔼[εi,k2|𝜽k]+1N2i=1NjiNμiμ¯𝔼[εj,k2|𝜽k]]\displaystyle=\frac{\gamma^{2}}{N}\Big{[}(1-\frac{1}{N})^{2}\sum_{i=1}^{N}\frac{\mu_{i}}{\bar{\mu}}\mathbb{E}[\|\varepsilon_{i,k}\|^{2}|\bm{\theta}_{k}]+\frac{1}{N^{2}}\sum_{i=1}^{N}\sum_{j\neq i}^{N}\frac{\mu_{i}}{\bar{\mu}}\mathbb{E}[\|\varepsilon_{j,k}\|^{2}|\bm{\theta}_{k}]\Big{]}
γ2N(μmaxμmin)i=1N𝔼[εi,k2|𝜽k]γ2ξσ2N.\displaystyle\leq\frac{\gamma^{2}}{N}\big{(}\frac{\mu_{\max}}{\mu_{\min}}\big{)}\sum_{i=1}^{N}\mathbb{E}[\|\varepsilon_{i,k}\|^{2}|\bm{\theta}_{k}]\leq\frac{\gamma^{2}\xi\sigma^{2}}{N}. (9)

We use inequalities (8) and (9) with (7) to obtain an upper bound of (6) as follows:

𝔼[V¯k+1|𝜽k](1+2γN(Lμminaλ2μmax))V¯k+γ2ξN2i=1N(θi,k)+ai,k2+γ2ξσ2N2.\begin{split}\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}]\leq&(1+\frac{2\gamma}{N}(L\mu_{\min}-a\lambda_{2}\mu_{\max}))\overline{V}_{k}+\frac{\gamma^{2}\xi}{N^{2}}\sum_{i=1}^{N}\|\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\|^{2}+\frac{\gamma^{2}\xi\sigma^{2}}{N^{2}}.\end{split} (10)

Finally, we analyze the third term on the right hand side of (10). By Parallellogram law

(θi,k)+ai,k2\displaystyle\|\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\|^{2} =2(θi,k)2+2ai,k2(θi,k)ai,k2\displaystyle=2\|\triangledown\ell(\theta_{i,k})\|^{2}+2\|a\triangledown\mathcal{F}_{i,k}\|^{2}-\|\triangledown\ell(\theta_{i,k})-a\triangledown\mathcal{F}_{i,k}\|^{2}
2(θi,k)2+2ai,k2\displaystyle\leq 2\|\triangledown\ell(\theta_{i,k})\|^{2}+2\|a\triangledown\mathcal{F}_{i,k}\|^{2}

In addition,

i,k2\displaystyle\|\triangledown\mathcal{F}_{i,k}\|^{2} =deg(i)2j=1,jiNαij(θi,kθj,t)deg(i)2\displaystyle=\deg(i)^{2}\left\|\sum_{j=1,j\neq i}^{N}\frac{\alpha_{ij}(\theta_{i,k}-\theta_{j,t})}{\deg(i)}\right\|^{2}
deg(i)j=1,jiNαijθi,kθj,k2d¯j=1,jiNαijθi,kθj,k2.\displaystyle\leq\deg(i)\sum_{j=1,j\neq i}^{N}\alpha_{ij}\left\|\theta_{i,k}-\theta_{j,k}\right\|^{2}\leq\bar{d}\sum_{j=1,j\neq i}^{N}\alpha_{ij}\left\|\theta_{i,k}-\theta_{j,k}\right\|^{2}.

which implies

i=1Ni,k2\displaystyle\sum_{i=1}^{N}\|\triangledown\mathcal{F}_{i,k}\|^{2} d¯i=1Nj=1,jiNαijθi,kθj,k2=d¯i=1NjiNαijei,kej,k2\displaystyle\leq\bar{d}\sum_{i=1}^{N}\sum_{j=1,j\neq i}^{N}\alpha_{ij}\left\|\theta_{i,k}-\theta_{j,k}\right\|^{2}=\bar{d}\sum_{i=1}^{N}\sum_{j\neq i}^{N}\alpha_{ij}\left\|e_{i,k}-e_{j,k}\right\|^{2}
2d¯i=1NjiNαij(ei,k2+ej,k2)4d¯2i=1Nei,k2=4Nd¯2V¯k.\displaystyle\leq 2\bar{d}\sum_{i=1}^{N}\sum_{j\neq i}^{N}\alpha_{ij}(\left\|e_{i,k}\right\|^{2}+\left\|e_{j,k}\right\|^{2})\leq 4\bar{d}^{2}\sum_{i=1}^{N}\left\|e_{i,k}\right\|^{2}=4N\bar{d}^{2}\overline{V}_{k}.

Thus,

i=1N(θi,k)+ai,k22i=1N(θi,k)2+8a2Nd¯2V¯k=4N(θ¯k)2+4i=1N(θi,k)(θ¯t)2+8a2Nd¯2V¯k4N(θ¯k)2+4N(L2+2a2d¯2)V¯k.\begin{split}\sum_{i=1}^{N}\|\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\|^{2}&\leq 2\sum_{i=1}^{N}\|\triangledown\ell(\theta_{i,k})\|^{2}+8a^{2}N\overline{d}^{2}\overline{V}_{k}\\ &=4N\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+4\sum_{i=1}^{N}\left\|\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{t})\right\|^{2}+8a^{2}N\overline{d}^{2}\overline{V}_{k}\\ &\leq 4N\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+4N(L^{2}+2a^{2}\overline{d}^{2})\overline{V}_{k}.\end{split} (11)

The result follows by using the previous inequality to obtain an upper bound for the right hand side of (10). \blacksquare

Proof of Theorem 1

By Taylor expansion and Lipschitz assumption:

(θ¯k+1)\displaystyle\ell(\bar{\theta}_{k+1})\leq (θ¯k)+(θ¯k)(θ¯k+1θ¯k)+L2θ¯k+1θ¯k2\displaystyle\ell(\bar{\theta}_{k})+\triangledown\ell(\bar{\theta}_{k})^{\intercal}(\bar{\theta}_{k+1}-\bar{\theta}_{k})+\frac{L}{2}\left\|\bar{\theta}_{k+1}-\bar{\theta}_{k}\right\|^{2}
=\displaystyle= (θ¯k)γNi=1N(θ¯k)(θi,k)𝟏i,kaγNi=1N(θ¯k)Ti,k𝟏i,k\displaystyle\ell(\bar{\theta}_{k})-\frac{\gamma}{N}\sum_{i=1}^{N}\triangledown\ell(\bar{\theta}_{k})^{\intercal}\triangledown\ell(\theta_{i,k})\mathbf{1}_{i,k}-\frac{a\gamma}{N}\sum_{i=1}^{N}\triangledown\ell(\bar{\theta}_{k})^{T}\triangledown\mathcal{F}_{i,k}\mathbf{1}_{i,k}
γNi=1N(θ¯k)Tεi,k𝟏i,k+L2θ¯k+1θ¯k2.\displaystyle-\frac{\gamma}{N}\sum_{i=1}^{N}\triangledown\ell(\bar{\theta}_{k})^{T}\varepsilon_{i,k}\mathbf{1}_{i,k}+\frac{L}{2}\left\|\bar{\theta}_{k+1}-\bar{\theta}_{k}\right\|^{2}.

Since i=1Ni,k=0\sum_{i=1}^{N}\triangledown\mathcal{F}_{i,k}=0, it follows that

E[(θ¯k+1)|𝜽k]\displaystyle E[\left.\ell(\bar{\theta}_{k+1})\right|\bm{\theta}_{k}]\leq (θ¯k)γξN2i=1N(θ¯k)T(θi,k)+L2E[θ¯k+1θ¯k2|𝜽k]\displaystyle\ell(\bar{\theta}_{k})-\frac{\gamma\xi}{N^{2}}\sum_{i=1}^{N}\triangledown\ell(\bar{\theta}_{k})^{T}\triangledown\ell(\theta_{i,k})+\frac{L}{2}E[\left\|\bar{\theta}_{k+1}-\bar{\theta}_{k}\right\|^{2}|\bm{\theta}_{k}] (12)
\displaystyle\leq (θ¯k)γξN2i=1N(θ¯k)T[(θi,k)(θ¯k)]γξN(θ¯k)2+L2E[θ¯k+1θ¯k2|𝜽k].\displaystyle\ell(\bar{\theta}_{k})-\frac{\gamma\xi}{N^{2}}\sum_{i=1}^{N}\triangledown\ell(\bar{\theta}_{k})^{T}[\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{k})]-\frac{\gamma\xi}{N}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{L}{2}E[\left\|\bar{\theta}_{k+1}-\bar{\theta}_{k}\right\|^{2}|\bm{\theta}_{k}].

Using (11) from the proof of Lemma 2 we obtain

E[θ¯k+1θ¯k2|𝜽k]=\displaystyle E[\left.\left\|\bar{\theta}_{k+1}-\bar{\theta}_{k}\right\|^{2}\right|\bm{\theta}_{k}]= γ2N3i=1Nμiμ¯(θi,k)+ai,k2+γ2N3i=1Nμiμ¯𝔼[εi,k2|𝜽k]\displaystyle\frac{\gamma^{2}}{N^{3}}\sum_{i=1}^{N}\frac{\mu_{i}}{\bar{\mu}}\left\|\triangledown\ell(\theta_{i,k})+a\triangledown\mathcal{F}_{i,k}\right\|^{2}+\frac{\gamma^{2}}{N^{3}}\sum_{i=1}^{N}\frac{\mu_{i}}{\bar{\mu}}\mathbb{E}[\|\varepsilon_{i,k}\|^{2}|\bm{\theta}_{k}]
\displaystyle\leq 4γ2ξN2[(θ¯k)2+(L2+2a2d¯2)V¯k]+γ2ξσ2N3.\displaystyle\frac{4\gamma^{2}\xi}{N^{2}}\Big{[}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+(L^{2}+2a^{2}\overline{d}^{2})\overline{V}_{k}\Big{]}+\frac{\gamma^{2}\xi\sigma^{2}}{N^{3}}. (13)

Also,

(θ¯k)T[(θi,k)(θ¯k)]=\displaystyle-\triangledown\ell(\bar{\theta}_{k})^{T}[\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{k})]= 12(θ¯k)2+12(θi,k)(θ¯k)2(θi,k)2\displaystyle\frac{1}{2}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{1}{2}\left\|\triangledown\ell(\theta_{i,k})-\triangledown\ell(\bar{\theta}_{k})\right\|^{2}-\left\|\triangledown\ell(\theta_{i,k})\right\|^{2}
\displaystyle\leq 12(θ¯k)2+L22θi,kθ¯k2.\displaystyle\frac{1}{2}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{L^{2}}{2}\left\|\theta_{i,k}-\bar{\theta}_{k}\right\|^{2}. (14)

Substituting (14) and (13) into (12) we obtain:

E[(θ¯k+1)|𝜽k]\displaystyle E[\left.\ell(\bar{\theta}_{k+1})\right|\bm{\theta}_{k}]\leq (θ¯k)γξ2N(θ¯k)2+L2γξ2NV¯k\displaystyle\ell(\bar{\theta}_{k})-\frac{\gamma\xi}{2N}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{L^{2}\gamma\xi}{2N}\bar{V}_{k}
+2Lγ2ξN2[(θ¯k)2+(L2+2a2d¯2)V¯k]+Lγ2ξσ22N3.\displaystyle+\frac{2L\gamma^{2}\xi}{N^{2}}\Big{[}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+(L^{2}+2a^{2}\overline{d}^{2})\overline{V}_{k}\Big{]}+\frac{L\gamma^{2}\xi\sigma^{2}}{2N^{3}}. (15)

Consider the function (θ¯k)+LV¯k\ell(\bar{\theta}_{k})+L\overline{V}_{k}. From the inequalities in (15) and Lemma 2 we obtain:

𝔼[V¯k+1|𝜽k](1+κγN)V¯k+4γ2ξN(θ¯k)2+γ2ξσ2N2,\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}]\leq(1+\frac{\kappa\gamma}{N})\overline{V}_{k}+\frac{4\gamma^{2}\xi}{N}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{\gamma^{2}\xi\sigma^{2}}{N^{2}},
𝔼[(θ¯k+1)+LV¯k+1|𝜽k]\displaystyle\mathbb{E}[\ell(\bar{\theta}_{k+1})+L\overline{V}_{k+1}|\bm{\theta}_{k}]\leq ((θ¯k)+LV¯k)γξN(122γL(2+1N))(θ¯k)2\displaystyle(\ell(\bar{\theta}_{k})+L\overline{V}_{k})-\frac{\gamma\xi}{N}\left(\frac{1}{2}-2\gamma L(2+\frac{1}{N})\right)\|\triangledown\ell(\bar{\theta}_{k})\|^{2}
+[κ+Lξ2+2γξN(L2+2a2d¯2)]LγNV¯k+Lγ2ξσ2N2(1+12N).\displaystyle+\Big{[}\kappa+\frac{L\xi}{2}+\frac{2\gamma\xi}{N}(L^{2}+2a^{2}\overline{d}^{2})\Big{]}\frac{L\gamma}{N}\overline{V}_{k}+\frac{L\gamma^{2}\xi\sigma^{2}}{N^{2}}(1+\frac{1}{2N}).

By choosing a>4μminL+ξL4λ2μmaxa>\frac{4\mu_{\min}L+\xi L}{4{\lambda}_{2}\mu_{\max}}, γ¯1>0\bar{\gamma}_{1}>0. Given the choice γ<γ¯1\gamma<\bar{\gamma}_{1} in the statement of Theorem 1, we have

κ+Lξ2+2γξN(L2+2a2d¯2)=\displaystyle\kappa+\frac{L\xi}{2}+\frac{2\gamma\xi}{N}(L^{2}+2a^{2}\overline{d}^{2})= 2aλ2μmax+L(2μmin+ξ2)+6γξN(L2+2a2d¯2)0\displaystyle-2a\lambda_{2}\mu_{\max}+L(2\mu_{\min}+\frac{\xi}{2})+\frac{6\gamma\xi}{N}(L^{2}+2a^{2}\overline{d}^{2})\leq 0

It follows that

γξN(122γL(2+1N))(θ¯k)2(θ¯k)+LV¯k𝔼[(θ¯k+1)+LV¯k+1|𝜽k]+Lγ2ξσ2N2(1+12N).\displaystyle\frac{\gamma\xi}{N}\Big{(}\frac{1}{2}-2{\gamma}L(2+\frac{1}{N})\Big{)}\|\triangledown\ell(\bar{\theta}_{k})\|^{2}\leq\ell(\bar{\theta}_{k})+L\overline{V}_{k}-\mathbb{E}\left[\ell(\bar{\theta}_{k+1})+L\overline{V}_{k+1}|\bm{\theta}_{k}\right]+\frac{L{\gamma}^{2}\xi\sigma^{2}}{N^{2}}(1+\frac{1}{2N}).

Let η=γξN(122γL(2+1N))\eta=\frac{\gamma\xi}{N}\Big{(}\frac{1}{2}-2{\gamma}L(2+\frac{1}{N})\Big{)}. By definition γ<γ¯2{\gamma}<\bar{\gamma}_{2}, we have η>0\eta>0. Since the loss function is nonnegative, l()0l(\cdot)\geq 0 and V¯k0\bar{V}_{k}\geq 0 for all kk. Taking full expectation and summing from k=0k=0 to k=K1k=K-1 on both sides of the above inequality, we obtain

𝔼[ηk=0K1(θ¯k)2](θ¯0)+LV¯0+KLγ2ξσ2N2(1+12N).\mathbb{E}[\eta\sum_{k=0}^{K-1}\|\triangledown\ell(\bar{\theta}_{k})\|^{2}]\leq\ell(\bar{\theta}_{0})+L\overline{V}_{0}+\frac{KL{\gamma}^{2}\xi\sigma^{2}}{N^{2}}(1+\frac{1}{2N}).

We conclude that

𝔼[1Kk=0K1𝔼[(θ¯k)2]]1ηK[(θ¯0)+LV¯0+KLγ2ξσ2N2(1+12N)].\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\triangledown\ell(\bar{\theta}_{k})\|^{2}]\right]\leq\frac{1}{\eta K}\left[\ell(\bar{\theta}_{0})+L\overline{V}_{0}+\frac{KL\gamma^{2}\xi\sigma^{2}}{N^{2}}(1+\frac{1}{2N})\right].\blacksquare

Proof of Corollary 1

Since γ<γ¯1{\gamma}<\bar{\gamma}_{1}, it follows that

κ<\displaystyle\kappa< 2Lμmin2aλ2μmax+23(2aλ2μmaxL(2μmin+ξ2))\displaystyle 2L\mu_{\min}-2a{\lambda}_{2}\mu_{\max}+\frac{2}{3}\Big{(}2a{\lambda}_{2}\mu_{\max}-L(2\mu_{\min}+\frac{\xi}{2})\Big{)}
=23Lμmin23aλ2μmaxξL3<0,\displaystyle=\frac{2}{3}L\mu_{\min}-\frac{2}{3}a{\lambda}_{2}\mu_{\max}-\frac{\xi L}{3}<0,

and from Lemma 2:

|κ|γNV¯kV¯k𝔼[V¯k+1|𝜽k]+4γ2ξN(θ¯k)2+γ2ξσ2N2.\frac{\left|\kappa\right|\gamma}{N}\overline{V}_{k}\leq\overline{V}_{k}-\mathbb{E}[\overline{V}_{k+1}|\bm{\theta}_{k}]+\frac{4\gamma^{2}\xi}{N}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{\gamma^{2}\xi\sigma^{2}}{N^{2}}.

Taking full expectation and summing from k=0k=0 to k=K1k=K-1 on both sides of the above inequality:

𝔼[1Kk=0K1V¯k]NK|κ|γV¯0+4γξ|κ|[1Kk=0K1(θ¯k)2+σ24N],\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\overline{V}_{k}\right]\leq\frac{N}{K|\kappa|\gamma}\overline{V}_{0}+\frac{4\gamma\xi}{|\kappa|}\Big{[}\frac{1}{K}\sum_{k=0}^{K-1}\left\|\triangledown\ell(\bar{\theta}_{k})\right\|^{2}+\frac{\sigma^{2}}{4N}\Big{]},

and using Theorem 1 we obtain the result. \blacksquare

Proof of Proposition 1

By Assumption 3 and Taylor expansion,

(θk+1)\displaystyle\ell(\theta_{k+1}) (θk)+(θk)(θk+1θk)+L2θk+1θk2\displaystyle\leq\ell(\theta_{k})+\triangledown\ell(\theta_{k})^{\intercal}(\theta_{k+1}-\theta_{k})+\frac{L}{2}\left\|\theta_{k+1}-\theta_{k}\right\|^{2}
=(θk)γ(θk)i=1N𝟏i,kgi(θk)+L2θk+1θk2.\displaystyle=\ell(\theta_{k})-\gamma\triangledown\ell(\theta_{k})^{\intercal}\sum_{i=1}^{N}\bm{1}_{i,k}g_{i}(\theta_{k})+\frac{L}{2}\left\|\theta_{k+1}-\theta_{k}\right\|^{2}.

Taking conditional expectation on both sides,

𝔼[(θk+1)|θk]\displaystyle\mathbb{E}[\left.\ell(\theta_{k+1})\right|{\theta}_{k}]\leq (θk)γ(θk)i=1Nμiμ𝔼[(θk)+εi]+L2𝔼[θk+1θk2|θk]\displaystyle\ell(\theta_{k})-\gamma\triangledown\ell(\theta_{k})^{\intercal}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\mathbb{E}[\triangledown\ell(\theta_{k})+\varepsilon_{i}]+\frac{L}{2}\mathbb{E}[\left\|\theta_{k+1}-\theta_{k}\right\|^{2}|\theta_{k}]
=\displaystyle= (θk)γ(θk)2+L2𝔼[θk+1θk2|θk].\displaystyle\ell(\theta_{k})-\gamma\left\lVert\triangledown\ell(\theta_{k})\right\rVert^{2}+\frac{L}{2}\mathbb{E}[\left\|\theta_{k+1}-\theta_{k}\right\|^{2}|\theta_{k}].

Note that

𝔼[θk+1θk2|θk]=𝔼[γi=1N𝟏i,kgi(θk)2|θk]\displaystyle\mathbb{E}[\left\|\theta_{k+1}-\theta_{k}\right\|^{2}|\theta_{k}]=\mathbb{E}[\|\gamma\sum_{i=1}^{N}\bm{1}_{i,k}g_{i}(\theta_{k})\|^{2}|\theta_{k}]
=\displaystyle= γ2i=1Nμiμ(θk)+εi(θk)2γ2((θk)2+1μi=1Nμiσi2),\displaystyle\gamma^{2}\sum_{i=1}^{N}\frac{\mu_{i}}{\mu}\left\lVert\triangledown\ell(\theta_{k})+\varepsilon_{i}(\theta_{k})\right\rVert^{2}\leq\gamma^{2}\big{(}\left\lVert\triangledown\ell(\theta_{k})\right\rVert^{2}+\frac{1}{\mu}\sum_{i=1}^{N}\mu_{i}\sigma_{i}^{2}\big{)},

it follows that

γ(1Lγ2)(θk)2(θk)𝔼[(θk+1)|θk]+Lγ22μi=1Nμiσi2.\displaystyle\gamma(1-\frac{L\gamma}{2})\|\triangledown\ell(\theta_{k})\|^{2}\leq\ell(\theta_{k})-\mathbb{E}[\ell(\theta_{k+1})|\theta_{k}]+\frac{L\gamma^{2}}{2\mu}\sum_{i=1}^{N}\mu_{i}\sigma_{i}^{2}.

The results follows by taking full expectation and summing from k=0k=0 to k=K1k=K-1 on both sides of the above inequality. \blacksquare


References

  • [1] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–19, 2019.
  • [2] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” arXiv preprint arXiv:1908.07873, 2019.
  • [3] D. Hallac, J. Leskovec, and S. Boyd, “Network lasso: Clustering and optimization in large graphs,” Proceedings SIGKDD, pp. 387–396, 2015.
  • [4] J. Chen, C. Richard, and A. Sayed, “Multitask diffusion adaptation over networks,” IEEE Transactions on Signal Processing, vol. 62, pp. 4129–4144, 2014.
  • [5] R. Nassif, C. Richard, A. Ferrari, and A. Sayed, “Multitask diffusion adaptation over asynchronous networks,” IEEE Transactions on Signal Processing, vol. 64, pp. 2835–2850, 2016.
  • [6] ——, “Multitask diffusion adaptation over asynchronous networks,” IEEE Transactions on Signal Processing, vol. 64, pp. 2835–2850, 2016.
  • [7] R. Nassif, S. Vlaski, and A. Sayed, “Learning over multitask graphs (part i: Stability analysis),” https://arxiv.org/abs/1805.08535, 2019.
  • [8] ——, “Learning over multitask graphs (part ii: Performance analysis),” https://arxiv.org/abs/1805.08547, 2019.
  • [9] F. Niu, B. Recht, C. Re, and S. Wright, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” Advances in Neural Information Processing Systems, pp. 693–701, 2011.
  • [10] J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Srikrishna, “An asynchronous parallel stochastic coordinate descent algorithm,” Journal of Machine Intelligence Research, vol. 16, pp. 285–322, 2015.
  • [11] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems 28 (NIPS), 2015.
  • [12] J. Duchi, S. Chaturapruek, and C. Ré, “Asynchronous stochastic convex optimization,” in Advances in Neural Information Processing Systems 28 (NIPS), 2015.
  • [13] N. Friedkin and E. Johnsen, “Social influence and opinions,” Journal of Mathematical Sociology, vol. 15, pp. 193–205, 1990.
  • [14] V. Gazi and K. Passino, Swarm Stability and Optimization.   Springer, 2011.
  • [15] S. Pu and A. Garcia, “A flocking-based approach for distributed stochastic optimization,” Operations Research, vol. 6, pp. 267–281, 2017.
  • [16] R. Leblond, F. Pedregosa, and S. Lacoste-Julien, “Improved asynchronous parallel optimization analysis for stochastic incremental methods,” Journal of Machine Intelligence Research, vol. 19, pp. 1–68, 2018.
  • [17] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [18] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first- order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, pp. 944–966, 2015.
  • [19] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
  • [20] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [21] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A distributed framework for emerging AI applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI) 18), 2018, pp. 561–577.
  • [22] D. Watts and S. Strogatz, “Collective dynamics of “small-world” networks,” Nature, vol. 393, pp. 440–442, 1998.
  • [23] J. M. Bates and C. M. W. Granger, “The combination of forecasts,” Operations Research Quaterly, vol. 20, pp. 451–468, 1969.
  • [24] C. Godsil and G. F. Royle, Algebraic graph theory.   Springer Science & Business Media, 2013.