Communication-efficient Byzantine-robust distributed learning with statistical guarantee

Xingcai Zhou, Le Chang, Pengfei Xu and Shaogao Lv
School of Statistics and Mathematics
Nanjing Audit University, Nanjing, 211085, Jiangsu, China

Abstract

Communication efficiency and robustness are two major issues in modern distributed learning framework. This is due to the practical situations where some computing nodes may have limited communication power or may behave adversarial behaviors. To address the two issues simultaneously, this paper develops two communication-efficient and robust distributed learning algorithms for convex problems. Our motivation is based on surrogate likelihood framework and the median and trimmed mean operations. Particularly, the proposed algorithms are provably robust against Byzantine failures, and also achieve optimal statistical rates for strong convex losses and convex (non-smooth) penalties. For typical statistical models such as generalized linear models, our results show that statistical errors dominate optimization errors in finite iterations. Simulated and real data experiments are conducted to demonstrate the numerical performance of our algorithms.

Key Words and Phrases: Distributed statistical learning; Byzantine failure; Communication efficiency; Surrogate likelihood; Proximal algorithm

1 Introduction

In many real-world applications, such as computer vision, natural language processing and recommendation systems, the exceedingly large size of data has made it impossible to store all of them on a single machine. Now, more and more data are stored locally in individual agents’ or users’ devices. Statistical analysis in modern era has to deal with the distributed storage data, which faces tremendous challenge on statistical method, computation and communication.

In several practical situations, smart-phone or remote devices with limited communication powers may sever as local nodes, or sometimes communication decay occurs from the constraint of network bandwidth (Konečnỳ et al., 2016). To address communication issue, communication efficiency-oriented algorithms for distributed optimization have been the focus of amounts of works in the past several years, for example, Zhang et al. (2013), Shamir et al. (2014), Wang et al. (2017a), Jordan et al. (2019) and among others. This literature has focused on data-parallel mode in which the overall dataset is partitioned and stored on $m$ worker machines that are processed independently. Among those existing distributed approaches, the divide and conquer may be the simplest strategy with a single communication round, where a master machine takes responsible to ultimately aggregate all local results computed independently at each local worker.

Although the divide and conquer strategy has been proved to achieve optimal estimation rates for parametric models and kernel methods (Zhang, Duchi and Wainwright, 2013, 2015), the global estimator based on the naive average may not inherent some useful structures from the model, such as sparsity. Moreover, a lower bound of the sample size $(m)$ assigned at the local nodes is required to attain the optimal statistical rates, that is, $m=\Omega(\sqrt{N})$ , where $N$ is the total sample size. This deviates from some practical scenarios where the dataset with the size $\sqrt{N}$ is also too large to store at a single node/machine. In addition, existing numerical analysis (Jordan, Lee and Yang, 2019) have shown that the naive averaging often performs poorly for nonlinear models, and even its generalization performance is usually unreliable when the local sample sizes among workers differ significantly (Fan et al., 2019).

In the distributed learning literature for communication efficiency, most of existing works on distributed machine learning consist of two categories: 1) how to design communication efficient algorithms to reduce the round of communications among workers (Jordan, Lee and Yang, 2019; Konečnỳ, McMahan, Yu, Richtárik, Suresh and Bacon, 2016; Lee, Lin, Ma and Yang, 2017; Shamir, Srebro and Zhang, 2014); 2) how to choose a suitable (lossy) compression for broadcasting parameters (Wang, Wang and Srebro, 2017b). Notably, Jordan et al. (2019) and Wang et al. (2017a) independently propose a Communication-efficient Surrogate Likelihood (CSL) framework for solving regular M-estimation problems, which also works for high-dimensional penalized regression and Bayesian statistics. Under the master-worker architectures, CSL makes full use of the total information of the data over the master machine, while only merges the first-order gradients from all the workers. Specially, a quasi-newton optimization at the master is solved as the final estimator, instead of merely aggregating all the local estimators like one-shot methods. It has been shown in (Jordan, Lee and Yang, 2019; Wang, Kolar, Srebro and Zhang, 2017a) that CSL-based distributed learning can preserve sparsity structure and achieve optimal statistical estimation rates for convex problems in finite-step iterations.

Despite the generality and elegance of the CSL framework, it is not a wisdom that if it would be directly applied to Byzantine learning. In view that CSL aggregation rule heavily depends on the local gradients, the learning performance will be degraded significantly if these received gradients from local workers are highly noisy. In fact, Byzantine-failure is frequently encountered in distributed or federated learning (Yin et al., 2018). In a decentralized environment, some computing units may exhibit abnormal behavior due to crashes, stalled computation or unreliable communication channels. It is typically modeled as Byzantine failure, meaning that some worker machines may behave arbitrary and potentially adversarial behavior. Thus, it leads to the misleading learning process (Vempaty et al., 2013; Yang et al., 2019; Wu et al., 2019). Robustifying learning against Byzantine failures has attracted a great of attention in recent years.

To copy with Byzantine failures in distributed statistical learning, most resilient approaches in a few recent works tend to combine stochastic gradient descent (SGD) with different robust aggregation rules, such as geometric median (Minsker, 2015; Chen et al., 2017), median (Xie et al., 2018; Yin et al., 2018), trimmed mean (Yin et al., 2018), iterative filtering (Su and Xu, 2018) and Krum (Blanchard et al., 2017). These learning algorithms can tolerate a small number of devices attacked by Byzantine adversaries. Yin et al. (2018) developed two distributed learning algorithms that were provably robust against the Byzantine failures, and also these proposed algorithms can achieve optimal statistical error rates for strongly convex losses. Yet, the above works did not consider the communication cost issue and an inappropriate robust technique can result in increasing the number of communications.

In this paper, we develop two efficient distributed learning algorithms with both communication-efficiency and Byzantine-robustness, in pursuit of accurate statistical estimators. The proposed algorithms integral the framework of CSL for effective communication with two robust techniques, which will be described in Section 3. At each round, the 1st non-Byzantine machine needs to solve a regularized M-estimation problem on its local data. Other workers only need to compute the gradients on their individual data, and then send these local gradients to the 1st non-Byzantine machine. Once receiving these gradient values from the workers, the 1st non-Byzantine machine further aggregates them on basis of coordinate-wise median or coordinate-wise trimmed mean technique, so as to formulate a robust proxy of the global gradient.

In our communication-efficient and Byzantine-robust framework, our estimation error indicates that there exist several trade offs between statistical efficiency, computation efficiency and robustness. In particular, our algorithms attempt to guard against Byzantine failures meanwhile not sacrifice the quality of learning. Theoretically, we show the first algorithm achieves the following statistical error rates

\mathcal{\tilde{O}}_{p}\left(\frac{1}{\sqrt{n}}\left[\alpha+\frac{\sqrt{p}}{\sqrt{m}}\right]+\frac{1}{n}\right)\mathrm{for~{}option~{}I~{}~{}and}~{}~{}\mathcal{\tilde{O}}_{p}\left(\frac{p}{\sqrt{n}}\left[\alpha+\frac{1}{\sqrt{m}}\right]\right)\mathrm{for~{}option~{}II,}

where $\frac{1}{\sqrt{n}}$ is the effective standard deviation for each machine with $n$ data points, $\alpha$ is the bias effect (price) of Byzantine machines, $\frac{1}{\sqrt{m}}$ is the averaging effect of $m$ normal machines, and $\frac{1}{n}$ is due to the dependence of the median on the skewness of the gradients. For strongly convex problems, Yin et al. (2018) proved that no algorithm can achieve an error lower than $\tilde{\Omega}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ under regular conditions. Hence, this shows the optimality of our methods in some senses. As an natural extension of our first algorithm, our 2nd algorithm embeds the proximal algorithm (Parikh and Boyd, 2014) into the distributed procedure. They still perform well even under extremely mild conditions. Particularly, it is more suitable for solving very large scale or high-dimensional problems. In addition, algorithmic convergence can be proved under more mild conditions, without requiring good initialization or a large sample size on each worker machine.

The remainder of this paper is organized as follows. In Section 2, we introduce the problem setup and communication-efficient surrogate likelihood framework for the distributed learning. Section 3 proposes a Byzantine-robust CSL distributed learning algorithm and gives statistical guarantees under general conditions. Section 4 presents another Byzantine-robust CSL-proximal distributed learning algorithm and analyzes their theoretical properties. Section 5 provides simulated and real data examples that illustrate the numerical performance of our algorithms, and thus validate the theoretical results.

Notations. For any positive integer $n$ , we denote the set $\{1,2,\cdots,n\}$ by $[n]$ for brevity. For a vector, the standard $\ell_{2}$ -norm and the $\ell_{\infty}$ -norm is written by $\|\cdot\|_{2}$ and $\|\cdot\|_{\infty}$ , respectively. For a matrix, the operator norm and the Frobenius norm is written by $\|\cdot\|_{2}$ and $\|\cdot\|_{F}$ , respectively. For a different function $r:\mathbb{R}^{p}\rightarrow\mathbb{R}$ , denote its partial derivative (or sub-differential set) with respect to the $k$ -th argument by $\partial_{k}r$ . Given a Euclidean space $\mathbb{R}^{p}$ , ${\mbox{\boldmath${\theta}$}}_{1},{\mbox{\boldmath${\theta}$}}_{2}\in\mathbb{R}^{p}$ and $r>0$ , define $B({\mbox{\boldmath${\theta}$}}_{0},r)=\{{\mbox{\boldmath${\theta}$}}_{1}\in\mathbb{R}^{p}:\|{\mbox{\boldmath${\theta}$}}_{0}-{\mbox{\boldmath${\theta}$}}_{1}\|_{2}\leq r\}$ to be a closed ball with the center ${\mbox{\boldmath${\theta}$}}_{0}$ , where $\theta_{1j}$ refers to the $j$ th component of ${\mbox{\boldmath${\theta}$}}_{1}$ . We assume that ${\mbox{\boldmath${\theta}$}}\in{\mbox{\boldmath${\Theta}$}}\subset\mathbb{R}^{p}$ and ${\Theta}$ is a convex and compact set with diameter $D$ . Let $\mathfrak{B}$ to be the set of Byzantine machines, $\mathfrak{B}\subset\mathds{B}=\{2,\cdots,m+1\}$ . Without loss of generality, we assume that the 1st worker machine is normal and the other worker machine may be Byzantine. For matrices $A$ and $B$ , $A\succ B$ means $A-B$ is strictly positive. Given two sequences $\{a_{n}\}$ and $\{b_{n}\}$ , we denote $a_{n}=\mathcal{O}(b_{n})$ if $a_{n}\leq C_{1}b_{n}$ for some absolute positive constant $C_{1}$ , $a_{n}=\Omega(b_{n})$ if $a_{n}\geq C_{2}b_{n}$ for some absolute positive constant $C_{2}$ . Furthermore, we also use notations $\mathcal{\tilde{O}}(\cdot)$ and $\tilde{\Omega}(\cdot)$ to hide logarithmic factors in $\mathcal{O}(\cdot)$ and $\Omega(\cdot)$ respectively.

2 Problem Formulation

In this section, we formally describe the problems setup. We focus on a standard statistical learning problem of (regularized) empirical risk minimization (ERM). In a distributed setting, suppose that we have access to one master and $m$ worker machines, and each worker machine independently communicates to the master one; each machine contains $n$ data points; and $\alpha m$ of the $m$ worker machines are Byzantine for some proportional level $\alpha<1/2$ and the remaining $1-\alpha$ fraction of worker machines are normal. Byzantine works can send any arbitrary values to the master machine. In addition, Byzantine workers may completely know the learning algorithm and are allowed to collude with each other (Yin et al., 2018). In this setting, the total number of data points is $N:=(m+1)n$ .

Suppose that the observed data are sampled independently from an unknown probability distribution $\mathcal{D}$ over some metric space $\mathcal{Z}$ . Let $\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}})$ be a loss function of a parameter ${\mbox{\boldmath${\theta}$}}\in{\mbox{\boldmath${\Theta}$}}\subset\mathbb{R}^{p}$ associated with the data point ${z}$ . To measure the population loss of ${\theta}$ , we define the expected risk by $F({\mbox{\boldmath${\theta}$}})=\mathbb{E}_{{\mbox{\boldmath${z}$}}\sim\mathcal{D}}\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}})$ . Theoretically, the true data-generating parameter we care about is a global minimizer of the population risk,

{\mbox{\boldmath${\theta}$}}^{*}=\mathrm{argmin}_{{\mbox{\boldmath${\theta}$}}\in\Theta}F({\mbox{\boldmath${\theta}$}}).

It is known that negative log-likelihood functions are viewed as typical examples of the loss function $\ell(\cdot)$ , for example, the Gaussian distribution corresponds to the least square loss, while the Bernoulli distribution for the logistic loss. Given that all the available samples $\{{\mbox{\boldmath${z}$}}_{i}:i=1,\cdots,N\}$ are stored on $m+1$ machines, the empirical loss of the $k$ th machine is given as $f_{k}({\mbox{\boldmath${\theta}$}})=\frac{1}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}}_{i})$ , where $\mathcal{I}_{k}$ is the index set of samples over the $k$ th machine with $|\mathcal{I}_{k}|=n=N/(m+1)$ for all $k\in[m+1]$ , and $\mathcal{I}_{j}\cap\mathcal{I}_{k}=\emptyset$ for any $j\neq k$ . In this paper, we are mainly concerned with learning ${\mbox{\boldmath${\theta}$}}^{*}$ via minimizing the regularized empirical risk

\hat{{\mbox{\boldmath${\theta}$}}}=\mathrm{argmin}_{{\mbox{\boldmath${\theta}$}}\in\Theta}\{f({\mbox{\boldmath${\theta}$}})+g({\mbox{\boldmath${\theta}$}})\},

(2.1)

where $f({\mbox{\boldmath${\theta}$}})=\frac{1}{m+1}\sum_{k=1}^{m+1}f_{k}({\mbox{\boldmath${\theta}$}}),$ and $g({\mbox{\boldmath${\theta}$}})$ is a deterministic penalty function and independent of sample points, such as the square $\ell_{2}$ -norm in ridge estimation, the $\ell_{1}$ -norm in the Lasso penalty (Tibshirani, 1996).

In the ideal non-Byzantine failure situation, one of core goals in the distributed framework is to develop efficient distributed algorithms to approximate $\hat{{\mbox{\boldmath${\theta}$}}}$ well. As a leading work in the literature, Jordan et al. (2019) and Wang et al. (2017a) independently proposed an efficient distributed approach via the quasi-likelihood estimation. We now introduce the formulation of this method. Without loss of generality, we take the first machine as our master one. An initial estimator $\tilde{{\mbox{\boldmath${\theta}$}}}_{0}$ in the 1st machine is broadcasted to all other machines, which compute their individual gradients at $\tilde{{\mbox{\boldmath${\theta}$}}}_{0}$ . Then each gradient vector $\nabla f_{k}(\tilde{{\mbox{\boldmath${\theta}$}}}_{0})$ is communicated back to the 1st machine. This constitutes one round of communication with a communication cost of $\mathcal{O}(mp)$ . At the $(t+1)$ -th iteration, the 1st machine calculates the following regularized surrogate loss

\tilde{{\mbox{\boldmath${\theta}$}}}_{t+1}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}~{}f_{1}({\mbox{\boldmath${\theta}$}})-\Big{\langle}\nabla f_{1}(\tilde{{\mbox{\boldmath${\theta}$}}}_{t})-\frac{1}{m+1}\sum_{k\in[m+1]}\nabla f_{k}(\tilde{{\mbox{\boldmath${\theta}$}}}_{t}),{\mbox{\boldmath${\theta}$}}\Big{\rangle}+g({\mbox{\boldmath${\theta}$}}).

(2.2)

Next, the (approximate) minimizer $\tilde{{\mbox{\boldmath${\theta}$}}}_{t+1}$ without any aggregation operation is communicated to all the local machines, which is used to compute the local gradients, and then iterates as (2.2) until convergence.

Different from any first-order distributed optimization, the refined objective (2.2) leverages both global first-order information and local higher-order information (Wang et al., 2017a). The idea of using such an adaptive enhanced function also has been developed in Shamir et al. (2014) and Fan et al. (2019).

Throughout the paper, we assume that $\ell()$ and $g()$ are convex in ${\theta}$ , and $\ell()$ is twice continuously differentiable in ${\theta}$ . We allow $g()$ to be non-smooth, for example, the $\ell_{1}$ -penalty ( $\lambda\|{\mbox{\boldmath${\theta}$}}\|_{1}$ ).

From (2.2), we observe that this update for interested parameters strongly depends upon local gradients at any iteration. Hence, the standard learning algorithm only based on average aggregation of the workers’ messages would be arbitrarily skewed if some of local workers are Byzantine-faulty machines. To address this robust-related problem, we develop two Byzantine-robust distributed learning algorithms given in next two sections.

3 Byzantine-robust CSL distributed learning

In this section, we introduce our first communication-efficient Byzantine-robust distributed learning algorithm based on the CSL framework, and particularly introduce two robust operations to handle the Byzantine failures. After giving some technical assumptions, we present optimization error and statistical analysis of multi-step estimators. In the end of this section, we further clarify our results by a concrete example of generalized linear models (GLMs).

3.1 Byzantine-robust CSL distributed algorithm

When the Byzantine failures occur, the aggregation rule (2.2) will be sensitive to the bad gradient values. More precisely, although the master machine communicates with the worker machines via some predefined protocol, the Byzantine machines do not have to obey this protocol and may send arbitrary messages to the master machine. At this time, the gradients $\{\nabla f_{k}(\cdot):k=2,\cdots,m+1\}$ received by the master machine are not always reliable, since the information from Byzantine machine may be completely out of its local data. To state it clearly, we assume that the Byzantine workers can provide arbitrary values written by the symbol “ $*$ ” to the master machine. In this situation, several robust operations should be implemented to substitute the simply average of local gradients as in (2.2).

Inspired by robust techniques developed recently in Yin et al. (2018), we apply for the coordinate-wise median and coordinate-wise trimmed mean to formulate our Byzantine-robust CSL distributed learning algorithm.

Definition 3.1

(Coordinate-wise median) For vectors ${\mbox{\boldmath${x}$}}^{i}\in\mathbb{R}^{p}$ , $i\in[m]$ , the coordinate-wise median ${\mbox{\boldmath${g}$}}=\mathbf{med}\{{\mbox{\boldmath${x}$}}^{i}:i\in[m]\}$ is a vector with its $k$ -th coordinate being $g_{k}=\mathbf{med}\{x_{k}^{i}:i\in[m]\}$ for each $k\in[p]$ , where $\mathbf{med}$ is the usual (one-dimensional) median.

Definition 3.2

(Coordinate-wise trimmed mean) For $\beta\in[0,\frac{1}{2})$ and vectors ${\mbox{\boldmath${x}$}}^{i}\in\mathbb{R}^{p}$ , $i\in[m]$ , the coordinate-wise $\beta$ -trimmed mean ${\mbox{\boldmath${g}$}}=\mathbf{trmean}_{\beta}\{{\mbox{\boldmath${x}$}}^{i}:i\in[m]\}$ is a vector with its $k$ -th coordinate being $g_{k}=\frac{1}{(1-2\beta)m}\sum_{x\in U_{k}}x$ for each $k\in[p]$ . Here $U_{k}$ is a subset of $\{x_{k}^{1},\cdots,x_{k}^{m}\}$ obtained by removing the largest and small $\beta$ fraction of its elements.

See Algorithm 1 below for details, and we call it Algorithm BCSL. In each parallel iteration of Algorithm 1, the 1st machine (the normal master machine) broadcasts the current model parameter to all worker machines. The normal worker machines calculate their own gradients of loss functions based on their local data and then send them to the 1st machine. Considering that the Byzantine machines may send any messages due to their abnormal or adversarial behavior, we implement the coordinate-wise median or trimmed mean operation for these received gradients at the 1st machine. Then the aggregation algorithm in (3.1) is conducted to update the global parameter.

Input: Initialize estimator

{\mbox{\boldmath${\theta}$}}_{0}

, algorithm parameter

\beta

(for Option II) and number of iteration

T

for $t=0,1,2,\cdots,T-1$ do

The 1st machine: send

{\mbox{\boldmath${\theta}$}}_{t}

to other worker machines.

for $\mathbf{all}~{}k\in\{2,3,\cdots,m+1\}~{}\mathbf{parallel}$ do

Worker machine $k$ : evaluate local gradient

{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t})\leftarrow\left\{\begin{array}[]{cl}\nabla f_{k}({\mbox{\boldmath${\theta}$}}_{t})&\mathrm{normal~{}worker~{}machines,}\\ &\mathrm{Byzantine~{}machines},\end{array}\right.

send

{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t})

to the 1st machine.

end for

The 1st machine: evaluates aggregate gradients

{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t})\leftarrow\left\{\begin{array}[]{ll}\mathbf{med}\{{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t}):k\in[m+1]\}&\mathrm{Option~{}I},\\ \mathbf{trmean}_{\beta}\{{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t}):k\in[m+1]\}&\mathrm{Option~{}II},\end{array}\right.

computes

{\mbox{\boldmath${\theta}$}}_{t+1}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\left\{f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\theta}$}}_{t})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})\right\},

(3.1)

and then broadcasts it to local machines.

end for

Output:

{\mbox{\boldmath${\theta}$}}_{T}

Algorithm 1 Byzantine-Robust CSL distributed learning (BCSL)

In order to provide an optimization error and statistical error of Algorithm 1, we need to introduce some basic conditions for our theoretical analysis.

Assumption 3.1

(Lipschitz Conditions and Smoothness). For any ${\mbox{\boldmath${z}$}}\in\mathcal{Z}$ and $k\in[p]$ , the partial derivative $\partial_{k}\ell(\cdot;{\mbox{\boldmath${z}$}})$ with respect to its first component is $L_{k}$ -Lipschitz.The loss function $\ell(\cdot;{\mbox{\boldmath${z}$}})$ itself is $L$ -smooth in sense that its gradient vector is $L$ -Lipschitz continuous under the $\ell_{2}$ -norm. Let $\tilde{L}=\sqrt{\sum_{k=1}^{p}L_{k}^{2}}$ . Further assume that the population loss function $F(\cdot)$ is $L_{F}$ -smooth.

For Option I in Algorithm 1: median-based algorithm, some moment conditions of the gradient of $\ell()$ is introduced to control stochastic behaviors.

Assumption 3.2

There exist two constants $V$ and $S$ , for any ${\mbox{\boldmath${\theta}$}}\in{\mbox{\boldmath${\Theta}$}}$ and all ${\mbox{\boldmath${z}$}}\in\mathcal{Z}$ , such that

(i) (Bounded variance of gradient). $V^{2}{\bf I}\succ Var(\nabla\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}}))$ .

(ii) (Bounded skewness of gradient). $\|\gamma(\nabla\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}}))\|_{\infty}\leq S$ . Here $\gamma(\cdot)$ refers to the coordinate-wise skewness of vector-valued random variables.

Assumption 3.2 is standard in the literature and is satisfied in many learning problems. See Proposition 1 in Yin et al. (2018) for a specific linear regression problem.

Assumption 3.3

(Strong convexity). $f+g$ has a unique minimizer $\hat{{\mbox{\boldmath${\theta}$}}}\in\mathbb{R}^{p}$ , and is $\rho$ -strongly convex in $B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ for some $R>0$ and $\rho>0$ .

Assumption 3.4

(Homogeneity) $\|\nabla^{2}f_{1}({\mbox{\boldmath${\theta}$}})-\nabla^{2}F({\mbox{\boldmath${\theta}$}})\|_{2}\leq\delta$ for $\delta\geq 0$ and ${\mbox{\boldmath${\theta}$}}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ .

In most existing studies, it is usually assumed that the population risk $F$ is smooth and strongly convex. The empirical risk $f$ also enjoys such good properties, as long as $\{{\mbox{\boldmath${z}$}}_{i}:i=1,\cdots,N\}$ are i.i.d. and the total sample size $N$ is sufficiently large relative to $p$ (Fan et al., 2019). From (3.1) in Algorithm 1, we know the local data at the 1st machine are used to optimize. So we need to control the gap between $\nabla^{2}f_{1}({\mbox{\boldmath${\theta}$}})$ and $\nabla^{2}F({\mbox{\boldmath${\theta}$}})$ to contract optimization rate of the algorithm. The similarity between $f_{1}$ and $F$ is depicted by Assumption 3.4. Indeed, the empirical risk $f_{1}$ should not be too far away from their population risk $F$ as long as the sample size $n$ of the 1st machine is not too small. Mei et al. (2018) showed it holds with reasonably small $\delta$ and large $R$ with high probability under general conditions. Specially, a large $n$ implies a small homogeneity index $\delta$ . Obviously, Assumption 3.4 always holds when taking $\delta=\sup_{{\mbox{\boldmath${\theta}$}}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)}\left(\|\nabla^{2}f_{1}({\mbox{\boldmath${\theta}$}})\|_{2}+\|\nabla^{2}F({\mbox{\boldmath${\theta}$}})\|_{2}\right)$ .

The following theorem establishes the global convergence of the proposed algorithm BCSL, involving a trade off between the optimization error and statistical error.

Theorem 3.1

Assume that Assumptions 3.1-3.4 hold, the iterates $\{{\mbox{\boldmath${\theta}$}}_{t}\}_{t=0}^{\infty}$ produced by Option I in Algorithm 1 with ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ for $t\geq 0$ , $\rho>\delta+2\Delta_{nm\alpha}/R$ , and the fraction $\alpha$ of Byzantine machines satisfies

\alpha+\sqrt{\frac{p\log(1+n(m+1)\tilde{L}D)}{m(1-\alpha)}}+0.4748\frac{S}{\sqrt{n}}\leq\frac{1}{2}-\varepsilon

(3.2)

for some $\varepsilon>0$ . Then, with probability at least $1-\frac{4p}{(1+n(m+1)\tilde{L}D)^{p}}$ , we have

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{\delta}{\rho}\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho}\Delta_{nm\alpha},

where

\Delta_{nm\alpha}=\frac{2\sqrt{2}}{(m+1)n}+\frac{\sqrt{2}C_{\varepsilon}V}{\sqrt{n}}\left(\alpha+\sqrt{\frac{p\log(1+n(m+1)\tilde{L}D)}{m(1-\alpha)}}+0.4748\frac{S}{\sqrt{n}}\right).

Here $\tilde{L}=\sqrt{\sum_{k=1}^{p}L_{k}^{2}}$ and $C_{\varepsilon}=\sqrt{2\pi}\exp\left(\frac{1}{2}(\Phi^{-1}(1-\varepsilon))^{2}\right)$ , where $\Phi^{-1}(\cdot)$ being the inverse of the cumulative distribution function of the standard Gaussian distribution $\Phi(\cdot)$ .

Theorem 3.1 shows the linear convergence of ${\mbox{\boldmath${\theta}$}}_{t}$ , which depends explicitly on the homogeneity index $\delta$ , the strong convex index $\rho$ , and the fraction of Byzantine machines $\alpha$ . The result is viewed as an extension of that in Jordan et al. (2019) and Fan et al. (2019). A significant difference from theirs is that, we allow the initial estimator to be inaccurate, and with high probability we have more explicit rates of convergence on optimization errors under the Byzantine failures.

Specially, the factor $C_{\varepsilon}$ in Theorem 3.1 is a function of $\varepsilon$ , for example, $C_{\varepsilon}\approx 4$ if setting $\varepsilon=1/6$ . After running $T$ for Algorithm 1, with high probability, we have

\|{\mbox{\boldmath${\theta}$}}_{T}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\left(\frac{\delta}{\rho}\right)^{T}\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho-\delta}\Delta_{nm\alpha}.

(3.3)

Notice that $\log(1-x)\leq-x$ . Theorem 3.1 guarantees that after parallel iterating $T\geq\frac{\rho}{\rho-\delta}\log\frac{(\rho-\delta)\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}}{2\Delta_{nm\alpha}}$ , with high probability we can obtain a solution $\tilde{{\mbox{\boldmath${\theta}$}}}={\mbox{\boldmath${\theta}$}}_{T}$ with an error

\|\tilde{{\mbox{\boldmath${\theta}$}}}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{4}{\rho-\delta}\Delta_{nm\alpha}.

In this case, the derived rate of the statistical error between $\tilde{{\mbox{\boldmath${\theta}$}}}$ and the centralized empirical risk minimizer $\hat{{\mbox{\boldmath${\theta}$}}}$ is of the order $\mathcal{O}\left(\frac{\alpha}{\sqrt{n}}+\sqrt{\frac{p\log(nm)}{nm}}+\frac{1}{n}\right)$ up to some constants, alternatively, $\tilde{\mathcal{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)$ up to the logarithmic factor. Note that

\tilde{\mathcal{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)=\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{n}}(\alpha+\frac{1}{\sqrt{m}})+\frac{1}{n}\right).

Intuitively, the above error rate is a near optimal rate that one should target, as $\frac{1}{\sqrt{n}}$ is the effective standard deviation for each machine with $n$ data points, $\alpha$ is the bias effect of Byzantine machines, $\frac{1}{\sqrt{m}}$ is the averaging effect of $m$ normal machines, and $\frac{1}{n}$ is the effect of the dependence of median on skewness of the gradients. If $n\gtrsim m$ , then $\tilde{\mathcal{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)=\tilde{\mathcal{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ is the order-optimal rate (Yin et al., 2018). When $\alpha=0$ (no Byzantine machine), one sees the usual scaling $\frac{1}{\sqrt{nm}}$ with the global sample size; when $\alpha\neq 0$ (some machines are Byzantine), their influence remains bounded and is proportional to $\alpha$ . So we do not sacrifice the quality of learning to guard against Byzantine failures, provided that the Byzentine failure proportion satisfies $\alpha=\mathcal{O}(1/\sqrt{m})$ .

Remark also that, our results for convex problems follow for any finite-bounded $R$ of the initial radius.

We next turn to an analysis for Option II in Algorithm 1: The robust distributed learning based on coordinate-wise trimmed mean. Compared to Option I, a stronger assumption on the tail behavior of the partial derivatives of the loss functions is needed as follows.

Assumption 3.5

(Sub-exponential gradients). For all $k\in[p]$ and ${\mbox{\boldmath${\theta}$}}\in{\mbox{\boldmath${\Theta}$}}$ , the partial derivative of $\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}})$ with respect to the $k$ -th coordinate of ${\theta}$ , $\partial_{k}\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}})$ , is $\upsilon$ -sub-exponential.

The sub-exponential assumption implies that all the moments of the derivatives are bounded. Hence, this condition is stronger a little than the bounded absolute skewness (Assumption 3.2(ii)). Fortunately, Assumption 3.5 can be satisfied in some learning problems, and see Proposition 2 in Yin et al. (2018).

Theorem 3.2

Assume that Assumptions 3.1, 3.3-3.5 hold, the iterates $\{{\mbox{\boldmath${\theta}$}}_{t}\}_{t=0}^{\infty}$ produced by Option II in Algorithm 1 with ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ for $t\geq 0$ , $\rho>\delta+2\Delta_{nm\beta}/R$ , and $\alpha\leq\beta\leq\frac{1}{2}-\varepsilon$ for some $\varepsilon>0$ . Then, with probability at least $1-\frac{2p(m+2)}{(1+n(m+1)\tilde{L}D)^{p}}$ , we have

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{\delta}{\rho}\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho}\Delta_{nm\beta},

where

\Delta_{nm\beta}=\left(\frac{\upsilon p}{\varepsilon}\left[\frac{3\sqrt{2}\beta}{\sqrt{n}}+\frac{2}{\sqrt{nm}}\right]\sqrt{\log(1+n(m+1)\tilde{L}D)+\frac{\log(1+m)}{p}}+\mathcal{\tilde{O}}\left(\frac{\beta}{n}+\frac{1}{nm}\right)\right).

Theorem 3.2 also shows the linear convergence of ${\mbox{\boldmath${\theta}$}}_{t}$ , which depends explicitly on the homogeneity index $\delta$ , the strong convex index $\rho$ , and the trimmed mean index $\beta$ with choosing the index to satisfy $\beta\geq\alpha$ , where $\alpha$ is a fraction of Byzantine machines. Note that, the hyperparameter $\upsilon$ in Assumption (3.5) only affect the statistical error appearing in $\Delta_{nm\beta}$ .

Similar to (3.3), we also have

\|{\mbox{\boldmath${\theta}$}}_{T}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\left(\frac{\delta}{\rho}\right)^{T}\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho-\delta}\Delta_{nm\beta}

after $T$ -step iterations. By running $T\geq\frac{\rho}{\rho-\delta}\log\frac{(\rho-\delta)\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}}{2\Delta_{nm\beta}}$ parallel computations, we can obtain a solution $\tilde{\tilde{{\mbox{\boldmath${\theta}$}}}}={\mbox{\boldmath${\theta}$}}_{T}$ satisfying $\|\tilde{\tilde{{\mbox{\boldmath${\theta}$}}}}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\mathcal{\tilde{O}}\left(\frac{\beta}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ , since the term $\Delta_{nm\beta}$ can be reduced to be $\Delta_{nm\beta}=\mathcal{O}\left(\frac{\upsilon p}{\varepsilon}\left[\frac{\beta}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right]\sqrt{\log(nm)}\right)$ .

It should be pointed out that, the trimmed mean index $\beta$ is strictly controlled by the fraction of Byzantine machines $\alpha$ , that is $\frac{1}{2}-\varepsilon\geq\beta\geq\alpha$ . By choosing $\beta=c\alpha$ with $c\geq 1$ , we still achieve the optimization error rate $\mathcal{\tilde{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ , which is also order-optimal in the statistical literature.

We now take comparable analysis for the above two Byzantine-robust CSL distributed learning in Algorithm 1 (Options I and II). The trimmed-mean-based algorithm (Option II) has an order-optimal optimization error rate $\mathcal{\tilde{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ . By contrast, the median-based algorithm (Option I) has the rate $\tilde{\mathcal{O}}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)$ involving an additional term $\frac{1}{n}$ , and thus the optimality is achieved only for $n\succeq m$ . Note that Option I algorithm needs milder moment conditions (bounded skewness) on the tail of the loss derivatives than the Option II algorithm (sub-exponentiality). In other words, this provides a profound insight into the underlying relation between the tail decay of the loss derivatives and the block number of local machines ( $m$ ).

On the other hand, Algorithm 1 based on Option II has an additional parameter $\beta$ such that $1/2>\beta\geq\alpha$ , which requires that the fraction of Byzantine machines is absolutely dominated by the normal ones for guaranteeing robustness. In contrast to this, Algorithm 1 based on Option I has a weaker restriction on $\alpha$ .

3.2 Specific Example: Generalized linear models

We now unpack Theorems 3.1 and 3.2 in generalized linear models, taking into consideration the effects of iterations in the proposed estimator, the roles of the initial estimator and the Byzantine failures. We will find an explicit rate of convergence of $\delta$ in Assumption 3.4 in the setting of generalized linear models. Theorem 3.1 and 3.2 guarantee that after running $T$ parallel iterations, with high probability we achieve an optimization error with a linear rate. Moreover, through finite steps, the optimization errors are eventually negligible in comparison with the statistical errors. We will give a specific analysis below.

For GLMs, the loss function is the negative partial log-likelihood of an exponential-type variable of the response given any input feature ${x}$ . Suppose that the i.i.d. pairs $\{{\mbox{\boldmath${z}$}}_{i}=({\mbox{\boldmath${x}$}}_{i}^{T},y_{i})^{T}\}_{i=1}^{N}$ are drawn from a generalized linear model. Recall that the conditional density of $y_{i}$ given ${\mbox{\boldmath${x}$}}_{i}$ takes the form

h(y_{i};{\mbox{\boldmath${x}$}}_{i},{\mbox{\boldmath${\theta}$}}^{*})=c({\mbox{\boldmath${x}$}}_{i},y_{i})\exp\left(y_{i}{\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}}^{*}-b({\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}}^{*})\right),

where $b(\cdot)$ is some known convex function, and $c(\cdot)$ is a known function such that $h(\cdot)$ is a valid probability density function. The loss function corresponding to the negative log likelihood of the whole data is given by $f({\mbox{\boldmath${\theta}$}})=\frac{1}{m+1}\sum_{k=1}^{m+1}f_{k}({\mbox{\boldmath${\theta}$}})$ with

f_{k}({\mbox{\boldmath${\theta}$}})=\frac{1}{n}\sum_{i\in\mathcal{I}_{k}}\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}}_{i})~{}~{}\mathrm{and}~{}~{}\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}}_{i})=b({\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}})-y_{i}{\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}}.

(3.4)

We further impose the following standard regularity technical conditions.

Assumption 3.6

(i) There exist universal positive constants $A_{i}(i=1,2,3)$ such that $A_{1}\leq\|{\mbox{\boldmath${\Sigma}$}}\|_{2}\leq A_{2}p^{A_{3}}$ , where ${\mbox{\boldmath${\Sigma}$}}=E({\mbox{\boldmath${x}$}}_{i}{\mbox{\boldmath${x}$}}_{i}^{T})$ .

(ii) $\{{\mbox{\boldmath${\Sigma}$}}^{-1/2}{\mbox{\boldmath${x}$}}_{i}\}_{i=1}^{N}$ are i.i.d. sub-Gaussian random vectors with bounded $\|{\mbox{\boldmath${\Sigma}$}}^{-1/2}{\mbox{\boldmath${x}$}}_{i}\|_{\psi_{2}}$ .

(iii) $\{{\mbox{\boldmath${\Sigma}$}}^{-1/2}{\mbox{\boldmath${x}$}}_{i}\}_{i=1}^{N}$ are i.i.d. random vectors, and each component is $v$ -sub-exponential.

(iv) For all $t\in\mathbb{R}$ , $|b^{\prime\prime}(t)|$ and $|b^{\prime\prime\prime}(t)|$ are both bounded.

(v) $\|{\mbox{\boldmath${\theta}$}}^{*}\|_{2}$ is bounded.

Assumption 3.7

$F+g$ is $\rho$ -strongly convex in $B({\mbox{\boldmath${\theta}$}}^{*},2R)$ , where $R<A_{4}p^{A_{5}}$ for some universal constants $A_{4}$ , $A_{5}$ and $\rho>0$ .

Assumptions (3.6)(i)(ii)(iv)(v)-(3.7) have been proposed by Fan et al. (2019) to establish the rate of optimization errors for a distributed statistical inference. In our paper, these assumptions mainly are used to obtain asymptotic properties of ${\mbox{\boldmath${\theta}$}}_{t}$ in Option I algorithm. For establishing the rate of optimization errors of Option II algorithm, we need Assumption (3.6) (iii) (sub-exponentiality) instead of Assumption (3.6)(ii) (sub-Gaussian), which is similar to Theorem 3.2.

From (3.4), we easily obtain

\nabla f_{1}({\mbox{\boldmath${\theta}$}})=\frac{1}{n}\sum_{i\in\mathcal{I}_{k}}[b^{\prime}({\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}})-y_{i}]{\mbox{\boldmath${x}$}}_{i}~{}~{}\mathrm{and}~{}~{}\nabla^{2}f_{1}({\mbox{\boldmath${\theta}$}})=\frac{1}{n}\sum_{i\in\mathcal{I}_{k}}b^{\prime\prime}({\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}}){\mbox{\boldmath${x}$}}_{i}{\mbox{\boldmath${x}$}}_{i}^{T}.

By Lemma A.5 in Fan et al. (2019), we immediately get

\max_{{\mbox{\boldmath${\theta}$}}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)}\|\nabla^{2}f_{1}({\mbox{\boldmath${\theta}$}})-\nabla^{2}F({\mbox{\boldmath${\theta}$}})\|_{2}=\mathcal{O}_{p}\left(\|{\mbox{\boldmath${\Sigma}$}}\|_{2}\sqrt{\frac{p(\log p+\log n)}{n}}\right),

as long as $n>cp$ for a given positive constant $c$ . So, $\delta=O\left(\|{\mbox{\boldmath${\Sigma}$}}\|_{2}\sqrt{p(\log p+\log n)/n}\right)$ can be chosen with high probability. From Theorems 3.1 and 3.2, we have the contraction factor $O\left(\kappa\sqrt{p(\log p+\log n)/n}\right)$ with $\kappa=\|{\mbox{\boldmath${\Sigma}$}}\|_{2}/\rho$ . The explicit parameter $\kappa$ is comparable to the condition number in Jordan et al. (2019), where finite $p$ and $\kappa$ were imposed.

Equipped with these above facts, we have the following corollary.

Theorem 3.3

Assume that Assumptions 3.6(i)(ii)(iv)(v) and 3.7 hold and with probability tending to one ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ for some $R>\|\hat{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}$ . The iterates $\{{\mbox{\boldmath${\theta}$}}_{t}\}_{t=0}^{\infty}$ produced by Option I in Algorithm 1 for $t\geq 0$ , $\rho>\eta^{1/2}+2\Delta_{nm\alpha}/R$ , and $\alpha$ satisfies (3.2). Then, after $t$ parallel iterations, we have

\|{\mbox{\boldmath${{\mbox{\boldmath${\theta}$}}}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\mathcal{O}_{p}\left(\eta^{t/2}\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|+\frac{2}{\rho-\eta^{1/2}}\Delta_{nm\alpha}\right),\forall\,t\geq 0,

where $\eta=\kappa^{2}p(\log p+\log n)/n$ and $\Delta_{nm\alpha}$ is defined Theorem 3.1.

Theorem 3.4

Assume that Assumptions 3.6(i)(iii)(iv)(v) and 3.7 hold, $\ell({\mbox{\boldmath${\theta}$}};{\mbox{\boldmath${z}$}})$ satisfies Assumption 3.5, and with probability tending to one ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ for some $R>\|\hat{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}$ . The iterates $\{{\mbox{\boldmath${\theta}$}}_{t}\}_{t=0}^{\infty}$ produced by Option II in Algorithm 1 for $t\geq 0$ , $\rho>\eta^{1/2}+2\Delta_{nm\beta}/R$ , and $\alpha\leq\beta\leq\frac{1}{2}-\varepsilon$ for some $\varepsilon>0$ . Then, after $t$ parallel iterations, we have

\|{\mbox{\boldmath${{\mbox{\boldmath${\theta}$}}}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\mathcal{O}_{p}\left(\eta^{t/2}\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|+\frac{2}{\rho-\eta^{1/2}}\Delta_{nm\beta}\right),\forall\,t\geq 0,

where $\eta=\kappa^{2}p(\log p+\log n)/n$ and $\Delta_{nm\beta}$ is defined Theorem 3.2.

Theorems 3.3 and 3.4 clearly present how Algorithm 1 depend on structural parameters of the problem. When $\kappa$ is bounded, $\eta=\mathcal{O}(p(\log p+\log n)/n)$ , through finite steps, $\eta^{t/2}$ can be much smaller than $\Delta_{nm\alpha}(\Delta_{nm\beta})$ . Thus,

\|{\mbox{\boldmath${{\mbox{\boldmath${\theta}$}}}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=E_{n}:=\left\{\begin{array}[]{cl}\mathcal{O}_{p}\left(\frac{\alpha}{\sqrt{n}}+\sqrt{\frac{p\log(nm)}{nm}}+\frac{1}{n}\right),&\mathrm{Option~{}I;}\\ \mathcal{O}_{p}\left(p\left[\frac{\beta}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right]\sqrt{\log(nm)}\right),&\mathrm{Option~{}II.}\end{array}\right.

(3.5)

So, By contrast to that results in Jordan et al. (2019), we allow an inaccurate initial value ${\mbox{\boldmath${\theta}$}}_{0}$ and give more explicit rates of optimization error even when $p$ diverges.

We know that the statistical error of the estimator ${\mbox{\boldmath${\theta}$}}_{t}$ can be controlled by the optimization error of ${\mbox{\boldmath${\theta}$}}_{t}$ and statistical error of $\hat{{\mbox{\boldmath${\theta}$}}}_{t}$ , that is,

\|{\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\|\hat{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}.

(3.6)

Note that the first term is not a deterministic optimization error, it holds in probability. Therefore, it is an optimization error in a statistical sense. We call it statistical optimization error. The second term is of order $\mathcal{O}_{p}(\sqrt{p/(nm)})$ under mild conditions, which has been well studied in statistics. Thus, the statistical error of ${\mbox{\boldmath${\theta}$}}_{t}$ is controlled by the magnitude of the first term. If we adopt two-step iteration, and $\|{\mbox{\boldmath${\theta}$}}_{0}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}=\mathcal{O}_{p}(\sqrt{p/n})$ when ${\mbox{\boldmath${\theta}$}}_{0}$ is obtained in 1st machine, one gets

\|{\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}=E_{n}

by (3.5)-(3.6), where $E_{n}$ is defined in (3.5), provided that for Option I

n\gg\left[\alpha^{-1}p^{3/2}+p^{2/3}N^{1/3}\log^{-1}N+p^{3}\right]\log N=p^{3}(\alpha^{-1}p^{1/2}+1)\log N+(p^{2}N)^{1/3},

or for Option II

n\gg\left[\beta^{-1}\sqrt{p\log N}+(Np\log N)^{1/3}\right].

It implies that Algorithm 1 has the order-optimal statistical error rate, for Option I which needs $n\gtrsim m$ .

4 Byzantine-robust CSL-proximal distributed learning

For Algorithm 1, we establish the contraction rates of optimization errors and statistical analysis under sufficiently strong convexity of $f_{1}+g$ and small discrepancy between $\nabla^{2}f_{1}$ and $\nabla^{2}F$ . It requires the data points of each machine to be large enough, and even the required data size $n$ of each machine depends on structural parameters, which may not be realistic in practice. The coordinate-wise median and coordinate-wise trimmed mean are proposed in algorithm 1, which is robust for the Byzantine failure, but it is unstable in the optimization process of the 1st worker machine even for moderate $n$ . In the section, we propose another Byzantine-robust CSL algorithm via embedding the proximal algorithm. See Rockafellar (1976) and Parikh and Boyd (2014) for proximal algorithm.

4.1 Byzantine-robust CSL-proximal distributed algorithm

First, recall that the proximal operator $\mathrm{prox}_{h}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}$ is defined by

\mathrm{prox}_{h}({\mbox{\boldmath${v}$}})=\mathrm{argmin}_{x}\left(h({\mbox{\boldmath${x}$}})+\frac{1}{2}\|{\mbox{\boldmath${x}$}}-{\mbox{\boldmath${v}$}}\|_{2}^{2}\right).

By the proximal operator of the function $\lambda^{-1}h$ with $\lambda>0$ , the proximal algorithm for minimizing $h$ iteratively computes

{\mbox{\boldmath${x}$}}_{t+1}=\mathrm{prox}_{\lambda^{-1}h}({\mbox{\boldmath${x}$}}_{t})=\mathrm{argmin}_{{\mbox{\boldmath${x}$}}\in\mathbb{R}^{p}}\left(h({\mbox{\boldmath${x}$}})+\frac{\lambda}{2}\|{\mbox{\boldmath${x}$}}-{\mbox{\boldmath${x}$}}_{t}\|_{2}^{2}\right)

starting from some initial value ${\mbox{\boldmath${x}$}}_{0}$ . Rockafellar (1976) showed the $\{{\mbox{\boldmath${x}$}}_{t}\}_{t=0}^{\infty}$ converges linearly to some $\hat{{\mbox{\boldmath${x}$}}}\in\mathrm{argmin}_{\mathbb{R}^{p}}h({\mbox{\boldmath${x}$}})$ .

For our problem (2.1), the proximal iteration algorithm is

{\mbox{\boldmath${\theta}$}}_{t+1}=\mathrm{prox}_{\lambda^{-1}(f+g)}({\mbox{\boldmath${\theta}$}}_{t})=\mathrm{argmin_{{\mbox{\boldmath${\theta}$}}\in\mathbb{R}^{p}}}\left(f({\mbox{\boldmath${\theta}$}})+g({\mbox{\boldmath${\theta}$}})+\frac{\lambda}{2}\|{\mbox{\boldmath${\theta}$}}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}\right).

In our setting, we adopt the distributed learning, and optimization is mainly on the 1st worker machine. The $g(\cdot)$ is a penalty function, which is used to the optimization step, and keep the local data of the 1st worker machine in $f_{1}(\cdot)$ . So, the penalty function in our proximal algorithm becomes $g({\mbox{\boldmath${\theta}$}})+\frac{\lambda}{2}\|{\mbox{\boldmath${\theta}$}}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}$ . Further, the optimization (3.1) in Algorithm 1 is replaced by

{\mbox{\boldmath${\theta}$}}_{t+1}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\left\{f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\theta}$}}_{t})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})+\frac{\lambda}{2}\|{\mbox{\boldmath${\theta}$}}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}\right\},

(4.1)

where $h({\mbox{\boldmath${\theta}$}}_{t})$ is the gradient information at ${\mbox{\boldmath${\theta}$}}_{t}$ from the other worker machines. This optimization (4.1) can make the Byzantine-robust CSL-proximal distributed learning converges rapidly. See the following Algorithm 2. We call it Algorithm BCSLp.

Input: Initialize estimator

{\mbox{\boldmath${\theta}$}}_{0}

, algorithm parameter

\beta

(for Option II) and number of iteration

T

for $t=0,1,2,\cdots,T-1$ do

The 1st machine: send

{\mbox{\boldmath${\theta}$}}_{t}

to other worker machines.

for $\mathbf{all}~{}k\in\{2,3,\cdots,m+1\}~{}\mathbf{parallel}$ do

Worker machine $k$ : evaluate local gradient

{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t})\leftarrow\left\{\begin{array}[]{cl}\nabla f_{k}({\mbox{\boldmath${\theta}$}}_{t})&\mathrm{normal~{}worker~{}machines,}\\ &\mathrm{Byzantine~{}machines},\end{array}\right.

send

{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t})

to the 1st machine.

end for

The 1st machine: evaluates aggregate gradients

{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t})\leftarrow\left\{\begin{array}[]{ll}\mathbf{med}\{{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t}):k\in[m+1]\}&\mathrm{Option~{}I},\\ \mathbf{trmean}_{\beta}\{{\mbox{\boldmath${h}$}}_{k}({\mbox{\boldmath${\theta}$}}_{t}):k\in[m+1]\}&\mathrm{Option~{}II},\end{array}\right.

(4.2)

computes

{\mbox{\boldmath${\theta}$}}_{t+1}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\left\{f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\theta}$}}_{t})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})+\frac{\lambda}{2}\|{\mbox{\boldmath${\theta}$}}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}\right\},

and then broadcasts to other machines.

end for

Output:

{\mbox{\boldmath${\theta}$}}_{T}

Algorithm 2 Byzantine-robust CSL-proximal distributed learning (BCSLp)

The above Algorithm 2 is a communication-efficient Byzantine-robust accurate statistical learning, which adopts coordinate-wise median and coordinate-wise trimmed mean cope with Byzantine fails, and use the proximal algorithm as the backbone. In each iteration, it has one round of communication and one optimization step similar to Algorithm 1. It is regularized version of Algorithm 1 by adding a strict convex quadratic term in the objective function. The technique has been used in the distributed stochastic optimization such as accelerating first-order algorithm (Lee et al., 2017) and regularizing sizes of updates (Wang et al., 2017b), and in the communication-efficient accurate distributed statistical estimation (Fan et al., 2019).

Now, we give contraction guarantees for Algorithm 2.

Theorem 4.1

Assume that Assumptions 3.1-3.4 hold, the iterates $\{{\mbox{\boldmath${\theta}$}}_{t}\}_{t=0}^{\infty}$ produced by Option I in Algorithm 2 with ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R/2)$ for $t\geq 0$ , $\left(\frac{\delta+2R^{-1}\Delta_{nm\alpha}}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ , and the fraction $\alpha$ of Byzantine machines satisfies (3.2). Then, with probability at least $1-\frac{4p}{(1+n(m+1)\tilde{L}D)^{p}}$ , we have

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{\frac{\delta}{\rho+\lambda}\sqrt{\rho^{2}+2\lambda\rho}+\lambda}{\rho+\lambda}\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho+\lambda}\Delta_{nm\alpha},

(4.3)

where $\Delta_{nm\alpha}$ is defined in Theorem 3.1.

Theorem 4.2

Assume that Assumptions 3.1, 3.3-3.5 hold, the iterates $\{{\mbox{\boldmath${\theta}$}}_{t}\}_{t=0}^{\infty}$ produced by Option II in Algorithm 2 with ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ for $t\geq 0$ , $\left(\frac{\delta+2R^{-1}\Delta_{nm\beta}}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ , and $\alpha\leq\beta\leq\frac{1}{2}-\varepsilon$ for some $\varepsilon>0$ . Then, with probability at least $1-\frac{2p(m+2)}{(1+n(m+1)\tilde{L}D)^{p}}$ , we have

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{\frac{\delta}{\rho+\lambda}\sqrt{\rho^{2}+2\lambda\rho}+\lambda}{\rho+\lambda}\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho+\lambda}\Delta_{nm\beta},

where $\Delta_{nm\beta}$ is defined in Theorem 3.2.

Theorems 4.1 and 4.2 present the linear convergence of Algorithm 2 Options I and II, respectively. Obviously, the contraction factor consists of two parts: $\frac{\frac{\delta}{\rho+\lambda}\sqrt{\rho^{2}+2\lambda\rho}}{\rho+\lambda}$ which comes from the error of the inexact proximal update $\|{\mbox{\boldmath${\theta}$}}_{t+1}-\mathrm{prof}_{\lambda^{-1}(f+g)}({\mbox{\boldmath${\theta}$}}_{t})\|_{2}$ , and $\frac{\lambda}{\rho+\lambda}$ which comes from the residual of proximal point $\|\mathrm{prof}_{\lambda^{-1}(f+g)}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}$ . Remarking similar to Theorems 3.1 and 3.2, within a finite $T$ -step , we have $\|{\mbox{\boldmath${\theta}$}}_{T}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\tilde{\mathcal{O}}_{p}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)$ for Option I, which is a order-optimal if $n\gtrsim m$ ; and $\|{\mbox{\boldmath${\theta}$}}_{T}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\mathcal{\tilde{O}}_{p}\left(\frac{\beta}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ for Option II, which also is a order optimal. These results just need that $\{f_{k}\}_{1}^{m+1}$ are convex and smooth while the penalty $g$ is allowed to be non-smooth, for example, $\ell_{1}$ norm. However, Most distributed statistical learning algorithms are only designed for smooth problems, and don’t consider Byzantine problems, for instance, Shamir et al. (2014), Wang et al. (2017a), Jordan et al. (2019), and so on.

In Theorems 3.1 and 3.2, we require the homogeneity assumption $\rho>\delta+2\Delta_{nm\alpha}/R$ between $f_{1}(\cdot)$ and $F(\cdot)$ . That is, we need they must be similar enough. By the law of large numbers, the sample size of the 1st worker machine must to be large. From Theorems 4.1 and 4.2, we see that such a condition is no longer needed, as long as the condition $\left(\frac{\delta+2R^{-1}\Delta_{nm\beta}}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ . The condition holds definitely by choosing sufficiently large regularity $\lambda$ . Therefore, Algorithm 2 needs a weaker homogeneous hypothesis than Algorithm 1. After running a finite step $T$ parallel iterations, with high probability we can obtain the error $\|{\mbox{\boldmath${\theta}$}}_{T}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{4}{\rho-\delta\sqrt{\rho^{2}+2\lambda\rho}/(\rho+\lambda)}\Delta_{nm\alpha}$ (Option I) or $\frac{4}{\rho-\delta\sqrt{\rho^{2}+2\lambda\rho}/(\rho+\lambda)}\Delta_{nm\beta}$ (Option II). Furthermore, the choice of a large $\lambda$ can accelerate its contraction. At this time, $\delta$ has little effect on the contraction factor. This is an important aspect of Algorithm 2 contribution.

The following corollary gives the choice of $\lambda$ that makes Algorithm 2 converge.

Corollary 4.1

Under the assumptions of Theorem 4.1 or 4.2,

(i) if $\lambda>4\delta^{2}/\rho$ , then with high probability,

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\left(1-\frac{\rho}{10(\lambda+\rho)}\right)\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho+\lambda}\Lambda_{nm};

(ii) if $\lambda\leq C\delta^{2}/\rho$ for some constant $C$ and $\delta/\rho$ is sufficiently small, then with high probability,

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq O\left(\frac{\delta}{\rho}\right)\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho+\lambda}\Lambda_{nm};

where $\Lambda_{nm}=\Delta_{nm\alpha}$ for Algorithm 1 and $\Lambda_{nm}=\Delta_{nm\beta}$ for Algorithm 2.

From Corollary 4.1, we can choose

\lambda\asymp\delta^{2}/\rho

as a default choice for Algorithm 2 to make the algorithm converge naturally. And then we achieve the order-optimal error rate after running a finite parallel iterations; They are $\tilde{\mathcal{O}}_{p}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)$ for Option I, which is a order-optimal if $n\gtrsim m$ , and $\mathcal{\tilde{O}}_{p}\left(\frac{\beta}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ for Option II. From Corollary 4.1(ii), we see that with a regularizer $\lambda$ up to $O(\delta^{2}/\rho)$ , the contraction fact $O\left(\frac{\delta}{\rho}\right)$ is essentially the same as the case of the unregularized problem ( $\lambda=0$ ). It also tell us how large $\lambda$ can be chose so that the contraction factor is the same order as Algorithm 1. Also see Theorems 3.1 and 3.2.

4.2 Statistical analysis for general models

In the subsection, we consider the case of generalized linear models as in Subsection 3.2. In Algorithm 2, $\lambda$ is a regularization parameter which is very important for adapting to the different scenarios of $n/p$ . That is, by specifying the correct order of the $\lambda$ , Algorithm 2 can solve the dilemma of small local sample size $n$ , while enjoying all the characteristics of Algorithm 1 in the large- $n$ local data.

Theorem 4.3

Under the assumptions of Theorem 3.3 or 3.4, except that with high probability ${\mbox{\boldmath${\theta}$}}_{0}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R/2)$ for some $R>\|\hat{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}$ . Let $\eta=\kappa^{2}p(\log p+\log n)/n$ and $\kappa=\|\Sigma\|_{2}/\rho$ . For any $c_{1},c_{2}>0$ , there exists $c_{3}>0$ , after $t$ parallel iterations,

(i) if $n>c_{1}p$ and $\alpha>c_{3}\rho\eta$ , then the algorithms have linear convergence

\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\left(1-\frac{\rho}{10(\lambda+\rho)}\right)^{t}\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho(1-\eta^{1/2})}\Lambda_{nm},\forall t\geq 0;

(ii) if $\eta$ is sufficiently small and $\alpha\leq c_{2}\rho\eta$ , then the algorithms have

\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\eta^{t/2}\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho(1-\eta^{1/2})}\Lambda_{nm},\forall t\geq 0;

where $\Lambda_{nm}=\Delta_{nm\alpha}$ for Algorithm 1 and $\Lambda_{nm\alpha}=\Delta_{nm\beta}$ for Algorithm 2.

For Theorem 4.3(i), we assume that $n\geq c_{1}p$ , which is reasonable especially for many big data situations. Theorem 4.3(ii) shows that by taking $\lambda=O(\rho\eta)$ , Algorithm 2 inherits all the advantages of Algorithm 1 in the large $n$ regime, one of them is fast linear contraction with the rate $O(\kappa\sqrt{p(\log p+\log n)/n})$ . In practice, it is difficult for us to determine whether the sample size is sufficiently large, but Algorithm 2 always guarantees the convergence via proper choice of $\lambda$ . And Theorem 4.3(ii) guarantees that after running $T\geq\frac{2\log\left(2^{-1}\Lambda_{nm}^{-1}\rho(1-\eta^{1/2})\|{\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\right)}{\log(1/\eta)}$ parallel iterations, with high probability we can obtain a solution $\tilde{{\mbox{\boldmath${\theta}$}}}={\mbox{\boldmath${\theta}$}}_{T}$ with error $\|\tilde{{\mbox{\boldmath${\theta}$}}}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{4}{\rho(1-\eta^{1/2})}\Lambda_{nm}$ . Similar to the discussion to Theorems 3.3 and 3.4, we have

\|\tilde{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}=\mathcal{O}_{p}(\Omega),

where $\mathcal{O}_{p}(\Omega)$ is defined in (3.5). It means that $\|\tilde{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}=\tilde{\mathcal{O}}_{p}\left(\frac{\alpha}{\sqrt{n}}+\frac{1}{\sqrt{nm}}+\frac{1}{n}\right)$ for Option I, which is a order-optimal if $n\gtrsim m$ , and $\|\tilde{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}=\mathcal{\tilde{O}}_{p}\left(\frac{\beta}{\sqrt{n}}+\frac{1}{\sqrt{nm}}\right)$ for Option II, which also is a order-optimal.

As mentioned, the distributed contraction rates depend strongly on the conditional number $\kappa$ , even for generalized linear models. Here, we give another specific case of distributed linear regression on $L_{2}$ loss, and then obtain the strong results under some specific conditions. We define a $L_{2}$ loss on the $k$ th worker machine as

f_{k}({\mbox{\boldmath${\theta}$}})=\frac{1}{2n}\sum_{i\in\mathcal{I}_{k}}(y_{i}-{\mbox{\boldmath${x}$}}_{i}^{T}{\mbox{\boldmath${\theta}$}})^{2}=\frac{1}{2}{\mbox{\boldmath${\theta}$}}^{T}\hat{{\mbox{\boldmath${\Sigma}$}}}_{k}{\mbox{\boldmath${\theta}$}}-\hat{{\mbox{\boldmath${v}$}}}_{k}^{T}{\mbox{\boldmath${\theta}$}}+\frac{1}{2n}\sum_{i\in\mathcal{I}_{k}}y_{i}^{2},

where $\hat{{\mbox{\boldmath${\Sigma}$}}}_{k}=\frac{1}{n}\sum_{i\in\mathcal{I}_{k}}{\mbox{\boldmath${x}$}}_{i}{\mbox{\boldmath${x}$}}_{i}^{T}$ and $\hat{{\mbox{\boldmath${v}$}}}_{k}=\frac{1}{n}\sum_{i\in\mathcal{I}_{k}}{\mbox{\boldmath${x}$}}_{i}y_{i}$ ; and

f({\mbox{\boldmath${\theta}$}})=\frac{1}{m+1}\sum_{k=1}^{m+1}f_{k}({\mbox{\boldmath${\theta}$}})=\frac{1}{2}{\mbox{\boldmath${\theta}$}}^{T}\hat{{\mbox{\boldmath${\Sigma}$}}}{\mbox{\boldmath${\theta}$}}-\hat{{\mbox{\boldmath${v}$}}}^{T}{\mbox{\boldmath${\theta}$}}+\frac{1}{2N}\sum_{i=1}^{N}y_{i}^{2},

where $\hat{{\mbox{\boldmath${\Sigma}$}}}=\frac{1}{N}\sum_{i=1}^{N}{\mbox{\boldmath${x}$}}_{i}{\mbox{\boldmath${x}$}}_{i}^{T}$ and $\hat{{\mbox{\boldmath${v}$}}}=\frac{1}{N}\sum_{i=1}^{N}{\mbox{\boldmath${x}$}}_{i}y_{i}$ .

For Algorithm 2 with the designation $g({\mbox{\boldmath${\theta}$}})=0$ , we have the closed form

{\mbox{\boldmath${\theta}$}}_{t+1}=(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\left[(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}}){\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t})\right],

(4.4)

where ${\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t})$ is defined by (4.2).

We present the following assumptions.

Assumption 4.1

(i) $E{\mbox{\boldmath${x}$}}_{i}=0$ and $E({\mbox{\boldmath${x}$}}_{i}{\mbox{\boldmath${x}$}}_{i}^{T})={\mbox{\boldmath${\Sigma}$}}\succ 0$ . The minimum eigenvalue $\sigma_{\min}({\mbox{\boldmath${\Sigma}$}})$ is bounded away from zero.

(ii) $N/\mathrm{Tr}({\mbox{\boldmath${\Sigma}$}})\geq C>0$ and $n/\log m\geq c>0$ where $C$ and $c$ are constants.

(iii) $\{{\mbox{\boldmath${\Sigma}$}}^{-1/2}{\mbox{\boldmath${x}$}}_{i}\}_{i=1}^{N}$ are i.i.d. sub-Gaussian random vectors with bounded $\|{\mbox{\boldmath${\Sigma}$}}^{-1/2}{\mbox{\boldmath${x}$}}_{i}\|_{\psi_{2}}$ .

(iv) $\{{\mbox{\boldmath${\Sigma}$}}^{-1/2}{\mbox{\boldmath${x}$}}_{i}\}_{i=1}^{N}$ are i.i.d. random vectors, and each component is $v$ -sub-exponential.

Theorem 4.4

Assume that there exist positive constants $C_{1}$ such that (1) $n>C_{1}p$ and $\lambda\geq 0$ or (2) $\lambda\geq C_{1}\mathrm{Tr}({\mbox{\boldmath${\Sigma}$}})/n$ , and $n/p$ is bound away from zero. In addition,

(a) if Assumption 4.1(i)-(iii) holds and the fraction $\alpha$ of Byzantine machines satisfies (3.2), then with high probability,

\left\|({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}\leq\sqrt{3\kappa}\left(1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}\right)^{t}\left\|({\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}\\ +\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha});

(b) If Assumption 4.1(i)-(ii)(iv) holds and the $\alpha$ and the trimmed parameter $\beta$ satisfies $\alpha\leq\beta\leq\frac{1}{2}-\varepsilon$ for some $\varepsilon>0$ , then with high probability,

\left\|({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}\leq\sqrt{3\kappa}\left(1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}\right)^{t}\left\|({\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}\\ +\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\beta}),

where $\kappa=\lambda_{\max}({\mbox{\boldmath${\Sigma}$}})/\lambda_{\min}({\mbox{\boldmath${\Sigma}$}})$ , and $\Delta_{nm\alpha}$ and $\Delta_{nm\beta}$ are defined in Theorem 3.1 and 3.2, respectively.

Note that if $n/p$ is large enough, then

1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}=\frac{C_{3}\lambda+C_{2}p/n}{1+C_{3}\lambda}=O(p/n)

by choosing $\lambda\asymp p/n)$ based on the weak requirement for regularization ( $\lambda>0$ , see Condition (1) in Theorem 4.4); We know that most distributed learning algorithms do not work even without the Byzantine failure if $n/p$ is not very large. But we still guarantee linear convergence with the rate

1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}=1-\frac{1}{2(1+C_{3}C_{1}\mathrm{Tr}({\mbox{\boldmath${\Sigma}$}})/n)}<1

by choosing $\lambda=C_{1}\mathrm{Tr}({\mbox{\boldmath${\Sigma}$}})/n$ (see Condition (2) in Theorem 4.4). Further, $\mathrm{Tr}({\mbox{\boldmath${\Sigma}$}})=O(p)$ in most scenarios, which implies $\lambda\asymp p/n$ . Therefore, we choose

\lambda\asymp p/n

as a universal and adaptive choice for Algorithm 2 regardless of the size of $n/p$ . Thus, Theorem 4.4 shows that no matter the size of $n/p$ , proper regularization $\lambda$ always obtains linear convergence. Hence, we can handle the distributed learning problems where the amount of data per worker machine is not large enough. This situation is difficult for some algorithms (Zhang et al. (2013), Jordan et al. (2019)). We still achieve an order-optimal error rate up to logarithmic factors for two options of Algorithm 2 no matter in the large sample regime or in the general regime.

Another benefit of the Algorithm 2 is that the contraction factor in Theorem 4.4 does not depend on the condition number $\kappa$ at all, and has hardly any effect on the optimal statistical rate of the Algorithm 2. Therefore, it helps relax the commonly used boundedness assumption on the condition number $\kappa$ in Zhang et al. (2013), Jordan et al. (2019) among others. Also see the remark in Theorem 3.3 of Fan et al. (2019). But their algorithms can not handle the distributed learning of Byzantine-failure.

5 Numerical experiments

5.1 Simulation experiments

In the subsection, we present several simulated examples to illustrate the performance of our algorithms BCLS and BCLSp, which are developed in Sections 3 and 4.

First, we conduct our numerical algorithms using the distributed logistic regression. In the logistic regression model, all the observations $\{{\mbox{\boldmath${x}$}}_{ij},y_{ij}\}_{i=1}^{n}$ for all $j\in[m]$ are generated independently from the model

y_{ij}\sim\mathrm{Ber}(P_{ij}),~{}\mathrm{with}~{}\log\frac{P_{ij}}{1-P_{ij}}=\langle{\mbox{\boldmath${x}$}}_{ij},{\mbox{\boldmath${\theta}$}}^{*}\rangle,

(5.1)

where ${\mbox{\boldmath${x}$}}_{ij}=(1,{\mbox{\boldmath${u}$}}_{ij}^{T})^{T}$ with ${\mbox{\boldmath${u}$}}_{ij}\in\mathbb{R}^{p}$ . In our simulation, we keep the total sample size $N=18000$ and the dimension $p=100$ fixed; the covariate vector ${\mbox{\boldmath${u}$}}_{ij}$ is independently generated from $N({\mbox{\boldmath${0}$}}_{p},{\mbox{\boldmath${\Sigma}$}})$ with ${\mbox{\boldmath${\Sigma}$}}=\mathrm{diag}(8,4,4,2,1,\cdots,1)\in\mathbb{R}^{p\times p}$ ; for each replicate of the simulation, ${\mbox{\boldmath${\theta}$}}^{*}\in\mathbb{R}^{p+1}$ is a random vector with $\left\|{\mbox{\boldmath${\theta}$}}^{*}\right\|_{2}=3$ whose direction is chosen uniformly at random from the sphere.

In the distributed learning, we split the whole dataset into each worker machine. According to the regimes “large $n$ ”, “moderate $n$ ” and “small $n$ ”, we set the local sample size and the number of machines as $(n,m)=(900,20)$ , $(450,40)$ and $(300,60)$ respectively. For our BCLS and BCLSp algorithms, we need to simulate the Byzantine failures. The $\alpha m$ worker machines are randomly chosen to be Byzantine, and one of the rest work machines as the 1st worker machine. In the experiments, we set $\alpha=20\%$ . In the coordinate-wise trimmed mean, set $\beta=20\%$ . For evaluating the effect of initialization on the convergence, we respectively take ${\mbox{\boldmath${\theta}$}}_{0}={\mbox{\boldmath${0}$}}$ , and $\bar{{\mbox{\boldmath${\theta}$}}}$ which is the local estimator based on the data of the 1st machine. They are referred to “zero initialization” and “good initialization”, respectively.

We implement Algorithms BCLS and BCLSp based on median, trimmed mean and mean for aggregating all local gradients, which are called BCLS-md, BCLS-tr, BCLS-me, BCLSp-md, BCLSp-tr and BCLSp-me, respectively. In the first experiment, we choose $g(\cdot)=0$ . The optimizations are carried out by mini batch stochastic gradient descent in the above algorithms. The estimation error $\|{\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}$ is used to measure the performance of these different algorithms based on the 50 simulation replications.

Figure 1 shows how the estimation errors evolve with iterations for the case of $p=100$ fixed. We find that our algorithms (BCLS-tr, BCLS-md, BCLSp-tr and BCLSp-md) converge reapidly in all scenarios, but the mean methods (BCLS-me and BCLSp-me) do not converge at all. Our algorithms almost converge after 2 iterations and do not require good initialization, which are in line with our theoretical results. This implies that our algorithms are robust against Byzantine failures, but the mean methods can’t tolerate such failures. In addition, our proposed Algorithm BCLSp is more robust and stable than Algorithm BCLS, especially for large $m$ . Note that large $m$ implies the more Byzantine worker machines. The embedding of the proximal technique in Algorithm BCLSp adds strict convex quadratic regularization, which leads to better performance.

Refer to caption — Figure 1: Impacts of local sample and initialization on convergence of algorithms for the fixed $p=100$ . The $x$ -axis is the number of iterations or the rounds of communications, and $y$ -axis is the estimation error $\|{\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${\theta}$}}^{*}\|_{2}$ . The “Best-line” shows the error of the minimizer of the overall loss function. Top panel uses $\boldsymbol{\bar{\theta}}$ as the initial value (“good initialization”) and bottom uses $\boldsymbol{0}$ as the initial value (“zero initialization”).

Next, we use the distributed logistic regression model (5.1) with sparsity. The experiment is used to validate the efficiency of our algorithms in the presence of a nonsmooth penalty. In the simulation, we still set the total sample size $N=18000$ , but the dimensionality of ${\mbox{\boldmath${\theta}$}}^{*}$ is fixed to $p=1000$ ; the covariate vector ${\mbox{\boldmath${u}$}}_{ij}$ is i.i.d. $N({\mbox{\boldmath${0}$}},{\mbox{\boldmath${I}$}}_{p})$ and ${\mbox{\boldmath${\theta}$}}^{*}=({\mbox{\boldmath${v}$}}_{10}^{T},{\mbox{\boldmath${0}$}}_{991}^{T})^{T}\in\mathbb{R}^{1001}$ with ${\mbox{\boldmath${v}$}}_{10}\sim N({\mbox{\boldmath${0}$}}_{10},{\mbox{\boldmath${I}$}}_{10})$ . The $\ell_{2}$ -norm of ${\mbox{\boldmath${\theta}$}}^{*}$ is constrained to 3. We choose the penalty function $g({\mbox{\boldmath${\theta}$}})=\gamma\|\boldsymbol{\theta}\|_{1}$ with $\gamma=0.2\sqrt{{\frac{\log{p}}{N}}}$ so that the nonzeros of ${\mbox{\boldmath${\theta}$}}^{*}$ can be recovers accurately by the regularized maximum likelihood estimation over the whole dataset. As in the first experiment, we set $(n,m)=(900,20)$ , $(450,40)$ and $(300,60)$ , $\alpha=20\%$ , $\beta=20\%$ , ${\mbox{\boldmath${\theta}$}}_{0}={\mbox{\boldmath${0}$}}$ (“zero initialization”) and $\bar{{\mbox{\boldmath${\theta}$}}}$ (“good initialization”). The $\bar{{\mbox{\boldmath${\theta}$}}}$ is the local estimator based on the dataset of the 1st machine. In the Algorithm BCLSp, the penalty parameters are selected appropriately for the cases of “good initialization” and “zero initialization”, respectively. The optimizations are carried out by mini batch stochastic sub-gradient descent in the algorithms. All the results are average values of 20 independent runs.

Figure 2 presents the performance of our algorithms and the mean aggregate method (BCLS-me and BCLSp-me). With proper regularization, our algorithms still work well whether the initial value is “good” or “bad”. The mean aggregate methods (BCLS-me and BCLSp-me) fail to converge. For this nonsmooth problem, Algorithms BCLSp-tr and BCLSp-md are more robust than Algorithms BCLS-tr and BCLS-md, and start to converge after just 2 rounds of communication, specially for the case of “good initialization”.

Here, we summarize the above simulations to highlight several outstanding advantages of our algorithms.

(1) Our proposed Byzantine-Robust CSL distributed learning algorithm (BCLS-md, BCLSp-tr) and Byzantine-Robust CSL-proximal distributed learning algorithm (BCLSp-md and BCLSp-tr) can indeed defend against Byzantine failures.

(2) Our proposed algorithms converge rapidly, usually with several rounds of communication, and do not require good initialization; these are consistent with our statistical theory.

(3) The Algorithms BCLSp-md and BCLSp-tr are more robust than Algorithms BCLS-md and BCLS-tr, due to add strict convex quadratic regularization by embedding proximal technique into Algorithm BCLSp.

5.2 Real data

In the subsection, we further assess the performance of our proposed algorithms by a real data example. We choose Spambase dataset from the UC Irvine Machine Learning Repository (Dua and Graff, 2017). The collection of Span e-mails in Spambase dataset come from their postmaster and individuals who had filed spam, and the collection of non-spam e-mails came from filed work and personal e-mails. Number of instances (total sample) is 4600, and Number of Attributes (features) is 57 based on their word frequencies and other characteristics. The goal is to use distributed logistic regression to construct a personalized spam filter that distinguishes spam emails from normal ones. In the experiment, randomly selecting 1000 instances as the testing set and the rest of 3600 instances as the training set; split the training set to each worker machine according to $(n,m)=(180,20)$ (“small $m$ ”), $(120,30)$ (“moderate $m$ ”) and $(90,40)$ (“large $m$ ”), respectively; set $\alpha=20\%$ for the fraction of Byzantine worker machines and $\beta=20\%$ for the BCLS-tr and BCLSp-tr algorithms. We use classification errors on the test set as the evaluation criteria.

Figure 3 shows the average performance of the 6 algorithms mentioned in Subsection 5.1. We find that the testing errors of our algorithms are very low. It implies our algorithms can accurately filter spam and non spam, even for more Byzantine worker machines (“large $m$ ”). But the filters based on Algorithms BCLS-me and BCLSp-me fail. These results are consistent with the ones of simulation experiments. Totally, the experiments on the real data also support our theoretical findings.

Acknowledgments

This work is partially supported by Chinese National Social Science Fund (No. 19BTJ034).

References

Blanchard et al. (2017) Blanchard P, El Mhamdi EM, Guerraoui R, Stainer J. Machine learning with adversaries: Byzantine tolerant gradient descent. Proceedings of NIPS 2017;:118–28.
Chen et al. (2017) Chen Y, Su L, Xu J. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc ACM Meas Anal Comput Syst 2017;1:1–25.
Dua and Graff (2017) Dua D, Graff C. Uci machine learning repository 2017;URL: https://archive.ics.uci.edu/ml/datasets/Spambase.
Fan et al. (2019) Fan J, Guo Y, Wang K. Communication-efficient accurate statistical estimation. arXiv preprint 2019;arXiv:1906.04870.
Jordan et al. (2019) Jordan MI, Lee JD, Yang Y. Communication-efficient distributed statistical inference. Journal of the American Statistical Association 2019;114(526):668–81.
Konečnỳ et al. (2016) Konečnỳ J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D. Federated learning: Strategies for improving communication efficiency. arXiv preprint 2016;arXiv:1610.05492.
Lee et al. (2017) Lee JD, Lin Q, Ma T, Yang T. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research 2017;18:4404–46.
Mei et al. (2018) Mei S, Bai Y, Montanari A. The landscape of empirical risk for nonconvex losses. The Annals of Statistics 2018;46:2747–74.
Minsker (2015) Minsker S. Geometric median and robust estimation in banach spaces. Bernoulli 2015;21(4):2308–35.
Nesterov (2004) Nesterov Y. Introductory Letures on Convex Optimization: A Basic Course. New York: Springer Science and Business Media, 2004.
Parikh and Boyd (2014) Parikh N, Boyd S. Proximal algorithms. Foundations and Trends® in Optimization 2014;1:127–239.
Rockafellar (1976) Rockafellar RT. Monotone operators and the proximal point algorithm. SIAM Journal on control and optimization 1976;14:877–98.
Shamir et al. (2014) Shamir O, Srebro N, Zhang T. Communication efficient distributed optimization using an approximate newton-type method. Proceedings of the 31st International Conference on Machine Learning 2014;:1000–8.
Su and Xu (2018) Su L, Xu J. Securing distributed machine learning in high dimensions. arxiv Preprint 2018;arXiv:1804.10140.
Tibshirani (1996) Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(1):267–88.
Vempaty et al. (2013) Vempaty A, Tong L, Varshney PK. Distributed inference with byzantine data: State-of-the-art review on data falsification attacks. IEEE Signal Processing Magazine 2013;30(5):65–75.
Wang et al. (2017a) Wang J, Kolar M, Srebro N, Zhang T. Efficient distributed learning with sparsity. Proceedings of Machine Learning Research, PMLR 2017a;70:3636–45.
Wang et al. (2017b) Wang J, Wang W, Srebro N. Memory and communication efficient distributed stochastic optimization with minibatch prox. Proceedings of the 2017 Conference on Learning Theory, PMLR 2017b;65:1882–919.
Wu et al. (2019) Wu Z, Ling Q, Chen T, Giannakis GB. Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. arXiv Preprint 2019;arXiv:1912.12716v1.
Xie et al. (2018) Xie C, Koyejo O, Gupta I. Generalized byzantine-tolerant sgd. arXiv Preprint 2018;arXiv:1802.10116.
Yang et al. (2019) Yang Z, Gang A, Bajwa WU. Adversary-resilient inference and machine learning: From distributed to decentralized. arXiv Preprint 2019;arXiv:1908.08649.
Yin et al. (2018) Yin D, Chen Y, Ramchandran K, Bartlett P. Byzantine-robust distributed learning: towards optimal statistical rates. Proceedings of the 35th International Conference on Machine Learning, PMLR 2018;80:5650–9.
Zhang et al. (2015) Zhang Y, Duchi J, Wainwright M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research 2015;16:3299–340.
Zhang et al. (2013) Zhang Y, Duchi JC, Wainwright MJ. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research 2013;14:3321–63.

Appendix A

A.1 Proof of Theorems 3.1 and 3.2

Proof of Theorem 3.1: Define

\varphi({\mbox{\boldmath${\xi}$}})=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\left\{f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\xi}$}})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\xi}$}}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})\right\}.

Obviously, ${\mbox{\boldmath${\theta}$}}_{t+1}=\varphi({\mbox{\boldmath${\theta}$}}_{t})$ . By Theorem 8 in Yin et al. (2018) and the law of large numbers, for ${\mbox{\boldmath${\xi}$}}\in{\mbox{\boldmath${\Theta}$}}$ , we have

	$\displaystyle f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\xi}$}})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\xi}$}}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})$	(A.1)
$\displaystyle=$	$\displaystyle\left\{f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\xi}$}})-\nabla f({\mbox{\boldmath${\xi}$}}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})\right\}+[{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\xi}$}})-\nabla F({\mbox{\boldmath${\xi}$}})]+[\nabla F({\mbox{\boldmath${\xi}$}})-\nabla f({\mbox{\boldmath${\xi}$}})]$
$\displaystyle=$	$\displaystyle\left\{f_{1}({\mbox{\boldmath${\theta}$}})-\langle{\mbox{\boldmath${\nabla}$}}f_{1}({\mbox{\boldmath${\xi}$}})-\nabla f({\mbox{\boldmath${\xi}$}}),{\mbox{\boldmath${\theta}$}}\rangle+g({\mbox{\boldmath${\theta}$}})\right\}+o_{p}(1).$

Therefore, $\varphi(\hat{{\mbox{\boldmath${\theta}$}}})=\hat{{\mbox{\boldmath${\theta}$}}}+o_{p}(1)$ , that is, $\hat{{\mbox{\boldmath${\theta}$}}}$ is a asymptotic fixed point of $\varphi(\cdot)$ . For the fixed ${\mbox{\boldmath${\theta}$}}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ , by the first order condition of $\varphi({\mbox{\boldmath${\theta}$}})$ , we have that

\nabla f_{1}({\mbox{\boldmath${\theta}$}})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}})\in\partial\left\{f_{1}(\varphi({\mbox{\boldmath${\theta}$}}))+g(\varphi({\mbox{\boldmath${\theta}$}}))\right\}.

(A.2)

Further,

\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-{\mbox{\boldmath${h}$}}(\hat{{\mbox{\boldmath${\theta}$}}})\in\partial\left\{f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})+g(\hat{{\mbox{\boldmath${\theta}$}}})\right\},

(A.3)

by using the fact $\varphi(\hat{{\mbox{\boldmath${\theta}$}}})=\hat{{\mbox{\boldmath${\theta}$}}}$ .

By the Taylor expansion, Assumption 3.4 and Theorem 8 in Yin et al. (2018), with probability at least $1-\frac{4p}{(1+n(m+1)\tilde{L}D)^{p}}$ , we have

	$\displaystyle\\|[\nabla f_{1}({\mbox{\boldmath${\theta}$}})-h({\mbox{\boldmath${\theta}$}})]-[\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})]\\|_{2}$	(A.4)
$\displaystyle\leq$	$\displaystyle\\|[\nabla f_{1}({\mbox{\boldmath${\theta}$}})-\nabla F({\mbox{\boldmath${\theta}$}})]-[\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-\nabla F(\hat{{\mbox{\boldmath${\theta}$}}})]\\|_{2}+\\|[\nabla F({\mbox{\boldmath${\theta}$}})-h({\mbox{\boldmath${\theta}$}})]-[\nabla F(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})]\\|_{2}$
$\displaystyle\leq$	$\displaystyle\\|[\nabla f_{1}({\mbox{\boldmath${\theta}$}})-\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})]-[\nabla F({\mbox{\boldmath${\theta}$}})-\nabla F(\hat{{\mbox{\boldmath${\theta}$}}})]\\|_{2}+2\Delta_{nm\alpha}$
$\displaystyle=$	$\displaystyle\left\\|\int_{0}^{1}\left(\nabla^{2}f_{1}[(1-t)\hat{{\mbox{\boldmath${\theta}$}}}+t{\mbox{\boldmath${\theta}$}}]-\nabla^{2}F[(1-t)\hat{{\mbox{\boldmath${\theta}$}}}+t{\mbox{\boldmath${\theta}$}}]\right)({\mbox{\boldmath${\theta}$}}-\hat{{\mbox{\boldmath${\theta}$}}})dt\right\\|_{2}+2\Delta_{nm\alpha}$
$\displaystyle\leq$	$\displaystyle\sup_{\zeta\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)}\\|\nabla^{2}f_{1}(\zeta)-\nabla^{2}F(\zeta)\\|_{2}\\|{\mbox{\boldmath${\theta}$}}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}+2\Delta_{nm\alpha}$
$\displaystyle\leq$	$\displaystyle\delta\\|{\mbox{\boldmath${\theta}$}}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}+2\Delta_{nm\alpha}\leq\delta R+2\Delta_{nm\alpha}=(\delta+2\Delta_{nm\alpha}/R)R<\rho R.$

Next, we will show that under (A.2)-(A.4), we have

\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\left\|[\nabla f_{1}({\mbox{\boldmath${\theta}$}})-h({\mbox{\boldmath${\theta}$}})]-[\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})]\right\|_{2}/\rho.

(A.5)

If we know $\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq R$ in advance, then

	$\displaystyle\rho\\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}$	$\displaystyle\leq$	$\displaystyle\left\langle[\nabla f_{1}({\mbox{\boldmath${\theta}$}})-h({\mbox{\boldmath${\theta}$}})]-[\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})],\varphi(\theta)-\hat{{\mbox{\boldmath${\theta}$}}}\right\rangle$
		$\displaystyle\leq$	$\displaystyle\\|[\nabla f_{1}({\mbox{\boldmath${\theta}$}})-h({\mbox{\boldmath${\theta}$}})]-[\nabla f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})]\\|_{2}\\|\varphi(\theta)-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}.$

Here the 1st inequality follows from the $\rho$ -strong convex in $B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ of $f_{1}+g$ and (A.2)-(A.3); the 2nd step uses the Cauchy-Schwarz inequality. Then, we obtain the desired result (A.5). Suppose on the contrary that $\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}>R$ . We use reduction to absurdity. Define $\varphi(\bar{{\mbox{\boldmath${\theta}$}}})=\hat{{\mbox{\boldmath${\theta}$}}}+R(\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}})/\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}$ . Obviously, $\|\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=R$ . By the strong convexity of $f_{1}+g$ in $B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ and (A.2)-(A.3) again,

\left\langle[f_{1}(\bar{{\mbox{\boldmath${\theta}$}}})-h(\bar{{\mbox{\boldmath${\theta}$}}})]-[f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})],\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\right\rangle\geq\rho\|\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}^{2};

(A.6)

Notice that

\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}=\frac{R}{\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}-R}(\varphi({\mbox{\boldmath${\theta}$}})-\varphi(\bar{{\mbox{\boldmath${\theta}$}}})).

Thus, by the convexity of $f_{1}+g$ and (A.2), we always have

			$\displaystyle\left\langle[f_{1}({{\mbox{\boldmath${\theta}$}}})-h({{\mbox{\boldmath${\theta}$}}})]-[f_{1}(\bar{{\mbox{\boldmath${\theta}$}}})-h(\bar{{\mbox{\boldmath${\theta}$}}})],\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\right\rangle$		(A.7)
		$\displaystyle=$	$\displaystyle\frac{R}{\\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}-R}\left\langle[f_{1}({{\mbox{\boldmath${\theta}$}}})-h({{\mbox{\boldmath${\theta}$}}})]-[f_{1}(\bar{{\mbox{\boldmath${\theta}$}}})-h(\bar{{\mbox{\boldmath${\theta}$}}})],\varphi({\mbox{\boldmath${\theta}$}})-\varphi(\bar{{\mbox{\boldmath${\theta}$}}})\right\rangle\geq 0,$		(A.7)

for any $\bar{{\mbox{\boldmath${\theta}$}}}\in B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ . Summing up the (A.6), and by (A.7) and Cauchy-Schwarz inequality, one gets

	$\displaystyle\rho\\|\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}$	$\displaystyle\leq$	$\displaystyle\left\langle[f_{1}({{\mbox{\boldmath${\theta}$}}})-h({{\mbox{\boldmath${\theta}$}}})]-[f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})],\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\right\rangle$
		$\displaystyle\leq$	$\displaystyle\\|[f_{1}({{\mbox{\boldmath${\theta}$}}})-h({{\mbox{\boldmath${\theta}$}}})]-[f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})]\\|_{2}\\|\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}.$

Thus, $\|[f_{1}({{\mbox{\boldmath${\theta}$}}})-h({{\mbox{\boldmath${\theta}$}}})]-[f_{1}(\hat{{\mbox{\boldmath${\theta}$}}})-h(\hat{{\mbox{\boldmath${\theta}$}}})]\|_{2}\geq\rho\|\varphi(\bar{{\mbox{\boldmath${\theta}$}}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\rho R$ , which contradicts (A.4). Therefore, we must have only $\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq R$ . And then (A.5) holds. Together with (A.4), with probability at least $1-\frac{4p}{(1+n(m+1)\tilde{L}D)^{p}}$ , we have

\|\varphi({\mbox{\boldmath${\theta}$}})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\frac{\delta}{\rho}\|{\mbox{\boldmath${\theta}$}}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}+\frac{2}{\rho}\Delta_{nm\alpha}.

Take ${\mbox{\boldmath${\theta}$}}={\mbox{\boldmath${\theta}$}}_{t}$ , then $\varphi({\mbox{\boldmath${\theta}$}}_{t})={\mbox{\boldmath${\theta}$}}_{t+1}$ . We complete the proof of Theorem 3.1.

Proof of Theorem 3.2: The proof is essentially the same as the proof of Theorem 3.1, except that the analysis of coordinate-wise trimmed mean of means estimator of the population gradients, which can be found in Yin et al. (2018), is different one of coordinate-wise median.

A.2 Proof of Theorems 3.3 and 3.4

Theorems 3.3 and 3.4 are the special cases of Theorem 4.3 (ii) by taking $\lambda=0$ . See subsection A. for the proofs of Theorem 4.3.

A.3 Proofs of Theorems 4.1, 4.2 and Corollary 4.1

Proof of Theorem 4.1: Under the conditions of Theorem 4.1, if $0<\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}<R/2$ , then (4.3) holds. Then Theorem 4.1 can directly follow from the result and induction. Below we prove the result.

Recall that $\hat{{\mbox{\boldmath${\theta}$}}}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}(f({\mbox{\boldmath${\theta}$}})+g({\mbox{\boldmath${\theta}$}}))$ . Denote ${\mbox{\boldmath${\theta}$}}_{t}^{+}=\mathrm{prox}_{\lambda^{-1}(f+g)}({\mbox{\boldmath${\theta}$}}_{t})$ . By the triangle inequality,

\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t+1}-{\mbox{\boldmath${\theta}$}}_{t}^{+}\|_{2}+\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}.

(A.8)

For the first term on the right of (A.8), $\|{\mbox{\boldmath${\theta}$}}_{t+1}-{\mbox{\boldmath${\theta}$}}_{t}^{+}\|_{2}$ , we can obtain its contracting optimization errors by Theorem 3.1, taking $g({\mbox{\boldmath${\theta}$}})$ in Algorithm 1 as $\tilde{g}({\mbox{\boldmath${\theta}$}})=g({\mbox{\boldmath${\theta}$}})+\frac{\lambda}{2}\|{\mbox{\boldmath${\theta}$}}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}$ . Thus, ${\mbox{\boldmath${\theta}$}}_{t}^{+}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\left\{f({\mbox{\boldmath${\theta}$}})+\tilde{g}({\mbox{\boldmath${\theta}$}})\right\}$ . Together with (4.2), ${\mbox{\boldmath${\theta}$}}_{t+1}$ is regarded as the first iterate of Algorithm 1 initialized at ${\mbox{\boldmath${\theta}$}}_{t}$ for obtaining ${\mbox{\boldmath${\theta}$}}_{t}^{+}$ . Here, ${\mbox{\boldmath${\theta}$}}_{t}^{+}$ is similar to $\hat{{\mbox{\boldmath${\theta}$}}}$ in Algorithm 1. Contrasting the assumptions of Theorem 3.1, we still need the following assertions:

(i) ${\mbox{\boldmath${\theta}$}}_{t}\in B({\mbox{\boldmath${\theta}$}}_{t}^{+},R/2)$ ;

(ii) $f+\tilde{g}$ is $\rho+\lambda$ -strongly convex in $B({\mbox{\boldmath${\theta}$}}_{t}^{+},R/2)$ ;

(iii) $\|\nabla^{2}f_{1}({\mbox{\boldmath${\theta}$}})-\nabla^{2}F({\mbox{\boldmath${\theta}$}})\|_{2}\leq\delta$ , and ${\mbox{\boldmath${\theta}$}}\in B({\mbox{\boldmath${\theta}$}}_{t}^{+},R/2)$ .

Let $\tilde{f}=f+g$ . Notice that

\hat{{\mbox{\boldmath${\theta}$}}}=\mathrm{prox}_{\lambda^{-1}\tilde{f}}(\hat{{\mbox{\boldmath${\theta}$}}}).

(A.9)

By the well-known “firm non-expansiveness” property of the proximal operation (Parikh and Boyd, 2014), we have

\left\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\mathrm{prox}_{\lambda^{-1}\tilde{f}}(\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}^{2}\leq\left\langle{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}},\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\mathrm{prox}_{\lambda^{-1}\tilde{f}}(\hat{{\mbox{\boldmath${\theta}$}}})\right\rangle.

(A.10)

From (A.9)-(A.10), we have $\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}=\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}$ . So, the condition $\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}<R/2$ implies $B({\mbox{\boldmath${\theta}$}}_{t}^{+},R/2)\subset B(\hat{{\mbox{\boldmath${\theta}$}}},R)$ . Assumptions 3.3 and 3.4 imply (ii) and (iii) hold, respectively. In the other hand,

			$\displaystyle\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-{\mbox{\boldmath${\theta}$}}_{t}\right\\|_{2}^{2}=\left\\|[\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}]-[{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}]\right\\|_{2}^{2}$
		$\displaystyle=$	$\displaystyle\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2}+\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-2\left\langle\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}},{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\right\rangle$
		$\displaystyle\leq$	$\displaystyle\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2}+\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-2\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2}$
		$\displaystyle=$	$\displaystyle\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2},$

that is,

\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}^{2}-\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}^{2}.

Further,

\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\sqrt{1-\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}^{2}/\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}^{2}},

(A.11)

which leads to $\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}<R/2$ . Therefore, (i) holds.

Based the above assertions and $\rho+\lambda>\delta+2\Delta_{nm\alpha}/R$ by $\left(\frac{\delta+2R^{-1}\Delta_{nm\alpha}}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ , by Theorem 3.1, we have

\|{\mbox{\boldmath${\theta}$}}_{t+1}-{\mbox{\boldmath${\theta}$}}_{t}^{+}\|_{2}\leq\frac{\delta}{\rho+\lambda}\|{\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${\theta}$}}_{t}^{+}\|_{2}+\frac{2}{\rho+\lambda}\Delta_{nm\alpha}.

(A.12)

From (A.8), (A.11) and (A.12), we have

$\displaystyle\\|{\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}$	$\displaystyle\leq$	$\displaystyle\frac{\delta}{\rho+\lambda}\\|{\mbox{\boldmath${\theta}$}}_{t}-{\mbox{\boldmath${\theta}$}}_{t}^{+}\\|_{2}+\frac{2}{\rho+\lambda}\Delta_{nm\alpha}+\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}$	(A.13)
	$\displaystyle\leq$	$\displaystyle\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}\left(\frac{\delta}{\rho+\lambda}\sqrt{1-\frac{\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}}{\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}}}+\frac{\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}}{\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}}\right)+\frac{2}{\rho+\lambda}\Delta_{nm\alpha}$
	$\displaystyle=$	$\displaystyle\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}\kappa\left(\frac{\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}}{\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}}\right)+\frac{2}{\rho+\lambda}\Delta_{nm\alpha},$

where $\kappa(u)=\frac{\delta}{\rho+\lambda}\sqrt{1-u^{2}}+u$ .

Here, we will prove

\frac{\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}}{\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}}\leq\frac{\lambda}{\rho+\lambda}<1,

(A.14)

which makes (A.13) valid. Indeed, on the hand, $\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}\leq\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\|_{2}$ ; on the other hand, $\hat{{\mbox{\boldmath${\theta}$}}}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\tilde{f}({\mbox{\boldmath${\theta}$}})$ and ${\mbox{\boldmath${\theta}$}}_{t}^{+}=\mathrm{argmin}_{\mbox{\boldmath${\theta}$}}\left(\tilde{f}({\mbox{\boldmath${\theta}$}})+\lambda/2\|{\mbox{\boldmath${\theta}$}}-{\mbox{\boldmath${\theta}$}}_{t}\|_{2}^{2}\right)$ imply that ${\mbox{\boldmath${0}$}}\in\partial\tilde{f}(\hat{{\mbox{\boldmath${\theta}$}}})$ and $-\lambda({\mbox{\boldmath${\theta}$}}_{t}^{+}-{\mbox{\boldmath${\theta}$}}_{t})\in\partial\tilde{f}({\mbox{\boldmath${\theta}$}}_{t}^{+})$ . Because $\tilde{f}$ is $\rho$ -strongly convex in $B(\hat{{\mbox{\boldmath${\theta}$}}},R/2)$ and the basic properties of strong convex functions (Nesterov, 2004), we have

$\displaystyle\rho\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}$	$\displaystyle\leq$	$\displaystyle\langle-\lambda({\mbox{\boldmath${\theta}$}}_{t}^{+}-{\mbox{\boldmath${\theta}$}}_{t})-{\mbox{\boldmath${0}$}},{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\rangle$
	$\displaystyle=$	$\displaystyle-\lambda\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-\lambda\langle\hat{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}_{t},{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\rangle$
	$\displaystyle\leq$	$\displaystyle-\lambda\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}+\lambda\\|\hat{{\mbox{\boldmath${\theta}$}}}-{\mbox{\boldmath${\theta}$}}_{t}\\|_{2}\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}.$

Thus, we get (A.14).

For $\kappa(u)$ in (A.14), $0\leq u<1$ and $\kappa(u)$ is an increasing function on $[0,1/\sqrt{1+[\delta/(\rho+\lambda)]^{2}}]$ . Notice that $\left(\frac{\delta+2R^{-1}\Delta_{nm\alpha}}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ implies $\left(\frac{\delta}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ . Therefore,

\frac{1}{\sqrt{1+[\delta/(\rho+\lambda)]^{2}}}>\frac{\sqrt{\lambda+\rho/2}}{\sqrt{\rho+\lambda}}\geq\frac{\lambda+\rho/2}{\rho+\lambda}\geq\frac{\lambda}{\rho+\lambda}.

Then, by (A.14) and $\left(\frac{\delta}{\rho+\lambda}\right)^{2}<\frac{\rho}{\rho+2\lambda}$ ,

$\displaystyle\kappa\left(\frac{\\|{\mbox{\boldmath${\theta}$}}_{t}^{+}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}}{\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}}\right)$	$\displaystyle\leq$	$\displaystyle\kappa\left(\frac{\lambda}{\rho+\lambda}\right)=\frac{\delta}{\rho+\lambda}\sqrt{1-\left(\frac{\lambda}{\rho+\lambda}\right)^{2}}+\frac{\lambda}{\rho+\lambda}$
	$\displaystyle=$	$\displaystyle\frac{\frac{\delta}{\rho+\lambda}\sqrt{\rho^{2}+2\rho\lambda}+\lambda}{\rho+\lambda}$
	$\displaystyle=$	$\displaystyle\left[\sqrt{\left(\frac{\delta}{\rho+\lambda}\right)^{2}\rho(\rho+2\lambda)}+\lambda\right]/(\rho+\lambda)<1.$

Combining with (A.13), we complete the proof of Theorem 4.1.

Proof of Theorem 4.2: The proof is essentially the same as the proof of Theorem 4.1, but invoke Theorem 3.3 to bound $\|{\mbox{\boldmath${\theta}$}}_{t+1}-{\mbox{\boldmath${\theta}$}}_{t}^{+}\|_{2}$ in (A.8).

Proof of Corollary 4.1: The proof is similar to the ones of Corollaries 3.1 and 3.2 in Fan et al. (2019). Just a few algebraic tricks are needed. Here, we omit the details.

A.4 Proofs of Theorems 4.3 and 4.4

Proof of Theorem 4.3: The proof is implied by combining proof of Corollary 4.1 with Lemma A.5 in Fan et al. (2019), which provides the order of Hessian difference on the 1st worker machine in the GLM. Further, it presents a contraction rate and a choice of $\lambda$ .

Proof of Theorem 4.4: We only prove (a). Proof of (b) is similar to (a) by applying Bernstein’s inequality for sub-exponential random variables. Let ${\mbox{\boldmath${\Xi}$}}=(\hat{{\mbox{\boldmath${\Sigma}$}}}+\lambda{\mbox{\boldmath${I}$}})^{-1/2}\left(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}-\hat{{\mbox{\boldmath${\Sigma}$}}}\right)\left(\hat{{\mbox{\boldmath${\Sigma}$}}}+\lambda{\mbox{\boldmath${I}$}}\right)^{-1/2}$ .

First, we have the following basic facts:

(i) $P\{{\mbox{\boldmath${\Xi}$}}\leq 1/2\}\geq 1-2e^{-n/C}$ for some constant $C$ , which is determined by $\|{\mbox{\boldmath${x}$}}_{i}\|_{\psi_{2}}$ , under the condition (1) $n>C_{1}p$ and $\lambda\geq 0$ or (2) $\lambda\geq C_{1}\mathrm{Tr}({\mbox{\boldmath${\Sigma}$}})/n$ .

(ii) $P\left\{\left\|{\mbox{\boldmath${\Sigma}$}}^{-1/2}(\hat{{\mbox{\boldmath${\Sigma}$}}}-{\mbox{\boldmath${\Sigma}$}}){\mbox{\boldmath${\Sigma}$}}^{-1/2}\right\|_{2}\leq 1/2\right\}\geq 1-2e^{N/C}$ for some constant $C$ in (i).

(iii) $P\{{\mbox{\boldmath${\Xi}$}}\leq C_{2}\sqrt{p/n}\}\geq 1-2e^{-C_{1}p}$ for some constants $C_{1}$ and $C_{2}$ .

By the fact (i), we can appropriately choose $\lambda$ , such that with high probability ${\mbox{\boldmath${\Xi}$}}\leq 1/2$ . Then

(iv) $\left\|{\mbox{\boldmath${I}$}}-\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}\right\|_{2}\leq\frac{2{\mbox{\boldmath${\Xi}$}}^{2}+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}{1+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}$ , for any $t\geq 0$ .
These facts can found in Lemmas A.6-A.8 of Fan et al. (2019).

Now we give the main proof. From (4.4), the law of large numbers and Theorem 8 in Yin et al. (2018), we have

$\displaystyle{\mbox{\boldmath${\theta}$}}_{t+1}$	$\displaystyle=$	$\displaystyle(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\left[(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}}){\mbox{\boldmath${\theta}$}}_{t}-\nabla f({\mbox{\boldmath${\theta}$}})\right]$
		$\displaystyle+(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\left[\nabla f({\mbox{\boldmath${\theta}$}})-\nabla F({\mbox{\boldmath${\theta}$}})\right]+(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\left[\nabla F({\mbox{\boldmath${\theta}$}})-{\mbox{\boldmath${h}$}}({\mbox{\boldmath${\theta}$}}_{t})\right]$
	$\displaystyle=$	$\displaystyle[{\mbox{\boldmath${I}$}}-(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\hat{{\mbox{\boldmath${\Sigma}$}}}]{\mbox{\boldmath${{\mbox{\boldmath${\theta}$}}}$}}_{t}+(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\hat{{\mbox{\boldmath${v}$}}}+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha}).$

Further, one gets

			$\displaystyle\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}})=\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\Sigma}$}}}^{-1}\hat{{\mbox{\boldmath${v}$}}})$
		$\displaystyle=$	$\displaystyle\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}[{\mbox{\boldmath${I}$}}-(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\hat{{\mbox{\boldmath${\Sigma}$}}}]{\mbox{\boldmath${{\mbox{\boldmath${\theta}$}}}$}}_{t}+\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\hat{{\mbox{\boldmath${v}$}}}-\hat{{\mbox{\boldmath${\Sigma}$}}}^{-1/2}\hat{{\mbox{\boldmath${v}$}}}$
			$\displaystyle+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha})$
		$\displaystyle=$	$\displaystyle[{\mbox{\boldmath${I}$}}-\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}(\hat{{\mbox{\boldmath${\Sigma}$}}}_{1}+\lambda{\mbox{\boldmath${I}$}})^{-1}\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}]\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha}).$

Together with (iv), we have

\left\|\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}\leq\frac{2{\mbox{\boldmath${\Xi}$}}^{2}+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}{1+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}\left\|\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})\right\|_{2}+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha}).

(A.15)

From (i)-(iii), we have ${\mbox{\boldmath${\Xi}$}}\leq 1/2$ , ${\mbox{\boldmath${\Xi}$}}\leq C_{2}\sqrt{p/n}$ and $\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})\geq C_{3}^{-1}>0$ simultaneously with high probability, where $C_{2}$ and $C_{3}$ are some positive constants. So, with high probability, we have

\frac{2{\mbox{\boldmath${\Xi}$}}^{2}+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}{1+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}=1-\frac{1-2{\mbox{\boldmath${\Xi}$}}^{2}}{1+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}\leq 1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}.

(A.16)

From (A.15)-(A.16), we have

$\displaystyle\left\\|\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t+1}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}$	$\displaystyle\leq$	$\displaystyle\frac{2{\mbox{\boldmath${\Xi}$}}^{2}+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}{1+\lambda/\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}\left\\|\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha})$	(A.17)
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}\right)\left\\|\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}$
		$\displaystyle+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha})$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}\right)^{t+1}\left\\|\hat{{\mbox{\boldmath${\Sigma}$}}}^{1/2}({\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}$
		$\displaystyle+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha}).$

In addition, from (ii), we have $-\frac{1}{2}{\mbox{\boldmath${\Sigma}$}}\preceq\hat{{\mbox{\boldmath${\Sigma}$}}}-{\mbox{\boldmath${\Sigma}$}}\preceq\frac{1}{2}{\mbox{\boldmath${\Sigma}$}}$ , and then $\frac{1}{2}{\mbox{\boldmath${\Sigma}$}}\preceq\hat{{\mbox{\boldmath${\Sigma}$}}}\preceq\frac{3}{2}{\mbox{\boldmath${\Sigma}$}}$ . Therefore,

\frac{1}{2}\lambda_{\min}({\mbox{\boldmath${\Sigma}$}})\leq\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})\leq\lambda_{\max}(\hat{{\mbox{\boldmath${\Sigma}$}}})\leq\frac{3}{2}\lambda_{\max}({\mbox{\boldmath${\Sigma}$}}).

Thus,

\frac{\lambda_{\max}(\hat{{\mbox{\boldmath${\Sigma}$}}})}{\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}\leq\frac{3\lambda_{\max}({\mbox{\boldmath${\Sigma}$}})}{\lambda_{\min}({\mbox{\boldmath${\Sigma}$}})}=3\kappa.

(A.18)

From (A.17)-(A.18), we have

$\displaystyle\left\\|({\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}$	$\displaystyle\leq$	$\displaystyle\left(1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}\right)^{t}\left(\frac{\lambda_{\max}(\hat{{\mbox{\boldmath${\Sigma}$}}})}{\lambda_{\min}(\hat{{\mbox{\boldmath${\Sigma}$}}})}\right)^{1/2}\left\\|({\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}$
		$\displaystyle+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha})$
	$\displaystyle\leq$	$\displaystyle\sqrt{3\kappa}\left(1-\frac{1-\min\{1/2,C_{2}p/n\}}{1+C_{3}\lambda}\right)^{t}\left\\|({\mbox{\boldmath${\theta}$}}_{0}-\hat{{\mbox{\boldmath${\theta}$}}})\right\\|_{2}$
		$\displaystyle+\mathcal{O}_{p}(N^{-1/2})+\mathcal{O}_{p}(\Delta_{nm\alpha}).$

We complete the proof (a).

			$\displaystyle\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-{\mbox{\boldmath${\theta}$}}_{t}\right\\|_{2}^{2}=\left\\|[\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}]-[{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}]\right\\|_{2}^{2}$
		$\displaystyle=$	$\displaystyle\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2}+\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-2\left\langle\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}},{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\right\rangle$
		$\displaystyle\leq$	$\displaystyle\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2}+\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-2\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2}$
		$\displaystyle=$	$\displaystyle\\|{\mbox{\boldmath${\theta}$}}_{t}-\hat{{\mbox{\boldmath${\theta}$}}}\\|_{2}^{2}-\left\\|\mathrm{prox}_{\lambda^{-1}\tilde{f}}({\mbox{\boldmath${\theta}$}}_{t})-\hat{{\mbox{\boldmath${\theta}$}}}\right\\|_{2}^{2},$