This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BFGS-ADMM for Large-Scale Distributed Optimization

Yichuan Li1, Yonghai Gong2, Nikolaos M. Freris2, Petros Voulgaris3 and Dušan Stipanović1 *This work was supported by the Ministry of Science and Technology of China under grant 2019YFB2102200, the Anhui Dept. of Science and Technology under grant 201903a05020049, the Tencent Holdings Ltd. under grant FR202003. 1Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, IL 61820, USA. [email protected], [email protected]. 2 School of Computer Science, University of Science and Technology of China, Hefei, Anhui, 230027, China. [email protected], [email protected].3Department of Mechanical Engineering, University of Nevada, Reno, NV 89557, USA. [email protected].
Abstract

We consider a class of distributed optimization problem where the objective function consists of a sum of strongly convex and smooth functions and a (possibly nonsmooth) convex regularizer. A multi-agent network is assumed, where each agent holds a private cost function and cooperates with its neighbors to compute the optimum of the aggregate objective. We propose a quasi-Newton Alternating Direction Method of Multipliers (ADMM) where the primal update is solved inexactly with approximated curvature information. By introducing an intermediate consensus variable, we achieve a block diagonal Hessian which eliminates the need for inner communication loops within the network when computing the update direction. We establish global linear convergence to the optimal primal-dual solution without the need for backtracking line search, under the assumption that component cost functions are strongly convex with Lipschitz continuous gradients. Numerical simulations on real datasets demonstrate the advantages of the proposed method over state of the art.

I INTRODUCTION

Distributed optimization acts as the computational engine for a wide range of applications in modern technology, including distributed control [1], power grid management [2], distributed machine learning [3], and resource allocation in sensor networks [4]. We consider a classical problem of minimizing a sum of cost functions and a convex regularizer, aiming at promoting sparsity in the solution structure. In specific:

minimizex^d{i=1mfi(x^)+g(x^)},\displaystyle\underset{\hat{x}\in\mathbb{R}^{d}}{\mathrm{minimize}}\left\{\sum_{i=1}^{m}f^{i}(\hat{x})+g(\hat{x})\right\}, (1)

where fi:df^{i}:\mathbb{R}^{d}\to\mathbb{R} captures the ii-th agent’s objective of interest and g:dg:\mathbb{R}^{d}\to\mathbb{R} is a possibly nonsmooth regularizer, such as the weighted 1\ell_{1}-norm. In a network of agents, each agent aims to minimize its local cost function fi()f^{i}(\cdot) while communicating with its neighbors to cooperatively find the solution of the global problem. Distributed solutions are often pursued to fully utilize the computational power of the agents and reduce the amount of message passing in the network. Efficient communication is deemed especially desirable in scenarios where the number of participants mm, and the dimension of the decision variable dd is large, such as in emerging applications of Cyber-Physical systems (CPS) [5]. An archetypal framework of distributed optimization introduces local decision variables at each agent, which are updated locally using private data and exchange variables with neighboring agents. A network-wide solution is achieved by means of asymptotic elimination of the consensus error, i.e., the difference between the values of neighboring agents converges to zero. Such a framework is termed distributed consensus optimization and offers significant flexibility for each agent to select appropriate updating schemes that suit its local hardware environment.

First order methods [6]-[9] constitute popular choices for solving (1) in a distributed fashion. The presence of the nonsmooth regularizer prohibits gradient descent from being implemented directly as g()g(\cdot) is not differentiable. Subgradient methods [6] invoke the notion of subdifferential set to compute descent directions while letting agents exchange information over a time-varying topology. Proximal gradient [8] accommodates the nonsmoothness by splitting the objective function and evaluating the proximal mapping associated with the nondifferentiable part. Various acceleration schemes [7], [9] exist for first order methods, that involve storing past gradient and iterate information so as to form a momentum correction term. However, as the size of data grow the condition number of problem (1) tends to get large which causes first order methods to exhibit extremely slow convergence rate. A natural remedy for the aforementioned issues is to consider second order methods [10]-[12]. Through use of function curvature information, second order methods compute descent directions in the objective function level sets that are effective at accelerating the convergence over first order methods. Proximal Newton [12] can be considered as the second order counterpart of the proximal gradient but the associated proximal mapping increases computational burden drastically and renders distributed implementation infeasible as global information is required to compute the Newton step. Moreover, second order methods require solving a linear system to compute the Newton step which induces a computation cost of 𝒪(d3)\mathcal{O}(d^{3}). Quasi-Newton methods [13] circumvent this procedure by using a finite difference of gradients to approximate the Hessian. The reduced computation costs along with competitive performance have rendered quasi-Newton methods a desirable alternative to second order methods, especially for Large-scale optimization problems.

Primal-dual methods [14]-[15] provide a different perspective for problem (1) within the framework of constrained optimization, where the consensus constraint is explicitly enforced. The Alternating Direction Method of Multipliers (ADMM) [15] falls into this catergory where the smooth and nonsmooth parts of the objective are considered separately and iterates are computed sequentially. However, at each iteration of the ADMM, a sub-optimization problem must be solved, which typically induces an expensive computational burden.

Contributions: (i) We propose BFGS-ADMM for convex composite optimization where the primal update (typically computational expensive) is replaced with a quasi-Newton step that does not require solving a linear system of equations. Moreover, by introducing an intermediate consensus variable, we eliminate the need for inner loops within the network when computing the update direction. (ii) We establish global linear convergence rate for the proposed algorithm without backtracking, under the assumption that private cost functions are strongly convex with Lipschitz continuous gradients. (iii) The advantage of BFGS-ADMM over first order methods is analytically established and experimentally demonstrated with numerical simulations on real datasets.

Notation: We denote column vectors xdx\in\mathbb{R}^{d} with lower case letters and matrices An×mA\in\mathbb{R}^{n\times m} with capital letters. Superscripts are used to denote partitioned vector components and subscripts are used to denote iteration steps, e.g., xtix^{i}_{t} denotes the value of subvector xix^{i} at step tt. Matrix entries are denoted as [A]ij[A]^{ij}. Unless specified otherwise, x\norm{x} and A\norm{A} denote the Euclidean norm of a vector and the corresponding induced norm of a matrix, respectively. We define the norm of a vector associated with a positive definite matrix P0P\succ 0 as xP:=xPx\norm{x}_{P}:=\sqrt{x^{\top}Px}, and the set {1,,m}\{1,\dots,m\} is abbreviated as [m][m]. The proximal mapping associated with the function g():dg(\cdot):\mathbb{R}^{d}\to\mathbb{R} is defined as proxμg(v)=argmin𝜃{g(θ)+12μθv2}\textbf{prox}_{\mu g}(v)=\underset{\theta}{\mathrm{argmin}}\left\{g(\theta)+\tfrac{1}{2\mu}\norm{\theta-v}^{2}\right\}.

II PRELIMINARIES

II-A Problem formulation

Consider a connected network of mm agents captured by a mm-th order undirected graph: 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where 𝒱={1,,m}\mathcal{V}=\{1,\dots,m\} is the vertex set, \mathcal{E} contains pairs (i,j)(i,j) if and only if agent ii communicates with agent jj, and ||=n\absolutevalue{\mathcal{E}}=n captures the number of communication links in the network. We denote the neighborhood of agent ii as 𝒩i={j:(i,j)}\mathcal{N}_{i}=\{j:(i,j)\in\mathcal{E}\}. We further define the source and the destination matrix (arbitrary ordering suffices since both are given the same value) A^s,A^dn×m\hat{A}_{s},\hat{A}_{d}\in\mathbb{R}^{n\times m} respectively as follows. Each row of A^s\hat{A}_{s} and A^d\hat{A}_{d} corresponds to a communication link k=(i,j)\mathcal{E}_{k}=(i,j), for some k[n]k\in[n], and [A^s]ki=[A^d]kj=1[\hat{A}_{s}]^{ki}=[\hat{A}_{d}]^{kj}=1 with all other entries equal to 0. We reformulate (1) to the consensus framework by introducing local decision variables xix^{i} held by the corresponding agent ii, along with a set of intermediate consensus variables {zij}\{z^{ij}\}. Specifically,

minimizexi,θ,zd{i=1mfi(xi)+g(θ)}\displaystyle\underset{x^{i},\,\theta,\,z\,\in\mathbb{R}^{d}}{\mathrm{minimize}}\left\{\sum_{i=1}^{m}f^{i}(x^{i})+g(\theta)\right\}
s.t.xi=zij=xjiandj𝒩i,\displaystyle\text{s.t.}\quad x^{i}=z^{ij}=x^{j}\quad\forall\,i\,\,\text{and}\,\,j\in\mathcal{N}_{i},
xl=θfor some l[m],\displaystyle x^{l}=\theta\quad\text{for some }l\in[m], (2)

where the ll-th agent is chosen arbitrarily to enforce the additional constraint for the nonsmooth regularizer decision variable. The constraints enforce consensus among the network if the underlying graph is connected and, therefore, (2) is equivalent to (1) in the sense that they have the same set of optimal solutions, i.e., x^=x1==xm=θ=z\hat{x}_{\star}=x^{1}_{\star}=\dots=x^{m}_{\star}=\theta_{\star}=z_{\star}. Consensus can be achieved by directly enforcing xi=xjx^{i}=x^{j} but having a intermediate consensus variable zijz^{ij} is crucial in terms of decoupling the primal variable xix^{i} so that computing quasi-Newton steps does not require additional communication within the network. We provide further discussion on this design choice in Section III Remark 1. We may compactly express (2) using the aforementioned source and destination matrices as follows:

minimizexmd,θd,znd{F(x)+g(θ)}\displaystyle\underset{x\in\mathbb{R}^{md},\theta\in\mathbb{R}^{d},z\in\mathbb{R}^{nd}}{\mathrm{minimize}}\left\{F(x)+g(\theta)\right\}
s.t.Ax=[A^sIdA^dId]x=[IndInd]z=Dz,\displaystyle\text{s.t.}\quad Ax=\begin{bmatrix}\hat{A}_{s}\otimes I_{d}\\ \hat{A}_{d}\otimes I_{d}\end{bmatrix}x=\begin{bmatrix}I_{nd}\\ I_{nd}\end{bmatrix}z=Dz,
Cx=θ,\displaystyle Cx=\theta, (3)

where \otimes denotes the Kronecker product and II denotes the identity matrix of the corresponding dimension. We stack local decision variables as xmd:=[(x1),,(xm)]x\in\mathbb{R}^{md}:=[(x^{1})^{\top},\dots,(x^{m})^{\top}]^{\top} and similarly for the consensus variable znd:=[(z1),,(zn)]z\in\mathbb{R}^{nd}:=[(z^{1})^{\top},\dots,(z^{n})^{\top}]^{\top}, and matrix AA and DD is obtained by stacking matrices as shown in (3). The matrix Cd×md:=(cl)IdC\in\mathbb{R}^{d\times md}:=(c^{l})^{\top}\otimes I_{d} enforces the last line of constraint of (2) using the coordinate selection vector clmc^{l}\in\mathbb{R}^{m}, whose entries are zero except the ll-th one being 1. We present some equalities that link our constraint matrices to the topology of the graph in the following.

E^s=A^sA^dE^u=A^s+A^d\displaystyle\hat{E}_{s}=\hat{A}_{s}-\hat{A}_{d}\quad\hat{E}_{u}=\hat{A}_{s}+\hat{A}_{d} (4)
E^sE^s=L^sE^uE^u=L^u\displaystyle\hat{E}_{s}^{\top}\hat{E}_{s}=\hat{L}_{s}\quad\hat{E}_{u}^{\top}\hat{E}_{u}=\hat{L}_{u} (5)
Δ^=12(L^s+L^u)=A^sA^s+A^dA^d\displaystyle\hat{\Delta}=\tfrac{1}{2}(\hat{L}_{s}+\hat{L}_{u})=\hat{A}_{s}^{\top}\hat{A}_{s}+\hat{A}_{d}^{\top}\hat{A}_{d} (6)

Matrices E^s,E^un×m\hat{E}_{s},\hat{E}_{u}\in\mathbb{R}^{n\times m} stand for the signed and unsigned incidence matrix of the graph, respectively, while L^s,L^um×m\hat{L}_{s},\hat{L}_{u}\in\mathbb{R}^{m\times m} denotes the signed and unsigned Laplacian matrix of the graph, respectively. The diagonal degree matrix of the graph is denoted as Δ^\hat{\Delta}, whose entries are Δ^ii=|𝒩i|\hat{\Delta}^{ii}=|\mathcal{N}_{i}|.

Assumption 1. Local cost functions fi()f^{i}(\cdot) are twice differentiable with uniformly bounded Hessian as follows:

mfI2fi()MfI,i[m],\displaystyle m_{f}I\preceq\nabla^{2}f^{i}(\cdot)\preceq M_{f}I,\,\,\forall\,\,i\in[m], (7)

where 0<mfMf<0<m_{f}\leq M_{f}<\infty. Since F(x)=i=1mfi(xi)F(x)=\sum_{i=1}^{m}f^{i}(x^{i}), 2F(x)\nabla^{2}F(x) is block diagonal with the ii-th block being 2fi(xi)\nabla^{2}f^{i}(x^{i}). Therefore, the same bounds apply to the Hessian of F(x)F(x) as well, i.e., mfI2F(x)MfIm_{f}I\preceq\nabla^{2}F(x)\preceq M_{f}I.

II-B Introduction to BFGS

Quasi-Newton methods [16]-[18] constitute a popular class of methods for accelerating the convergence of optimization methods without directly invoking the Hessian. Quasi-Newton methods seek to approximate the curvature information using consecutive gradient evaluations and iterates differences. More specifically, we define the descent direction hth_{t} as

ht:=Bt1ft,\displaystyle h_{t}:=-B_{t}^{-1}\nabla f_{t}, (8)

where BtB_{t} is a positive definite matrix that approximates the Hessian 2ft\nabla^{2}f_{t}. Various schemes exist for designing algorithms that iteratively update BtB_{t} including those by Davidon, Fletcher, and Powell (DFP) [16] and Broyden, Fletcher, Goldfarb, and Shannon (BFGS) [17].
In this paper, we focus on the BFGS scheme to select Bt1B_{t}^{-1} as it is observed to work best in the practice both in terms of convergence speed and self-correcting capabilities [18]. The BFGS approximates the Hessian using finite differences of consecutive iterates and gradient evaluations. In specific, define the iterates difference and the gradient difference as follows:

st=xt+1xt,dt=f(xt+1)f(xt).\displaystyle s_{t}=x_{t+1}-x_{t},\quad d_{t}=\nabla f(x_{t+1})-\nabla f(x_{t}).

Quasi-Newton methods select Bt+1B_{t+1} so that the secant condition is satisfied, i.e., Bt+1st=dtB_{t+1}s_{t}=d_{t}, which is motivated by the fact that the real Hessian satisfies this condition as xt+1x_{t+1} tends to xtx_{t}. To ensure Bt+1>0B_{t+1}>0, it must hold that stdt>0s_{t}^{\top}d_{t}>0 as can seen by premultiplying the secant equation with sts_{t}^{\top}. For convex objectives, this condition is satisfied automatically. Note that, however, the secant condition alone is not enough to specify Bt+1B_{t+1}. To uniquely determine Bt+1B_{t+1}, we further require its inverse to be close to the previous value Bt1B_{t}^{-1} in the following sense:

minimizeB1B1Bt1𝐖\displaystyle\underset{B^{-1}}{\mathrm{minimize}}\quad\norm{B^{-1}-B_{t}^{-1}}_{\mathbf{W}}
s.t.B1=(B1),B1dt=st,\displaystyle\text{s.t.}\quad B^{-1}=(B^{-1})^{\top},\quad B^{-1}d_{t}=s_{t}, (9)

where A𝐖:=W12AW12𝐅\norm{A}_{\mathbf{W}}:=\norm{W^{\tfrac{1}{2}}AW^{\tfrac{1}{2}}}_{\mathbf{F}} stands for the weighted Frobenius norm associated with matrix WW whose inverse is the average Hessian [18]. A closed form solution for (9) can be obtained as:

Bt+11=(Istdtdtst)Bt1(Idtstdtst)+ststdtst.\displaystyle B_{t+1}^{-1}=\left(I-\tfrac{s_{t}d_{t}^{\top}}{d_{t}^{\top}s_{t}}\right)B_{t}^{-1}\left(I-\tfrac{d_{t}s_{t}^{\top}}{d_{t}^{\top}s_{t}}\right)+\tfrac{s_{t}s_{t}^{\top}}{d_{t}^{\top}s_{t}}. (10)

II-C Alternating Direction Method of Multipliers

We proceed to introduce the Augmented Lagrangian associated with problem (2) as follows:

(x,z,θ;y,λ)=F(x)+g(θ)+y(AxDz)\displaystyle\mathcal{L}(x,z,\theta;y,\lambda)=F(x)+g(\theta)+y^{\top}(Ax-Dz)
+λ(Cxθ)+12μ1AxDz2+12μ2Cxθ2,\displaystyle+\lambda^{\top}(Cx-\theta)+\tfrac{1}{2\mu_{1}}\norm{Ax-Dz}^{2}+\tfrac{1}{2\mu_{2}}\norm{Cx-\theta}^{2},

where y2nd,λdy\in\mathbb{R}^{2nd},\lambda\in\mathbb{R}^{d} are dual variables corresponding to the two linear constraints in (3). The Alternating Direction Method of Multipliers (ADMM) solves (2) by sequentially minimizing the Augmented Lagrangian over primal/dual variables as:

xt+1\displaystyle x_{t+1} =argmin𝑥(x,zt,θt;yt),\displaystyle=\underset{x}{\mathrm{argmin}}\,\,\mathcal{L}(x,z_{t},\theta_{t};y_{t}), (11a)
zt+1\displaystyle z_{t+1} =argmin𝑧(xt+1,z,θt+1;yt),\displaystyle=\underset{z}{\mathrm{argmin}}\,\,\mathcal{L}(x_{t+1},z,\theta_{t+1};y_{t}), (11b)
θt+1\displaystyle\theta_{t+1} =argmin𝜃(xt+1,zt+1,θ;yt),\displaystyle=\underset{\theta}{\mathrm{argmin}}\,\,\mathcal{L}(x_{t+1},z_{t+1},\theta;y_{t}), (11c)
yt+1\displaystyle y_{t+1} =yt+1μ1(Axt+1Dzt+1)\displaystyle=y_{t}+\tfrac{1}{\mu_{1}}(Ax_{t+1}-Dz_{t+1}) (11d)
λt+1\displaystyle\lambda_{t+1} =λt+1μ2(Cxt+1θt+1)\displaystyle=\lambda_{t}+\tfrac{1}{\mu_{2}}(Cx_{t+1}-\theta_{t+1}) (11e)

Note that (11) is a 3-block ADMM which, in general, is not guaranteed to converge for arbitrary values of μ1,μ2>0\mu_{1},\mu_{2}>0 [19]. We separate the dual variables into yy and λ\lambda for two reasons: (i) Since dual variables accumulate the consensus error within the network as seen in (11d) and (11e), and from that the fact yy corresponds to 2n2n constraints while λ\lambda only involves a single constraint, it is beneficial to separate the two and design different stepsizes 1μ1\tfrac{1}{\mu_{1}} and 1μ2\tfrac{1}{\mu_{2}}, which allows for extra freedom for hyperparameters and results in better performance in practice. (ii) In section III, we further show that with appropriate initialization, the storage requirement for yy can be reduced by half.
In each iteration of ADMM, a sub-optimization problem has to be solved in (11a) to obtain xt+1x_{t+1}. This may be computationally expensive and place heavy burden on the system. We therefore propose to perform inexact optimization with the aid of an approximated Hessian using BFGS, aiming to reduce computational costs while maintaining fast convergence speed.

III Algorithmic description and implementation

We build a local model of the Augmented Lagrangian at step tt with respect to the primal variable xx as:

^t(x)=t+xt(xxt)+12xxtHt2,\displaystyle\widehat{\mathcal{L}}_{t}(x)=\mathcal{L}_{t}+\nabla_{x}\mathcal{L}_{t}^{\top}(x-x_{t})+\tfrac{1}{2}\norm{x-x_{t}}^{2}_{H_{t}}, (12)

where t:=(xt,zt,θt;yt,λt)\mathcal{L}_{t}:=\mathcal{L}(x_{t},z_{t},\theta_{t};y_{t},\lambda_{t}). Various algorithms can be designed by choosing different HtH_{t}. In this paper, we opt to use regularized BFGS approximation of the Augmented Lagrangian Hessian as follows:

Ht=Bt+1μ1Δ+1μ2CC+ϵImd,\displaystyle H_{t}=B_{t}+\tfrac{1}{\mu_{1}}\Delta+\tfrac{1}{\mu_{2}}C^{\top}C+\epsilon I_{md}, (13)

where ϵ>0\epsilon>0 and Δ=AA\Delta=A^{\top}A is the Kronecker product of the graph degree matrix Δ^\hat{\Delta} and IdI_{d}. Note that by setting Bt=2F(xt)B_{t}=\nabla^{2}F(x_{t}), we recover the Hessian of the Augmented Lagrangian with respect to xx, i.e., Ht=xx2H_{t}=\nabla_{xx}^{2}\mathcal{L} in such case. The quadratic form of (12) admits a closed form solution when minimizing over xx, i.e.,

xt+1=xtHt1xt.\displaystyle x_{t+1}=x_{t}-H_{t}^{-1}\nabla_{x}\mathcal{L}_{t}. (14)

Step (11b) reduces to solving the following linear system of equations:

Dyt+1μ1D(Axt+1Dzt+1)=0.\displaystyle D^{\top}y_{t}+\tfrac{1}{\mu_{1}}D^{\top}(Ax_{t+1}-Dz_{t+1})=0. (15)

Moreover, by completion of squares, step (11c) is equivalent to evaluating the following proximal mapping associated with g()g(\cdot) with parameter μ2\mu_{2}:

θt+1=proxμ2g(Cxt+1+μ2λt).\displaystyle\theta_{t+1}=\textbf{prox}_{\mu_{2}g}(Cx_{t+1}+\mu_{2}\lambda_{t}). (16)

Dual variables are updated in verbatim as in (11d) and (11e). We proceed to present Lemma 1 which states that with an appropriate initialization, the intermediate variable {zij}\{z^{ij}\} evolves on the manifold defined by the column space of EuE_{u}. Matrices without the hat denotes the Kronecker product of the corresponding matrix with the identity matrix, e.g., Eu=E^uIdE_{u}=\hat{E}_{u}\otimes I_{d} where E^u\hat{E}_{u} is defined in (4).

Lemma 1. Recall the signed and unsigned incidence and Laplacian matrices in (4) and (5), respectively. For zero initialization of (xt,yt)(x_{t},y_{t}), we can decompose yt=[αtαt]y_{t}^{\top}=\begin{bmatrix}\alpha_{t}^{\top}&-\alpha_{t}^{\top}\end{bmatrix} for all t0t\geq 0. Moreover, the update rules (14)-(16), (11d), and (11e) can be simplified as:

xt+1\displaystyle x_{t+1} =xtHt1{F(xt)+Esαt+Cλt+12μ1Lsxt\displaystyle=x_{t}-H_{t}^{-1}\big{\{}\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+C^{\top}\lambda_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}
+1μ2C(xtlθt)},\displaystyle+\tfrac{1}{\mu_{2}}C^{\top}(x_{t}^{l}-\theta_{t})\big{\}}, (17a)
θt+1\displaystyle\theta_{t+1} =proxμ2g(xt+1l+μ2λt),\displaystyle=\textbf{prox}_{\mu_{2}g}(x_{t+1}^{l}+\mu_{2}\lambda_{t}), (17b)
αt+1\displaystyle\alpha_{t+1} =αt+12μ1Esxt+1,\displaystyle=\alpha_{t}+\tfrac{1}{2\mu_{1}}E_{s}x_{t+1}, (17c)
λt+1\displaystyle\lambda_{t+1} =λt+1μ2(xt+1lθt+1).\displaystyle=\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t+1}). (17d)

Proof : See Appendix A.
Once θt+1\theta_{t+1} is obtained, updates (17c) and (17d) can be performed in parallel and we have effectively transformed a 3-block ADMM in (11) to a 2-block ADMM, which converges under more general settings [20].
Remark 1: The reason for introducing the consensus variable zz in (2) is to decouple xix^{i}’s so that the Hessian of the Augmented Lagrangian has a block diagonal structure, cf.(13), which is instrumental for computing the quasi-Newton step without additional communication within the network once the gradient of the Augmented Lagrangian is obtained. Note that there is no need to invert any matrix or solve a linear system, and the inverse in (17), is only notational. Instead, we directly approximate each block (Bt+11)ii(B_{t+1}^{-1})^{ii}, corresponding to the agent ii and form (Ht+11)ii(H_{t+1}^{-1})^{ii} as explained in the following. Since Bt+11B_{t+1}^{-1} is block diagonal, we approximate (Bt+11)ii(B_{t+1}^{-1})^{ii} using (10) with dt=fi(xt+1i)fi(xti)d_{t}=\nabla f^{i}(x^{i}_{t+1})-\nabla f^{i}(x^{i}_{t}) and st=xt+1ixtis_{t}=x_{t+1}^{i}-x_{t}^{i}. Moreover, Ht+1Bt+1H_{t+1}-B_{t+1} is diagonal and constant (time invariant) from (13), which allows for eigendecomposition with components only depending on local information. Specifically, using the Sherman-Morrison formula and defining C1i=(Bt+11)iiC^{i}_{1}=(B_{t+1}^{-1})^{ii}, we obtain an iterative formula for (Ht+11)ii=Cd+1i(H_{t+1}^{-1})^{ii}=C^{i}_{d+1} as follows:

Ck+1i=CkiCkivki(vki)Cki1+(vk)Ckivk,\displaystyle C^{i}_{k+1}=C^{i}_{k}-\tfrac{C^{i}_{k}v_{k}^{i}(v^{i}_{k})^{\top}C^{i}_{k}}{1+(v^{k})^{\top}C^{i}_{k}v^{k}}, (18)

with constant vector vkiv^{i}_{k} specified as:

vki=(|𝒩i|μ1+𝟏ilμ2+ϵ)ek,\displaystyle v^{i}_{k}=\left(\tfrac{\absolutevalue{\mathcal{N}_{i}}}{\mu_{1}}+\tfrac{\mathbf{1}_{il}}{\mu_{2}}+\epsilon\right)e_{k}, (19)

where 𝟏il=1\mathbf{1}_{il}=1 if i=li=l (i.e., if it is the ll-th agent who performs the proximal mapping associated with the nonsmooth regularizer g()g(\cdot)) and is zero otherwise, and ekde_{k}\in\mathbb{R}^{d} is a constant vector with all entries equal to 0 except for the kk-th one. If consensus is enforced directly as xi=xjx^{i}=x^{j}, the resulting Hessian of the Augmented Lagrangian would involve the graph Laplacian matrix as in [11] which couples xix^{i} with its neighbors. In other words, computing quasi-Newton steps would induce multiple inner communication rounds within the network at each iteration, to approximate the descent direction to a desired accuracy.
We now present a first order variant of BFGS-ADMM, which only uses first order information, so as to demonstrate the merits of using the BFGS approximation, both theoretically and experimentally. Note that the only part of HtH_{t} in (13) that depends on the iterate counter is BtB_{t}, which is an approximation of 2F(xt)\nabla^{2}F(x_{t}). Therefore, a first order approximation can be derived by setting Bt=0B_{t}=0 in which case HtH_{t} in (13) is a constant diagonal matrix. In other words, agent ii selects a step size equal to:

si=(|𝒩i|μ1+𝟏ilμ2+ϵ)1.\displaystyle s_{i}=\left(\tfrac{\absolutevalue{\mathcal{N}_{i}}}{\mu_{1}}+\tfrac{\mathbf{1}_{il}}{\mu_{2}}+\epsilon\right)^{-1}. (20)

Therefore the first order variant updates the primal vector xt+1x_{t+1} as:

xt+1=xtdiag[si]{F(xt)+Esαt+Cλt+12μ1Lsxt\displaystyle x_{t+1}=x_{t}-\textbf{diag}[s_{i}]\big{\{}\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+C^{\top}\lambda_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}
+1μ2C(xtlθt)},\displaystyle+\tfrac{1}{\mu_{2}}C^{\top}(x_{t}^{l}-\theta_{t})\big{\}}, (21)
Algorithm 1 BFGS-ADMM update

Zero initialization for all variables. Hyperparameters μ1,μ2\mu_{1},\mu_{2}, and ϵ\epsilon. Initialization for B01=aImdB_{0}^{-1}=aI_{md} for some a>0a>0.

1:for t=0,1,2,t=0,1,2,\ldots do
2:     for i[m]i\in[m] do
3:         Compute hti=fi(xti)+12μ1j𝒩i(xtixtj)+ϕtih_{t}^{i}=\nabla f^{i}(x_{t}^{i})+\tfrac{1}{2\mu_{1}}\sum_{j\in\mathcal{N}^{i}}(x_{t}^{i}-x_{t}^{j})+\phi_{t}^{i}.
4:         if i=li=l then
5:              Update htihti+1μ2(xtiθt)+λth_{t}^{i}\leftarrow h_{t}^{i}+\tfrac{1}{\mu_{2}}(x_{t}^{i}-\theta_{t})+\lambda_{t}.
6:         end if
7:         Compute uti=(Ht1)iihtiu^{i}_{t}=(H_{t}^{-1})^{ii}h_{t}^{i}.
8:         Primal update: xt+1i=xtiutix_{t+1}^{i}=x_{t}^{i}-u_{t}^{i}.
9:         Dual update: ϕt+1i=ϕti+12μ1j𝒩i(xt+1ixt+1j)\phi_{t+1}^{i}=\phi_{t}^{i}+\tfrac{1}{2\mu_{1}}\sum_{j\in\mathcal{N}^{i}}(x_{t+1}^{i}-x_{t+1}^{j}).
10:         if i=li=l then
11:              Compute θt+1=proxμ2g(xt+1i+μ2λt)\theta_{t+1}=\textbf{prox}_{\mu_{2}g}(x^{i}_{t+1}+\mu_{2}\lambda_{t}).
12:              Compute λt+1=λt+1μ2(xt+1iθt+1)\lambda_{t+1}=\lambda_{t}+\tfrac{1}{\mu_{2}}(x^{i}_{t+1}-\theta_{t+1}).
13:         end if
14:         Set st=utis_{t}=-u_{t}^{i} and dt=fi(xt+1i)f(xti)d_{t}=\nabla f^{i}(x^{i}_{t+1})-\nabla f(x^{i}_{t}).
15:         Update (Bt+11)ii(B_{t+1}^{-1})^{ii} using (10).
16:         Update (Ht+11)ii(H_{t+1}^{-1})^{ii} using (18).
17:     end for
18:end for

where diag[si]md\textbf{diag}[s_{i}]\in\mathbb{R}^{md} is the diagonal matrix formed by siImds_{i}I_{md}, and the dual variables (θt+1,αt+1,λt+1)(\theta_{t+1},\alpha_{t+1},\lambda_{t+1}) are updated in the exact same way as in the BFGS-ADMM using (17b)-(17c). Note that our first order variant encapsulates existing first order methods EXTRA [21] as a special case by setting si=1ϵs_{i}=\tfrac{1}{\epsilon}. Indeed, when g()=0g(\cdot)=0, the proximal mapping and the associated dual variables in (17) are not needed and we obtain: xt+1=xt1ϵ[F(xt)+Esαt+12μ1Lsxt].x_{t+1}=x_{t}-\tfrac{1}{\epsilon}\left[\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}\right]. Taking the difference between the xt+1x_{t+1} update and the xtx_{t} update defined this way, and substituting the dual update for αt\alpha_{t} using (17c), we obtain:

xt+1=2(I12μ1Ls)xt(I12μ1Ls)xt1\displaystyle x_{t+1}=2(I-\tfrac{1}{2\mu_{1}}L_{s})x_{t}-(I-\tfrac{1}{2\mu_{1}}L_{s})x_{t-1}
1ϵ(F(xt)F(xt1)).\displaystyle-\tfrac{1}{\epsilon}(\nabla F(x_{t})-\nabla F(x_{t-1})). (22)

With appropriate choices of μ1,μ2,ϵ\mu_{1},\mu_{2},\epsilon, the update defined in (22) is equivalent to EXTRA. We present the comparison between BFGS-ADMM and its first order variant in terms of approximating the exact ADMM in the next section.
We proceed to present the distributed implementation of BFGS-ADMM in Algorithm 1. Note that in the primal update (17a), dual variables are invoked in the form of EsαtE_{s}^{\top}\alpha_{t}. By defining ϕt=Esαtmd\phi_{t}=E_{s}^{\top}\alpha_{t}\in\mathbb{R}^{md} and pre-multiplying both sides of (17c) with EsE_{s}^{\top}, we obtain an equivalent algorithm that allows for efficient distributed implementation. We let each agent hold the corresponding pair (xi,ϕi)(x^{i},\phi^{i}) while the ll-th agent additionally holds (θ,λ)(\theta,\lambda). At each iteration, each agent ii begins with calculating htih_{t}^{i} using xtjx_{t}^{j} from its neighbors as in step 3, where the ll-th agent performs an additional update in step 5. Note that step 7 does not require solving a linear system since (Ht1)ii(H_{t}^{-1})^{ii} is directly approximated using (18). Primal and dual updates are performed at each agent as in step 8 and step 9, respectively. The ll-th agent evaluates an additional proximal mapping in step 11 to update θt+1\theta_{t+1}, and updates λt+1\lambda_{t+1} in step 12. Finally, each agent ii updates its (Bt+11)ii(B_{t+1}^{-1})^{ii} and (Ht+11)ii(H_{t+1}^{-1})^{ii} using (10) and (18), respectively.

IV CONVERGENCE ANALYSIS

Linear convergence rate for ADMM is well established [15]. Since our goal here is to reduce the computational cost of ADMM, the best one can hope for is to reduce the gap between the approximation and the standard ADMM, while maintaining the linear convergence rate. In Lemma 4, we demonstrate the advantage of using BFGS approximation over first order methods by showing that the aforementioned gap decreases faster for BFGS-ADMM. We first state an additional assumption on the nonsmooth regularizer g()g(\cdot) and the KKT conditions associated with (3) expressed in the primal optimal pair (x,θ)(x_{\star},\theta_{\star}) and the dual optimal pair (α,λ)(\alpha_{\star},\lambda_{\star}). Note that we have eliminated zz_{\star}, as it is shown in Lemma 1 that it is a function of xx_{\star}.

Assumption 2: The function g():dg(\cdot):\mathbb{R}^{d}\to\mathbb{R} is proper, closed, and convex. Equivalently, x,yd\forall\,x,y\in\mathbb{R}^{d}, the following inequality holds:

(xy)(g(x)g(y))0,\displaystyle(x-y)^{\top}(\partial g(x)-\partial g(y))\geq 0, (23)

where g()\partial g(\cdot) denotes the subdifferential set. We proceed to state a non-restrictive assumption that can be easily achieved by regularization.

Assumption 3: There exists positive constant 0<ν<0<\nu<\infty such that the eigenvalues of BtB_{t} are upper bounded, i.e,

Bt<νI.\displaystyle B_{t}<\nu I.

Remark 2: The upper bound described above can be achieved by setting Bt1=B^t1+1νIB_{t}^{-1}=\hat{B}_{t}^{-1}+\tfrac{1}{\nu}I, where B^t\hat{B}_{t} comes from the BFGS update formula. Since B^t1>0\hat{B}_{t}^{-1}>0 with positive definite initialization, we have Bt1>1νIB_{t}^{-1}>\tfrac{1}{\nu}I, i.e., Bt<νIB_{t}<\nu I.
KKT conditions:

F(x)+Esα+Cλ\displaystyle\nabla F(x_{\star})+E_{s}^{\top}\alpha_{\star}+C^{\top}\lambda_{\star} =0\displaystyle=0 (KKTa)
g(θ)λ\displaystyle\partial g(\theta_{\star})-\lambda_{\star} 0\displaystyle\ni 0 (KKTb)
Esx\displaystyle E_{s}x_{\star} =0\displaystyle=0 (KKTc)
xlθ\displaystyle x_{\star}^{l}-\theta_{\star} =0\displaystyle=0 (KKTd)

We proceed to state a technical lemma that describes an inclusion relationship between the subdifferential set of g(θt+1)\partial g(\theta_{t+1}) and the dual variable λt+1\lambda_{t+1}.

Lemma 2. Consider the update θt+1=proxμ2g(xt+1l+μ2λt)\theta_{t+1}=\textbf{prox}_{\mu_{2}g}(x_{t+1}^{l}+\mu_{2}\lambda_{t}) in (17b). It holds that:

λt+1g(θt+1).\displaystyle\lambda_{t+1}\in\partial g(\theta_{t+1}). (25)

Proof : See Appendix B.

Recall the first order variant of the BFGS-ADMM where the primal variable is updated as in (21) while BFGS-ADMM updates xt+1x_{t+1} as in (17a), and both update dual variables as in (17b)-(17c). We proceed to present results which capture their similarities and differences in the following.

Lemma 3. If Assumptions 1-2 hold, then the iterates generated by BFGS-ADMM and its first order variant both satisfy the following equation:

et+F(xt+1)F(x)+12μ1(Lu+ϵI)(xt+1xt)\displaystyle e_{t}+\nabla F(x_{t+1})-\nabla F(x_{\star})+\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})
+C{λt+1λ+1μ2(θt+1θt)}+Es(αt+1α)=0,\displaystyle+C^{\top}\big{\{}\lambda_{t+1}-\lambda_{\star}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})\big{\}}+E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})=0, (26)

where for BFGS-ADMM,

et=F(xt)+Bt(xt+1xt)F(xt+1).\displaystyle e_{t}=\nabla F(x_{t})+B_{t}(x_{t+1}-x_{t})-\nabla F(x_{t+1}). (27)

for its first order variant,

et=F(xt)F(xt+1).\displaystyle e_{t}=\nabla F(x_{t})-\nabla F(x_{t+1}). (28)

Proof : See Appendix C.
By comparing (27) and (28), we see that the difference between using BFGS vs. first order approximation lies in how F(xt+1)\nabla F(x_{t+1}) is approximated at step tt. Moreover, if the sub-optimization problem is solved exactly as in (11a), by optimality condition, equation (26) holds with et=0e_{t}=0.
Lemma 4. Consider the bound for 2F(x)\nabla^{2}F(x) in (7) and the ete_{t} introduced in Lemma 3. If Assumption 3 holds, an upper bound can be established as follows.
For BFGS-ADMM,

et\displaystyle\norm{e_{t}} Bt+1Btxt+1xt\displaystyle\leq\norm{B_{t+1}-B_{t}}\norm{x_{t+1}-x_{t}} (29)

For its first order variant as in (21),

etMfxt+1xt.\displaystyle\norm{e_{t}}\leq M_{f}\norm{x_{t+1}-x_{t}}. (30)

Proof : See Appendix D.
Remark 3: Lemma 4 is a standalone result whose proof does not require the convergence of the algorithm. In Theorem 1, we establish that BFGS-ADMM converges linearly which means limtBt+1Bt=0\lim_{t\to\infty}\norm{B_{t+1}-B_{t}}=0, since both dt,std_{t},s_{t} approach 0 as tt\to\infty in (10). Therefore, et=o(xt+1xt)\norm{e_{t}}=o(\norm{x_{t+1}-x_{t}}) for the BFGS-ADMM while et=O(xt+1xt)\norm{e_{t}}=O(\norm{x_{t+1}-x_{t}}) for its first order variant. The variable ete_{t} captures the approximation error vs. exact solution of the primal problem (et=0e_{t}=0 for this case), whence Lemma 4 unravels that the proposed has superior tracking of the exact ADMM, which is the key attribute to yield superior linear convergence rate over its first-order variant.
Lemma 5. Consider the update in (17c) and (17d). With zero initialization, the stacked vector [αtλt](n+1)d[\alpha^{\top}_{t}\,\,\lambda_{t}^{\top}]^{\top}\in\mathbb{R}^{(n+1)d} lies in the column space of X=[EsC](n+1)d×mdX^{\top}=[E_{s}^{\top}\,\,C^{\top}]^{\top}\in\mathbb{R}^{(n+1)d\times md}. Moreover, there exists a unique optimal [αλ][\alpha_{\star}^{\top}\,\,\lambda_{\star}^{\top}]^{\top} in the column space of XX^{\top} and, denoting σmin+\sigma^{+}_{\mathrm{min}} as the smallest positive eigenvalue of XXX^{\top}X, the following inequality holds:

σmin+(αt+1α2+λt+1λ2)\displaystyle\sigma^{+}_{\mathrm{min}}\left(\norm{\alpha_{t+1}-\alpha_{\star}}^{2}+\norm{\lambda_{t+1}-\lambda_{\star}}^{2}\right)\leq
Es(αt+1α)+C(λt+1λ)2.\displaystyle\norm{E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})}^{2}. (31)

Proof : See Appendix E.
We introduce the following scaling matrix defined in terms of the hyperparameters ϵ,μ1,μ2\epsilon,\mu_{1},\mu_{2} and the graph topology (as captured by LuL_{u}), =[12μ1Lu+ϵI00001μ200002μ10000μ2].\mathcal{H}=\begin{bmatrix}\tfrac{1}{2\mu_{1}}L_{u}+\epsilon I&0&0&0\\ 0&\tfrac{1}{\mu_{2}}&0&0\\ 0&0&2\mu_{1}&0\\ 0&0&0&\mu_{2}\end{bmatrix}. Theorem 1. Consider the update in (17). Define u=[x,θ,α,λ]u^{\top}=[x^{\top},\theta^{\top},\alpha^{\top},\lambda^{\top}] and uu_{\star} the corresponding optimum. Denote the smallest positive eigenvalue of [EsC][EsC]\begin{bmatrix}E_{s}\\ C\end{bmatrix}\begin{bmatrix}E_{s}^{\top}C^{\top}\end{bmatrix} as σmin+\sigma^{+}_{\mathrm{min}}, the smallest and largest eigenvalue of LuL_{u} as σminG\sigma^{G}_{\mathrm{min}} and σmaxG\sigma^{G}_{\mathrm{max}}, respectively. Denoting κ=Mfmf\kappa=\tfrac{M_{f}}{m_{f}}, for arbitrary constants β,γ,ϕ,ρ>1\beta,\gamma,\phi,\rho>1, and ζ(mf+Mf2mfMf,ϵ4Mf2)\zeta\in(\tfrac{m_{f}+M_{f}}{2m_{f}M_{f}},\tfrac{\epsilon}{4M_{f}^{2}}), ϵ>2(mf+Mf)κ\epsilon>2(m_{f}+M_{f})\kappa. the iterates converge linearly as follows:

ut+1u211+δutu2,\displaystyle\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}\leq\tfrac{1}{1+\delta}\norm{u_{t}-u_{\star}}^{2}_{\mathcal{H}}, (32)

where

δ=min{(2mfMfmf+Mf1ζ)(2μ1μ2μ2(σmaxG+2μ1ϵ)+ρμ1),ρ1ρ,\displaystyle\delta=\min\bigg{\{}\left(\tfrac{2m_{f}M_{f}}{m_{f}+M_{f}}-\tfrac{1}{\zeta}\right)\left(\tfrac{2\mu_{1}\mu_{2}}{\mu_{2}(\sigma_{\max}^{G}+2\mu_{1}\epsilon)+\rho\mu_{1}}\right),\tfrac{\rho-1}{\rho},
(β1)σmin+μβ(σminG2μ1+ϵζτ2σmaxG2μ1+ϵ+γτ2(β1)σmin+(γ1)),\displaystyle\tfrac{(\beta-1)\sigma^{+}_{\mathrm{min}}}{\mu\beta}\left(\tfrac{\tfrac{\sigma_{\min}^{G}}{2\mu_{1}}+\epsilon-\zeta\tau^{2}}{\tfrac{\sigma_{\max}^{G}}{2\mu_{1}}+\epsilon+\tfrac{\gamma\tau^{2}(\beta-1)}{\sigma^{+}_{\mathrm{min}}(\gamma-1)}}\right),
2σmin+(mf+Mf)μβψ(1γ),σmin+μ2(ψ1)μβψ(1γ)}.\displaystyle\tfrac{2\sigma^{+}_{\mathrm{min}}}{(m_{f}+M_{f})\mu\beta\psi}\left(\tfrac{1}{\gamma}\right),\tfrac{\sigma^{+}_{\mathrm{min}}\mu_{2}(\psi-1)}{\mu\beta\psi}\left(\tfrac{1}{\gamma}\right)\bigg{\}}. (33)

Proof: Since F(x)F(x) is strongly convex with parameter mfm_{f} and the gradient F(x)\nabla F(x) is Lipschitz continuous with parameter MfM_{f},

mfMfmf+Mfxt+1x2+1mf+MfF(xt+1)F(x)2\displaystyle\tfrac{m_{f}M_{f}}{m_{f}+M_{f}}\norm{x_{t+1}-x_{\star}}^{2}+\tfrac{1}{m_{f}+M_{f}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}\leq
(xt+1x)(F(xt+1)F(x)).\displaystyle(x_{t+1}-x_{\star})^{\top}(\nabla F(x_{t+1})-\nabla F(x_{\star})). (34)

Using Lemma 3, we rewrite the RHS (right hand side) of (34) as

RHS=(xt+1x)12μ1(Lu+ϵI)(xt+1xt)\displaystyle\textbf{RHS}=-(x_{t+1}-x_{\star})^{\top}\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})
(xt+1lxl)[λt+1λ+1μ2(θt+1θt)]\displaystyle-(x_{t+1}^{l}-x_{\star}^{l})^{\top}\left[\lambda_{t+1}-\lambda_{\star}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})\right]
(xt+1x)Es(αt+1α)(xt+1x)et,\displaystyle-(x_{t+1}-x_{\star})^{\top}E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})-(x_{t+1}-x_{\star})^{\top}e_{t}, (35)

where we have used the fact that (xt+1x)C=(xt+1lxl)(x_{t+1}-x_{\star})^{\top}C^{\top}=(x_{t+1}^{l}-x_{\star}^{l}). From the dual update (17c) and (KKTc), it holds that

(xt+1x)Es=2μ1(αt+1αt).\displaystyle(x_{t+1}-x_{\star})^{\top}E_{s}^{\top}=2\mu_{1}(\alpha_{t+1}-\alpha_{t})^{\top}. (36)

Moreover, using (17d) and (KKTd), we obtain

(xt+1lxl)=μ2(λt+1λt)+(θt+1θ).\displaystyle(x_{t+1}^{l}-x_{\star}^{l})^{\top}=\mu_{2}(\lambda_{t+1}-\lambda_{t})^{\top}+(\theta_{t+1}-\theta_{\star})^{\top}. (37)

After substituting (36) and (37) into (35), we obtain

RHS=(xt+1x)12μ1(Lu+ϵI)(xt+1xt)\displaystyle\textbf{RHS}=-(x_{t+1}-x_{\star})^{\top}\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})
μ2(λt+1λt)(λt+1λ)(λt+1λt)(θt+1θt)\displaystyle-\mu_{2}(\lambda_{t+1}-\lambda_{t})^{\top}(\lambda_{t+1}-\lambda_{\star})-(\lambda_{t+1}-\lambda_{t})^{\top}(\theta_{t+1}-\theta_{t})
(θt+1θ)(λt+1λ)1μ2(θt+1θ)(θt+1θt)\displaystyle-(\theta_{t+1}-\theta_{\star})^{\top}(\lambda_{t+1}-\lambda_{\star})-\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{\star})^{\top}(\theta_{t+1}-\theta_{t})
2μ1(αt+1αt)(αt+1α)(xt+1x)et.\displaystyle-2\mu_{1}(\alpha_{t+1}-\alpha_{t})^{\top}(\alpha_{t+1}-\alpha_{\star})-(x_{t+1}-x_{\star})^{\top}e_{t}. (38)

Using Lemma 2, we have (λt+1λt)(θt+1θt)(g(θt+1)g(θt))(θt+1θt)0(\lambda_{t+1}-\lambda_{t})^{\top}(\theta_{t+1}-\theta_{t})\in(\partial g(\theta_{t+1})-\partial g(\theta_{t}))^{\top}(\theta_{t+1}-\theta_{t})\geq 0, where the inequality follows from Lemma 2 and Assumption 1. Similarly, (θt+1θ)(λt+1λ)0(\theta_{t+1}-\theta_{\star})^{\top}(\lambda_{t+1}-\lambda_{\star})\geq 0. Therefore, we obtain

RHS(xt+1x)12μ1(Lu+ϵI)(xt+1xt)\displaystyle\textbf{RHS}\leq-(x_{t+1}-x_{\star})^{\top}\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})
μ2(λt+1λt)(λt+1λ)(θt+1θ)1μ2(θt+1θt)\displaystyle-\mu_{2}(\lambda_{t+1}-\lambda_{t})^{\top}(\lambda_{t+1}-\lambda_{\star})-(\theta_{t+1}-\theta_{\star})^{\top}\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})
2μ1(αt+1αt)(αt+1α)(xt+1x)et.\displaystyle-2\mu_{1}(\alpha_{t+1}-\alpha_{t})^{\top}(\alpha_{t+1}-\alpha_{\star})-(x_{t+1}-x_{\star})^{\top}e_{t}. (39)

Note that 2(ab)(ac)=bc2ab2ac2-2(a-b)^{\top}(a-c)=\norm{b-c}^{2}-\norm{a-b}^{2}-\norm{a-c}^{2} holds true for any (a,b,c)(a,b,c). Multiplying both sides of (39) by 22 and using the aforementioned identity for the inner product terms, and considering the concatenation u=[x,θ,α,λ]u^{\top}=[x^{\top},\theta^{\top},\alpha^{\top},\lambda^{\top}] we obtain:

2mMm+Mxt+1x2+2m+MF(xt+1)F(x)2\displaystyle\tfrac{2mM}{m+M}\norm{x_{t+1}-x_{\star}}^{2}+\tfrac{2}{m+M}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}
+xt+1xtL~u2+μ2λt+1λt2+1μ2θt+1θt2\displaystyle+\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{t}}^{2}+\tfrac{1}{\mu_{2}}\norm{\theta_{t+1}-\theta_{t}}^{2}
+2μ1αt+1αt2+2(xt+1x)et\displaystyle+2\mu_{1}\norm{\alpha_{t+1}-\alpha_{t}}^{2}+2(x_{t+1}-x_{\star})^{\top}e_{t}\leq
utu2ut+1u2\displaystyle\norm{u_{t}-u_{\star}}^{2}_{\mathcal{H}}-\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}} (40)

where L~u=12μ1(Lu+ϵI)\tilde{L}_{u}=\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I). To establish linear convergence as in (32), it suffices to show that for some δ>0\delta>0, the following holds: δut+1u2utu2ut+1u2\delta\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}\leq\norm{u_{t}-u_{\star}}^{2}_{\mathcal{H}}-\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}. Moreover, for any ζ>0\zeta>0, the inequality holds: 2(xt+1x)et1ζxt+1x2ζet22(x_{t+1}-x_{\star})^{\top}e_{t}\geq-\tfrac{1}{\zeta}\norm{x_{t+1}-x_{\star}}^{2}-\zeta\norm{e_{t}}^{2}. Therefore, it is sufficient to show

2mfMfmf+Mfxt+1x2+2mf+MfF(xt+1)F(x)2\displaystyle\tfrac{2m_{f}M_{f}}{m_{f}+M_{f}}\norm{x_{t+1}-x_{\star}}^{2}+\tfrac{2}{m_{f}+M_{f}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}
+xt+1xtL~u2+μ2λt+1λt2+1μ2θt+1θt2\displaystyle+\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{t}}^{2}+\tfrac{1}{\mu_{2}}\norm{\theta_{t+1}-\theta_{t}}^{2}
+2μ1αt+1αt21ζxt+1x2ζet2\displaystyle+2\mu_{1}\norm{\alpha_{t+1}-\alpha_{t}}^{2}-\tfrac{1}{\zeta}\norm{x_{t+1}-x_{\star}}^{2}-\zeta\norm{e_{t}}^{2}\geq
δut+1u2.\displaystyle\delta\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}. (41)

We proceed to establish this bound. From Lemma 3, it holds:

Es(αt+1α)+C(λt+1λ)=\displaystyle E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})=
{F(xt+1)F(x)+1μ1(Lu+ϵI)(xt+1xt)\displaystyle-\big{\{}\nabla F(x_{t+1})-\nabla F(x_{\star})+\tfrac{1}{\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})
+1μ2C(θt+1θt)+et}.\displaystyle+\tfrac{1}{\mu_{2}}C^{\top}(\theta_{t+1}-\theta_{t})+e_{t}\big{\}}. (42)

Note that a2ββ1b2+βc2\norm{a}^{2}\leq\tfrac{\beta}{\beta-1}\norm{b}^{2}+\beta\norm{c}^{2} holds true for any β>1\beta>1. After applying this inequality three times with arbitrary constant β,γ,ψ>1\beta,\gamma,\psi>1 for (42), we obtain

Es(αt+1α)+C(λt+1λ)2\displaystyle\norm{E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})}^{2}\leq
ββ1xt+1xtL~u22+βγγ1et2+βγψ(μ2)2(ψ1)θt+1θt2\displaystyle\tfrac{\beta}{\beta-1}\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}^{2}}+\tfrac{\beta\gamma}{\gamma-1}\norm{e_{t}}^{2}+\tfrac{\beta\gamma\psi}{(\mu_{2})^{2}(\psi-1)}\norm{\theta_{t+1}-\theta_{t}}^{2}
+βγψF(xt+1)F(x)2.\displaystyle+\beta\gamma\psi\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}. (43)

Consider the error term in Lemma 4, et=Bt+1Bt=BtststBtstBtst+dtdtdtst.e_{t}=B_{t+1}-B_{t}=-\tfrac{B_{t}s_{t}s_{t}^{\top}B_{t}}{s_{t}^{\top}B_{t}s_{t}}+\tfrac{d_{t}d_{t}^{\top}}{d_{t}^{\top}s_{t}}. Therefore, Bt+1Btdt2dtst+Btst2stBtst\norm{B_{t+1}-B_{t}}\leq\tfrac{\norm{d_{t}}^{2}}{d_{t}^{\top}s_{t}}+\tfrac{\norm{B_{t}s_{t}}^{2}}{s_{t}^{\top}B_{t}s_{t}}. Since F(x)F(x) is strongly convex with mfm_{f} and the gradient is Lipschitz continuous with parameter MfM_{f}, the following inequality holds:

dtstmfst2,anddtst1Mfdt2.\displaystyle d_{t}^{\top}s_{t}\geq m_{f}\norm{s_{t}}^{2},\quad\text{and}\quad d_{t}^{\top}s_{t}\geq\tfrac{1}{M_{f}}\norm{d_{t}}^{2}. (44)

By setting qt=Bt12stq_{t}=B_{t}^{\tfrac{1}{2}}s_{t}, we obtain Btst2stBtst=qtBtqtqtqtλmax(Bt)<ν.\tfrac{\norm{B_{t}s_{t}}^{2}}{s_{t}^{\top}B_{t}s_{t}}=\tfrac{q_{t}^{\top}B_{t}q_{t}}{q_{t}^{\top}q_{t}}\leq\lambda_{\mathrm{max}}(B_{t})<\nu. Therefore, etτxt+1xt,\norm{e_{t}}\leq\tau\norm{x_{t+1}-x_{t}}, where τ=min{M,dt2st2}+ν\tau=\min\left\{M,\tfrac{\norm{d_{t}}^{2}}{\norm{s_{t}}^{2}}\right\}+\nu. From Lemma 5, we have

σmin+(αt+1α2+λt+1λ2)\displaystyle\sigma^{+}_{\mathrm{min}}\left(\norm{\alpha_{t+1}-\alpha_{\star}}^{2}+\norm{\lambda_{t+1}-\lambda_{\star}}^{2}\right)\leq
Es(αt+1α)+C(λt+1λ)2.\displaystyle\norm{E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})}^{2}. (45)

Denoting μ=max{2μ1,μ2}\mu=\max\{2\mu_{1},\mu_{2}\} and combining (43)-(45), we have:

2μ1αt+1α2+μ2λt+1λ2\displaystyle 2\mu_{1}\norm{\alpha_{t+1}-\alpha_{\star}}^{2}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{\star}}^{2}\leq
μβσmin+(β1)xt+1xtL~u22+μβγτ2σmin+(γ1)xt+1xt2\displaystyle\tfrac{\mu\beta}{\sigma^{+}_{\mathrm{min}}(\beta-1)}\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}^{2}}+\tfrac{\mu\beta\gamma\tau^{2}}{\sigma^{+}_{\mathrm{min}}(\gamma-1)}\norm{x_{t+1}-x_{t}}^{2}
+μβγψ(μ2)2σmin+(ψ1)θt+1θt2\displaystyle+\tfrac{\mu\beta\gamma\psi}{(\mu_{2})^{2}\sigma^{+}_{\mathrm{min}}(\psi-1)}\norm{\theta_{t+1}-\theta_{t}}^{2}
+μβγψσmin+F(xt+1)F(x)2.\displaystyle+\tfrac{\mu\beta\gamma\psi}{\sigma^{+}_{\mathrm{min}}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}. (46)

Furthermore, from (17d) and (KKTd), it holds that θt+1θ=μ2(λt+1λt)+xt+1lxl\theta_{t+1}-\theta_{\star}=-\mu_{2}(\lambda_{t+1}-\lambda_{t})+x_{t+1}^{l}-x_{\star}^{l}. Using the same technique as in deriving (43), we obtain for arbitrary ρ>1\rho>1:

θt+1θρxt+1lxl2+(μ2)2ρρ1λt+1λt2.\displaystyle\norm{\theta_{t+1}-\theta_{\star}}\leq\rho\norm{x_{t+1}^{l}-x_{\star}^{l}}^{2}+\tfrac{(\mu_{2})^{2}\rho}{\rho-1}\norm{\lambda_{t+1}-\lambda_{t}}^{2}. (47)

Using (46) and (47), it suffices to show that the following inequality holds true for some δ>0\delta>0,

δut+1u22mfMfmf+Mfxt+1x2\displaystyle\delta\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}\leq\tfrac{2m_{f}M_{f}}{m_{f}+M_{f}}\norm{x_{t+1}-x_{\star}}^{2}
+2mf+MfF(xt+1)F(x)2\displaystyle+\tfrac{2}{m_{f}+M_{f}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}
+xt+1xtL~u2+μ2λt+1λt2+1μ2θt+1θt2\displaystyle+\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{t}}^{2}+\tfrac{1}{\mu_{2}}\norm{\theta_{t+1}-\theta_{t}}^{2}
+2μ1αt+1αt21ζxt+1x2ζτ2xt+1xt2,\displaystyle+2\mu_{1}\norm{\alpha_{t+1}-\alpha_{t}}^{2}-\tfrac{1}{\zeta}\norm{x_{t+1}-x_{\star}}^{2}-\zeta\tau^{2}\norm{x_{t+1}-x_{t}}^{2},

which is satisfied if δ\delta is chosen as in (33). \blacksquare

V EXPERIMENTS

Refer to caption
(a) m=20,d=3m=20,\,d=3.
Refer to caption
(b) m=40,d=3m=40,\,d=3.
Figure 1: skin_noskin dataset.
Refer to caption
(c) m=20,d=22m=20,\,d=22.
Refer to caption
(d) m=40,d=22m=40,\,d=22.
Figure 2: ijcnn1 dataset.

In this section we present some numerical experiments of the proposed method compared to distributed first order methods: P2D2 [22] and PG-EXTRA [7], for solving the following distributed logistic regression problem:

minimizexdF(x)={1mi=1mfi(x)+γx1},\displaystyle\underset{x\in\mathbb{R}^{d}}{\mathrm{minimize}}\,\,F(x)=\left\{\frac{1}{m}\sum_{i=1}^{m}f^{i}(x)+\gamma\norm{x}_{1}\right\},

where fi(x)=1mij=1mi[ln(1+ewjx)+(1yj)wjx],f^{i}(x)=\tfrac{1}{m_{i}}\sum_{j=1}^{m_{i}}\left[\ln(1+e^{w_{j}^{\top}x})+(1-y_{j})w_{j}^{\top}x\right], and mim_{i} is the number of data points held at each agent and (wj,yj)d×{0,1}(w_{j},y_{j})\in\mathbb{R}^{d}\times\{0,1\} are training samples of dimension dd and binary labels, respectively. We consider two real datasets from LIBSVM111https://www.csie.ntu.edu.tw/ cjlin/libsvm/: the skin_nonskin dataset and the ijcnn1 dataset. We take 5,000 data points from each dataset with dimension d=3d=3 and dimension d=22d=22, respectively. A random binomial graph with mm agents is drawn (each edge is drawn independently from a Bernoulli(pp) distribution; pp=0.2 was used in all cases). The mixing matrix for P2D2 is generated using the Metropolis rule while the mixing matrix of PG-EXTRA is generated using the Laplacian based constant edge weight matrix. In all cases, the 1\ell_{1}-norm weight γ=2×106\gamma=2\times 10^{-6}. We plot the average relative cost error defined as 1mi=1mF(xti)F(x)1mi=1mF(x0i)F(x)\tfrac{\tfrac{1}{m}\sum_{i=1}^{m}F(x_{t}^{i})-F(x_{\star})}{\tfrac{1}{m}\sum_{i=1}^{m}F(x_{0}^{i})-F(x_{\star})} for different network sizes. As it can be seen in Figure 1 and 2, the BFGS-ADMM method demonstrates significant speed up in both datasets compared to other methods. Note that our first order variant also shows advantages compared to other first order methods, due to the fact that in our first order approximation, step size selection also takes into account the number of neighbors each agent has as in (20).

References

  • [1] Y. Li, D. Stipanović, P. Voulgaris, and Z. Gu, “Decentralized model predictive control of urbandrainage systems,” WSEAS Transactions on Systems and Control, vol. 14, pp. 247–256, 2019.
  • [2] T. Huang, N. M. Freris, P. R. Kumar, and L. Xie, “A synchrophasor data-driven method for forced oscillation localization under resonance conditions,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 3927–3939, 2020.
  • [3] R. Bekkerman, M. Bilenko, and J. Langford, “Scaling up machine learning: Parallel and distributed approaches,” in Proceedings of the 17th ACM SIGKDD International Conference Tutorials, 2011.
  • [4] N. M. Freris, H. Kowshik, and P. R. Kumar, “Fundamentals of large sensor networks: Connectivity, capacity, clocks, and computation,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1828–1846, 2010.
  • [5] K. Kim and P. R. Kumar, “Cyber–physical systems: A perspective at the centennial,” Proceedings of the IEEE, vol. 100, pp. 1287–1308, 2012.
  • [6] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [7] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
  • [8] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, p. 127–239, 2014.
  • [9] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized linearized alternating direction method of multipliers,” IEEE Transactions on Signal Processing, vol. 63, no. 15, pp. 4051–4064, 2015.
  • [10] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
  • [11] Y. Li, N. M. Freris, P. Voulgaris, and D. Stipanović, “D-SOP: Distributed second order proximal method for convex composite optimization,” in Proceedings of the 2020 American Control Conference, 2020, pp. 2844–2849.
  • [12] J. D. Lee, Y. Sun, and M. A. Saunders, “Proximal Newton-type methods for minimizing composite functions,” SIAM Journal on Optimization, vol. 24, no. 3, pp. 1420–1443, 2014.
  • [13] J. E. Dennis, Jr. and J. J. Moré, “Quasi-newton methods, motivation and theory,” SIAM Review, vol. 19, no. 1, pp. 46–89, 1977.
  • [14] P. Latafat, N. M. Freris, and P. Patrinos, “A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 4050–4065, 2019.
  • [15] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
  • [16] W. C. Davidon, “Variable metric method for minimization,” SIAM Journal on Optimization, vol. 1, no. 1, pp. 1–17, 1991.
  • [17] D. Goldfarb, “A family of variable-metric methods derived by variational means,” Mathematics of Computation, pp. 23–26, 1970.
  • [18] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed.   Springer, 2006.
  • [19] C. Chen, B. He, Y. Ye, and X. Yuan, “The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent,” vol. 155, no. 1–2, 2016.
  • [20] T. Lin, S. Ma, and S. Zhang, “Global convergence of unmodified 3-block ADMM for a class of convex minimization problems,” Journal of Scientific Computing, vol. 76, no. 1, pp. 69–88, 2018.
  • [21] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [22] S. Alghunaim, K. Yuan, and A. H. Sayed, “A linearly convergent proximal gradient algorithm for decentralized optimization,” in Advances in Neural Information Processing Systems, vol. 32, 2019.

Proof of Lemma 1
From the construction of C=(cl)Idd×mdC=(c^{l})^{\top}\otimes I_{d}\in\mathbb{R}^{d\times md}, where clc^{l} selects the ll-th coordinate, we stress that the update (11e) only involves the subvector xt+1lx_{t+1}^{l} as it can be seen by explicitly writing (11e) as:

λt+1=λt+1μ2(xt+1lθt+1).\displaystyle\lambda_{t+1}=\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t+1}).

We decompose the dual variable as y=[αβ]y^{\top}=\begin{bmatrix}\alpha^{\top}&\beta^{\top}\end{bmatrix}, where α,βnd\alpha,\beta\in\mathbb{R}^{nd}. Recalling D=[IndInd]D^{\top}=\begin{bmatrix}I_{nd}&I_{nd}\end{bmatrix} and pre-multiplying the dual update (11d) with DD^{\top}, we obtain

Dyt+1=Dyt+1μ1D(Axt+1Dzt+1).\displaystyle D^{\top}y_{t+1}=D^{\top}y_{t}+\tfrac{1}{\mu_{1}}D^{\top}(Ax_{t+1}-Dz_{t+1}).

Using (15), we conclude that Dyt+1=0D^{\top}y_{t+1}=0 for t0t\geq 0, i.e., βt+1=αt+1\beta_{t+1}=-\alpha_{t+1}. With zero initialization, we can therefore express yty_{t} as yt=[αtαt]y_{t}=\begin{bmatrix}\alpha_{t}\\ -\alpha_{t}\end{bmatrix}, for all t0t\geq 0. This shows that it is redundant to maintain both α\alpha and β\beta since they sum to 0 for all t0t\geq 0. Therefore, we can rewrite (11d) as:

αt+1=αt+1μ1(Asxt+1zt+1),\displaystyle\alpha_{t+1}=\alpha_{t}+\tfrac{1}{\mu_{1}}(A_{s}x_{t+1}-z_{t+1}), (48)
αt+1=αt+1μ1(Adxt+1zt+1).\displaystyle-\alpha_{t+1}=-\alpha_{t}+\tfrac{1}{\mu_{1}}(A_{d}x_{t+1}-z_{t+1}). (49)

After taking the difference of the above two equations and dividing by 2, we obtain:

αt+1=αt+12μ1Esxt+1,\displaystyle\alpha_{t+1}=\alpha_{t}+\tfrac{1}{2\mu_{1}}E_{s}x_{t+1}, (50)

where we have used the fact that Es=AsAdE_{s}=A_{s}-A_{d}. Similarly, by summing (48) and (49), we obtain zt+1=12Euxt+1z_{t+1}=\tfrac{1}{2}E_{u}x_{t+1}. With zero initialization for xtx_{t}, it is redundant to keep ztz_{t} since we can compute it as:

zt=12Euxtt0.\displaystyle z_{t}=\tfrac{1}{2}E_{u}x_{t}\,\,\forall\,\,t\geq 0. (51)

Recalling the approximated Hessian in (13), we express the primal update (14) as

xt+1=xtHt1{F(xt)+Ayt+Cλt\displaystyle x_{t+1}=x_{t}-H_{t}^{-1}\big{\{}\nabla F(x_{t})+A^{\top}y_{t}+C^{\top}\lambda_{t}
+1μ1A(AxtDzt)+1μ2C(Cxtθt)}.\displaystyle+\tfrac{1}{\mu_{1}}A^{\top}(Ax_{t}-Dz_{t})+\tfrac{1}{\mu_{2}}C^{\top}(Cx_{t}-\theta_{t})\big{\}}. (52)

Since Es=AsAdE_{s}^{\top}=A_{s}^{\top}-A_{d}^{\top}, Eu=As+AdE_{u}^{\top}=A_{s}^{\top}+A_{d}^{\top}, and Lu=EuEuL_{u}=E_{u}^{\top}E_{u} as in (4) and (5), along with the decomposition of yty_{t} and the equality in (51), we have

Ayt=Esαt,\displaystyle A^{\top}y_{t}=E_{s}^{\top}\alpha_{t}, (53)
1μ1ADzt=12μ1Luxt.\displaystyle\tfrac{1}{\mu_{1}}A^{\top}Dz_{t}=\tfrac{1}{2\mu_{1}}L_{u}x_{t}. (54)

Using (6), we can rewrite

1μ1AAxt12μ1Luxt=12μ1(2ΔLu)xt=12μ1Lsxt.\displaystyle\tfrac{1}{\mu_{1}}A^{\top}Ax_{t}-\tfrac{1}{2\mu_{1}}L_{u}x_{t}=\tfrac{1}{2\mu_{1}}(2\Delta-L_{u})x_{t}=\tfrac{1}{2\mu_{1}}L_{s}x_{t}. (55)

Substituting (53)-(55) into (52), we obtain the primal updates as in (17a). \blacksquare
Proof of Lemma 2
The update (17b) is equivalent to

θt+1=argmin𝜃{g(θ)+12μ2xt+1l+μ2λtθ2}.\displaystyle\theta_{t+1}=\underset{\theta}{\mathrm{argmin}}\left\{g(\theta)+\tfrac{1}{2\mu_{2}}\norm{x_{t+1}^{l}+\mu_{2}\lambda_{t}-\theta}^{2}\right\}.

From the optimality condition, it holds that

0g(θt+1)1μ2(xt+1+μ2λtθt+1).\displaystyle 0\in g(\theta_{t+1})-\tfrac{1}{\mu_{2}}(x_{t+1}+\mu_{2}\lambda_{t}-\theta_{t+1}). (56)

Substituting the dual update λt+1=λt+1μ2(xt+1lθt+1)\lambda_{t+1}=\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t+1}) into (56), we obtain

0g(θt+1)λt+1.\displaystyle 0\in\partial g(\theta_{t+1})-\lambda_{t+1}.

After re-arranging, we obtain the desired. \blacksquare
Proof of Lemma 3
Note that the difference between BFGS-ADMM and it’s first order variant lies in HtH_{t} (13). For the first order variant, Ht=1μ1Δ+1μ2CC+ϵImd,H_{t}=\tfrac{1}{\mu_{1}}\Delta+\tfrac{1}{\mu_{2}}C^{\top}C+\epsilon I_{md}, i.e., BtB_{t} is identically zero. By rearranging (17a), we obtain the following equation:

F(xt)+Esαt+Cλt+12μ1Lsxt+1μ2C(Cxtθt)\displaystyle\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+C^{\top}\lambda_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}+\tfrac{1}{\mu_{2}}C^{\top}(Cx_{t}-\theta_{t})
+(Bt+1μ1Δ+1μ2CC+ϵImd)(xt+1xt)=0.\displaystyle+(B_{t}+\tfrac{1}{\mu_{1}}\Delta+\tfrac{1}{\mu_{2}}C^{\top}C+\epsilon I_{md})(x_{t+1}-x_{t})=0. (57)

Using the dual update (17c), we have

Esαt+12μ1Lsxt=Esαt+112μ1Ls(xt+1xt).\displaystyle E_{s}^{\top}\alpha_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}=E_{s}^{\top}\alpha_{t+1}-\tfrac{1}{2\mu_{1}}L_{s}(x_{t+1}-x_{t}). (58)

Since 2Δ=Ls+Lu2\Delta=L_{s}+L_{u}, it holds that

1μ1Δ(xt+1xt)12μ1Ls(xt+1xt)=12μ1Lu(xt+1xt).\displaystyle\tfrac{1}{\mu_{1}}\Delta(x_{t+1}-x_{t})-\tfrac{1}{2\mu_{1}}L_{s}(x_{t+1}-x_{t})=\tfrac{1}{2\mu_{1}}L_{u}(x_{t+1}-x_{t}). (59)

Substituting (58) and (59) into (57), we obtain

F(xt)+Bt(xt+1xt)+12μ1(Lu+ϵImd)(xt+1xt)\displaystyle\nabla F(x_{t})+B_{t}(x_{t+1}-x_{t})+\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I_{md})(x_{t+1}-x_{t})
+C{λt+1μ2(xt+1lθt)}+Esαt+1=0.\displaystyle+C^{\top}\big{\{}\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t})\big{\}}+E_{s}^{\top}\alpha_{t+1}=0. (60)

Using the dual update (17d), we have

λt+1μ2(xt+1lθt)=λt+1+1μ2(θt+1θt).\displaystyle\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t})=\lambda_{t+1}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t}). (61)

Substituting (61) into (60) and using the definition of et=F(xt)+Bt(xt+1xt)F(xt+1)e_{t}=\nabla F(x_{t})+B_{t}(x_{t+1}-x_{t})-\nabla F(x_{t+1}), we obtain

F(xt+1)+12μ1(Lu+ϵI)(xt+1xt)\displaystyle\nabla F(x_{t+1})+\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})
+C{λt+1+1μ2(θt+1θt)}+Esαt+1+et=0.\displaystyle+C^{\top}\big{\{}\lambda_{t+1}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})\big{\}}+E_{s}^{\top}\alpha_{t+1}+e_{t}=0. (62)

After subtracting (KKTa) from (62), we obtain the desired where Bt=0B_{t}=0 for the first order variant. \blacksquare
Proof of Lemma 4
For BFGS-ADMM, since Bt+1B_{t+1} satisfies the secant equation: Bt+1(xt+1xt)=F(xt+1)F(xt)B_{t+1}(x_{t+1}-x_{t})=\nabla F(x_{t+1})-\nabla F(x_{t}), we have

et=(BtBt+1)(xt+1xt),\displaystyle e_{t}=(B_{t}-B_{t+1})(x_{t+1}-x_{t}),

and (29) follows by applying the Cauchy-Schwartz inequality. For the first order variant, the Lipschitz continuity of F(x)\nabla F(x) implies that x,ymd\forall\,x,y\in\mathbb{R}^{md}, F(x)F(y)Mfxy\norm{\nabla F(x)-\nabla F(y)}\leq M_{f}\norm{x-y}. The claim in (30) follows. \blacksquare
Proof of Lemma 5
Consider the update (17c) and (17d). It can be written compactly as: [αt+1λt+1]=[αtλt]+[12μ1001μ2][EsC]xt+1[0θt+1],\begin{bmatrix}\alpha_{t+1}\\ \lambda_{t+1}\end{bmatrix}=\begin{bmatrix}\alpha_{t}\\ \lambda_{t}\end{bmatrix}+\begin{bmatrix}\tfrac{1}{2\mu_{1}}&0\\ 0&\tfrac{1}{\mu_{2}}\end{bmatrix}\begin{bmatrix}E_{s}\\ C\end{bmatrix}x_{t+1}-\begin{bmatrix}\textbf{0}\\ \theta_{t+1}\end{bmatrix}, where 0 is a zero vector of dimension ndnd. Therefore, it suffices to prove the vector [0θt+1][\textbf{0}^{\top}\,\,\theta_{t+1}^{\top}]^{\top} lies in the column space of XX^{\top} . Note that since the graph is connected, Esr=0E_{s}r=0 if rmdr\in\mathbb{R}^{md} is in consensus, i.e., r1=r2==rmdr^{1}=r^{2}=\dots=r^{m}\in\mathbb{R}^{d}. By letting ri=θt+1r^{i}=\theta_{t+1} for all ii, we have: [EsC]r=[0θt+1]\begin{bmatrix}E_{s}\\ C\end{bmatrix}r=\begin{bmatrix}\textbf{0}\\ \theta_{t+1}\end{bmatrix}, which shows that [0θt+1][\textbf{0}^{\top}\,\,\theta_{t+1}^{\top}]^{\top} lies in the column space of XX^{\top}. To simplify the notation we denote [αt,λt]=wt[\alpha_{t}^{\top},\lambda_{t}^{\top}]^{\top}=w_{t}. To prove the existence of optimal ww_{\star} in the column space of XX^{\top}, consider the (KKTa) satisfied with any optimal ww,

F(x)+Xw=0.\displaystyle\nabla F(x_{\star})+Xw=0.

The projection of ww into the column space of XX^{\top}, denoted as ww_{\star}, also satisfies (KKTa) since their differences lie in the kernel of XX, i.e., X(ww)=0X(w-w_{\star})=0. The uniqueness of ww_{\star} can be established by contradiction. Let w1=Xr1w_{1}=X^{\top}r_{1} and w2=Xr2w_{2}=X^{\top}r_{2} be two optimal stacked dual vector that lie in the column space of XX^{\top}, r1r2r_{1}\neq r_{2}. Since F()F(\cdot) is strongly convex, from (KKTa), it holds that F(x)+XXr1=0,F(x)+XXr2=0.\nabla F(x_{\star})+XX^{\top}r_{1}=0,\\ \nabla F(x_{\star})+XX^{\top}r_{2}=0. After taking the difference, we have XX(r1r2)=0XX^{\top}(r_{1}-r_{2})=0. But XX=EsEs+CC=Ls+CC>0XX^{\top}=E_{s}^{\top}E_{s}+C^{\top}C=L_{s}+C^{\top}C>0. Therefore r1r2=0r_{1}-r_{2}=0, contradiction. Inequality (31) follows from the fact that wt+1ww_{t+1}-w_{\star} is orthogonal to the kernel of XX. \blacksquare