BFGS-ADMM for Large-Scale Distributed Optimization

Yichuan Li¹, Yonghai Gong², Nikolaos M. Freris², Petros Voulgaris³ and Dušan Stipanović¹ *This work was supported by the Ministry of Science and Technology of China under grant 2019YFB2102200, the Anhui Dept. of Science and Technology under grant 201903a05020049, the Tencent Holdings Ltd. under grant FR202003. ¹Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, IL 61820, USA. [email protected], [email protected]. ² School of Computer Science, University of Science and Technology of China, Hefei, Anhui, 230027, China. [email protected], [email protected].³Department of Mechanical Engineering, University of Nevada, Reno, NV 89557, USA. [email protected].

Abstract

We consider a class of distributed optimization problem where the objective function consists of a sum of strongly convex and smooth functions and a (possibly nonsmooth) convex regularizer. A multi-agent network is assumed, where each agent holds a private cost function and cooperates with its neighbors to compute the optimum of the aggregate objective. We propose a quasi-Newton Alternating Direction Method of Multipliers (ADMM) where the primal update is solved inexactly with approximated curvature information. By introducing an intermediate consensus variable, we achieve a block diagonal Hessian which eliminates the need for inner communication loops within the network when computing the update direction. We establish global linear convergence to the optimal primal-dual solution without the need for backtracking line search, under the assumption that component cost functions are strongly convex with Lipschitz continuous gradients. Numerical simulations on real datasets demonstrate the advantages of the proposed method over state of the art.

I INTRODUCTION

Distributed optimization acts as the computational engine for a wide range of applications in modern technology, including distributed control [1], power grid management [2], distributed machine learning [3], and resource allocation in sensor networks [4]. We consider a classical problem of minimizing a sum of cost functions and a convex regularizer, aiming at promoting sparsity in the solution structure. In specific:

\displaystyle\underset{\hat{x}\in\mathbb{R}^{d}}{\mathrm{minimize}}\left\{\sum_{i=1}^{m}f^{i}(\hat{x})+g(\hat{x})\right\},

(1)

where $f^{i}:\mathbb{R}^{d}\to\mathbb{R}$ captures the $i$ -th agent’s objective of interest and $g:\mathbb{R}^{d}\to\mathbb{R}$ is a possibly nonsmooth regularizer, such as the weighted $\ell_{1}-$ norm. In a network of agents, each agent aims to minimize its local cost function $f^{i}(\cdot)$ while communicating with its neighbors to cooperatively find the solution of the global problem. Distributed solutions are often pursued to fully utilize the computational power of the agents and reduce the amount of message passing in the network. Efficient communication is deemed especially desirable in scenarios where the number of participants $m$ , and the dimension of the decision variable $d$ is large, such as in emerging applications of Cyber-Physical systems (CPS) [5]. An archetypal framework of distributed optimization introduces local decision variables at each agent, which are updated locally using private data and exchange variables with neighboring agents. A network-wide solution is achieved by means of asymptotic elimination of the consensus error, i.e., the difference between the values of neighboring agents converges to zero. Such a framework is termed distributed consensus optimization and offers significant flexibility for each agent to select appropriate updating schemes that suit its local hardware environment.

First order methods [6]-[9] constitute popular choices for solving (1) in a distributed fashion. The presence of the nonsmooth regularizer prohibits gradient descent from being implemented directly as $g(\cdot)$ is not differentiable. Subgradient methods [6] invoke the notion of subdifferential set to compute descent directions while letting agents exchange information over a time-varying topology. Proximal gradient [8] accommodates the nonsmoothness by splitting the objective function and evaluating the proximal mapping associated with the nondifferentiable part. Various acceleration schemes [7], [9] exist for first order methods, that involve storing past gradient and iterate information so as to form a momentum correction term. However, as the size of data grow the condition number of problem (1) tends to get large which causes first order methods to exhibit extremely slow convergence rate. A natural remedy for the aforementioned issues is to consider second order methods [10]-[12]. Through use of function curvature information, second order methods compute descent directions in the objective function level sets that are effective at accelerating the convergence over first order methods. Proximal Newton [12] can be considered as the second order counterpart of the proximal gradient but the associated proximal mapping increases computational burden drastically and renders distributed implementation infeasible as global information is required to compute the Newton step. Moreover, second order methods require solving a linear system to compute the Newton step which induces a computation cost of $\mathcal{O}(d^{3})$ . Quasi-Newton methods [13] circumvent this procedure by using a finite difference of gradients to approximate the Hessian. The reduced computation costs along with competitive performance have rendered quasi-Newton methods a desirable alternative to second order methods, especially for Large-scale optimization problems.

Primal-dual methods [14]-[15] provide a different perspective for problem (1) within the framework of constrained optimization, where the consensus constraint is explicitly enforced. The Alternating Direction Method of Multipliers (ADMM) [15] falls into this catergory where the smooth and nonsmooth parts of the objective are considered separately and iterates are computed sequentially. However, at each iteration of the ADMM, a sub-optimization problem must be solved, which typically induces an expensive computational burden.

Contributions: (i) We propose BFGS-ADMM for convex composite optimization where the primal update (typically computational expensive) is replaced with a quasi-Newton step that does not require solving a linear system of equations. Moreover, by introducing an intermediate consensus variable, we eliminate the need for inner loops within the network when computing the update direction. (ii) We establish global linear convergence rate for the proposed algorithm without backtracking, under the assumption that private cost functions are strongly convex with Lipschitz continuous gradients. (iii) The advantage of BFGS-ADMM over first order methods is analytically established and experimentally demonstrated with numerical simulations on real datasets.

Notation: We denote column vectors $x\in\mathbb{R}^{d}$ with lower case letters and matrices $A\in\mathbb{R}^{n\times m}$ with capital letters. Superscripts are used to denote partitioned vector components and subscripts are used to denote iteration steps, e.g., $x^{i}_{t}$ denotes the value of subvector $x^{i}$ at step $t$ . Matrix entries are denoted as $[A]^{ij}$ . Unless specified otherwise, $\norm{x}$ and $\norm{A}$ denote the Euclidean norm of a vector and the corresponding induced norm of a matrix, respectively. We define the norm of a vector associated with a positive definite matrix $P\succ 0$ as $\norm{x}_{P}:=\sqrt{x^{\top}Px}$ , and the set $\{1,\dots,m\}$ is abbreviated as $[m]$ . The proximal mapping associated with the function $g(\cdot):\mathbb{R}^{d}\to\mathbb{R}$ is defined as $\textbf{prox}_{\mu g}(v)=\underset{\theta}{\mathrm{argmin}}\left\{g(\theta)+\tfrac{1}{2\mu}\norm{\theta-v}^{2}\right\}$ .

II PRELIMINARIES

II-A Problem formulation

Consider a connected network of $m$ agents captured by a $m$ -th order undirected graph: $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}=\{1,\dots,m\}$ is the vertex set, $\mathcal{E}$ contains pairs $(i,j)$ if and only if agent $i$ communicates with agent $j$ , and $\absolutevalue{\mathcal{E}}=n$ captures the number of communication links in the network. We denote the neighborhood of agent $i$ as $\mathcal{N}_{i}=\{j:(i,j)\in\mathcal{E}\}$ . We further define the source and the destination matrix (arbitrary ordering suffices since both are given the same value) $\hat{A}_{s},\hat{A}_{d}\in\mathbb{R}^{n\times m}$ respectively as follows. Each row of $\hat{A}_{s}$ and $\hat{A}_{d}$ corresponds to a communication link $\mathcal{E}_{k}=(i,j)$ , for some $k\in[n]$ , and $[\hat{A}_{s}]^{ki}=[\hat{A}_{d}]^{kj}=1$ with all other entries equal to $0$ . We reformulate (1) to the consensus framework by introducing local decision variables $x^{i}$ held by the corresponding agent $i$ , along with a set of intermediate consensus variables $\{z^{ij}\}$ . Specifically,

	$\displaystyle\underset{x^{i},\,\theta,\,z\,\in\mathbb{R}^{d}}{\mathrm{minimize}}\left\{\sum_{i=1}^{m}f^{i}(x^{i})+g(\theta)\right\}$
	$\displaystyle\text{s.t.}\quad x^{i}=z^{ij}=x^{j}\quad\forall\,i\,\,\text{and}\,\,j\in\mathcal{N}_{i},$
	$\displaystyle x^{l}=\theta\quad\text{for some }l\in[m],$		(2)

where the $l$ -th agent is chosen arbitrarily to enforce the additional constraint for the nonsmooth regularizer decision variable. The constraints enforce consensus among the network if the underlying graph is connected and, therefore, (2) is equivalent to (1) in the sense that they have the same set of optimal solutions, i.e., $\hat{x}_{\star}=x^{1}_{\star}=\dots=x^{m}_{\star}=\theta_{\star}=z_{\star}$ . Consensus can be achieved by directly enforcing $x^{i}=x^{j}$ but having a intermediate consensus variable $z^{ij}$ is crucial in terms of decoupling the primal variable $x^{i}$ so that computing quasi-Newton steps does not require additional communication within the network. We provide further discussion on this design choice in Section III Remark 1. We may compactly express (2) using the aforementioned source and destination matrices as follows:

	$\displaystyle\underset{x\in\mathbb{R}^{md},\theta\in\mathbb{R}^{d},z\in\mathbb{R}^{nd}}{\mathrm{minimize}}\left\{F(x)+g(\theta)\right\}$
	$\displaystyle\text{s.t.}\quad Ax=\begin{bmatrix}\hat{A}_{s}\otimes I_{d}\\ \hat{A}_{d}\otimes I_{d}\end{bmatrix}x=\begin{bmatrix}I_{nd}\\ I_{nd}\end{bmatrix}z=Dz,$
	$\displaystyle Cx=\theta,$		(3)

where $\otimes$ denotes the Kronecker product and $I$ denotes the identity matrix of the corresponding dimension. We stack local decision variables as $x\in\mathbb{R}^{md}:=[(x^{1})^{\top},\dots,(x^{m})^{\top}]^{\top}$ and similarly for the consensus variable $z\in\mathbb{R}^{nd}:=[(z^{1})^{\top},\dots,(z^{n})^{\top}]^{\top}$ , and matrix $A$ and $D$ is obtained by stacking matrices as shown in (3). The matrix $C\in\mathbb{R}^{d\times md}:=(c^{l})^{\top}\otimes I_{d}$ enforces the last line of constraint of (2) using the coordinate selection vector $c^{l}\in\mathbb{R}^{m}$ , whose entries are zero except the $l$ -th one being 1. We present some equalities that link our constraint matrices to the topology of the graph in the following.

	$\displaystyle\hat{E}_{s}=\hat{A}_{s}-\hat{A}_{d}\quad\hat{E}_{u}=\hat{A}_{s}+\hat{A}_{d}$		(4)
	$\displaystyle\hat{E}_{s}^{\top}\hat{E}_{s}=\hat{L}_{s}\quad\hat{E}_{u}^{\top}\hat{E}_{u}=\hat{L}_{u}$		(5)
	$\displaystyle\hat{\Delta}=\tfrac{1}{2}(\hat{L}_{s}+\hat{L}_{u})=\hat{A}_{s}^{\top}\hat{A}_{s}+\hat{A}_{d}^{\top}\hat{A}_{d}$		(6)

Matrices $\hat{E}_{s},\hat{E}_{u}\in\mathbb{R}^{n\times m}$ stand for the signed and unsigned incidence matrix of the graph, respectively, while $\hat{L}_{s},\hat{L}_{u}\in\mathbb{R}^{m\times m}$ denotes the signed and unsigned Laplacian matrix of the graph, respectively. The diagonal degree matrix of the graph is denoted as $\hat{\Delta}$ , whose entries are $\hat{\Delta}^{ii}=|\mathcal{N}_{i}|$ .

Assumption 1. Local cost functions $f^{i}(\cdot)$ are twice differentiable with uniformly bounded Hessian as follows:

\displaystyle m_{f}I\preceq\nabla^{2}f^{i}(\cdot)\preceq M_{f}I,\,\,\forall\,\,i\in[m],

(7)

where $0<m_{f}\leq M_{f}<\infty$ . Since $F(x)=\sum_{i=1}^{m}f^{i}(x^{i})$ , $\nabla^{2}F(x)$ is block diagonal with the $i$ -th block being $\nabla^{2}f^{i}(x^{i})$ . Therefore, the same bounds apply to the Hessian of $F(x)$ as well, i.e., $m_{f}I\preceq\nabla^{2}F(x)\preceq M_{f}I$ .

II-B Introduction to BFGS

Quasi-Newton methods [16]-[18] constitute a popular class of methods for accelerating the convergence of optimization methods without directly invoking the Hessian. Quasi-Newton methods seek to approximate the curvature information using consecutive gradient evaluations and iterates differences. More specifically, we define the descent direction $h_{t}$ as

\displaystyle h_{t}:=-B_{t}^{-1}\nabla f_{t},

(8)

where $B_{t}$ is a positive definite matrix that approximates the Hessian $\nabla^{2}f_{t}$ . Various schemes exist for designing algorithms that iteratively update $B_{t}$ including those by Davidon, Fletcher, and Powell (DFP) [16] and Broyden, Fletcher, Goldfarb, and Shannon (BFGS) [17].
In this paper, we focus on the BFGS scheme to select $B_{t}^{-1}$ as it is observed to work best in the practice both in terms of convergence speed and self-correcting capabilities [18]. The BFGS approximates the Hessian using finite differences of consecutive iterates and gradient evaluations. In specific, define the iterates difference and the gradient difference as follows:

\displaystyle s_{t}=x_{t+1}-x_{t},\quad d_{t}=\nabla f(x_{t+1})-\nabla f(x_{t}).

Quasi-Newton methods select $B_{t+1}$ so that the secant condition is satisfied, i.e., $B_{t+1}s_{t}=d_{t}$ , which is motivated by the fact that the real Hessian satisfies this condition as $x_{t+1}$ tends to $x_{t}$ . To ensure $B_{t+1}>0$ , it must hold that $s_{t}^{\top}d_{t}>0$ as can seen by premultiplying the secant equation with $s_{t}^{\top}$ . For convex objectives, this condition is satisfied automatically. Note that, however, the secant condition alone is not enough to specify $B_{t+1}$ . To uniquely determine $B_{t+1}$ , we further require its inverse to be close to the previous value $B_{t}^{-1}$ in the following sense:

	$\displaystyle\underset{B^{-1}}{\mathrm{minimize}}\quad\norm{B^{-1}-B_{t}^{-1}}_{\mathbf{W}}$
	$\displaystyle\text{s.t.}\quad B^{-1}=(B^{-1})^{\top},\quad B^{-1}d_{t}=s_{t},$		(9)

where $\norm{A}_{\mathbf{W}}:=\norm{W^{\tfrac{1}{2}}AW^{\tfrac{1}{2}}}_{\mathbf{F}}$ stands for the weighted Frobenius norm associated with matrix $W$ whose inverse is the average Hessian [18]. A closed form solution for (9) can be obtained as:

\displaystyle B_{t+1}^{-1}=\left(I-\tfrac{s_{t}d_{t}^{\top}}{d_{t}^{\top}s_{t}}\right)B_{t}^{-1}\left(I-\tfrac{d_{t}s_{t}^{\top}}{d_{t}^{\top}s_{t}}\right)+\tfrac{s_{t}s_{t}^{\top}}{d_{t}^{\top}s_{t}}.

(10)

II-C Alternating Direction Method of Multipliers

We proceed to introduce the Augmented Lagrangian associated with problem (2) as follows:

	$\displaystyle\mathcal{L}(x,z,\theta;y,\lambda)=F(x)+g(\theta)+y^{\top}(Ax-Dz)$
	$\displaystyle+\lambda^{\top}(Cx-\theta)+\tfrac{1}{2\mu_{1}}\norm{Ax-Dz}^{2}+\tfrac{1}{2\mu_{2}}\norm{Cx-\theta}^{2},$

where $y\in\mathbb{R}^{2nd},\lambda\in\mathbb{R}^{d}$ are dual variables corresponding to the two linear constraints in (3). The Alternating Direction Method of Multipliers (ADMM) solves (2) by sequentially minimizing the Augmented Lagrangian over primal/dual variables as:


$\displaystyle x_{t+1}$	$\displaystyle=\underset{x}{\mathrm{argmin}}\,\,\mathcal{L}(x,z_{t},\theta_{t};y_{t}),$	(11a)
$\displaystyle z_{t+1}$	$\displaystyle=\underset{z}{\mathrm{argmin}}\,\,\mathcal{L}(x_{t+1},z,\theta_{t+1};y_{t}),$	(11b)
$\displaystyle\theta_{t+1}$	$\displaystyle=\underset{\theta}{\mathrm{argmin}}\,\,\mathcal{L}(x_{t+1},z_{t+1},\theta;y_{t}),$	(11c)
$\displaystyle y_{t+1}$	$\displaystyle=y_{t}+\tfrac{1}{\mu_{1}}(Ax_{t+1}-Dz_{t+1})$	(11d)
$\displaystyle\lambda_{t+1}$	$\displaystyle=\lambda_{t}+\tfrac{1}{\mu_{2}}(Cx_{t+1}-\theta_{t+1})$	(11e)

Note that (11) is a 3-block ADMM which, in general, is not guaranteed to converge for arbitrary values of $\mu_{1},\mu_{2}>0$ [19]. We separate the dual variables into $y$ and $\lambda$ for two reasons: (i) Since dual variables accumulate the consensus error within the network as seen in (11d) and (11e), and from that the fact $y$ corresponds to $2n$ constraints while $\lambda$ only involves a single constraint, it is beneficial to separate the two and design different stepsizes $\tfrac{1}{\mu_{1}}$ and $\tfrac{1}{\mu_{2}}$ , which allows for extra freedom for hyperparameters and results in better performance in practice. (ii) In section III, we further show that with appropriate initialization, the storage requirement for $y$ can be reduced by half.
In each iteration of ADMM, a sub-optimization problem has to be solved in (11a) to obtain $x_{t+1}$ . This may be computationally expensive and place heavy burden on the system. We therefore propose to perform inexact optimization with the aid of an approximated Hessian using BFGS, aiming to reduce computational costs while maintaining fast convergence speed.

III Algorithmic description and implementation

We build a local model of the Augmented Lagrangian at step $t$ with respect to the primal variable $x$ as:

\displaystyle\widehat{\mathcal{L}}_{t}(x)=\mathcal{L}_{t}+\nabla_{x}\mathcal{L}_{t}^{\top}(x-x_{t})+\tfrac{1}{2}\norm{x-x_{t}}^{2}_{H_{t}},

(12)

where $\mathcal{L}_{t}:=\mathcal{L}(x_{t},z_{t},\theta_{t};y_{t},\lambda_{t})$ . Various algorithms can be designed by choosing different $H_{t}$ . In this paper, we opt to use regularized BFGS approximation of the Augmented Lagrangian Hessian as follows:

\displaystyle H_{t}=B_{t}+\tfrac{1}{\mu_{1}}\Delta+\tfrac{1}{\mu_{2}}C^{\top}C+\epsilon I_{md},

(13)

where $\epsilon>0$ and $\Delta=A^{\top}A$ is the Kronecker product of the graph degree matrix $\hat{\Delta}$ and $I_{d}$ . Note that by setting $B_{t}=\nabla^{2}F(x_{t})$ , we recover the Hessian of the Augmented Lagrangian with respect to $x$ , i.e., $H_{t}=\nabla_{xx}^{2}\mathcal{L}$ in such case. The quadratic form of (12) admits a closed form solution when minimizing over $x$ , i.e.,

\displaystyle x_{t+1}=x_{t}-H_{t}^{-1}\nabla_{x}\mathcal{L}_{t}.

(14)

Step (11b) reduces to solving the following linear system of equations:

\displaystyle D^{\top}y_{t}+\tfrac{1}{\mu_{1}}D^{\top}(Ax_{t+1}-Dz_{t+1})=0.

(15)

Moreover, by completion of squares, step (11c) is equivalent to evaluating the following proximal mapping associated with $g(\cdot)$ with parameter $\mu_{2}$ :

\displaystyle\theta_{t+1}=\textbf{prox}_{\mu_{2}g}(Cx_{t+1}+\mu_{2}\lambda_{t}).

(16)

Dual variables are updated in verbatim as in (11d) and (11e). We proceed to present Lemma 1 which states that with an appropriate initialization, the intermediate variable $\{z^{ij}\}$ evolves on the manifold defined by the column space of $E_{u}$ . Matrices without the hat denotes the Kronecker product of the corresponding matrix with the identity matrix, e.g., $E_{u}=\hat{E}_{u}\otimes I_{d}$ where $\hat{E}_{u}$ is defined in (4).

Lemma 1. Recall the signed and unsigned incidence and Laplacian matrices in (4) and (5), respectively. For zero initialization of $(x_{t},y_{t})$ , we can decompose $y_{t}^{\top}=\begin{bmatrix}\alpha_{t}^{\top}&-\alpha_{t}^{\top}\end{bmatrix}$ for all $t\geq 0$ . Moreover, the update rules (14)-(16), (11d), and (11e) can be simplified as:


$\displaystyle x_{t+1}$	$\displaystyle=x_{t}-H_{t}^{-1}\big{\{}\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+C^{\top}\lambda_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}$
	$\displaystyle+\tfrac{1}{\mu_{2}}C^{\top}(x_{t}^{l}-\theta_{t})\big{\}},$	(17a)
$\displaystyle\theta_{t+1}$	$\displaystyle=\textbf{prox}_{\mu_{2}g}(x_{t+1}^{l}+\mu_{2}\lambda_{t}),$	(17b)
$\displaystyle\alpha_{t+1}$	$\displaystyle=\alpha_{t}+\tfrac{1}{2\mu_{1}}E_{s}x_{t+1},$	(17c)
$\displaystyle\lambda_{t+1}$	$\displaystyle=\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t+1}).$	(17d)

Proof : See Appendix A.
Once $\theta_{t+1}$ is obtained, updates (17c) and (17d) can be performed in parallel and we have effectively transformed a 3-block ADMM in (11) to a 2-block ADMM, which converges under more general settings [20].
Remark 1: The reason for introducing the consensus variable $z$ in (2) is to decouple $x^{i}$ ’s so that the Hessian of the Augmented Lagrangian has a block diagonal structure, cf.(13), which is instrumental for computing the quasi-Newton step without additional communication within the network once the gradient of the Augmented Lagrangian is obtained. Note that there is no need to invert any matrix or solve a linear system, and the inverse in (17), is only notational. Instead, we directly approximate each block $(B_{t+1}^{-1})^{ii}$ , corresponding to the agent $i$ and form $(H_{t+1}^{-1})^{ii}$ as explained in the following. Since $B_{t+1}^{-1}$ is block diagonal, we approximate $(B_{t+1}^{-1})^{ii}$ using (10) with $d_{t}=\nabla f^{i}(x^{i}_{t+1})-\nabla f^{i}(x^{i}_{t})$ and $s_{t}=x_{t+1}^{i}-x_{t}^{i}$ . Moreover, $H_{t+1}-B_{t+1}$ is diagonal and constant (time invariant) from (13), which allows for eigendecomposition with components only depending on local information. Specifically, using the Sherman-Morrison formula and defining $C^{i}_{1}=(B_{t+1}^{-1})^{ii}$ , we obtain an iterative formula for $(H_{t+1}^{-1})^{ii}=C^{i}_{d+1}$ as follows:

\displaystyle C^{i}_{k+1}=C^{i}_{k}-\tfrac{C^{i}_{k}v_{k}^{i}(v^{i}_{k})^{\top}C^{i}_{k}}{1+(v^{k})^{\top}C^{i}_{k}v^{k}},

(18)

with constant vector $v^{i}_{k}$ specified as:

\displaystyle v^{i}_{k}=\left(\tfrac{\absolutevalue{\mathcal{N}_{i}}}{\mu_{1}}+\tfrac{\mathbf{1}_{il}}{\mu_{2}}+\epsilon\right)e_{k},

(19)

where $\mathbf{1}_{il}=1$ if $i=l$ (i.e., if it is the $l$ -th agent who performs the proximal mapping associated with the nonsmooth regularizer $g(\cdot)$ ) and is zero otherwise, and $e_{k}\in\mathbb{R}^{d}$ is a constant vector with all entries equal to $0$ except for the $k$ -th one. If consensus is enforced directly as $x^{i}=x^{j}$ , the resulting Hessian of the Augmented Lagrangian would involve the graph Laplacian matrix as in [11] which couples $x^{i}$ with its neighbors. In other words, computing quasi-Newton steps would induce multiple inner communication rounds within the network at each iteration, to approximate the descent direction to a desired accuracy.
We now present a first order variant of BFGS-ADMM, which only uses first order information, so as to demonstrate the merits of using the BFGS approximation, both theoretically and experimentally. Note that the only part of $H_{t}$ in (13) that depends on the iterate counter is $B_{t}$ , which is an approximation of $\nabla^{2}F(x_{t})$ . Therefore, a first order approximation can be derived by setting $B_{t}=0$ in which case $H_{t}$ in (13) is a constant diagonal matrix. In other words, agent $i$ selects a step size equal to:

\displaystyle s_{i}=\left(\tfrac{\absolutevalue{\mathcal{N}_{i}}}{\mu_{1}}+\tfrac{\mathbf{1}_{il}}{\mu_{2}}+\epsilon\right)^{-1}.

(20)

Therefore the first order variant updates the primal vector $x_{t+1}$ as:

	$\displaystyle x_{t+1}=x_{t}-\textbf{diag}[s_{i}]\big{\{}\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+C^{\top}\lambda_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}$
	$\displaystyle+\tfrac{1}{\mu_{2}}C^{\top}(x_{t}^{l}-\theta_{t})\big{\}},$		(21)

Algorithm 1 BFGS-ADMM update

Zero initialization for all variables. Hyperparameters $\mu_{1},\mu_{2}$ , and $\epsilon$ . Initialization for $B_{0}^{-1}=aI_{md}$ for some $a>0$ .

1:for

t=0,1,2,\ldots

2: for

i\in[m]

3: Compute

h_{t}^{i}=\nabla f^{i}(x_{t}^{i})+\tfrac{1}{2\mu_{1}}\sum_{j\in\mathcal{N}^{i}}(x_{t}^{i}-x_{t}^{j})+\phi_{t}^{i}

4: if

i=l

then

5: Update

h_{t}^{i}\leftarrow h_{t}^{i}+\tfrac{1}{\mu_{2}}(x_{t}^{i}-\theta_{t})+\lambda_{t}

6: end if

7: Compute

u^{i}_{t}=(H_{t}^{-1})^{ii}h_{t}^{i}

8: Primal update:

x_{t+1}^{i}=x_{t}^{i}-u_{t}^{i}

9: Dual update:

\phi_{t+1}^{i}=\phi_{t}^{i}+\tfrac{1}{2\mu_{1}}\sum_{j\in\mathcal{N}^{i}}(x_{t+1}^{i}-x_{t+1}^{j})

10: if

i=l

then

11: Compute

\theta_{t+1}=\textbf{prox}_{\mu_{2}g}(x^{i}_{t+1}+\mu_{2}\lambda_{t})

12: Compute

\lambda_{t+1}=\lambda_{t}+\tfrac{1}{\mu_{2}}(x^{i}_{t+1}-\theta_{t+1})

13: end if

14: Set

s_{t}=-u_{t}^{i}

and

d_{t}=\nabla f^{i}(x^{i}_{t+1})-\nabla f(x^{i}_{t})

15: Update

(B_{t+1}^{-1})^{ii}

using (10).

16: Update

(H_{t+1}^{-1})^{ii}

using (18).

17: end for

18:end for

where $\textbf{diag}[s_{i}]\in\mathbb{R}^{md}$ is the diagonal matrix formed by $s_{i}I_{md}$ , and the dual variables $(\theta_{t+1},\alpha_{t+1},\lambda_{t+1})$ are updated in the exact same way as in the BFGS-ADMM using (17b)-(17c). Note that our first order variant encapsulates existing first order methods EXTRA [21] as a special case by setting $s_{i}=\tfrac{1}{\epsilon}$ . Indeed, when $g(\cdot)=0$ , the proximal mapping and the associated dual variables in (17) are not needed and we obtain: $x_{t+1}=x_{t}-\tfrac{1}{\epsilon}\left[\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}\right].$ Taking the difference between the $x_{t+1}$ update and the $x_{t}$ update defined this way, and substituting the dual update for $\alpha_{t}$ using (17c), we obtain:

	$\displaystyle x_{t+1}=2(I-\tfrac{1}{2\mu_{1}}L_{s})x_{t}-(I-\tfrac{1}{2\mu_{1}}L_{s})x_{t-1}$
	$\displaystyle-\tfrac{1}{\epsilon}(\nabla F(x_{t})-\nabla F(x_{t-1})).$		(22)

With appropriate choices of $\mu_{1},\mu_{2},\epsilon$ , the update defined in (22) is equivalent to EXTRA. We present the comparison between BFGS-ADMM and its first order variant in terms of approximating the exact ADMM in the next section.
We proceed to present the distributed implementation of BFGS-ADMM in Algorithm 1. Note that in the primal update (17a), dual variables are invoked in the form of $E_{s}^{\top}\alpha_{t}$ . By defining $\phi_{t}=E_{s}^{\top}\alpha_{t}\in\mathbb{R}^{md}$ and pre-multiplying both sides of (17c) with $E_{s}^{\top}$ , we obtain an equivalent algorithm that allows for efficient distributed implementation. We let each agent hold the corresponding pair $(x^{i},\phi^{i})$ while the $l$ -th agent additionally holds $(\theta,\lambda)$ . At each iteration, each agent $i$ begins with calculating $h_{t}^{i}$ using $x_{t}^{j}$ from its neighbors as in step 3, where the $l$ -th agent performs an additional update in step 5. Note that step 7 does not require solving a linear system since $(H_{t}^{-1})^{ii}$ is directly approximated using (18). Primal and dual updates are performed at each agent as in step 8 and step 9, respectively. The $l$ -th agent evaluates an additional proximal mapping in step 11 to update $\theta_{t+1}$ , and updates $\lambda_{t+1}$ in step 12. Finally, each agent $i$ updates its $(B_{t+1}^{-1})^{ii}$ and $(H_{t+1}^{-1})^{ii}$ using (10) and (18), respectively.

IV CONVERGENCE ANALYSIS

Linear convergence rate for ADMM is well established [15]. Since our goal here is to reduce the computational cost of ADMM, the best one can hope for is to reduce the gap between the approximation and the standard ADMM, while maintaining the linear convergence rate. In Lemma 4, we demonstrate the advantage of using BFGS approximation over first order methods by showing that the aforementioned gap decreases faster for BFGS-ADMM. We first state an additional assumption on the nonsmooth regularizer $g(\cdot)$ and the KKT conditions associated with (3) expressed in the primal optimal pair $(x_{\star},\theta_{\star})$ and the dual optimal pair $(\alpha_{\star},\lambda_{\star})$ . Note that we have eliminated $z_{\star}$ , as it is shown in Lemma 1 that it is a function of $x_{\star}$ .

Assumption 2: The function $g(\cdot):\mathbb{R}^{d}\to\mathbb{R}$ is proper, closed, and convex. Equivalently, $\forall\,x,y\in\mathbb{R}^{d}$ , the following inequality holds:

\displaystyle(x-y)^{\top}(\partial g(x)-\partial g(y))\geq 0,

(23)

where $\partial g(\cdot)$ denotes the subdifferential set. We proceed to state a non-restrictive assumption that can be easily achieved by regularization.

Assumption 3: There exists positive constant $0<\nu<\infty$ such that the eigenvalues of $B_{t}$ are upper bounded, i.e,

\displaystyle B_{t}<\nu I.

Remark 2: The upper bound described above can be achieved by setting $B_{t}^{-1}=\hat{B}_{t}^{-1}+\tfrac{1}{\nu}I$ , where $\hat{B}_{t}$ comes from the BFGS update formula. Since $\hat{B}_{t}^{-1}>0$ with positive definite initialization, we have $B_{t}^{-1}>\tfrac{1}{\nu}I$ , i.e., $B_{t}<\nu I$ .
KKT conditions:


$\displaystyle\nabla F(x_{\star})+E_{s}^{\top}\alpha_{\star}+C^{\top}\lambda_{\star}$	$\displaystyle=0$	(KKTa)
$\displaystyle\partial g(\theta_{\star})-\lambda_{\star}$	$\displaystyle\ni 0$	(KKTb)
$\displaystyle E_{s}x_{\star}$	$\displaystyle=0$	(KKTc)
$\displaystyle x_{\star}^{l}-\theta_{\star}$	$\displaystyle=0$	(KKTd)

We proceed to state a technical lemma that describes an inclusion relationship between the subdifferential set of $\partial g(\theta_{t+1})$ and the dual variable $\lambda_{t+1}$ .

Lemma 2. Consider the update $\theta_{t+1}=\textbf{prox}_{\mu_{2}g}(x_{t+1}^{l}+\mu_{2}\lambda_{t})$ in (17b). It holds that:

\displaystyle\lambda_{t+1}\in\partial g(\theta_{t+1}).

(25)

Proof : See Appendix B.

Recall the first order variant of the BFGS-ADMM where the primal variable is updated as in (21) while BFGS-ADMM updates $x_{t+1}$ as in (17a), and both update dual variables as in (17b)-(17c). We proceed to present results which capture their similarities and differences in the following.

Lemma 3. If Assumptions 1-2 hold, then the iterates generated by BFGS-ADMM and its first order variant both satisfy the following equation:

	$\displaystyle e_{t}+\nabla F(x_{t+1})-\nabla F(x_{\star})+\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})$
	$\displaystyle+C^{\top}\big{\{}\lambda_{t+1}-\lambda_{\star}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})\big{\}}+E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})=0,$		(26)

where for BFGS-ADMM,

\displaystyle e_{t}=\nabla F(x_{t})+B_{t}(x_{t+1}-x_{t})-\nabla F(x_{t+1}).

(27)

for its first order variant,

\displaystyle e_{t}=\nabla F(x_{t})-\nabla F(x_{t+1}).

(28)

Proof : See Appendix C.
By comparing (27) and (28), we see that the difference between using BFGS vs. first order approximation lies in how $\nabla F(x_{t+1})$ is approximated at step $t$ . Moreover, if the sub-optimization problem is solved exactly as in (11a), by optimality condition, equation (26) holds with $e_{t}=0$ .
Lemma 4. Consider the bound for $\nabla^{2}F(x)$ in (7) and the $e_{t}$ introduced in Lemma 3. If Assumption 3 holds, an upper bound can be established as follows.
For BFGS-ADMM,

\displaystyle\norm{e_{t}}

\displaystyle\leq\norm{B_{t+1}-B_{t}}\norm{x_{t+1}-x_{t}}

(29)

For its first order variant as in (21),

\displaystyle\norm{e_{t}}\leq M_{f}\norm{x_{t+1}-x_{t}}.

(30)

Proof : See Appendix D.
Remark 3: Lemma 4 is a standalone result whose proof does not require the convergence of the algorithm. In Theorem 1, we establish that BFGS-ADMM converges linearly which means $\lim_{t\to\infty}\norm{B_{t+1}-B_{t}}=0$ , since both $d_{t},s_{t}$ approach $0$ as $t\to\infty$ in (10). Therefore, $\norm{e_{t}}=o(\norm{x_{t+1}-x_{t}})$ for the BFGS-ADMM while $\norm{e_{t}}=O(\norm{x_{t+1}-x_{t}})$ for its first order variant. The variable $e_{t}$ captures the approximation error vs. exact solution of the primal problem ( $e_{t}=0$ for this case), whence Lemma 4 unravels that the proposed has superior tracking of the exact ADMM, which is the key attribute to yield superior linear convergence rate over its first-order variant.
Lemma 5. Consider the update in (17c) and (17d). With zero initialization, the stacked vector $[\alpha^{\top}_{t}\,\,\lambda_{t}^{\top}]^{\top}\in\mathbb{R}^{(n+1)d}$ lies in the column space of $X^{\top}=[E_{s}^{\top}\,\,C^{\top}]^{\top}\in\mathbb{R}^{(n+1)d\times md}$ . Moreover, there exists a unique optimal $[\alpha_{\star}^{\top}\,\,\lambda_{\star}^{\top}]^{\top}$ in the column space of $X^{\top}$ and, denoting $\sigma^{+}_{\mathrm{min}}$ as the smallest positive eigenvalue of $X^{\top}X$ , the following inequality holds:

	$\displaystyle\sigma^{+}_{\mathrm{min}}\left(\norm{\alpha_{t+1}-\alpha_{\star}}^{2}+\norm{\lambda_{t+1}-\lambda_{\star}}^{2}\right)\leq$
	$\displaystyle\norm{E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})}^{2}.$		(31)

Proof : See Appendix E.
We introduce the following scaling matrix defined in terms of the hyperparameters $\epsilon,\mu_{1},\mu_{2}$ and the graph topology (as captured by $L_{u}$ ), $\mathcal{H}=\begin{bmatrix}\tfrac{1}{2\mu_{1}}L_{u}+\epsilon I&0&0&0\\ 0&\tfrac{1}{\mu_{2}}&0&0\\ 0&0&2\mu_{1}&0\\ 0&0&0&\mu_{2}\end{bmatrix}.$ Theorem 1. Consider the update in (17). Define $u^{\top}=[x^{\top},\theta^{\top},\alpha^{\top},\lambda^{\top}]$ and $u_{\star}$ the corresponding optimum. Denote the smallest positive eigenvalue of $\begin{bmatrix}E_{s}\\ C\end{bmatrix}\begin{bmatrix}E_{s}^{\top}C^{\top}\end{bmatrix}$ as $\sigma^{+}_{\mathrm{min}}$ , the smallest and largest eigenvalue of $L_{u}$ as $\sigma^{G}_{\mathrm{min}}$ and $\sigma^{G}_{\mathrm{max}}$ , respectively. Denoting $\kappa=\tfrac{M_{f}}{m_{f}}$ , for arbitrary constants $\beta,\gamma,\phi,\rho>1$ , and $\zeta\in(\tfrac{m_{f}+M_{f}}{2m_{f}M_{f}},\tfrac{\epsilon}{4M_{f}^{2}})$ , $\epsilon>2(m_{f}+M_{f})\kappa$ . the iterates converge linearly as follows:

\displaystyle\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}\leq\tfrac{1}{1+\delta}\norm{u_{t}-u_{\star}}^{2}_{\mathcal{H}},

(32)

where

	$\displaystyle\delta=\min\bigg{\{}\left(\tfrac{2m_{f}M_{f}}{m_{f}+M_{f}}-\tfrac{1}{\zeta}\right)\left(\tfrac{2\mu_{1}\mu_{2}}{\mu_{2}(\sigma_{\max}^{G}+2\mu_{1}\epsilon)+\rho\mu_{1}}\right),\tfrac{\rho-1}{\rho},$
	$\displaystyle\tfrac{(\beta-1)\sigma^{+}_{\mathrm{min}}}{\mu\beta}\left(\tfrac{\tfrac{\sigma_{\min}^{G}}{2\mu_{1}}+\epsilon-\zeta\tau^{2}}{\tfrac{\sigma_{\max}^{G}}{2\mu_{1}}+\epsilon+\tfrac{\gamma\tau^{2}(\beta-1)}{\sigma^{+}_{\mathrm{min}}(\gamma-1)}}\right),$
	$\displaystyle\tfrac{2\sigma^{+}_{\mathrm{min}}}{(m_{f}+M_{f})\mu\beta\psi}\left(\tfrac{1}{\gamma}\right),\tfrac{\sigma^{+}_{\mathrm{min}}\mu_{2}(\psi-1)}{\mu\beta\psi}\left(\tfrac{1}{\gamma}\right)\bigg{\}}.$		(33)

Proof: Since $F(x)$ is strongly convex with parameter $m_{f}$ and the gradient $\nabla F(x)$ is Lipschitz continuous with parameter $M_{f}$ ,

	$\displaystyle\tfrac{m_{f}M_{f}}{m_{f}+M_{f}}\norm{x_{t+1}-x_{\star}}^{2}+\tfrac{1}{m_{f}+M_{f}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}\leq$
	$\displaystyle(x_{t+1}-x_{\star})^{\top}(\nabla F(x_{t+1})-\nabla F(x_{\star})).$		(34)

Using Lemma 3, we rewrite the RHS (right hand side) of (34) as

	$\displaystyle\textbf{RHS}=-(x_{t+1}-x_{\star})^{\top}\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})$
	$\displaystyle-(x_{t+1}^{l}-x_{\star}^{l})^{\top}\left[\lambda_{t+1}-\lambda_{\star}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})\right]$
	$\displaystyle-(x_{t+1}-x_{\star})^{\top}E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})-(x_{t+1}-x_{\star})^{\top}e_{t},$		(35)

where we have used the fact that $(x_{t+1}-x_{\star})^{\top}C^{\top}=(x_{t+1}^{l}-x_{\star}^{l})$ . From the dual update (17c) and (KKTc), it holds that

\displaystyle(x_{t+1}-x_{\star})^{\top}E_{s}^{\top}=2\mu_{1}(\alpha_{t+1}-\alpha_{t})^{\top}.

(36)

Moreover, using (17d) and (KKTd), we obtain

\displaystyle(x_{t+1}^{l}-x_{\star}^{l})^{\top}=\mu_{2}(\lambda_{t+1}-\lambda_{t})^{\top}+(\theta_{t+1}-\theta_{\star})^{\top}.

(37)

After substituting (36) and (37) into (35), we obtain

	$\displaystyle\textbf{RHS}=-(x_{t+1}-x_{\star})^{\top}\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})$
	$\displaystyle-\mu_{2}(\lambda_{t+1}-\lambda_{t})^{\top}(\lambda_{t+1}-\lambda_{\star})-(\lambda_{t+1}-\lambda_{t})^{\top}(\theta_{t+1}-\theta_{t})$
	$\displaystyle-(\theta_{t+1}-\theta_{\star})^{\top}(\lambda_{t+1}-\lambda_{\star})-\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{\star})^{\top}(\theta_{t+1}-\theta_{t})$
	$\displaystyle-2\mu_{1}(\alpha_{t+1}-\alpha_{t})^{\top}(\alpha_{t+1}-\alpha_{\star})-(x_{t+1}-x_{\star})^{\top}e_{t}.$		(38)

Using Lemma 2, we have $(\lambda_{t+1}-\lambda_{t})^{\top}(\theta_{t+1}-\theta_{t})\in(\partial g(\theta_{t+1})-\partial g(\theta_{t}))^{\top}(\theta_{t+1}-\theta_{t})\geq 0$ , where the inequality follows from Lemma 2 and Assumption 1. Similarly, $(\theta_{t+1}-\theta_{\star})^{\top}(\lambda_{t+1}-\lambda_{\star})\geq 0$ . Therefore, we obtain

	$\displaystyle\textbf{RHS}\leq-(x_{t+1}-x_{\star})^{\top}\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})$
	$\displaystyle-\mu_{2}(\lambda_{t+1}-\lambda_{t})^{\top}(\lambda_{t+1}-\lambda_{\star})-(\theta_{t+1}-\theta_{\star})^{\top}\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})$
	$\displaystyle-2\mu_{1}(\alpha_{t+1}-\alpha_{t})^{\top}(\alpha_{t+1}-\alpha_{\star})-(x_{t+1}-x_{\star})^{\top}e_{t}.$		(39)

Note that $-2(a-b)^{\top}(a-c)=\norm{b-c}^{2}-\norm{a-b}^{2}-\norm{a-c}^{2}$ holds true for any $(a,b,c)$ . Multiplying both sides of (39) by $2$ and using the aforementioned identity for the inner product terms, and considering the concatenation $u^{\top}=[x^{\top},\theta^{\top},\alpha^{\top},\lambda^{\top}]$ we obtain:

	$\displaystyle\tfrac{2mM}{m+M}\norm{x_{t+1}-x_{\star}}^{2}+\tfrac{2}{m+M}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}$
	$\displaystyle+\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{t}}^{2}+\tfrac{1}{\mu_{2}}\norm{\theta_{t+1}-\theta_{t}}^{2}$
	$\displaystyle+2\mu_{1}\norm{\alpha_{t+1}-\alpha_{t}}^{2}+2(x_{t+1}-x_{\star})^{\top}e_{t}\leq$
	$\displaystyle\norm{u_{t}-u_{\star}}^{2}_{\mathcal{H}}-\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}$		(40)

where $\tilde{L}_{u}=\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)$ . To establish linear convergence as in (32), it suffices to show that for some $\delta>0$ , the following holds: $\delta\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}\leq\norm{u_{t}-u_{\star}}^{2}_{\mathcal{H}}-\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}$ . Moreover, for any $\zeta>0$ , the inequality holds: $2(x_{t+1}-x_{\star})^{\top}e_{t}\geq-\tfrac{1}{\zeta}\norm{x_{t+1}-x_{\star}}^{2}-\zeta\norm{e_{t}}^{2}$ . Therefore, it is sufficient to show

	$\displaystyle\tfrac{2m_{f}M_{f}}{m_{f}+M_{f}}\norm{x_{t+1}-x_{\star}}^{2}+\tfrac{2}{m_{f}+M_{f}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}$
	$\displaystyle+\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{t}}^{2}+\tfrac{1}{\mu_{2}}\norm{\theta_{t+1}-\theta_{t}}^{2}$
	$\displaystyle+2\mu_{1}\norm{\alpha_{t+1}-\alpha_{t}}^{2}-\tfrac{1}{\zeta}\norm{x_{t+1}-x_{\star}}^{2}-\zeta\norm{e_{t}}^{2}\geq$
	$\displaystyle\delta\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}.$		(41)

We proceed to establish this bound. From Lemma 3, it holds:

	$\displaystyle E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})=$
	$\displaystyle-\big{\{}\nabla F(x_{t+1})-\nabla F(x_{\star})+\tfrac{1}{\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})$
	$\displaystyle+\tfrac{1}{\mu_{2}}C^{\top}(\theta_{t+1}-\theta_{t})+e_{t}\big{\}}.$		(42)

Note that $\norm{a}^{2}\leq\tfrac{\beta}{\beta-1}\norm{b}^{2}+\beta\norm{c}^{2}$ holds true for any $\beta>1$ . After applying this inequality three times with arbitrary constant $\beta,\gamma,\psi>1$ for (42), we obtain

	$\displaystyle\norm{E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})}^{2}\leq$
	$\displaystyle\tfrac{\beta}{\beta-1}\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}^{2}}+\tfrac{\beta\gamma}{\gamma-1}\norm{e_{t}}^{2}+\tfrac{\beta\gamma\psi}{(\mu_{2})^{2}(\psi-1)}\norm{\theta_{t+1}-\theta_{t}}^{2}$
	$\displaystyle+\beta\gamma\psi\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}.$		(43)

Consider the error term in Lemma 4, $e_{t}=B_{t+1}-B_{t}=-\tfrac{B_{t}s_{t}s_{t}^{\top}B_{t}}{s_{t}^{\top}B_{t}s_{t}}+\tfrac{d_{t}d_{t}^{\top}}{d_{t}^{\top}s_{t}}.$ Therefore, $\norm{B_{t+1}-B_{t}}\leq\tfrac{\norm{d_{t}}^{2}}{d_{t}^{\top}s_{t}}+\tfrac{\norm{B_{t}s_{t}}^{2}}{s_{t}^{\top}B_{t}s_{t}}$ . Since $F(x)$ is strongly convex with $m_{f}$ and the gradient is Lipschitz continuous with parameter $M_{f}$ , the following inequality holds:

\displaystyle d_{t}^{\top}s_{t}\geq m_{f}\norm{s_{t}}^{2},\quad\text{and}\quad d_{t}^{\top}s_{t}\geq\tfrac{1}{M_{f}}\norm{d_{t}}^{2}.

(44)

By setting $q_{t}=B_{t}^{\tfrac{1}{2}}s_{t}$ , we obtain $\tfrac{\norm{B_{t}s_{t}}^{2}}{s_{t}^{\top}B_{t}s_{t}}=\tfrac{q_{t}^{\top}B_{t}q_{t}}{q_{t}^{\top}q_{t}}\leq\lambda_{\mathrm{max}}(B_{t})<\nu.$ Therefore, $\norm{e_{t}}\leq\tau\norm{x_{t+1}-x_{t}},$ where $\tau=\min\left\{M,\tfrac{\norm{d_{t}}^{2}}{\norm{s_{t}}^{2}}\right\}+\nu$ . From Lemma 5, we have

	$\displaystyle\sigma^{+}_{\mathrm{min}}\left(\norm{\alpha_{t+1}-\alpha_{\star}}^{2}+\norm{\lambda_{t+1}-\lambda_{\star}}^{2}\right)\leq$
	$\displaystyle\norm{E_{s}^{\top}(\alpha_{t+1}-\alpha_{\star})+C^{\top}(\lambda_{t+1}-\lambda_{\star})}^{2}.$		(45)

Denoting $\mu=\max\{2\mu_{1},\mu_{2}\}$ and combining (43)-(45), we have:

	$\displaystyle 2\mu_{1}\norm{\alpha_{t+1}-\alpha_{\star}}^{2}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{\star}}^{2}\leq$
	$\displaystyle\tfrac{\mu\beta}{\sigma^{+}_{\mathrm{min}}(\beta-1)}\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}^{2}}+\tfrac{\mu\beta\gamma\tau^{2}}{\sigma^{+}_{\mathrm{min}}(\gamma-1)}\norm{x_{t+1}-x_{t}}^{2}$
	$\displaystyle+\tfrac{\mu\beta\gamma\psi}{(\mu_{2})^{2}\sigma^{+}_{\mathrm{min}}(\psi-1)}\norm{\theta_{t+1}-\theta_{t}}^{2}$
	$\displaystyle+\tfrac{\mu\beta\gamma\psi}{\sigma^{+}_{\mathrm{min}}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}.$		(46)

Furthermore, from (17d) and (KKTd), it holds that $\theta_{t+1}-\theta_{\star}=-\mu_{2}(\lambda_{t+1}-\lambda_{t})+x_{t+1}^{l}-x_{\star}^{l}$ . Using the same technique as in deriving (43), we obtain for arbitrary $\rho>1$ :

\displaystyle\norm{\theta_{t+1}-\theta_{\star}}\leq\rho\norm{x_{t+1}^{l}-x_{\star}^{l}}^{2}+\tfrac{(\mu_{2})^{2}\rho}{\rho-1}\norm{\lambda_{t+1}-\lambda_{t}}^{2}.

(47)

Using (46) and (47), it suffices to show that the following inequality holds true for some $\delta>0$ ,

	$\displaystyle\delta\norm{u_{t+1}-u_{\star}}^{2}_{\mathcal{H}}\leq\tfrac{2m_{f}M_{f}}{m_{f}+M_{f}}\norm{x_{t+1}-x_{\star}}^{2}$
	$\displaystyle+\tfrac{2}{m_{f}+M_{f}}\norm{\nabla F(x_{t+1})-\nabla F(x_{\star})}^{2}$
	$\displaystyle+\norm{x_{t+1}-x_{t}}^{2}_{\tilde{L}_{u}}+\mu_{2}\norm{\lambda_{t+1}-\lambda_{t}}^{2}+\tfrac{1}{\mu_{2}}\norm{\theta_{t+1}-\theta_{t}}^{2}$
	$\displaystyle+2\mu_{1}\norm{\alpha_{t+1}-\alpha_{t}}^{2}-\tfrac{1}{\zeta}\norm{x_{t+1}-x_{\star}}^{2}-\zeta\tau^{2}\norm{x_{t+1}-x_{t}}^{2},$

which is satisfied if $\delta$ is chosen as in (33). $\blacksquare$

V EXPERIMENTS

In this section we present some numerical experiments of the proposed method compared to distributed first order methods: P2D2 [22] and PG-EXTRA [7], for solving the following distributed logistic regression problem:

\displaystyle\underset{x\in\mathbb{R}^{d}}{\mathrm{minimize}}\,\,F(x)=\left\{\frac{1}{m}\sum_{i=1}^{m}f^{i}(x)+\gamma\norm{x}_{1}\right\},

where $f^{i}(x)=\tfrac{1}{m_{i}}\sum_{j=1}^{m_{i}}\left[\ln(1+e^{w_{j}^{\top}x})+(1-y_{j})w_{j}^{\top}x\right],$ and $m_{i}$ is the number of data points held at each agent and $(w_{j},y_{j})\in\mathbb{R}^{d}\times\{0,1\}$ are training samples of dimension $d$ and binary labels, respectively. We consider two real datasets from LIBSVM¹¹1https://www.csie.ntu.edu.tw/ cjlin/libsvm/: the skin_nonskin dataset and the ijcnn1 dataset. We take 5,000 data points from each dataset with dimension $d=3$ and dimension $d=22$ , respectively. A random binomial graph with $m$ agents is drawn (each edge is drawn independently from a Bernoulli( $p$ ) distribution; $p$ =0.2 was used in all cases). The mixing matrix for P2D2 is generated using the Metropolis rule while the mixing matrix of PG-EXTRA is generated using the Laplacian based constant edge weight matrix. In all cases, the $\ell_{1}-$ norm weight $\gamma=2\times 10^{-6}$ . We plot the average relative cost error defined as $\tfrac{\tfrac{1}{m}\sum_{i=1}^{m}F(x_{t}^{i})-F(x_{\star})}{\tfrac{1}{m}\sum_{i=1}^{m}F(x_{0}^{i})-F(x_{\star})}$ for different network sizes. As it can be seen in Figure 1 and 2, the BFGS-ADMM method demonstrates significant speed up in both datasets compared to other methods. Note that our first order variant also shows advantages compared to other first order methods, due to the fact that in our first order approximation, step size selection also takes into account the number of neighbors each agent has as in (20).

References

[1] Y. Li, D. Stipanović, P. Voulgaris, and Z. Gu, “Decentralized model predictive control of urbandrainage systems,” WSEAS Transactions on Systems and Control, vol. 14, pp. 247–256, 2019.
[2] T. Huang, N. M. Freris, P. R. Kumar, and L. Xie, “A synchrophasor data-driven method for forced oscillation localization under resonance conditions,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 3927–3939, 2020.
[3] R. Bekkerman, M. Bilenko, and J. Langford, “Scaling up machine learning: Parallel and distributed approaches,” in Proceedings of the 17th ACM SIGKDD International Conference Tutorials, 2011.
[4] N. M. Freris, H. Kowshik, and P. R. Kumar, “Fundamentals of large sensor networks: Connectivity, capacity, clocks, and computation,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1828–1846, 2010.
[5] K. Kim and P. R. Kumar, “Cyber–physical systems: A perspective at the centennial,” Proceedings of the IEEE, vol. 100, pp. 1287–1308, 2012.
[6] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[7] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
[8] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, p. 127–239, 2014.
[9] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized linearized alternating direction method of multipliers,” IEEE Transactions on Signal Processing, vol. 63, no. 15, pp. 4051–4064, 2015.
[10] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
[11] Y. Li, N. M. Freris, P. Voulgaris, and D. Stipanović, “D-SOP: Distributed second order proximal method for convex composite optimization,” in Proceedings of the 2020 American Control Conference, 2020, pp. 2844–2849.
[12] J. D. Lee, Y. Sun, and M. A. Saunders, “Proximal Newton-type methods for minimizing composite functions,” SIAM Journal on Optimization, vol. 24, no. 3, pp. 1420–1443, 2014.
[13] J. E. Dennis, Jr. and J. J. Moré, “Quasi-newton methods, motivation and theory,” SIAM Review, vol. 19, no. 1, pp. 46–89, 1977.
[14] P. Latafat, N. M. Freris, and P. Patrinos, “A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 4050–4065, 2019.
[15] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
[16] W. C. Davidon, “Variable metric method for minimization,” SIAM Journal on Optimization, vol. 1, no. 1, pp. 1–17, 1991.
[17] D. Goldfarb, “A family of variable-metric methods derived by variational means,” Mathematics of Computation, pp. 23–26, 1970.
[18] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. Springer, 2006.
[19] C. Chen, B. He, Y. Ye, and X. Yuan, “The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent,” vol. 155, no. 1–2, 2016.
[20] T. Lin, S. Ma, and S. Zhang, “Global convergence of unmodified 3-block ADMM for a class of convex minimization problems,” Journal of Scientific Computing, vol. 76, no. 1, pp. 69–88, 2018.
[21] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[22] S. Alghunaim, K. Yuan, and A. H. Sayed, “A linearly convergent proximal gradient algorithm for decentralized optimization,” in Advances in Neural Information Processing Systems, vol. 32, 2019.

Proof of Lemma 1
From the construction of $C=(c^{l})^{\top}\otimes I_{d}\in\mathbb{R}^{d\times md}$ , where $c^{l}$ selects the $l$ -th coordinate, we stress that the update (11e) only involves the subvector $x_{t+1}^{l}$ as it can be seen by explicitly writing (11e) as:

\displaystyle\lambda_{t+1}=\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t+1}).

We decompose the dual variable as $y^{\top}=\begin{bmatrix}\alpha^{\top}&\beta^{\top}\end{bmatrix}$ , where $\alpha,\beta\in\mathbb{R}^{nd}$ . Recalling $D^{\top}=\begin{bmatrix}I_{nd}&I_{nd}\end{bmatrix}$ and pre-multiplying the dual update (11d) with $D^{\top}$ , we obtain

\displaystyle D^{\top}y_{t+1}=D^{\top}y_{t}+\tfrac{1}{\mu_{1}}D^{\top}(Ax_{t+1}-Dz_{t+1}).

Using (15), we conclude that $D^{\top}y_{t+1}=0$ for $t\geq 0$ , i.e., $\beta_{t+1}=-\alpha_{t+1}$ . With zero initialization, we can therefore express $y_{t}$ as $y_{t}=\begin{bmatrix}\alpha_{t}\\ -\alpha_{t}\end{bmatrix}$ , for all $t\geq 0$ . This shows that it is redundant to maintain both $\alpha$ and $\beta$ since they sum to $0$ for all $t\geq 0$ . Therefore, we can rewrite (11d) as:

	$\displaystyle\alpha_{t+1}=\alpha_{t}+\tfrac{1}{\mu_{1}}(A_{s}x_{t+1}-z_{t+1}),$		(48)
	$\displaystyle-\alpha_{t+1}=-\alpha_{t}+\tfrac{1}{\mu_{1}}(A_{d}x_{t+1}-z_{t+1}).$		(49)

After taking the difference of the above two equations and dividing by 2, we obtain:

\displaystyle\alpha_{t+1}=\alpha_{t}+\tfrac{1}{2\mu_{1}}E_{s}x_{t+1},

(50)

where we have used the fact that $E_{s}=A_{s}-A_{d}$ . Similarly, by summing (48) and (49), we obtain $z_{t+1}=\tfrac{1}{2}E_{u}x_{t+1}$ . With zero initialization for $x_{t}$ , it is redundant to keep $z_{t}$ since we can compute it as:

\displaystyle z_{t}=\tfrac{1}{2}E_{u}x_{t}\,\,\forall\,\,t\geq 0.

(51)

Recalling the approximated Hessian in (13), we express the primal update (14) as

	$\displaystyle x_{t+1}=x_{t}-H_{t}^{-1}\big{\{}\nabla F(x_{t})+A^{\top}y_{t}+C^{\top}\lambda_{t}$
	$\displaystyle+\tfrac{1}{\mu_{1}}A^{\top}(Ax_{t}-Dz_{t})+\tfrac{1}{\mu_{2}}C^{\top}(Cx_{t}-\theta_{t})\big{\}}.$		(52)

Since $E_{s}^{\top}=A_{s}^{\top}-A_{d}^{\top}$ , $E_{u}^{\top}=A_{s}^{\top}+A_{d}^{\top}$ , and $L_{u}=E_{u}^{\top}E_{u}$ as in (4) and (5), along with the decomposition of $y_{t}$ and the equality in (51), we have

	$\displaystyle A^{\top}y_{t}=E_{s}^{\top}\alpha_{t},$		(53)
	$\displaystyle\tfrac{1}{\mu_{1}}A^{\top}Dz_{t}=\tfrac{1}{2\mu_{1}}L_{u}x_{t}.$		(54)

Using (6), we can rewrite

\displaystyle\tfrac{1}{\mu_{1}}A^{\top}Ax_{t}-\tfrac{1}{2\mu_{1}}L_{u}x_{t}=\tfrac{1}{2\mu_{1}}(2\Delta-L_{u})x_{t}=\tfrac{1}{2\mu_{1}}L_{s}x_{t}.

(55)

Substituting (53)-(55) into (52), we obtain the primal updates as in (17a). $\blacksquare$
Proof of Lemma 2
The update (17b) is equivalent to

\displaystyle\theta_{t+1}=\underset{\theta}{\mathrm{argmin}}\left\{g(\theta)+\tfrac{1}{2\mu_{2}}\norm{x_{t+1}^{l}+\mu_{2}\lambda_{t}-\theta}^{2}\right\}.

From the optimality condition, it holds that

\displaystyle 0\in g(\theta_{t+1})-\tfrac{1}{\mu_{2}}(x_{t+1}+\mu_{2}\lambda_{t}-\theta_{t+1}).

(56)

Substituting the dual update $\lambda_{t+1}=\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t+1})$ into (56), we obtain

\displaystyle 0\in\partial g(\theta_{t+1})-\lambda_{t+1}.

After re-arranging, we obtain the desired. $\blacksquare$
Proof of Lemma 3
Note that the difference between BFGS-ADMM and it’s first order variant lies in $H_{t}$ (13). For the first order variant, $H_{t}=\tfrac{1}{\mu_{1}}\Delta+\tfrac{1}{\mu_{2}}C^{\top}C+\epsilon I_{md},$ i.e., $B_{t}$ is identically zero. By rearranging (17a), we obtain the following equation:

	$\displaystyle\nabla F(x_{t})+E_{s}^{\top}\alpha_{t}+C^{\top}\lambda_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}+\tfrac{1}{\mu_{2}}C^{\top}(Cx_{t}-\theta_{t})$
	$\displaystyle+(B_{t}+\tfrac{1}{\mu_{1}}\Delta+\tfrac{1}{\mu_{2}}C^{\top}C+\epsilon I_{md})(x_{t+1}-x_{t})=0.$		(57)

Using the dual update (17c), we have

\displaystyle E_{s}^{\top}\alpha_{t}+\tfrac{1}{2\mu_{1}}L_{s}x_{t}=E_{s}^{\top}\alpha_{t+1}-\tfrac{1}{2\mu_{1}}L_{s}(x_{t+1}-x_{t}).

(58)

Since $2\Delta=L_{s}+L_{u}$ , it holds that

\displaystyle\tfrac{1}{\mu_{1}}\Delta(x_{t+1}-x_{t})-\tfrac{1}{2\mu_{1}}L_{s}(x_{t+1}-x_{t})=\tfrac{1}{2\mu_{1}}L_{u}(x_{t+1}-x_{t}).

(59)

Substituting (58) and (59) into (57), we obtain

	$\displaystyle\nabla F(x_{t})+B_{t}(x_{t+1}-x_{t})+\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I_{md})(x_{t+1}-x_{t})$
	$\displaystyle+C^{\top}\big{\{}\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t})\big{\}}+E_{s}^{\top}\alpha_{t+1}=0.$		(60)

Using the dual update (17d), we have

\displaystyle\lambda_{t}+\tfrac{1}{\mu_{2}}(x_{t+1}^{l}-\theta_{t})=\lambda_{t+1}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t}).

(61)

Substituting (61) into (60) and using the definition of $e_{t}=\nabla F(x_{t})+B_{t}(x_{t+1}-x_{t})-\nabla F(x_{t+1})$ , we obtain

	$\displaystyle\nabla F(x_{t+1})+\tfrac{1}{2\mu_{1}}(L_{u}+\epsilon I)(x_{t+1}-x_{t})$
	$\displaystyle+C^{\top}\big{\{}\lambda_{t+1}+\tfrac{1}{\mu_{2}}(\theta_{t+1}-\theta_{t})\big{\}}+E_{s}^{\top}\alpha_{t+1}+e_{t}=0.$		(62)

After subtracting (KKTa) from (62), we obtain the desired where $B_{t}=0$ for the first order variant. $\blacksquare$
Proof of Lemma 4
For BFGS-ADMM, since $B_{t+1}$ satisfies the secant equation: $B_{t+1}(x_{t+1}-x_{t})=\nabla F(x_{t+1})-\nabla F(x_{t})$ , we have

\displaystyle e_{t}=(B_{t}-B_{t+1})(x_{t+1}-x_{t}),

and (29) follows by applying the Cauchy-Schwartz inequality. For the first order variant, the Lipschitz continuity of $\nabla F(x)$ implies that $\forall\,x,y\in\mathbb{R}^{md}$ , $\norm{\nabla F(x)-\nabla F(y)}\leq M_{f}\norm{x-y}$ . The claim in (30) follows. $\blacksquare$
Proof of Lemma 5
Consider the update (17c) and (17d). It can be written compactly as: $\begin{bmatrix}\alpha_{t+1}\\ \lambda_{t+1}\end{bmatrix}=\begin{bmatrix}\alpha_{t}\\ \lambda_{t}\end{bmatrix}+\begin{bmatrix}\tfrac{1}{2\mu_{1}}&0\\ 0&\tfrac{1}{\mu_{2}}\end{bmatrix}\begin{bmatrix}E_{s}\\ C\end{bmatrix}x_{t+1}-\begin{bmatrix}\textbf{0}\\ \theta_{t+1}\end{bmatrix},$ where 0 is a zero vector of dimension $nd$ . Therefore, it suffices to prove the vector $[\textbf{0}^{\top}\,\,\theta_{t+1}^{\top}]^{\top}$ lies in the column space of $X^{\top}$ . Note that since the graph is connected, $E_{s}r=0$ if $r\in\mathbb{R}^{md}$ is in consensus, i.e., $r^{1}=r^{2}=\dots=r^{m}\in\mathbb{R}^{d}$ . By letting $r^{i}=\theta_{t+1}$ for all $i$ , we have: $\begin{bmatrix}E_{s}\\ C\end{bmatrix}r=\begin{bmatrix}\textbf{0}\\ \theta_{t+1}\end{bmatrix}$ , which shows that $[\textbf{0}^{\top}\,\,\theta_{t+1}^{\top}]^{\top}$ lies in the column space of $X^{\top}$ . To simplify the notation we denote $[\alpha_{t}^{\top},\lambda_{t}^{\top}]^{\top}=w_{t}$ . To prove the existence of optimal $w_{\star}$ in the column space of $X^{\top}$ , consider the (KKTa) satisfied with any optimal $w$ ,

\displaystyle\nabla F(x_{\star})+Xw=0.

The projection of $w$ into the column space of $X^{\top}$ , denoted as $w_{\star}$ , also satisfies (KKTa) since their differences lie in the kernel of $X$ , i.e., $X(w-w_{\star})=0$ . The uniqueness of $w_{\star}$ can be established by contradiction. Let $w_{1}=X^{\top}r_{1}$ and $w_{2}=X^{\top}r_{2}$ be two optimal stacked dual vector that lie in the column space of $X^{\top}$ , $r_{1}\neq r_{2}$ . Since $F(\cdot)$ is strongly convex, from (KKTa), it holds that $\nabla F(x_{\star})+XX^{\top}r_{1}=0,\\ \nabla F(x_{\star})+XX^{\top}r_{2}=0.$ After taking the difference, we have $XX^{\top}(r_{1}-r_{2})=0$ . But $XX^{\top}=E_{s}^{\top}E_{s}+C^{\top}C=L_{s}+C^{\top}C>0$ . Therefore $r_{1}-r_{2}=0$ , contradiction. Inequality (31) follows from the fact that $w_{t+1}-w_{\star}$ is orthogonal to the kernel of $X$ . $\blacksquare$