BFGS-ADMM for Large-Scale Distributed Optimization
Abstract
We consider a class of distributed optimization problem where the objective function consists of a sum of strongly convex and smooth functions and a (possibly nonsmooth) convex regularizer. A multi-agent network is assumed, where each agent holds a private cost function and cooperates with its neighbors to compute the optimum of the aggregate objective. We propose a quasi-Newton Alternating Direction Method of Multipliers (ADMM) where the primal update is solved inexactly with approximated curvature information. By introducing an intermediate consensus variable, we achieve a block diagonal Hessian which eliminates the need for inner communication loops within the network when computing the update direction. We establish global linear convergence to the optimal primal-dual solution without the need for backtracking line search, under the assumption that component cost functions are strongly convex with Lipschitz continuous gradients. Numerical simulations on real datasets demonstrate the advantages of the proposed method over state of the art.
I INTRODUCTION
Distributed optimization acts as the computational engine for a wide range of applications in modern technology, including distributed control [1], power grid management [2], distributed machine learning [3], and resource allocation in sensor networks [4]. We consider a classical problem of minimizing a sum of cost functions and a convex regularizer, aiming at promoting sparsity in the solution structure. In specific:
(1) |
where captures the -th agent’s objective of interest and is a possibly nonsmooth regularizer, such as the weighted norm. In a network of agents, each agent aims to minimize its local cost function while communicating with its neighbors to cooperatively find the solution of the global problem. Distributed solutions are often pursued to fully utilize the computational power of the agents and reduce the amount of message passing in the network. Efficient communication is deemed especially desirable in scenarios where the number of participants , and the dimension of the decision variable is large, such as in emerging applications of Cyber-Physical systems (CPS) [5]. An archetypal framework of distributed optimization introduces local decision variables at each agent, which are updated locally using private data and exchange variables with neighboring agents. A network-wide solution is achieved by means of asymptotic elimination of the consensus error, i.e., the difference between the values of neighboring agents converges to zero. Such a framework is termed distributed consensus optimization and offers significant flexibility for each agent to select appropriate updating schemes that suit its local hardware environment.
First order methods [6]-[9] constitute popular choices for solving (1) in a distributed fashion. The presence of the nonsmooth regularizer prohibits gradient descent from being implemented directly as is not differentiable. Subgradient methods [6] invoke the notion of subdifferential set to compute descent directions while letting agents exchange information over a time-varying topology. Proximal gradient [8] accommodates the nonsmoothness by splitting the objective function and evaluating the proximal mapping associated with the nondifferentiable part. Various acceleration schemes [7], [9] exist for first order methods, that involve storing past gradient and iterate information so as to form a momentum correction term. However, as the size of data grow the condition number of problem (1) tends to get large which causes first order methods to exhibit extremely slow convergence rate. A natural remedy for the aforementioned issues is to consider second order methods [10]-[12]. Through use of function curvature information, second order methods compute descent directions in the objective function level sets that are effective at accelerating the convergence over first order methods. Proximal Newton [12] can be considered as the second order counterpart of the proximal gradient but the associated proximal mapping increases computational burden drastically and renders distributed implementation infeasible as global information is required to compute the Newton step. Moreover, second order methods require solving a linear system to compute the Newton step which induces a computation cost of . Quasi-Newton methods [13] circumvent this procedure by using a finite difference of gradients to approximate the Hessian. The reduced computation costs along with competitive performance have rendered quasi-Newton methods a desirable alternative to second order methods, especially for Large-scale optimization problems.
Primal-dual methods [14]-[15] provide a different perspective for problem (1) within the framework of constrained optimization, where the consensus constraint is explicitly enforced. The Alternating Direction Method of Multipliers (ADMM) [15] falls into this catergory where the smooth and nonsmooth parts of the objective are considered separately and iterates are computed sequentially. However, at each iteration of the ADMM, a sub-optimization problem must be solved, which typically induces an expensive computational burden.
Contributions: (i) We propose BFGS-ADMM for convex composite optimization where the primal update (typically computational expensive) is replaced with a quasi-Newton step that does not require solving a linear system of equations. Moreover, by introducing an intermediate consensus variable, we eliminate the need for inner loops within the network when computing the update direction. (ii) We establish global linear convergence rate for the proposed algorithm without backtracking, under the assumption that private cost functions are strongly convex with Lipschitz continuous gradients. (iii) The advantage of BFGS-ADMM over first order methods is analytically established and experimentally demonstrated with numerical simulations on real datasets.
Notation: We denote column vectors with lower case letters and matrices with capital letters. Superscripts are used to denote partitioned vector components and subscripts are used to denote iteration steps, e.g., denotes the value of subvector at step . Matrix entries are denoted as . Unless specified otherwise, and denote the Euclidean norm of a vector and the corresponding induced norm of a matrix, respectively. We define the norm of a vector associated with a positive definite matrix as , and the set is abbreviated as . The proximal mapping associated with the function is defined as .
II PRELIMINARIES
II-A Problem formulation
Consider a connected network of agents captured by a -th order undirected graph: , where is the vertex set, contains pairs if and only if agent communicates with agent , and captures the number of communication links in the network. We denote the neighborhood of agent as . We further define the source and the destination matrix (arbitrary ordering suffices since both are given the same value) respectively as follows. Each row of and corresponds to a communication link , for some , and with all other entries equal to . We reformulate (1) to the consensus framework by introducing local decision variables held by the corresponding agent , along with a set of intermediate consensus variables . Specifically,
(2) |
where the -th agent is chosen arbitrarily to enforce the additional constraint for the nonsmooth regularizer decision variable. The constraints enforce consensus among the network if the underlying graph is connected and, therefore, (2) is equivalent to (1) in the sense that they have the same set of optimal solutions, i.e., . Consensus can be achieved by directly enforcing but having a intermediate consensus variable is crucial in terms of decoupling the primal variable so that computing quasi-Newton steps does not require additional communication within the network. We provide further discussion on this design choice in Section III Remark 1. We may compactly express (2) using the aforementioned source and destination matrices as follows:
(3) |
where denotes the Kronecker product and denotes the identity matrix of the corresponding dimension. We stack local decision variables as and similarly for the consensus variable , and matrix and is obtained by stacking matrices as shown in (3). The matrix enforces the last line of constraint of (2) using the coordinate selection vector , whose entries are zero except the -th one being 1. We present some equalities that link our constraint matrices to the topology of the graph in the following.
(4) | |||
(5) | |||
(6) |
Matrices stand for the signed and unsigned incidence matrix of the graph, respectively, while denotes the signed and unsigned Laplacian matrix of the graph, respectively. The diagonal degree matrix of the graph is denoted as , whose entries are .
Assumption 1. Local cost functions are twice differentiable with uniformly bounded Hessian as follows:
(7) |
where . Since , is block diagonal with the -th block being . Therefore, the same bounds apply to the Hessian of as well, i.e., .
II-B Introduction to BFGS
Quasi-Newton methods [16]-[18] constitute a popular class of methods for accelerating the convergence of optimization methods without directly invoking the Hessian. Quasi-Newton methods seek to approximate the curvature information using consecutive gradient evaluations and iterates differences. More specifically, we define the descent direction as
(8) |
where is a positive definite matrix that approximates the Hessian . Various schemes exist for designing algorithms that iteratively update including those by Davidon, Fletcher, and Powell (DFP) [16] and Broyden, Fletcher, Goldfarb, and Shannon (BFGS) [17].
In this paper, we focus on the BFGS scheme to select as it is observed to work best in the practice both in terms of convergence speed and self-correcting capabilities [18]. The BFGS approximates the Hessian using finite differences of consecutive iterates and gradient evaluations. In specific, define the iterates difference and the gradient difference as follows:
Quasi-Newton methods select so that the secant condition is satisfied, i.e., , which is motivated by the fact that the real Hessian satisfies this condition as tends to . To ensure , it must hold that as can seen by premultiplying the secant equation with . For convex objectives, this condition is satisfied automatically. Note that, however, the secant condition alone is not enough to specify . To uniquely determine , we further require its inverse to be close to the previous value in the following sense:
(9) |
where stands for the weighted Frobenius norm associated with matrix whose inverse is the average Hessian [18]. A closed form solution for (9) can be obtained as:
(10) |
II-C Alternating Direction Method of Multipliers
We proceed to introduce the Augmented Lagrangian associated with problem (2) as follows:
where are dual variables corresponding to the two linear constraints in (3). The Alternating Direction Method of Multipliers (ADMM) solves (2) by sequentially minimizing the Augmented Lagrangian over primal/dual variables as:
(11a) | ||||
(11b) | ||||
(11c) | ||||
(11d) | ||||
(11e) |
Note that (11) is a 3-block ADMM which, in general, is not guaranteed to converge for arbitrary values of [19]. We separate the dual variables into and for two reasons: (i) Since dual variables accumulate the consensus error within the network as seen in (11d) and (11e), and from that the fact corresponds to constraints while only involves a single constraint, it is beneficial to separate the two and design different stepsizes and , which allows for extra freedom for hyperparameters and results in better performance in practice. (ii) In section III, we further show that with appropriate initialization, the storage requirement for can be reduced by half.
In each iteration of ADMM, a sub-optimization problem has to be solved in (11a) to obtain . This may be computationally expensive and place heavy burden on the system. We therefore propose to perform inexact optimization with the aid of an approximated Hessian using BFGS, aiming to reduce computational costs while maintaining fast convergence speed.
III Algorithmic description and implementation
We build a local model of the Augmented Lagrangian at step with respect to the primal variable as:
(12) |
where . Various algorithms can be designed by choosing different . In this paper, we opt to use regularized BFGS approximation of the Augmented Lagrangian Hessian as follows:
(13) |
where and is the Kronecker product of the graph degree matrix and . Note that by setting , we recover the Hessian of the Augmented Lagrangian with respect to , i.e., in such case. The quadratic form of (12) admits a closed form solution when minimizing over , i.e.,
(14) |
Step (11b) reduces to solving the following linear system of equations:
(15) |
Moreover, by completion of squares, step (11c) is equivalent to evaluating the following proximal mapping associated with with parameter :
(16) |
Dual variables are updated in verbatim as in (11d) and (11e). We proceed to present Lemma 1 which states that with an appropriate initialization, the intermediate variable evolves on the manifold defined by the column space of . Matrices without the hat denotes the Kronecker product of the corresponding matrix with the identity matrix, e.g., where is defined in (4).
Lemma 1. Recall the signed and unsigned incidence and Laplacian matrices in (4) and (5), respectively. For zero initialization of , we can decompose for all . Moreover, the update rules (14)-(16), (11d), and (11e) can be simplified as:
(17a) | ||||
(17b) | ||||
(17c) | ||||
(17d) |
Proof : See Appendix A.
Once is obtained, updates (17c) and (17d) can be performed in parallel and we have effectively transformed a 3-block ADMM in (11) to a 2-block ADMM, which converges under more general settings [20].
Remark 1: The reason for introducing the consensus variable in (2) is to decouple ’s so that the Hessian of the Augmented Lagrangian has a block diagonal structure, cf.(13), which is instrumental for computing the quasi-Newton step without additional communication within the network once the gradient of the Augmented Lagrangian is obtained. Note that there is no need to invert any matrix or solve a linear system, and the inverse in (17), is only notational. Instead, we directly approximate each block , corresponding to the agent and form as explained in the following. Since is block diagonal, we approximate using (10) with and . Moreover, is diagonal and constant (time invariant) from (13), which allows for eigendecomposition with components only depending on local information. Specifically, using the Sherman-Morrison formula and defining , we obtain an iterative formula for as follows:
(18) |
with constant vector specified as:
(19) |
where if (i.e., if it is the -th agent who performs the proximal mapping associated with the nonsmooth regularizer ) and is zero otherwise, and is a constant vector with all entries equal to except for the -th one. If consensus is enforced directly as , the resulting Hessian of the Augmented Lagrangian would involve the graph Laplacian matrix as in [11] which couples with its neighbors. In other words, computing quasi-Newton steps would induce multiple inner communication rounds within the network at each iteration, to approximate the descent direction to a desired accuracy.
We now present a first order variant of BFGS-ADMM, which only uses first order information, so as to demonstrate the merits of using the BFGS approximation, both theoretically and experimentally. Note that the only part of in (13) that depends on the iterate counter is , which is an approximation of . Therefore, a first order approximation can be derived by setting in which case in (13) is a constant diagonal matrix. In other words, agent selects a step size equal to:
(20) |
Therefore the first order variant updates the primal vector as:
(21) |
Zero initialization for all variables. Hyperparameters , and . Initialization for for some .
where is the diagonal matrix formed by , and the dual variables are updated in the exact same way as in the BFGS-ADMM using (17b)-(17c). Note that our first order variant encapsulates existing first order methods EXTRA [21] as a special case by setting . Indeed, when , the proximal mapping and the associated dual variables in (17) are not needed and we obtain: Taking the difference between the update and the update defined this way, and substituting the dual update for using (17c), we obtain:
(22) |
With appropriate choices of , the update defined in (22) is equivalent to EXTRA. We present the comparison between BFGS-ADMM and its first order variant in terms of approximating the exact ADMM in the next section.
We proceed to present the distributed implementation of BFGS-ADMM in Algorithm 1. Note that in the primal update (17a), dual variables are invoked in the form of . By defining and pre-multiplying both sides of (17c) with , we obtain an equivalent algorithm that allows for efficient distributed implementation. We let each agent hold the corresponding pair while the -th agent additionally holds . At each iteration, each agent begins with calculating using from its neighbors as in step 3, where the -th agent performs an additional update in step 5. Note that step 7 does not require solving a linear system since is directly approximated using (18). Primal and dual updates are performed at each agent as in step 8 and step 9, respectively. The -th agent evaluates an additional proximal mapping in step 11 to update , and updates in step 12. Finally, each agent updates its and using (10) and (18), respectively.
IV CONVERGENCE ANALYSIS
Linear convergence rate for ADMM is well established [15]. Since our goal here is to reduce the computational cost of ADMM, the best one can hope for is to reduce the gap between the approximation and the standard ADMM, while maintaining the linear convergence rate. In Lemma 4, we demonstrate the advantage of using BFGS approximation over first order methods by showing that the aforementioned gap decreases faster for BFGS-ADMM. We first state an additional assumption on the nonsmooth regularizer and the KKT conditions associated with (3) expressed in the primal optimal pair and the dual optimal pair . Note that we have eliminated , as it is shown in Lemma 1 that it is a function of .
Assumption 2: The function is proper, closed, and convex. Equivalently, , the following inequality holds:
(23) |
where denotes the subdifferential set. We proceed to state a non-restrictive assumption that can be easily achieved by regularization.
Assumption 3: There exists positive constant such that the eigenvalues of are upper bounded, i.e,
Remark 2: The upper bound described above can be achieved by setting , where comes from the BFGS update formula. Since with positive definite initialization, we have , i.e., .
KKT conditions:
(KKTa) | ||||
(KKTb) | ||||
(KKTc) | ||||
(KKTd) |
We proceed to state a technical lemma that describes an inclusion relationship between the subdifferential set of and the dual variable .
Recall the first order variant of the BFGS-ADMM where the primal variable is updated as in (21) while BFGS-ADMM updates as in (17a), and both update dual variables as in (17b)-(17c). We proceed to present results which capture their similarities and differences in the following.
Lemma 3. If Assumptions 1-2 hold, then the iterates generated by BFGS-ADMM and its first order variant both satisfy the following equation:
(26) |
where for BFGS-ADMM,
(27) |
for its first order variant,
(28) |
Proof : See Appendix C.
By comparing (27) and (28), we see that the difference between using BFGS vs. first order approximation lies in how is approximated at step . Moreover, if the sub-optimization problem is solved exactly as in (11a), by optimality condition, equation (26) holds with .
Lemma 4. Consider the bound for in (7) and the introduced in Lemma 3. If Assumption 3 holds, an upper bound can be established as follows.
For BFGS-ADMM,
(29) |
For its first order variant as in (21),
(30) |
Proof : See Appendix D.
Remark 3: Lemma 4 is a standalone result whose proof does not require the convergence of the algorithm. In Theorem 1, we establish that BFGS-ADMM converges linearly which means , since both approach as in (10). Therefore, for the BFGS-ADMM while for its first order variant. The variable captures the approximation error vs. exact solution of the primal problem ( for this case), whence Lemma 4 unravels that the proposed has superior tracking of the exact ADMM, which is the key attribute to yield superior linear convergence rate over its first-order variant.
Lemma 5. Consider the update in (17c) and (17d). With zero initialization, the stacked vector lies in the column space of . Moreover, there exists a unique optimal in the column space of and, denoting as the smallest positive eigenvalue of , the following inequality holds:
(31) |
Proof : See Appendix E.
We introduce the following scaling matrix defined in terms of the hyperparameters and the graph topology (as captured by ),
Theorem 1. Consider the update in (17). Define and the corresponding optimum. Denote the smallest positive eigenvalue of as , the smallest and largest eigenvalue of as and , respectively. Denoting , for arbitrary constants , and , . the iterates converge linearly as follows:
(32) |
where
(33) |
Proof: Since is strongly convex with parameter and the gradient is Lipschitz continuous with parameter ,
(34) |
Using Lemma 3, we rewrite the RHS (right hand side) of (34) as
(35) |
where we have used the fact that . From the dual update (17c) and (KKTc), it holds that
(36) |
Moreover, using (17d) and (KKTd), we obtain
(37) |
After substituting (36) and (37) into (35), we obtain
(38) |
Using Lemma 2, we have , where the inequality follows from Lemma 2 and Assumption 1. Similarly, . Therefore, we obtain
(39) |
Note that holds true for any . Multiplying both sides of (39) by and using the aforementioned identity for the inner product terms, and considering the concatenation we obtain:
(40) |
where . To establish linear convergence as in (32), it suffices to show that for some , the following holds: . Moreover, for any , the inequality holds: . Therefore, it is sufficient to show
(41) |
We proceed to establish this bound. From Lemma 3, it holds:
(42) |
Note that holds true for any . After applying this inequality three times with arbitrary constant for (42), we obtain
(43) |
Consider the error term in Lemma 4, Therefore, . Since is strongly convex with and the gradient is Lipschitz continuous with parameter , the following inequality holds:
(44) |
By setting , we obtain Therefore, where . From Lemma 5, we have
(45) |
Denoting and combining (43)-(45), we have:
(46) |
Furthermore, from (17d) and (KKTd), it holds that . Using the same technique as in deriving (43), we obtain for arbitrary :
(47) |
Using (46) and (47), it suffices to show that the following inequality holds true for some ,
which is satisfied if is chosen as in (33).
V EXPERIMENTS




In this section we present some numerical experiments of the proposed method compared to distributed first order methods: P2D2 [22] and PG-EXTRA [7], for solving the following distributed logistic regression problem:
where and is the number of data points held at each agent and are training samples of dimension and binary labels, respectively. We consider two real datasets from LIBSVM111https://www.csie.ntu.edu.tw/ cjlin/libsvm/: the skin_nonskin dataset and the ijcnn1 dataset. We take 5,000 data points from each dataset with dimension and dimension , respectively. A random binomial graph with agents is drawn (each edge is drawn independently from a Bernoulli() distribution; =0.2 was used in all cases). The mixing matrix for P2D2 is generated using the Metropolis rule while the mixing matrix of PG-EXTRA is generated using the Laplacian based constant edge weight matrix. In all cases, the norm weight . We plot the average relative cost error defined as for different network sizes. As it can be seen in Figure 1 and 2, the BFGS-ADMM method demonstrates significant speed up in both datasets compared to other methods. Note that our first order variant also shows advantages compared to other first order methods, due to the fact that in our first order approximation, step size selection also takes into account the number of neighbors each agent has as in (20).
References
- [1] Y. Li, D. Stipanović, P. Voulgaris, and Z. Gu, “Decentralized model predictive control of urbandrainage systems,” WSEAS Transactions on Systems and Control, vol. 14, pp. 247–256, 2019.
- [2] T. Huang, N. M. Freris, P. R. Kumar, and L. Xie, “A synchrophasor data-driven method for forced oscillation localization under resonance conditions,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 3927–3939, 2020.
- [3] R. Bekkerman, M. Bilenko, and J. Langford, “Scaling up machine learning: Parallel and distributed approaches,” in Proceedings of the 17th ACM SIGKDD International Conference Tutorials, 2011.
- [4] N. M. Freris, H. Kowshik, and P. R. Kumar, “Fundamentals of large sensor networks: Connectivity, capacity, clocks, and computation,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1828–1846, 2010.
- [5] K. Kim and P. R. Kumar, “Cyber–physical systems: A perspective at the centennial,” Proceedings of the IEEE, vol. 100, pp. 1287–1308, 2012.
- [6] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
- [7] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
- [8] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, p. 127–239, 2014.
- [9] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized linearized alternating direction method of multipliers,” IEEE Transactions on Signal Processing, vol. 63, no. 15, pp. 4051–4064, 2015.
- [10] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
- [11] Y. Li, N. M. Freris, P. Voulgaris, and D. Stipanović, “D-SOP: Distributed second order proximal method for convex composite optimization,” in Proceedings of the 2020 American Control Conference, 2020, pp. 2844–2849.
- [12] J. D. Lee, Y. Sun, and M. A. Saunders, “Proximal Newton-type methods for minimizing composite functions,” SIAM Journal on Optimization, vol. 24, no. 3, pp. 1420–1443, 2014.
- [13] J. E. Dennis, Jr. and J. J. Moré, “Quasi-newton methods, motivation and theory,” SIAM Review, vol. 19, no. 1, pp. 46–89, 1977.
- [14] P. Latafat, N. M. Freris, and P. Patrinos, “A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 4050–4065, 2019.
- [15] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
- [16] W. C. Davidon, “Variable metric method for minimization,” SIAM Journal on Optimization, vol. 1, no. 1, pp. 1–17, 1991.
- [17] D. Goldfarb, “A family of variable-metric methods derived by variational means,” Mathematics of Computation, pp. 23–26, 1970.
- [18] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. Springer, 2006.
- [19] C. Chen, B. He, Y. Ye, and X. Yuan, “The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent,” vol. 155, no. 1–2, 2016.
- [20] T. Lin, S. Ma, and S. Zhang, “Global convergence of unmodified 3-block ADMM for a class of convex minimization problems,” Journal of Scientific Computing, vol. 76, no. 1, pp. 69–88, 2018.
- [21] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [22] S. Alghunaim, K. Yuan, and A. H. Sayed, “A linearly convergent proximal gradient algorithm for decentralized optimization,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
Proof of Lemma 1
From the construction of , where selects the -th coordinate, we stress that the update (11e) only involves the subvector as it can be seen by explicitly writing (11e) as:
We decompose the dual variable as , where . Recalling and pre-multiplying the dual update (11d) with , we obtain
Using (15), we conclude that for , i.e., . With zero initialization, we can therefore express as , for all . This shows that it is redundant to maintain both and since they sum to for all . Therefore, we can rewrite (11d) as:
(48) | |||
(49) |
After taking the difference of the above two equations and dividing by 2, we obtain:
(50) |
where we have used the fact that . Similarly, by summing (48) and (49), we obtain . With zero initialization for , it is redundant to keep since we can compute it as:
(51) |
Recalling the approximated Hessian in (13), we express the primal update (14) as
(52) |
Since , , and as in (4) and (5), along with the decomposition of and the equality in (51), we have
(53) | |||
(54) |
Using (6), we can rewrite
(55) |
Substituting (53)-(55) into (52), we obtain the primal updates as in (17a).
Proof of Lemma 2
The update (17b) is equivalent to
From the optimality condition, it holds that
(56) |
Substituting the dual update into (56), we obtain
After re-arranging, we obtain the desired.
Proof of Lemma 3
Note that the difference between BFGS-ADMM and it’s first order variant lies in (13). For the first order variant,
i.e., is identically zero. By rearranging (17a), we obtain the following equation:
(57) |
Using the dual update (17c), we have
(58) |
Since , it holds that
(59) |
Substituting (58) and (59) into (57), we obtain
(60) |
Using the dual update (17d), we have
(61) |
Substituting (61) into (60) and using the definition of , we obtain
(62) |
After subtracting (KKTa) from (62), we obtain the desired where for the first order variant.
Proof of Lemma 4
For BFGS-ADMM, since satisfies the secant equation: , we have
and (29) follows by applying the Cauchy-Schwartz inequality. For the first order variant, the Lipschitz continuity of implies that ,
. The claim in (30) follows.
Proof of Lemma 5
Consider the update (17c) and (17d). It can be written compactly as:
where 0 is a zero vector of dimension . Therefore, it suffices to prove the vector lies in the column space of . Note that since the graph is connected, if is in consensus, i.e., . By letting for all , we have:
,
which shows that lies in the column space of . To simplify the notation we denote . To prove the existence of optimal in the column space of , consider the (KKTa) satisfied with any optimal ,
The projection of into the column space of , denoted as , also satisfies (KKTa) since their differences lie in the kernel of , i.e., . The uniqueness of can be established by contradiction. Let and be two optimal stacked dual vector that lie in the column space of , . Since is strongly convex, from (KKTa), it holds that After taking the difference, we have . But . Therefore , contradiction. Inequality (31) follows from the fact that is orthogonal to the kernel of .