Distributed Optimization Method Based On Optimal Control

Ziyuan Guo, Yue Sun, Yeming Xu, Liping Zhang, and Huanshui Zhang, Senior Member, IEEE This work was supported by the Original Exploratory Program Project of National Natural Science Foundation of China (62450004), the Joint Funds of the National Natural Science Foundation of China (U23A20325), and the Major Basic Research of Natural Science Foundation of Shandong Province (ZR2021ZD14). (Corresponding author: Huanshui Zhang). Ziyuan Guo, Yeming Xu and Liping Zhang are with the College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao, 266590, China (e-mail: [email protected]; [email protected]; [email protected]). Yue Sun is with the School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China (email: [email protected]). Huanshui Zhang is with the College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 266590, China, and also with the School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China (email: [email protected]).

Abstract

In this paper, a novel distributed optimization framework has been proposed. The key idea is to convert optimization problems into optimal control problems where the objective of each agent is to design the current control input minimizing the original objective function of itself and updated size for the future time instant. Compared with the existing distributed optimization problem for optimizing a sum of convex objective functions corresponding to multiple agents, we present a distributed optimization algorithm for multi-agents system based on the results from the maximum principle. Moreover, the convergence and superlinear convergence rate are also analyzed stringently.

I INTRODUCTION

Optimization problem which involves maximizing or minimizing the objective function is ubiquitous in science and engineering [1]. In the traditional centralized optimization control, it requires a centralized manner to collect and deal with the information received from all other agents. With increasing penetrations of the networked control systems, the division of labor and cooperation of multi-agent systems plays an increasingly important role in daily life. Meanwhile, distributed optimization problems arise accordingly in which several autonomous agents collectively try to achieve a global objective and compared with the centralized optimization, distributed algorithms also have the potential to respect privacy of data, measurements, cost functions, and constraints, which becomes increasingly important in practical applications such as intelligent traffic system [2], electric power systems [3], formation control[4], resource allocation [5] and so on.

Distributed optimization has gained growing renewed interest and there are various distributed algorithms proposed in the literature. The dual decomposition, which is the early class of distributed optimization algorithms presented in [6], typically relied on choosing the step size to ensure convergence. Due to the inexpensive computation cost of (sub)gradients, the gradient descent algorithms is also a popular tool to deal with the distributed optimization. [7] considered a unconstrained distributed computation model for optimizing a sum of convex objective functions corresponding to multiple agents where each agent performed a consensus step and then a descent step along the local (sub)gradient direction of its own convex objective function. For the constrained distributed optimization problem, [8] focused on the cooperative control problems where the values of agents are constrained to lie in closed convex sets and developed distributed optimization algorithms for problems. To accelerate the convergence of (sub)gradient method, an effective distribution algorithm called Newton method is taken into consideration [9, 10, 11], which has a superlinear rate in the convergence rate. In particular, [12] proposed a distributed adaptive Newton algorithm with a global quadratic convergence rate where each node of a peer-to-peer network minimizes a finite sum of objective functions by communicating with its neighboring nodes. The procedure proposed in [13] let agents distributedly compute and sequentially update an approximated Newton-Raphson direction by means of suitable average consensus ratios. However, as point out in [14], although subgradient and Newton methods are favorable, powerful and widely used in distributed optimization algorithms, the choice of step size and the singularity of Hessian matrix are crucial to the algorithms.

Motivated by [14], a novel distributed optimization algorithm is proposed from the viewpoint of the optimal control problem. To be specific, there exists control input of each agent who is the updated size of each iteration. The objective of each agent is to design current control input to minimize the objective function of itself and updated size for the future time instant. Through a simple transformation, the sum of the original objective function of each agent is constructed. Based on the maximum principle, the distributed optimization method is derived compared with the existing distributed optimization problem for optimizing a sum of convex objective functions. To show the superiority of the algorithm, the superlinear convergence rate is proved.

The outline of this paper is given below. Compared with the distributed optimization problem, the optimal control problem is derived in Section II. The distributed optimization algorithm based on the maximum principle is proposed in Section III. The rate of the convergence is analysed in Section IV. Numerical examples are given in Section V. Conclusion is arranged in Section VI.

Notation: Throughout the paper, $A^{T}$ stands for the transposition of matrix $A$ . $R^{n}$ denotes the $n$ -dimensional real vector space. $I$ is a unit matrix with appropriate dimension. For a symmetric matrix $M$ , $M>0(\geq 0)$ means that $M$ is a positive definite (positive semi-definite) matrix. $\rho(M)$ represents the spectral radius of $M$ . $\|\cdot\|$ denotes 2-norm of vectors and $\|\cdot\|_{m}$ is the induced norm of matrices. $[a,b]$ represents all integers from integer $a$ to integer $b$ .

II PROBLEM FORMULATION

II-A Graph theory

The communication graph is denoted by a directed graph $\mathcal{G=(V,E,A)}$ , where $\mathcal{V}=\{1,2,\ldots,n\}$ is a set of vertices(nodes), $n$ is the number of agents satisfying $n\geq 2$ , $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ is the set of edges, and the weighted matrix $\mathcal{A}=(a_{ij})_{n\times n}$ is a non-negative matrix for adjacency weights of edges, $a_{ij}\neq 0$ if and only if $(i,j)\in\mathcal{E}$ . The graph $\mathcal{G}$ is said to be balanced if the sum of the interaction weights from agent $j$ to agent $i$ are equal, i.e., $\sum\limits_{j=1}^{n}a_{ij}=\sum\limits_{j=1}^{n}a_{ji},\forall i\in\mathcal{V}$ . Moreover, $\mathcal{N}_{i}=\{j\in\mathcal{V}|{(i,j)}\in\mathcal{E}\}$ is used to represent the neighbor set of agent $i$ . For the topology $\mathcal{G=(V,E,A)}$ , a path of length $r$ from node $i_{1}$ to node $i_{r}+1$ is a sequence of $r+1$ distinct nodes ${\{i_{1}\ldots,i_{r}+1\}}$ such that $(i_{q},i_{q}+1)\in\mathcal{E}$ for $q=1,\ldots,r$ . If there exists a path between any two nodes, then $\mathcal{G}$ is said to be strongly connected. Here we make the following assumptions for the communication graph.

Assumption 1

The directed graph $\mathcal{G}$ is balanced and is strongly connected.

Noting that in the distributed coordination control, conditions in Assumption 1 play important roles in ensuring multi-agent systems (MAS) to reach consensus [7] [15].

II-B Distributed optimization

Consider an MAS consisting of $n$ agents with optimization problem, labeled by set $\mathcal{V}=\{1,\dots,n\}$ where agents communicate the local state information with their neighbors via directed graph $\mathcal{G}$ . Only agent $i\in\mathcal{V}$ knows the function $f_{i}:{\mathbb{R}^{n}}\to\mathbb{R}$ , which is twice continuously differentiable, possibly non-convex. The traditional unconstrained optimization problem is given by

\mathop{\rm minimize}\limits_{x\in{\mathbb{R}^{n}}}\{f(x)\overset{\text{def}}{=}\frac{1}{n}\sum\limits_{i=1}^{n}f_{i}(x)\}.

(1)

Different from the traditional distributed optimization method, we transform the task of finding solutions to problem (1) into updating of the state sequence within an optimal control problem. To be specific, consider the following discrete-time linear time-invariant system for each agent

{x_{i}(k+1)}=x_{i}(k)+u_{i}(k),

(2)

where $x_{i}(k)\in R^{n}$ is state with the initial value $x(0)$ , $u_{i}(k)\in R^{n}$ is the control input which needs to be further specified later. The $i$ represents the $i$ th agent, $i\in{\{1,2,\dots n\}}$ .

Correspondingly, the cost function of each agent satisfies

	$\displaystyle J_{i}$	$\displaystyle=$	$\displaystyle\sum\limits^{N}_{k=0}\Big{(}\sum\limits_{j\in\mathcal{N}_{i}}e^{T}_{ij}(k)Q_{i}e_{ij}(k)+u^{T}_{i}(k)R_{i}u_{i}(k)+f_{i}(x_{i}(k))\Big{)}$		(3)
			$\displaystyle+\sum\limits_{j\in\mathcal{N}_{i}}e^{T}_{ij}(N+1)H_{i}e_{ij}(N+1)+f_{i}(x_{i}(N+1)),$		(3)

with $e_{ij}(k)=x_{i}(k)-x_{j}(k)$ where $Q_{i}\geq 0$ , $H_{i}\geq 0$ and $R_{i}>0$ .

Remark 1

Compared with the traditional method, we convert the optimization problem into an optimal control problem, where the objective of each agent is to design the current control input minimizing the sum of the original objective function and updated size for the future time instant. On the one hand, the problem studied in this paper includes the traditional results with multi-agents system, if we let $f_{i}(x_{i}(k))=f_{i}(x_{i}(N+1))=0$ while the system (1) can be modified as $x_{i}(k+1)=Ax_{i}(k)+B_{i}u_{i}(k)$ . On the other hand, in the existing optimization algorithms to handle the problem (1), there is a typical distributed first-order gradient descent algorithm [16]:

{x_{i}(k+1)}=\sum\limits_{j=1}^{n}a_{ij}x_{j}(k)-\eta(k)\nabla f_{i}({x_{i}}(k)),

(4)

where $\eta(k)$ is the step size. Noting that in convex optimization, it has been proven that using distributed algorithm (4), the global optimal solution for problem (1) can be obtained. However, distributed algorithm (4) has limitations in terms of selecting the step size. The step size $\eta(k)$ in distributed algorithm (4) should theoretically satisfy the following two conditions:

lim_{k\to\infty}\eta(k)=0,

(5)

\sum\limits_{k=1}^{\infty}\eta(k)=\infty.

(6)

In practical applications, the common practice for selecting the $\eta$ is to start with a small constant value. Alternatively, even if a time-varying step size $\eta(k)$ is chosen to satisfy the two conditions (5)-(6), the results obtained from distributed algorithm (4) are evidently influenced by the $\eta(k)$ . The distributed algorithm (4) has limitations in terms of selecting the step size and the results obtained from distributed algorithm (4) are evidently influenced by the $\eta(k)$ .

Based on the discussion above, we are in position to derive the distributed optimization algorithm from the viewpoint of the optimal control. For convenience of future discussion, some symbols are denoted below.

$\displaystyle x(k)$	$\displaystyle=$	$\displaystyle[x^{T}_{1}(k),...,x^{T}_{n}(k)]^{T},\quad u(k)=[u^{T}_{1}(k),...,u^{T}_{n}(k)]^{T},$
$\displaystyle Q$	$\displaystyle=$	$\displaystyle diag\{Q_{1},...,Q_{n}\},\quad R=diag\{R_{1},...,R_{n}\},$
$\displaystyle H$	$\displaystyle=$	$\displaystyle diag\{H_{1},...,H_{n}\},\quad F({x(k)})=\sum\limits_{i=1}^{n}f_{i}({x_{i}(k)}).$

Let $\delta_{i}(k)$ be the errors set of the $i$ th agent, then $e(k)=[\delta^{T}_{1}(k),...,\delta^{T}_{n}(k)]$ . In this way, we have

\displaystyle e(k+1)=e(k)+\sum\limits_{i=1}^{n}B_{i}u_{i}(k)=e(k)+Bu(k),

where $B_{i}$ is the column vector composed by $0,1,-1$ , satisfying $B=[B_{1},...,B_{n}]$ and $B^{T}B=L$ is a Laplacian matrix. To this end, (2) and (3) can be written with expanded forms as

\displaystyle x(k+1)=x(k)+u(k),

(7)

and

\begin{array}[]{l}\vspace{0.15cm}{\rm{}}\mathop{J}_{N}=\sum\limits_{k=0}^{N}(\sum\limits_{i=1}^{n}(\sum\limits_{j\in N_{i}}e^{T}_{ij}(k)Q_{i}e_{ij}(k)+u_{i}(k)^{\mathrm{T}}R_{i}u_{i}(k)\\ \vspace{0.15cm}\quad\quad+f_{i}({x_{i}(k)})))+\sum\limits_{i=1}^{n}(\sum\limits_{j\in N_{i}}e^{T}_{ij}(N+1)H_{i}e_{ij}(N+1)\\ \quad\quad+f_{i}({x_{i}(N+1)}))\\ \vspace{0.15cm}\quad\ =\sum\limits_{k=0}^{N}(e^{T}(k)Qe(k)+u(k)^{\mathrm{T}}Ru(k))+F({x(k)})\\ \vspace{0.15cm}\quad\quad+e^{T}(N+1)He(N+1)+F({x(N+1)}),\end{array}

(8)

respectively. The optimal control problem to be solved in this section is given.

Problem: Find the optimal control $u(k)$ minimizing the long-term cost $J_{N}$ and subject to (7).

Remark 2

In our recent work [14], we implemented a centralized optimization algorithm, which causes us to design the system and cost function (7)-(8). Naturally, we considered the possibility of designing a distributed algorithm based on the centralized optimization algorithm from the maximum principle. It should be noted that if $F({x(k)})=\sum\limits_{i=1}^{n}f_{i}({x_{i}(k)})=0$ , (8) can be reduced to a cost function of consensus problem based on optimal control in our recent work [17].

III Distributed Optimization Method Using Optimal Control

In this section, we use optimal control theory to solve the Problem. This inspires us to develop an optimization algorithm which can be implemented into a distributed manner. Additionally, a variant of the algorithm is also proposed to balance the number of iterations and communications.

III-A Algorithm Development

Following from [18], applying Pontryagins maximum principle to the system (7) with the cost function (8), the following costate equation and equilibrium condition are obtained:

0=Ru(k)+\lambda(k),

(9)

{\lambda(k-1)}=B^{T}Qe(k)+\nabla f(k)+\lambda(k),

(10)

with the terminal value ${\lambda(N)}=B^{T}He(N+1)+\nabla f(N+1)$ , where $\nabla f(k)=[\nabla f_{1}(x_{1}(k),...,\nabla f_{n}(x_{n}(k)))]^{T}$ .

To derive the analytical solution from the FBDEs (7), (9)- (10), define the Riccati equation

{P(k)}=Q+P(k+1)-P(k+1)B\Gamma^{-1}(k)B^{T}P(k+1),

(11)

with the terminal value ${P(N+1)}=H$ , where ${\Gamma(k)}=R+B^{T}P(k+1)B$ .

The analytical expression to the controller is given.

Lemma 1

If the Riccati equation (11) admits solution such that ${\Gamma(k)}$ is invertible, then the control satisfies

{u(k)}=-\Gamma^{-1}(k)[B^{T}P(k+1)e(k)+\sum\limits_{l=k+1}^{N+1}M(l)\nabla f(l)],

(12)

where $M(l)=M(l-1)R(R+B^{T}P(i)B)^{-1}$ for $l>k+1$ and $M(l)=I$ for $l=k+1$ . Moreover, the costate equation is derived as

{\lambda(k-1)}=B^{T}P(k)e(k)+\sum\limits_{l=k}^{N+1}M(l)\nabla f(l).

(13)

Proof:

For $k=N$ , adding the terminal value $\lambda(N)$ into (9), there holds

\begin{split}0&=Ru(N)+B^{T}H[e(N)+Bu(N)]+\nabla f(N+1)\\ &=[R+B^{T}HB]u(N)+B^{T}He(k)+\nabla f(N+1),\end{split}

(14)

i.e., $u(N)$ satisfies (12) for $k=N$ . Accordingly, substituting (14) into (10), the costate $\lambda(N-1)$ is calculated as

\begin{split}&\lambda(N-1)\\ =&B^{T}Qe(N)\!+\!B^{T}H[e(N)\!+\!Bu(N)]\!+\!\nabla f(N\!+\!1)\!+\!\nabla f(N)\\ =&B^{T}[Q+H]e(N)-B^{T}HB\Gamma^{-1}(k)[B^{T}P(k+1)e(k)\\ &+\nabla f(N+1)]+\nabla f(N+1)+\nabla f(N)\\ =&B^{T}P(N)e(N)+M(N)\nabla f(N+1)+\nabla f(N),\end{split}

which is exactly (13) for $k=N$ , where $P(N)$ satisfying (11) has been used in the last equality.

Assume that (13) is established for $k\geq s+1$ , we will show it also holds for $k=s$ . For $k=s$ , adding $\lambda(s)$ into (9), one has

0=Ru(s)+B^{T}P(s+1)[e(s)+Bu(s)]+\sum\limits_{l=s+1}^{N+1}M(l)\nabla f(l),

(15)

i.e., $u(s)$ is such that

{u(s)}=-\Gamma^{-1}(s)[B^{T}P(s+1)e(s)+\sum\limits_{l=s+1}^{N+1}M(l)\nabla f(l)].

(16)

Adding it into (10) for $k=s$ , it derives

\begin{split}\lambda(s-1)&=B^{T}Qe(s)+B^{T}P(s+1)[e(s)+Bu(s)]\\ &+\sum\limits_{l=s+1}^{N+1}M(l)\nabla f(l)+\nabla f(s)\\ &=B^{T}P(s)e(s)+\sum\limits_{l=s}^{N+1}M(l)\nabla f(l),\end{split}

which is exactly (13). This completes the proof. ∎

Remark 3

It should be noted that the analytical solution $u(k)$ obtained in Lemma 1 contains the future information, i.e., it depends on $\{\nabla f(i),i\in[k+1,N+1]\}$ , which is difficult to realize in practical life, thus a modified algorithm is presented.

Before proposing an algorithm based on optimal control obtained in Lemma 1, we first show an interesting phenomenon: the controller (12) derived from optimal control theory utilizes the average value of the derivative of each agent’s local function, referred to as the “average gradient” hereafter. That is to say, the matrix $M(l)$ has the property of taking the average value, which is proved below. It should be noted that distributed optimization algorithms that employ the average gradient have been explored in [19]. Our derived results theoretically demonstrate the rationality and correctness of adopting average gradient. The property of $M(l)$ defined in Lemma 1 is given below.

Lemma 2

$M(l)$ defined in Lemma 1 satisfies $lim_{i\to\infty}M(l)=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}$ .

Proof:

If the Riccati equation (11) admits a stable solution $P$ when $k\rightarrow\infty$ , then we adopt this stable solution in $M(l)$ . In order to prove the convergence of $M(l)$ , the following system is constructed firstly

\begin{split}{w(k+1)}&=R(R+B^{T}PB)^{-1}w(k)\\ &=(R(R+B^{T}PB)^{-1})^{k}w(0),\end{split}

(17)

where $w(k)\in{\mathbb{R}^{n}}$ . In this case, the key point is to prove $lim_{k\to\infty}(R(R+B^{T}PB)^{-1})^{k}=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}$ . Utilizing the properties of the Riccati equation’s solution and $B^{T}B=L$ , rewrite the above (17) with finesse

\begin{split}{w(k+1)}&=(I+\bar{L})^{-1}w(k)\\ &=w(k)-((I+\bar{L})^{-1}\bar{L})w(k),\end{split}

(18)

where $\bar{L}=B^{T}PBR^{-1}$ is a Laplacian matrix.

According to [20], since the eigenvalues of matrix $\bar{L}$ are greater than or equal to zero, the eigenvalues of $(I+\bar{L})$ are greater than or equal to $1$ . Therefore, the eigenvalues of the inverse of $(I+\bar{L})$ lie between $0$ and $1$ , indicating that the above system (17) is stable.

When $k\rightarrow\infty$ , the stable state $w$ of the system should satisfy $-((I+\bar{L})^{-1}\bar{L})w=0$ , which simplifies to $-\bar{L}w=0$ . Referring to conclusion [20], the solution of $-\bar{L}w=0$ is $w^{*}=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}w(0)$ , so we have $lim_{k\to\infty}(R(R+B^{T}PB)^{-1})^{k}=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}$ , i.e., $lim_{i\to\infty}M(l)=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}$ . This completes the proof. ∎

Lemma 2 indicates that (12) utilizes the average gradient. For the sake of simplifying notation, we will denote the average gradient as $g(l)$ , i.e., $g(l)=M(l)\nabla f(l)$ .

III-B Distributed Optimization Algorithm

In this subsection, we discuss the distributed optimization algorithm derived based on Lemma 1, which includes a centralized algorithm and a decentralized algorithm.

Lemma 3

Under the condition of Lemma 1 and based on the optimal control obtained from it, there exists an iteration algorithm

\begin{split}{x(k+1)}&=x(k)+d_{k}(x(k))\\ {d_{k}(x(k))}&=-(\Gamma_{P}+h(k))^{-1}[g(k)-\Gamma_{P}d_{k-1}(x(k))],\\ d_{0}(x(k))&=-(\Gamma_{P}+h(k))^{-1}(g(k)+B^{T}Pe(k)).\end{split}

(19)

where $g(k)=(\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}})\nabla f(k)$ is average gradient, $h(k)=diag\{h_{1}(k),...,h_{n}(k)\}$ , $h_{i}(k)=\frac{1}{n}\sum\limits_{i=1}^{n}\nabla^{2}f_{i}(x_{i}(k))$ , $\Gamma_{P}$ represents the stable solution $P$ of (11) used for $\Gamma(k)$ .

Proof:

For controller (12), noting that it exhibits non-causality because the future average gradient is used. Fortunately, the problem of utilizing the future information has been previously addressed in our research referring to [14] for the details. In accordance with the methodology presented in [14], controller (12) will transform into $d_{k}$ as depicted in (19). ∎

Remark 4

The connection between algorithm (19) and our previous work presented in [14] will be discussed. On the one hand, if each agent does not have a local function to optimize, i.e., $f_{i}(x_{i}(k))=f_{i}(x_{i}(N+1))=0$ , it is evident that the terms $g(k)$ and $h(k)$ in (19) will not exist. That is to say, (19) transforms into the following form:

\begin{split}&{x(k+1)}=x(k)+d_{k}\\ &d_{k}=-(\Gamma_{P})^{-1}B^{T}Pe(k).\end{split}

(20)

In this case, $d_{k}$ corresponds to the form of the controller $u_{k}$ from Lemma 1 in [17] when the matrix $A$ is the identity matrix. On the other hand, if we only consider a centralized optimization problem, or if the states during the iteration process of each agent are same, then $e(k)$ and the related $P$ will not exist. Moreover, (19) degenerates into the following form:

\begin{split}{x(k+1)}&=x(k)+d_{k}(x(k))\\ d_{k}(x(k))&=-(R+h(k))^{-1}[g(k)-Rd_{k-1}(x(k))],\\ d_{0}(x(k))&=-(R+h(k))^{-1}g(k).\end{split}

(21)

In this case, (21) corresponds to Algorithm I in the previous work [14], where $g(k)$ and $h(k)$ correspond to the (average) gradient and Hessian matrix, respectively.

Based on optimal control theory, we have derived a distributed optimization algorithm (19) with a central computing node (in computer science, it is called as $parameter\ server$ [21] and we use this expression in Algorithm 1). The algorithm presented in Algorithm 1 is called the Distributed Optimal Control Method with a Central Computing Node (DOCMC). Now, we summarize the DOCMC algorithm from the perspective of each agent $i$ .

Algorithm 1 DOCMC

1: Initialization: Each agent checks whether it can communicate with the server, and sets

x_{i}(0)\in{\mathbb{R}^{p}}

2: for

k=0,1,...

3: for each agent

i=1,...,n

4: Compute local gradient

\nabla f_{i}({x_{i}}(k))

and local hessian

{\nabla^{2}f_{i}({x_{i}}(k))}

. Calculate the state error

e_{ij}(k)

between itself and its neighbor.

5: Send

\nabla f_{i}({x_{i}}(k))

\nabla^{2}f_{i}({x_{i}}(k))

and

e_{ij}(k)

to server.

6: end for

7: For the server, compute

g(k)

h(k)

and

d_{0}(x(k))=-(\Gamma_{P}+h(k))^{-1}(g(k)+B^{T}Pe(k))

8: for

t=0,1,...,k-1

{d_{t+1}(x(k))=-(\Gamma_{P}+h(k))^{-1}[g(k)-\Gamma_{P}d_{t}(x(k))]}

9: end for

10: Broadcast

{d_{k}(x_{i}(k))}

i

-th agent, where

d_{k}(x_{i}(k))

is the components corresponding to the

i-th

agent in

d_{k}(x(k))

11: Each agent update

x_{i}(k+1)=x_{i}(k)+d_{k}(x_{i}(k))

12: end for

Utilizing the state error $e_{ij}$ with neighbors can significantly enhance the computational stability of the server. In fact, if each agent adopts the same state, then (21) will be utilized in DOCMC. However, when the states of each agent cannot be guaranteed to be consistent due to communication processes or the influence of individual agents, relying solely on (21) may not be sufficient to achieve the optimization goals. The existence of $e_{ij}$ ensures that each agent can achieve the optimization goals even when their states are not consistent (whether due to initially different states or deviations during the optimization process), as the algorithm incorporates an optimal consensus feedback control form based on $e_{ij}$ .

Next, we will focus on describing the distributed optimization algorithm. Firstly, reconsidering the DOCMC algorithm, the existence of $(\Gamma_{P}+h(k))^{-1}$ is a primary reason why the algorithm requires a server. Motivated by our recent work [22], which replaced the inverse matrix with an adjustable scalar $\eta>0$ , in the DOCMC algorithm, $(\Gamma_{P}+h(k))^{-1}$ will be replaced by $\eta>0$ in the following discussion. Due to the relationship $(\Gamma_{P}+h(k))^{-1}\Gamma_{P}=I-(\Gamma_{P}+h(k))^{-1}h(k)$ based on (19), we propose the following algorithm

\begin{split}{x(k+1)}&=x(k)+\overline{d}_{k}(x(k))\\ \overline{d}_{k}(x(k))&=-\eta g(k)+(I-\eta h(k))\overline{d}_{k-1}(x(k)),\\ \overline{d}_{0}(x(k))&=-\eta(g(k)+B^{T}Pe(k)),\end{split}

(22)

where the second-order information $h(k)$ is used in algorithm (22) on account of the optimization framework of optimal control. We refer to the above algorithm as the Distributed Optimization Algorithm based on Optimal Control (DOAOC).

Remark 5

Now the connection between algorithm (22) and our previous work in [23] is discussed. If we only consider a centralized optimization problem, or if the states during the iteration process of each agent are the same, then $e(k)$ and the related $P$ will not exist. In this case, (22) degrades into the following form:

\begin{split}{x(k+1)}&=x(k)+\overline{d}_{k}(x(k))\\ \overline{d}_{k}(x(k))&=-\eta g(k)+(I-\eta h(k))\overline{d}_{k-1}(x(k)),\\ \overline{d}_{0}(x(k))&=-\eta g(k).\end{split}

(23)

which is exactly Algorithm II in the previous work [23].

Now, a distributed optimization algorithm for each agent $i$ to handling problem (1) from the DOAOC algorithm is given, referring to Algorithm 2 for details. In the proposed distributed algorithm, as described in Steps 3 and 4, each agent only communicates with its neighbors. Step 5 initializes the loop by computing the initial $\overline{d}^{0}_{i}(k)$ . Different from [22], Step 7 only requires each agent to calculate $\overline{d}^{t+1}_{i}(k)$ independently.

Algorithm 2 DOAOC

1: Initialization: Each agent

i

requires

\eta

and sets

x_{i}(0)\in{\mathbb{R}^{p}}

2: for

k=0,1,...

3: Take any average consensus protocol to get

h_{i}(k)

and

g_{i}(k)

4: Each agent

i

obtains

e_{ij}(k)

by communicating with its neighbors

j\in{\mathcal{N}_{i}}

and combining with

e_{ij}(k)

, agent

i

records

\delta^{T}_{i}(k)

5: Each agent

i

calculates

\overline{d}^{0}_{i}(k)=-\eta(g_{i}(k)+B_{i}^{T}P_{i}\delta^{T}_{i}(k))

6: for

t=0,1,...,k-1

7: Each agent

i

calculates

\overline{d}^{t+1}_{i}(k)=\overline{d}^{t}_{i}(k)-\eta g_{i}(k)-\eta h_{i}(k)\overline{d}^{t}_{i}(k)

8: end for

9: Each agent

i

updates

x_{i}(k+1)=x_{i}(k)+\overline{d}^{t+1}_{i}(k)

10: end for

III-C Dissussion of the DOCMC and DOAOC Algorithm

In this subsection, we discuss the compelling characteristics of DOCMC and DOAOC and some simplification techniques in practical use , which are summarized as follows:

•

Many distributed optimization methods commonly used in machine learning, including frameworks like TensorFlow, employ distributed algorithms with a central computing node. When considering a star network (master-slave network) structure [21] in the DOCMC algorithm (where $e_{ij}$ does not occur as each agent independently exchanges information with the server), simplification can be achieved using (21), where $g(k)$ represents the average derivative and $h(k)$ directly represents $\frac{1}{n}\sum\limits_{i=1}^{n}\nabla^{2}f_{i}(x_{i}(k))$ .
•

To obtain the average gradient, Step 3 in Algorithm 2 is necessary. In practical applications, after evaluating the performance of the agents, a suitable consensus protocol can be chosen: if agents have ample memory, memory-intensive consensus protocols like the DSF algorithm [24] can be employed; if agents have limited memory, a finite-time average consensus protocol [25] can be used; and in cases where agent performance is poor, traditional consensus protocols [7] [20] can still be applied. $h_{i}(k)$ in DOAOC algorithm can be obtained not only through the above consensus protocol but also can be approximated by first-order derivative difference. In fact, in our recent work [23], Algorithm III utilizes a first-order derivative difference method to obtain the Hessian matrix.

IV CONVERGENCE ANALYSIS

In this section, we conduct a convergence analysis of the proposed algorithms. The superlinear convergence rate demonstrates the superiority of our optimal control-based algorithms. The assumption provided in this subsection is instrumental for drawing our conclusion.

Assumption 2

The local objective functions $f_{i},i=1,...,n$ are twice continuously differentiable, and there exist constants $0<m_{1}\leq m_{2}<\infty$ such that for any $x\in{\mathbb{R}^{n}}$ , $m_{1}I\preceq\nabla^{2}f_{i}(x)\preceq m_{2}I$

Remark 6

Assumption 2 is standard in the convergence analysis , see [26]. The lower bound $m_{1}$ of $\nabla^{2}f_{i}(x)$ implies that the local objective function $f_{i}$ is strongly convex, and thus the objective function $f_{i}$ has a unique global minimum point. The upper bound $m_{2}$ of $\nabla^{2}f_{i}(x)$ means that the local objective function $f_{i}$ is smooth.

Let $x^{*}$ be the minimum point of $F(x)$ , where $g^{*}$ represents $g(x^{*})$ and $h^{*}$ represents $h(x^{*})$ , now we can proceed with the convergence analysis of our algorithms.

For DOCMC algorithm, we can compactly write the update iteration in (19) as

\begin{split}{x(k+1)}&=x(k)-[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]\\ &\times h(k)^{-1}g(k)-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}\\ &\times\Gamma_{P}^{-1}B^{T}Pe(k).\end{split}

(24)

Lemma 4

Under the condition of Assumption 2. DOCMC algorithm converges if the weighted matrix $R$ is selected appropriately.

Proof:

First ly, we prove that $x(k)$ converges to $\overline{x}(k)$ , where $\overline{x}(k)=(\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}})x(k)$ .

\begin{split}&\rVert x(k+1)-\overline{x}(k+1)\rVert\\ &=\rVert x(k)-[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]h(k)^{-1}g(k)\\ &-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}R^{-1}B^{T}Pe(k)-\overline{x}(k)\\ &+[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]h(k)^{-1}g(k)\rVert\\ &=\rVert x(k)-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}Pe(k)-\overline{x}(k)\rVert\\ &=\rVert[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}\Gamma_{P}^{-1}\bar{L}]x(k)-\overline{x}(k)\rVert.\end{split}

(25)

Let $\epsilon=((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}\Gamma_{P}^{-1}$ . Recalling that $\Gamma_{P}>0$ and $h(k)>0$ , it is obtained immediately from [27] that every eigenvalue of $(\Gamma_{P}+h(k))^{-1}\Gamma_{P}$ is positive and less than 1. Accordingly, all eigenvalues of $((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}$ are positive and its spectral radius is less than 1. It should be noted that $((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}$ is a symmetric matrix , which can be proved in [23]. Therefore, $W=I-\epsilon\bar{L}$ satisf ies the properties of $P_{\epsilon}$ in [20]. According to [19], then we have

\begin{split}\rVert x(k+1)-\overline{x}(k+1)\rVert&=\rVert(W-\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}})(x(k)-\overline{x}(k))\rVert\\ &\leq\sigma\rVert x(k)-\overline{x}(k)\rVert,\end{split}

(26)

where $\sigma\in(0,1)$ . This means that $x(k)$ converges to $\overline{x}(k)$ as $k\rightarrow\infty$ . Let $y(k)=\overline{x}(k)$ , next we will show that $y(k)$ converges to $x^{*}$ . Using $y(k)$ in the update iteration in (24) , we have

{y(k+1)}=y(k)-[I-((\Gamma_{P}+h(k))^{-1}R)^{k+1}]h(k)^{-1}g(k),

(27)

which is consistent with the form of Algorithm 1 in [23]. Thus, there holds

{F(y(k+1))-F(x^{*})}\leq c(F(y(k))-F(x^{*})).

(28)

where $0\leq c<1$ . More details can refer to the proof of Lemma 2 in [23]. This proof is completed. ∎

Now we are in a position to discuss the convergence rate of DOCMC algorithm.

Theorem 1

Under the condition of Assumption 2, when $\{x(k)\}$ generated by DOCMC algorithm is convergent, then there exists a scalar $r_{1}>0$ such that

	$\displaystyle\|\|x(k+1)-x^{*}\|\|$	$\displaystyle\leq r_{1}\rho((\Gamma_{P}+h(k)))^{-1}\Gamma_{P}))^{k+1}$		(29)
		$\displaystyle\times\|\|x(k)-x^{*}\|\|,$		(30)

that is to say, DOCMC algorithm is superlinearly convergent.

Proof:

Denote

	$\displaystyle z_{k}(x(k))$	$\displaystyle=$	$\displaystyle-[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]h(k)^{-1}g(k)$		(31)
			$\displaystyle-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}Pe(k).$		(31)

It is evident from (31) that

$\displaystyle z_{k}(x^{*})=$	$\displaystyle-[I-((\Gamma_{P}+h^{})^{-1}\Gamma_{P})^{k+1}]{h^{}}^{-1}g^{*}$
	$\displaystyle-[(\Gamma_{P}+h^{})^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}Pe^{}$
$\displaystyle=$	$\displaystyle 0,$	(32)
$\displaystyle z_{k}^{\prime}(x^{*})=$	$\displaystyle-(I-((\Gamma_{P}+{h^{*}}))^{-1}\Gamma_{P})^{k+1})$
	$\displaystyle-[(\Gamma_{P}+h^{*})^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}PB.$	(33)

From (24), one has

	$\displaystyle x(k+1)-x^{*}$
$\displaystyle=$	$\displaystyle x(k)-x^{*}+z_{k}(x(k))$
$\displaystyle\approx$	$\displaystyle x(k)-x^{}+[z_{k}(x^{})+z_{k}^{\prime}(x^{})(x(k)-x^{})]$
$\displaystyle=$	$\displaystyle x(k)-x^{}+z_{k}^{\prime}(x^{})(x(k)-x^{*})$
$\displaystyle=$	$\displaystyle[((\Gamma_{P}+h^{}))^{-1}\Gamma_{P})^{k+1}\Gamma_{P}^{-1}R](x(k)-x^{}).$	(34)

Noting that $\Gamma_{P}$ is positive definite, there exists a matrix $C$ satisfying $\Gamma_{P}=C^{\prime}C$ . To this end, it generates $(\Gamma_{P}+h^{*}))^{-1}\Gamma_{P}=(\Gamma_{P}+h^{*}))^{-1}C^{\prime}C=C^{-1}C(\Gamma_{P}+h^{*}))^{-1}C^{\prime}C$ . The second equality indicates $(\Gamma_{P}+h*))^{-1}\Gamma_{P}=\bar{C}^{-1}\bar{\Lambda}\bar{C}$ , where $\bar{\Lambda}$ is a diagonal matrix. Due to $\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})<1$ , it immediately gets $\rho(\bar{\Lambda})=\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})<1$ and

	$\displaystyle\|\|x(k+1)-x^{*}\|\|$
$\displaystyle=$	$\displaystyle\|\|\bar{C}^{-1}\bar{\Lambda}^{k+1}\bar{C}\Gamma_{P}^{-1}R(x(k)-x^{*})\|\|$
$\displaystyle\leq$	$\displaystyle\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})^{k+1}\|\|\bar{C}^{-1}\|\|_{m}\ \|\|\bar{C}\|\|_{m}\$
	$\displaystyle\times\|\|\Gamma_{P}^{-1}R\|\|_{m}\ \|\|x(k)-x^{*}\|\|.$	(35)

By letting $r_{1}=||\bar{C}^{-1}||_{m}||\bar{C}||_{m}||\Gamma_{P}^{-1}R||_{m}\$ , (30) is obtained accordingly. The proof is completed.

∎

Now, the superlinear convergence of DOAOC algorithm is proved.

Theorem 2

Under the condition of Assumption 2, selecting $0<\eta<1$ and $c=\|I-\eta h^{*}\|<1$ , when $\{x(k)\}$ generated by DOAOC algorithm converges, then there exists a scalar $r_{2}>0$ such that

\displaystyle\|x(k+1)-x^{*}\|\leq r_{2}c^{k}\|x(k)-x^{*}\|.

(36)

That is to say, DOAOC algorithm is superlinearly convergent.

Proof:

Following from the direct derivation , it shows

$\displaystyle\bar{d}_{k}(x^{*})=$	$\displaystyle-[I-(I-\eta h^{})^{k+1}]{h^{}}^{-1}g^{*}$
	$\displaystyle-((I-\eta h^{})^{k}\eta B^{T}P)e^{}=0,$	(37)
$\displaystyle\bar{d}^{\prime}_{k}(x^{*})=$	$\displaystyle-(I-(I-\eta h^{})^{k+1})-\eta(I-\eta h^{})^{k}B^{T}PB.$	(38)

Let $I-(\eta h^{*}+B^{T}PB)=G$ . From (23), it follows that

	$\displaystyle x(k+1)-x^{*}$
$\displaystyle=$	$\displaystyle x(k)-x^{*}+\bar{d}_{k}(x(k))$
$\displaystyle\approx$	$\displaystyle x(k)-x^{}+[\bar{d}_{k}(x^{})+\bar{d}^{\prime}_{k}(x^{})(x(k)-x^{})]$
$\displaystyle=$	$\displaystyle x(k)-x^{}+\bar{d}^{\prime}_{k}(x^{})(x(k)-x^{*})$
$\displaystyle=$	$\displaystyle(I+\bar{d}^{\prime}_{k}(x^{}))(x_{k}-{x^{}})$
$\displaystyle=$	$\displaystyle((I-\eta h^{})^{k+1}-\eta(I-\eta h^{})^{k}B^{T}PB)(x(k)-{x^{*}})$
$\displaystyle=$	$\displaystyle[(I-\eta h^{})^{k}G](x(k)-{x^{}}).$	(39)

Since $\|I-\eta h^{*}\|<1$ , when $\eta<2/m_{2}$ it follows that $\|x(k+1)-x^{*})\|\leq\|I-\eta h^{*}\|^{k}\|G\|_{m}\|x(k)-x^{*})\|$ . By letting $r_{2}=\|G\|_{m}$ , then the superlinearly convergence of the DOAOC algorithm can be obtained. The proof is completed. ∎

V CONCLUSIONS

This paper has proposed distributed second-order optimization algorithms with global superlinear convergence from the viewpoint of optimal control theory. It successfully integrates second-order information into distributed optimization without requiring the inversion of the Hessian matrix. A connection has been revealed between algorithms and traditional optimization methods. The convergence analysis of algorithms also have been achieved.

References

[1] D. Belegundu, and R. Chandrupatla. Optimization concepts and applications in engineering. Journal of Management, 52(11): 56, 2000.
[2] R. Eini, and S. Abdelwahed. Distributed model predictive control for intelligent trafic system. International Conference on Internet of Things (iThings), 909-915, 2019
[3] K. Molzahn, F. D $\ddot{o}$ rfler, H. Sandberg, H. Low, S. Chakrabarti, R. Baldick, and J. Lavaei. A survey of distributed optimization and control algorithms for electric power systems. IEEE Transactions on Smart grid, 8(6): 2941-2962, 2017.
[4] A. Olshevsky. Efficient Information Aggregation for Distributed Control and Signal Processing. Ph.D. dissertation, MIT, Cambridge, MA, 2010.
[5] Y. Xue, B. Li, and K. Nahrstedt. Optimal resource allocation in wireless ad hoc networks: a price-based approach. IEEE Transactions on Mobile Computing, 5(4):347–364, 2006
[6] P. Wan, and D. Lemmon. Event-triggered distributed optimization in sensor networks. International Conference on Information Processing in Sensor Networks, 49-60, 2009.
[7] A. Nedi $\acute{c}$ , and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1): 48–61, 2009.
[8] A.Nedi $\acute{c}$ , A. Ozdaglar, and A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatica Control, 55(4): 922-938, 2010.
[9] R. Islamov, X. Qian, and P. Richt $\acute{a}$ rik. Distributed second order methods with fast rates and compressed communication. Proceedings of the 38 th International Conference on Machine Learning, 139: 4617-4628, 2021.
[10] E. Wei, A. Ozdaglar, and A. Jadbabaie. A distributed Newton method for network utility maximization-I: Algorithm. IEEE Transactions on Automatic Control, 58(9): 2162-2175, 2013.
[11] E. Wei, A. Ozdaglar, and A. Jadbabaie. A distributed Newton method for network utility maximization-II: Convergence. IEEE Transactions on Automatic Control, 58(9): 2176-2188 .
[12] J. Zhang, K. You, and T. Ba $\c{s}$ ar. Distributed adaptive Newton methods with global superlinear convergence. Automatica,138: 110156, 2022.
[13] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato. Newton-Raphson consensus for distributed convex optimization. IEEE Transactions on Automatica Control, 61(4): 994-1009, 2016.
[14] H. Zhang, and H. Wang. Optimization methods rooting in optimal control. arXiv preprint arXiv:2312.01334, 1-8, 2024.
[15] V. D. Blondel, J. M. Hendrickx, A. Olshevsky, and J. N. Tsitsiklis. Convergence in multiagent coordination, consensus, and flocking. Proceedings of the 44th IEEE Conference on Decision and Control, pp. 2996-3000, 2005. IEEE.
[16] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
[17] L. Zhang, J. Xu, H. Zhang, and L. Xie. Distributed Optimal Control and Application to Consensus of Multi-Agent Systems. arXiv preprint arXiv:2309.12577, 2023.
[18] H. Zhang, H. Wang, and L. Li. Adapted and casual maximum principle and analytical solution to optimal control for stochastic multiplicativenoise systems with multiple input-delays. in Proc. 51st IEEE Conf. Decision Control, Maui, HI, USA, pp. 2122–2127, 2012.
[19] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245-1260, 2017. IEEE.
[20] R. Olfati-Saber and R. M. Murray. Consensus problems in networks of agents with switching topology and time-delays. IEEE Transactions on Automatic Control, vol. 49, no. 9, pp. 1520-1533, 2004. IEEE.
[21] S. Soori, K. Mishchenko, A. Mokhtari, M. M. Dehnavi, and M. Gurbuzbalaban. DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate. International Conference on Artificial Intelligence and Statistics, pp. 1965-1976, 2020. PMLR.
[22] Y. Xu, Z. Guo, K. Lu, and H. Zhang. Distributed Optimization Algorithm with Superlinear Convergence Rate. arXiv preprint arXiv:2409.12392, 2024.
[23] H. Wang, Y. Xu, Z. Guo, and H. Zhang. Superlinear Optimization Algorithms. arXiv preprint arXiv:2403.11115, 2024.
[24] J. Zhang, K. You, and T. Başar. Distributed adaptive Newton methods with global superlinear convergence. Automatica, vol. 138, pp. 110156, 2022. Elsevier.
[25] T. Charalambous and C. N. Hadjicostis. Laplacian-based matrix design for finite-time average consensus in digraphs. 2018 IEEE Conference on Decision and Control (CDC), pp. 3654-3659, 2018. IEEE.
[26] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
[27] S.G. Wang, M.X. Wu, and Z.Z. Jia. Matrix inequalities. Chinese Science Press, Beijing, 2006.

	$\displaystyle\|\|x(k+1)-x^{*}\|\|$
$\displaystyle=$	$\displaystyle\|\|\bar{C}^{-1}\bar{\Lambda}^{k+1}\bar{C}\Gamma_{P}^{-1}R(x(k)-x^{*})\|\|$
$\displaystyle\leq$	$\displaystyle\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})^{k+1}\|\|\bar{C}^{-1}\|\|_{m}\ \|\|\bar{C}\|\|_{m}\$
	$\displaystyle\times\|\|\Gamma_{P}^{-1}R\|\|_{m}\ \|\|x(k)-x^{*}\|\|.$	(35)