This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Optimization Method Based On Optimal Control

Ziyuan Guo, Yue Sun, Yeming Xu, Liping Zhang, and Huanshui Zhang, Senior Member, IEEE This work was supported by the Original Exploratory Program Project of National Natural Science Foundation of China (62450004), the Joint Funds of the National Natural Science Foundation of China (U23A20325), and the Major Basic Research of Natural Science Foundation of Shandong Province (ZR2021ZD14). (Corresponding author: Huanshui Zhang). Ziyuan Guo, Yeming Xu and Liping Zhang are with the College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao, 266590, China (e-mail: [email protected]; [email protected]; [email protected]). Yue Sun is with the School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China (email: [email protected]). Huanshui Zhang is with the College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 266590, China, and also with the School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China (email: [email protected]).
Abstract

In this paper, a novel distributed optimization framework has been proposed. The key idea is to convert optimization problems into optimal control problems where the objective of each agent is to design the current control input minimizing the original objective function of itself and updated size for the future time instant. Compared with the existing distributed optimization problem for optimizing a sum of convex objective functions corresponding to multiple agents, we present a distributed optimization algorithm for multi-agents system based on the results from the maximum principle. Moreover, the convergence and superlinear convergence rate are also analyzed stringently.

I INTRODUCTION

Optimization problem which involves maximizing or minimizing the objective function is ubiquitous in science and engineering [1]. In the traditional centralized optimization control, it requires a centralized manner to collect and deal with the information received from all other agents. With increasing penetrations of the networked control systems, the division of labor and cooperation of multi-agent systems plays an increasingly important role in daily life. Meanwhile, distributed optimization problems arise accordingly in which several autonomous agents collectively try to achieve a global objective and compared with the centralized optimization, distributed algorithms also have the potential to respect privacy of data, measurements, cost functions, and constraints, which becomes increasingly important in practical applications such as intelligent traffic system [2], electric power systems [3], formation control[4], resource allocation [5] and so on.

Distributed optimization has gained growing renewed interest and there are various distributed algorithms proposed in the literature. The dual decomposition, which is the early class of distributed optimization algorithms presented in [6], typically relied on choosing the step size to ensure convergence. Due to the inexpensive computation cost of (sub)gradients, the gradient descent algorithms is also a popular tool to deal with the distributed optimization. [7] considered a unconstrained distributed computation model for optimizing a sum of convex objective functions corresponding to multiple agents where each agent performed a consensus step and then a descent step along the local (sub)gradient direction of its own convex objective function. For the constrained distributed optimization problem, [8] focused on the cooperative control problems where the values of agents are constrained to lie in closed convex sets and developed distributed optimization algorithms for problems. To accelerate the convergence of (sub)gradient method, an effective distribution algorithm called Newton method is taken into consideration [9, 10, 11], which has a superlinear rate in the convergence rate. In particular, [12] proposed a distributed adaptive Newton algorithm with a global quadratic convergence rate where each node of a peer-to-peer network minimizes a finite sum of objective functions by communicating with its neighboring nodes. The procedure proposed in [13] let agents distributedly compute and sequentially update an approximated Newton-Raphson direction by means of suitable average consensus ratios. However, as point out in [14], although subgradient and Newton methods are favorable, powerful and widely used in distributed optimization algorithms, the choice of step size and the singularity of Hessian matrix are crucial to the algorithms.

Motivated by [14], a novel distributed optimization algorithm is proposed from the viewpoint of the optimal control problem. To be specific, there exists control input of each agent who is the updated size of each iteration. The objective of each agent is to design current control input to minimize the objective function of itself and updated size for the future time instant. Through a simple transformation, the sum of the original objective function of each agent is constructed. Based on the maximum principle, the distributed optimization method is derived compared with the existing distributed optimization problem for optimizing a sum of convex objective functions. To show the superiority of the algorithm, the superlinear convergence rate is proved.

The outline of this paper is given below. Compared with the distributed optimization problem, the optimal control problem is derived in Section II. The distributed optimization algorithm based on the maximum principle is proposed in Section III. The rate of the convergence is analysed in Section IV. Numerical examples are given in Section V. Conclusion is arranged in Section VI.

Notation: Throughout the paper, ATA^{T} stands for the transposition of matrix AA. RnR^{n} denotes the nn-dimensional real vector space. II is a unit matrix with appropriate dimension. For a symmetric matrix MM, M>0(0)M>0(\geq 0) means that MM is a positive definite (positive semi-definite) matrix. ρ(M)\rho(M) represents the spectral radius of MM. \|\cdot\| denotes 2-norm of vectors and m\|\cdot\|_{m} is the induced norm of matrices. [a,b][a,b] represents all integers from integer aa to integer bb.

II PROBLEM FORMULATION

II-A Graph theory

The communication graph is denoted by a directed graph 𝒢=(𝒱,,𝒜)\mathcal{G=(V,E,A)}, where 𝒱={1,2,,n}\mathcal{V}=\{1,2,\ldots,n\} is a set of vertices(nodes), nn is the number of agents satisfying n2n\geq 2, 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} is the set of edges, and the weighted matrix 𝒜=(aij)n×n\mathcal{A}=(a_{ij})_{n\times n} is a non-negative matrix for adjacency weights of edges, aij0a_{ij}\neq 0 if and only if (i,j)(i,j)\in\mathcal{E}. The graph 𝒢\mathcal{G} is said to be balanced if the sum of the interaction weights from agent jj to agent ii are equal, i.e., j=1naij=j=1naji,i𝒱\sum\limits_{j=1}^{n}a_{ij}=\sum\limits_{j=1}^{n}a_{ji},\forall i\in\mathcal{V}. Moreover, 𝒩i={j𝒱|(i,j)}\mathcal{N}_{i}=\{j\in\mathcal{V}|{(i,j)}\in\mathcal{E}\} is used to represent the neighbor set of agent ii. For the topology 𝒢=(𝒱,,𝒜)\mathcal{G=(V,E,A)}, a path of length rr from node i1i_{1} to node ir+1i_{r}+1 is a sequence of r+1r+1 distinct nodes {i1,ir+1}{\{i_{1}\ldots,i_{r}+1\}} such that (iq,iq+1)(i_{q},i_{q}+1)\in\mathcal{E} for q=1,,rq=1,\ldots,r. If there exists a path between any two nodes, then 𝒢\mathcal{G} is said to be strongly connected. Here we make the following assumptions for the communication graph.

Assumption 1

The directed graph 𝒢\mathcal{G} is balanced and is strongly connected.

Noting that in the distributed coordination control, conditions in Assumption 1 play important roles in ensuring multi-agent systems (MAS) to reach consensus [7] [15].

II-B Distributed optimization

Consider an MAS consisting of nn agents with optimization problem, labeled by set 𝒱={1,,n}\mathcal{V}=\{1,\dots,n\} where agents communicate the local state information with their neighbors via directed graph 𝒢\mathcal{G}. Only agent i𝒱i\in\mathcal{V} knows the function fi:nf_{i}:{\mathbb{R}^{n}}\to\mathbb{R}, which is twice continuously differentiable, possibly non-convex. The traditional unconstrained optimization problem is given by

minimizexn{f(x)=def1ni=1nfi(x)}.\mathop{\rm minimize}\limits_{x\in{\mathbb{R}^{n}}}\{f(x)\overset{\text{def}}{=}\frac{1}{n}\sum\limits_{i=1}^{n}f_{i}(x)\}. (1)

Different from the traditional distributed optimization method, we transform the task of finding solutions to problem (1) into updating of the state sequence within an optimal control problem. To be specific, consider the following discrete-time linear time-invariant system for each agent

xi(k+1)=xi(k)+ui(k),{x_{i}(k+1)}=x_{i}(k)+u_{i}(k), (2)

where xi(k)Rnx_{i}(k)\in R^{n} is state with the initial value x(0)x(0), ui(k)Rnu_{i}(k)\in R^{n} is the control input which needs to be further specified later. The ii represents the iith agent, i{1,2,n}i\in{\{1,2,\dots n\}}.

Correspondingly, the cost function of each agent satisfies

Ji\displaystyle J_{i} =\displaystyle= k=0N(j𝒩ieijT(k)Qieij(k)+uiT(k)Riui(k)+fi(xi(k)))\displaystyle\sum\limits^{N}_{k=0}\Big{(}\sum\limits_{j\in\mathcal{N}_{i}}e^{T}_{ij}(k)Q_{i}e_{ij}(k)+u^{T}_{i}(k)R_{i}u_{i}(k)+f_{i}(x_{i}(k))\Big{)} (3)
+j𝒩ieijT(N+1)Hieij(N+1)+fi(xi(N+1)),\displaystyle+\sum\limits_{j\in\mathcal{N}_{i}}e^{T}_{ij}(N+1)H_{i}e_{ij}(N+1)+f_{i}(x_{i}(N+1)),

with eij(k)=xi(k)xj(k)e_{ij}(k)=x_{i}(k)-x_{j}(k) where Qi0Q_{i}\geq 0, Hi0H_{i}\geq 0 and Ri>0R_{i}>0.

Remark 1

Compared with the traditional method, we convert the optimization problem into an optimal control problem, where the objective of each agent is to design the current control input minimizing the sum of the original objective function and updated size for the future time instant. On the one hand, the problem studied in this paper includes the traditional results with multi-agents system, if we let fi(xi(k))=fi(xi(N+1))=0f_{i}(x_{i}(k))=f_{i}(x_{i}(N+1))=0 while the system (1) can be modified as xi(k+1)=Axi(k)+Biui(k)x_{i}(k+1)=Ax_{i}(k)+B_{i}u_{i}(k). On the other hand, in the existing optimization algorithms to handle the problem (1), there is a typical distributed first-order gradient descent algorithm [16]:

xi(k+1)=j=1naijxj(k)η(k)fi(xi(k)),{x_{i}(k+1)}=\sum\limits_{j=1}^{n}a_{ij}x_{j}(k)-\eta(k)\nabla f_{i}({x_{i}}(k)), (4)

where η(k)\eta(k) is the step size. Noting that in convex optimization, it has been proven that using distributed algorithm (4), the global optimal solution for problem (1) can be obtained. However, distributed algorithm (4) has limitations in terms of selecting the step size. The step size η(k)\eta(k) in distributed algorithm (4) should theoretically satisfy the following two conditions:

limkη(k)=0,lim_{k\to\infty}\eta(k)=0, (5)
k=1η(k)=.\sum\limits_{k=1}^{\infty}\eta(k)=\infty. (6)

In practical applications, the common practice for selecting the η\eta is to start with a small constant value. Alternatively, even if a time-varying step size η(k)\eta(k) is chosen to satisfy the two conditions (5)-(6), the results obtained from distributed algorithm (4) are evidently influenced by the η(k)\eta(k). The distributed algorithm (4) has limitations in terms of selecting the step size and the results obtained from distributed algorithm (4) are evidently influenced by the η(k)\eta(k).

Based on the discussion above, we are in position to derive the distributed optimization algorithm from the viewpoint of the optimal control. For convenience of future discussion, some symbols are denoted below.

x(k)\displaystyle x(k) =\displaystyle= [x1T(k),,xnT(k)]T,u(k)=[u1T(k),,unT(k)]T,\displaystyle[x^{T}_{1}(k),...,x^{T}_{n}(k)]^{T},\quad u(k)=[u^{T}_{1}(k),...,u^{T}_{n}(k)]^{T},
Q\displaystyle Q =\displaystyle= diag{Q1,,Qn},R=diag{R1,,Rn},\displaystyle diag\{Q_{1},...,Q_{n}\},\quad R=diag\{R_{1},...,R_{n}\},
H\displaystyle H =\displaystyle= diag{H1,,Hn},F(x(k))=i=1nfi(xi(k)).\displaystyle diag\{H_{1},...,H_{n}\},\quad F({x(k)})=\sum\limits_{i=1}^{n}f_{i}({x_{i}(k)}).

Let δi(k)\delta_{i}(k) be the errors set of the iith agent, then e(k)=[δ1T(k),,δnT(k)]e(k)=[\delta^{T}_{1}(k),...,\delta^{T}_{n}(k)]. In this way, we have

e(k+1)=e(k)+i=1nBiui(k)=e(k)+Bu(k),\displaystyle e(k+1)=e(k)+\sum\limits_{i=1}^{n}B_{i}u_{i}(k)=e(k)+Bu(k),

where BiB_{i} is the column vector composed by 0,1,10,1,-1, satisfying B=[B1,,Bn]B=[B_{1},...,B_{n}] and BTB=LB^{T}B=L is a Laplacian matrix. To this end, (2) and (3) can be written with expanded forms as

x(k+1)=x(k)+u(k),\displaystyle x(k+1)=x(k)+u(k), (7)

and

𝐽N=k=0N(i=1n(jNieijT(k)Qieij(k)+ui(k)TRiui(k)+fi(xi(k))))+i=1n(jNieijT(N+1)Hieij(N+1)+fi(xi(N+1)))=k=0N(eT(k)Qe(k)+u(k)TRu(k))+F(x(k))+eT(N+1)He(N+1)+F(x(N+1)),\begin{array}[]{l}\vspace{0.15cm}{\rm{}}\mathop{J}_{N}=\sum\limits_{k=0}^{N}(\sum\limits_{i=1}^{n}(\sum\limits_{j\in N_{i}}e^{T}_{ij}(k)Q_{i}e_{ij}(k)+u_{i}(k)^{\mathrm{T}}R_{i}u_{i}(k)\\ \vspace{0.15cm}\quad\quad+f_{i}({x_{i}(k)})))+\sum\limits_{i=1}^{n}(\sum\limits_{j\in N_{i}}e^{T}_{ij}(N+1)H_{i}e_{ij}(N+1)\\ \quad\quad+f_{i}({x_{i}(N+1)}))\\ \vspace{0.15cm}\quad\ =\sum\limits_{k=0}^{N}(e^{T}(k)Qe(k)+u(k)^{\mathrm{T}}Ru(k))+F({x(k)})\\ \vspace{0.15cm}\quad\quad+e^{T}(N+1)He(N+1)+F({x(N+1)}),\end{array} (8)

respectively. The optimal control problem to be solved in this section is given.

Problem: Find the optimal control u(k)u(k) minimizing the long-term cost JNJ_{N} and subject to (7).

Remark 2

In our recent work [14], we implemented a centralized optimization algorithm, which causes us to design the system and cost function (7)-(8). Naturally, we considered the possibility of designing a distributed algorithm based on the centralized optimization algorithm from the maximum principle. It should be noted that if F(x(k))=i=1nfi(xi(k))=0F({x(k)})=\sum\limits_{i=1}^{n}f_{i}({x_{i}(k)})=0, (8) can be reduced to a cost function of consensus problem based on optimal control in our recent work [17].

III Distributed Optimization Method Using Optimal Control

In this section, we use optimal control theory to solve the Problem. This inspires us to develop an optimization algorithm which can be implemented into a distributed manner. Additionally, a variant of the algorithm is also proposed to balance the number of iterations and communications.

III-A Algorithm Development

Following from [18], applying Pontryagins maximum principle to the system (7) with the cost function (8), the following costate equation and equilibrium condition are obtained:

0=Ru(k)+λ(k),0=Ru(k)+\lambda(k), (9)
λ(k1)=BTQe(k)+f(k)+λ(k),{\lambda(k-1)}=B^{T}Qe(k)+\nabla f(k)+\lambda(k), (10)

with the terminal value λ(N)=BTHe(N+1)+f(N+1){\lambda(N)}=B^{T}He(N+1)+\nabla f(N+1), where f(k)=[f1(x1(k),,fn(xn(k)))]T\nabla f(k)=[\nabla f_{1}(x_{1}(k),...,\nabla f_{n}(x_{n}(k)))]^{T}.

To derive the analytical solution from the FBDEs (7), (9)- (10), define the Riccati equation

P(k)=Q+P(k+1)P(k+1)BΓ1(k)BTP(k+1),{P(k)}=Q+P(k+1)-P(k+1)B\Gamma^{-1}(k)B^{T}P(k+1), (11)

with the terminal value P(N+1)=H{P(N+1)}=H, where Γ(k)=R+BTP(k+1)B{\Gamma(k)}=R+B^{T}P(k+1)B.

The analytical expression to the controller is given.

Lemma 1

If the Riccati equation (11) admits solution such that Γ(k){\Gamma(k)} is invertible, then the control satisfies

u(k)=Γ1(k)[BTP(k+1)e(k)+l=k+1N+1M(l)f(l)],{u(k)}=-\Gamma^{-1}(k)[B^{T}P(k+1)e(k)+\sum\limits_{l=k+1}^{N+1}M(l)\nabla f(l)], (12)

where M(l)=M(l1)R(R+BTP(i)B)1M(l)=M(l-1)R(R+B^{T}P(i)B)^{-1} for l>k+1l>k+1 and M(l)=IM(l)=I for l=k+1l=k+1. Moreover, the costate equation is derived as

λ(k1)=BTP(k)e(k)+l=kN+1M(l)f(l).{\lambda(k-1)}=B^{T}P(k)e(k)+\sum\limits_{l=k}^{N+1}M(l)\nabla f(l). (13)
Proof:

For k=Nk=N, adding the terminal value λ(N)\lambda(N) into (9), there holds

0=Ru(N)+BTH[e(N)+Bu(N)]+f(N+1)=[R+BTHB]u(N)+BTHe(k)+f(N+1),\begin{split}0&=Ru(N)+B^{T}H[e(N)+Bu(N)]+\nabla f(N+1)\\ &=[R+B^{T}HB]u(N)+B^{T}He(k)+\nabla f(N+1),\end{split} (14)

i.e., u(N)u(N) satisfies (12) for k=Nk=N. Accordingly, substituting (14) into (10), the costate λ(N1)\lambda(N-1) is calculated as

λ(N1)=BTQe(N)+BTH[e(N)+Bu(N)]+f(N+1)+f(N)=BT[Q+H]e(N)BTHBΓ1(k)[BTP(k+1)e(k)+f(N+1)]+f(N+1)+f(N)=BTP(N)e(N)+M(N)f(N+1)+f(N),\begin{split}&\lambda(N-1)\\ =&B^{T}Qe(N)\!+\!B^{T}H[e(N)\!+\!Bu(N)]\!+\!\nabla f(N\!+\!1)\!+\!\nabla f(N)\\ =&B^{T}[Q+H]e(N)-B^{T}HB\Gamma^{-1}(k)[B^{T}P(k+1)e(k)\\ &+\nabla f(N+1)]+\nabla f(N+1)+\nabla f(N)\\ =&B^{T}P(N)e(N)+M(N)\nabla f(N+1)+\nabla f(N),\end{split}

which is exactly (13) for k=Nk=N, where P(N)P(N) satisfying (11) has been used in the last equality.

Assume that (13) is established for ks+1k\geq s+1, we will show it also holds for k=sk=s. For k=sk=s, adding λ(s)\lambda(s) into (9), one has

0=Ru(s)+BTP(s+1)[e(s)+Bu(s)]+l=s+1N+1M(l)f(l),0=Ru(s)+B^{T}P(s+1)[e(s)+Bu(s)]+\sum\limits_{l=s+1}^{N+1}M(l)\nabla f(l), (15)

i.e., u(s)u(s) is such that

u(s)=Γ1(s)[BTP(s+1)e(s)+l=s+1N+1M(l)f(l)].{u(s)}=-\Gamma^{-1}(s)[B^{T}P(s+1)e(s)+\sum\limits_{l=s+1}^{N+1}M(l)\nabla f(l)]. (16)

Adding it into (10) for k=sk=s, it derives

λ(s1)=BTQe(s)+BTP(s+1)[e(s)+Bu(s)]+l=s+1N+1M(l)f(l)+f(s)=BTP(s)e(s)+l=sN+1M(l)f(l),\begin{split}\lambda(s-1)&=B^{T}Qe(s)+B^{T}P(s+1)[e(s)+Bu(s)]\\ &+\sum\limits_{l=s+1}^{N+1}M(l)\nabla f(l)+\nabla f(s)\\ &=B^{T}P(s)e(s)+\sum\limits_{l=s}^{N+1}M(l)\nabla f(l),\end{split}

which is exactly (13). This completes the proof. ∎

Remark 3

It should be noted that the analytical solution u(k)u(k) obtained in Lemma 1 contains the future information, i.e., it depends on {f(i),i[k+1,N+1]}\{\nabla f(i),i\in[k+1,N+1]\}, which is difficult to realize in practical life, thus a modified algorithm is presented.

Before proposing an algorithm based on optimal control obtained in Lemma 1, we first show an interesting phenomenon: the controller (12) derived from optimal control theory utilizes the average value of the derivative of each agent’s local function, referred to as the “average gradient” hereafter. That is to say, the matrix M(l)M(l) has the property of taking the average value, which is proved below. It should be noted that distributed optimization algorithms that employ the average gradient have been explored in [19]. Our derived results theoretically demonstrate the rationality and correctness of adopting average gradient. The property of M(l)M(l) defined in Lemma 1 is given below.

Lemma 2

M(l)M(l) defined in Lemma 1 satisfies limiM(l)=1n𝟏𝟏𝐓lim_{i\to\infty}M(l)=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}.

Proof:

If the Riccati equation (11) admits a stable solution PP when kk\rightarrow\infty, then we adopt this stable solution in M(l)M(l). In order to prove the convergence of M(l)M(l), the following system is constructed firstly

w(k+1)=R(R+BTPB)1w(k)=(R(R+BTPB)1)kw(0),\begin{split}{w(k+1)}&=R(R+B^{T}PB)^{-1}w(k)\\ &=(R(R+B^{T}PB)^{-1})^{k}w(0),\end{split} (17)

where w(k)nw(k)\in{\mathbb{R}^{n}}. In this case, the key point is to prove limk(R(R+BTPB)1)k=1n𝟏𝟏𝑻lim_{k\to\infty}(R(R+B^{T}PB)^{-1})^{k}=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}. Utilizing the properties of the Riccati equation’s solution and BTB=LB^{T}B=L, rewrite the above (17) with finesse

w(k+1)=(I+L¯)1w(k)=w(k)((I+L¯)1L¯)w(k),\begin{split}{w(k+1)}&=(I+\bar{L})^{-1}w(k)\\ &=w(k)-((I+\bar{L})^{-1}\bar{L})w(k),\end{split} (18)

where L¯=BTPBR1\bar{L}=B^{T}PBR^{-1} is a Laplacian matrix.

According to [20], since the eigenvalues of matrix L¯\bar{L} are greater than or equal to zero, the eigenvalues of (I+L¯)(I+\bar{L}) are greater than or equal to 11. Therefore, the eigenvalues of the inverse of (I+L¯)(I+\bar{L}) lie between 0 and 11, indicating that the above system (17) is stable.

When kk\rightarrow\infty, the stable state ww of the system should satisfy ((I+L¯)1L¯)w=0-((I+\bar{L})^{-1}\bar{L})w=0, which simplifies to L¯w=0-\bar{L}w=0. Referring to conclusion [20], the solution of L¯w=0-\bar{L}w=0 is w=1n𝟏𝟏𝑻w(0)w^{*}=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}w(0), so we have limk(R(R+BTPB)1)k=1n𝟏𝟏𝑻lim_{k\to\infty}(R(R+B^{T}PB)^{-1})^{k}=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}, i.e., limiM(l)=1n𝟏𝟏𝑻lim_{i\to\infty}M(l)=\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}}. This completes the proof. ∎

Lemma 2 indicates that (12) utilizes the average gradient. For the sake of simplifying notation, we will denote the average gradient as g(l)g(l), i.e., g(l)=M(l)f(l)g(l)=M(l)\nabla f(l).

III-B Distributed Optimization Algorithm

In this subsection, we discuss the distributed optimization algorithm derived based on Lemma 1, which includes a centralized algorithm and a decentralized algorithm.

Lemma 3

Under the condition of Lemma 1 and based on the optimal control obtained from it, there exists an iteration algorithm

x(k+1)=x(k)+dk(x(k))dk(x(k))=(ΓP+h(k))1[g(k)ΓPdk1(x(k))],d0(x(k))=(ΓP+h(k))1(g(k)+BTPe(k)).\begin{split}{x(k+1)}&=x(k)+d_{k}(x(k))\\ {d_{k}(x(k))}&=-(\Gamma_{P}+h(k))^{-1}[g(k)-\Gamma_{P}d_{k-1}(x(k))],\\ d_{0}(x(k))&=-(\Gamma_{P}+h(k))^{-1}(g(k)+B^{T}Pe(k)).\end{split} (19)

where g(k)=(1n𝟏𝟏𝐓)f(k)g(k)=(\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}})\nabla f(k) is average gradient, h(k)=diag{h1(k),,hn(k)}h(k)=diag\{h_{1}(k),...,h_{n}(k)\}, hi(k)=1ni=1n2fi(xi(k))h_{i}(k)=\frac{1}{n}\sum\limits_{i=1}^{n}\nabla^{2}f_{i}(x_{i}(k)), ΓP\Gamma_{P} represents the stable solution PP of (11) used for Γ(k)\Gamma(k).

Proof:

For controller (12), noting that it exhibits non-causality because the future average gradient is used. Fortunately, the problem of utilizing the future information has been previously addressed in our research referring to [14] for the details. In accordance with the methodology presented in [14], controller (12) will transform into dkd_{k} as depicted in (19). ∎

Remark 4

The connection between algorithm (19) and our previous work presented in [14] will be discussed. On the one hand, if each agent does not have a local function to optimize, i.e., fi(xi(k))=fi(xi(N+1))=0f_{i}(x_{i}(k))=f_{i}(x_{i}(N+1))=0, it is evident that the terms g(k)g(k) and h(k)h(k) in (19) will not exist. That is to say, (19) transforms into the following form:

x(k+1)=x(k)+dkdk=(ΓP)1BTPe(k).\begin{split}&{x(k+1)}=x(k)+d_{k}\\ &d_{k}=-(\Gamma_{P})^{-1}B^{T}Pe(k).\end{split} (20)

In this case, dkd_{k} corresponds to the form of the controller uku_{k} from Lemma 1 in [17] when the matrix AA is the identity matrix. On the other hand, if we only consider a centralized optimization problem, or if the states during the iteration process of each agent are same, then e(k)e(k) and the related PP will not exist. Moreover, (19) degenerates into the following form:

x(k+1)=x(k)+dk(x(k))dk(x(k))=(R+h(k))1[g(k)Rdk1(x(k))],d0(x(k))=(R+h(k))1g(k).\begin{split}{x(k+1)}&=x(k)+d_{k}(x(k))\\ d_{k}(x(k))&=-(R+h(k))^{-1}[g(k)-Rd_{k-1}(x(k))],\\ d_{0}(x(k))&=-(R+h(k))^{-1}g(k).\end{split} (21)

In this case, (21) corresponds to Algorithm I in the previous work [14], where g(k)g(k) and h(k)h(k) correspond to the (average) gradient and Hessian matrix, respectively.

Based on optimal control theory, we have derived a distributed optimization algorithm (19) with a central computing node (in computer science, it is called as parameterserverparameter\ server [21] and we use this expression in Algorithm 1). The algorithm presented in Algorithm 1 is called the Distributed Optimal Control Method with a Central Computing Node (DOCMC). Now, we summarize the DOCMC algorithm from the perspective of each agent ii.

Algorithm 1 DOCMC
1:  Initialization: Each agent checks whether it can communicate with the server, and sets xi(0)px_{i}(0)\in{\mathbb{R}^{p}}
2:  for k=0,1,k=0,1,... do
3:     for each agent i=1,,ni=1,...,n do
4:        Compute local gradient fi(xi(k))\nabla f_{i}({x_{i}}(k)) and local hessian 2fi(xi(k)){\nabla^{2}f_{i}({x_{i}}(k))}. Calculate the state error eij(k)e_{ij}(k) between itself and its neighbor.
5:        Send fi(xi(k))\nabla f_{i}({x_{i}}(k)), 2fi(xi(k))\nabla^{2}f_{i}({x_{i}}(k)) and eij(k)e_{ij}(k) to server.
6:     end for
7:     For the server, compute g(k)g(k), h(k)h(k) and d0(x(k))=(ΓP+h(k))1(g(k)+BTPe(k))d_{0}(x(k))=-(\Gamma_{P}+h(k))^{-1}(g(k)+B^{T}Pe(k)).
8:     for t=0,1,,k1t=0,1,...,k-1 do
8:        dt+1(x(k))=(ΓP+h(k))1[g(k)ΓPdt(x(k))]{d_{t+1}(x(k))=-(\Gamma_{P}+h(k))^{-1}[g(k)-\Gamma_{P}d_{t}(x(k))]}
9:     end for
10:     Broadcast dk(xi(k)){d_{k}(x_{i}(k))} to ii-th agent, where dk(xi(k))d_{k}(x_{i}(k)) is the components corresponding to the ithi-th agent in dk(x(k))d_{k}(x(k)).
11:     Each agent update xi(k+1)=xi(k)+dk(xi(k))x_{i}(k+1)=x_{i}(k)+d_{k}(x_{i}(k))
12:  end for

Utilizing the state error eije_{ij} with neighbors can significantly enhance the computational stability of the server. In fact, if each agent adopts the same state, then (21) will be utilized in DOCMC. However, when the states of each agent cannot be guaranteed to be consistent due to communication processes or the influence of individual agents, relying solely on (21) may not be sufficient to achieve the optimization goals. The existence of eije_{ij} ensures that each agent can achieve the optimization goals even when their states are not consistent (whether due to initially different states or deviations during the optimization process), as the algorithm incorporates an optimal consensus feedback control form based on eije_{ij}.

Next, we will focus on describing the distributed optimization algorithm. Firstly, reconsidering the DOCMC algorithm, the existence of (ΓP+h(k))1(\Gamma_{P}+h(k))^{-1} is a primary reason why the algorithm requires a server. Motivated by our recent work [22], which replaced the inverse matrix with an adjustable scalar η>0\eta>0, in the DOCMC algorithm, (ΓP+h(k))1(\Gamma_{P}+h(k))^{-1} will be replaced by η>0\eta>0 in the following discussion. Due to the relationship (ΓP+h(k))1ΓP=I(ΓP+h(k))1h(k)(\Gamma_{P}+h(k))^{-1}\Gamma_{P}=I-(\Gamma_{P}+h(k))^{-1}h(k) based on (19), we propose the following algorithm

x(k+1)=x(k)+d¯k(x(k))d¯k(x(k))=ηg(k)+(Iηh(k))d¯k1(x(k)),d¯0(x(k))=η(g(k)+BTPe(k)),\begin{split}{x(k+1)}&=x(k)+\overline{d}_{k}(x(k))\\ \overline{d}_{k}(x(k))&=-\eta g(k)+(I-\eta h(k))\overline{d}_{k-1}(x(k)),\\ \overline{d}_{0}(x(k))&=-\eta(g(k)+B^{T}Pe(k)),\end{split} (22)

where the second-order information h(k)h(k) is used in algorithm (22) on account of the optimization framework of optimal control. We refer to the above algorithm as the Distributed Optimization Algorithm based on Optimal Control (DOAOC).

Remark 5

Now the connection between algorithm (22) and our previous work in [23] is discussed. If we only consider a centralized optimization problem, or if the states during the iteration process of each agent are the same, then e(k)e(k) and the related PP will not exist. In this case, (22) degrades into the following form:

x(k+1)=x(k)+d¯k(x(k))d¯k(x(k))=ηg(k)+(Iηh(k))d¯k1(x(k)),d¯0(x(k))=ηg(k).\begin{split}{x(k+1)}&=x(k)+\overline{d}_{k}(x(k))\\ \overline{d}_{k}(x(k))&=-\eta g(k)+(I-\eta h(k))\overline{d}_{k-1}(x(k)),\\ \overline{d}_{0}(x(k))&=-\eta g(k).\end{split} (23)

which is exactly Algorithm II in the previous work [23].

Now, a distributed optimization algorithm for each agent ii to handling problem (1) from the DOAOC algorithm is given, referring to Algorithm 2 for details. In the proposed distributed algorithm, as described in Steps 3 and 4, each agent only communicates with its neighbors. Step 5 initializes the loop by computing the initial d¯i0(k)\overline{d}^{0}_{i}(k). Different from [22], Step 7 only requires each agent to calculate d¯it+1(k)\overline{d}^{t+1}_{i}(k) independently.

Algorithm 2 DOAOC
1:  Initialization: Each agent ii requires η\eta and sets xi(0)px_{i}(0)\in{\mathbb{R}^{p}}
2:  for k=0,1,k=0,1,... do
3:     Take any average consensus protocol to get hi(k)h_{i}(k) and gi(k)g_{i}(k).
4:     Each agent ii obtains eij(k)e_{ij}(k) by communicating with its neighbors j𝒩ij\in{\mathcal{N}_{i}} and combining with eij(k)e_{ij}(k), agent ii records δiT(k)\delta^{T}_{i}(k).
5:     Each agent ii calculates d¯i0(k)=η(gi(k)+BiTPiδiT(k))\overline{d}^{0}_{i}(k)=-\eta(g_{i}(k)+B_{i}^{T}P_{i}\delta^{T}_{i}(k)).
6:     for t=0,1,,k1t=0,1,...,k-1 do
7:        Each agent ii calculates d¯it+1(k)=d¯it(k)ηgi(k)ηhi(k)d¯it(k)\overline{d}^{t+1}_{i}(k)=\overline{d}^{t}_{i}(k)-\eta g_{i}(k)-\eta h_{i}(k)\overline{d}^{t}_{i}(k).
8:     end for
9:     Each agent ii updates xi(k+1)=xi(k)+d¯it+1(k)x_{i}(k+1)=x_{i}(k)+\overline{d}^{t+1}_{i}(k)
10:  end for

III-C Dissussion of the DOCMC and DOAOC Algorithm

In this subsection, we discuss the compelling characteristics of DOCMC and DOAOC and some simplification techniques in practical use , which are summarized as follows:

  • Many distributed optimization methods commonly used in machine learning, including frameworks like TensorFlow, employ distributed algorithms with a central computing node. When considering a star network (master-slave network) structure [21] in the DOCMC algorithm (where eije_{ij} does not occur as each agent independently exchanges information with the server), simplification can be achieved using (21), where g(k)g(k) represents the average derivative and h(k)h(k) directly represents 1ni=1n2fi(xi(k))\frac{1}{n}\sum\limits_{i=1}^{n}\nabla^{2}f_{i}(x_{i}(k)).

  • To obtain the average gradient, Step 3 in Algorithm 2 is necessary. In practical applications, after evaluating the performance of the agents, a suitable consensus protocol can be chosen: if agents have ample memory, memory-intensive consensus protocols like the DSF algorithm [24] can be employed; if agents have limited memory, a finite-time average consensus protocol [25] can be used; and in cases where agent performance is poor, traditional consensus protocols [7] [20] can still be applied. hi(k)h_{i}(k) in DOAOC algorithm can be obtained not only through the above consensus protocol but also can be approximated by first-order derivative difference. In fact, in our recent work [23], Algorithm III utilizes a first-order derivative difference method to obtain the Hessian matrix.

IV CONVERGENCE ANALYSIS

In this section, we conduct a convergence analysis of the proposed algorithms. The superlinear convergence rate demonstrates the superiority of our optimal control-based algorithms. The assumption provided in this subsection is instrumental for drawing our conclusion.

Assumption 2

The local objective functions fi,i=1,,nf_{i},i=1,...,n are twice continuously differentiable, and there exist constants 0<m1m2<0<m_{1}\leq m_{2}<\infty such that for any xnx\in{\mathbb{R}^{n}}, m1I2fi(x)m2Im_{1}I\preceq\nabla^{2}f_{i}(x)\preceq m_{2}I

Remark 6

Assumption 2 is standard in the convergence analysis , see [26]. The lower bound m1m_{1} of 2fi(x)\nabla^{2}f_{i}(x) implies that the local objective function fif_{i} is strongly convex, and thus the objective function fif_{i} has a unique global minimum point. The upper bound m2m_{2} of 2fi(x)\nabla^{2}f_{i}(x) means that the local objective function fif_{i} is smooth.

Let xx^{*} be the minimum point of F(x)F(x), where gg^{*} represents g(x)g(x^{*}) and hh^{*} represents h(x)h(x^{*}), now we can proceed with the convergence analysis of our algorithms.

For DOCMC algorithm, we can compactly write the update iteration in (19) as

x(k+1)=x(k)[I((ΓP+h(k))1ΓP)k+1]×h(k)1g(k)[(ΓP+h(k))1ΓP]k+1×ΓP1BTPe(k).\begin{split}{x(k+1)}&=x(k)-[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]\\ &\times h(k)^{-1}g(k)-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}\\ &\times\Gamma_{P}^{-1}B^{T}Pe(k).\end{split} (24)
Lemma 4

Under the condition of Assumption 2. DOCMC algorithm converges if the weighted matrix RR is selected appropriately.

Proof:

First ly, we prove that x(k)x(k) converges to x¯(k)\overline{x}(k), where x¯(k)=(1n𝟏𝟏𝑻)x(k)\overline{x}(k)=(\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}})x(k).

x(k+1)x¯(k+1)=x(k)[I((ΓP+h(k))1ΓP)k+1]h(k)1g(k)[(ΓP+h(k))1ΓP]k+1R1BTPe(k)x¯(k)+[I((ΓP+h(k))1ΓP)k+1]h(k)1g(k)=x(k)[(ΓP+h(k))1ΓP]k+1ΓP1BTPe(k)x¯(k)=[I((ΓP+h(k))1ΓP)k+1ΓP1L¯]x(k)x¯(k).\begin{split}&\rVert x(k+1)-\overline{x}(k+1)\rVert\\ &=\rVert x(k)-[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]h(k)^{-1}g(k)\\ &-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}R^{-1}B^{T}Pe(k)-\overline{x}(k)\\ &+[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]h(k)^{-1}g(k)\rVert\\ &=\rVert x(k)-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}Pe(k)-\overline{x}(k)\rVert\\ &=\rVert[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}\Gamma_{P}^{-1}\bar{L}]x(k)-\overline{x}(k)\rVert.\end{split} (25)

Let ϵ=((ΓP+h(k))1ΓP)k+1ΓP1\epsilon=((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}\Gamma_{P}^{-1}. Recalling that ΓP>0\Gamma_{P}>0 and h(k)>0h(k)>0, it is obtained immediately from [27] that every eigenvalue of (ΓP+h(k))1ΓP(\Gamma_{P}+h(k))^{-1}\Gamma_{P} is positive and less than 1. Accordingly, all eigenvalues of ((ΓP+h(k))1ΓP)k+1((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1} are positive and its spectral radius is less than 1. It should be noted that ((ΓP+h(k))1ΓP)k+1((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1} is a symmetric matrix , which can be proved in [23]. Therefore, W=IϵL¯W=I-\epsilon\bar{L} satisf ies the properties of PϵP_{\epsilon} in [20]. According to [19], then we have

x(k+1)x¯(k+1)=(W1n𝟏𝟏𝑻)(x(k)x¯(k))σx(k)x¯(k),\begin{split}\rVert x(k+1)-\overline{x}(k+1)\rVert&=\rVert(W-\frac{1}{n}\boldsymbol{1}\boldsymbol{1^{T}})(x(k)-\overline{x}(k))\rVert\\ &\leq\sigma\rVert x(k)-\overline{x}(k)\rVert,\end{split} (26)

where σ(0,1)\sigma\in(0,1). This means that x(k)x(k) converges to x¯(k)\overline{x}(k) as kk\rightarrow\infty. Let y(k)=x¯(k)y(k)=\overline{x}(k), next we will show that y(k)y(k) converges to xx^{*}. Using y(k)y(k) in the update iteration in (24) , we have

y(k+1)=y(k)[I((ΓP+h(k))1R)k+1]h(k)1g(k),{y(k+1)}=y(k)-[I-((\Gamma_{P}+h(k))^{-1}R)^{k+1}]h(k)^{-1}g(k), (27)

which is consistent with the form of Algorithm 1 in [23]. Thus, there holds

F(y(k+1))F(x)c(F(y(k))F(x)).{F(y(k+1))-F(x^{*})}\leq c(F(y(k))-F(x^{*})). (28)

where 0c<10\leq c<1. More details can refer to the proof of Lemma 2 in [23]. This proof is completed. ∎

Now we are in a position to discuss the convergence rate of DOCMC algorithm.

Theorem 1

Under the condition of Assumption 2, when {x(k)}\{x(k)\} generated by DOCMC algorithm is convergent, then there exists a scalar r1>0r_{1}>0 such that

x(k+1)x\displaystyle||x(k+1)-x^{*}|| r1ρ((ΓP+h(k)))1ΓP))k+1\displaystyle\leq r_{1}\rho((\Gamma_{P}+h(k)))^{-1}\Gamma_{P}))^{k+1} (29)
×x(k)x,\displaystyle\times||x(k)-x^{*}||, (30)

that is to say, DOCMC algorithm is superlinearly convergent.

Proof:

Denote

zk(x(k))\displaystyle z_{k}(x(k)) =\displaystyle= [I((ΓP+h(k))1ΓP)k+1]h(k)1g(k)\displaystyle-[I-((\Gamma_{P}+h(k))^{-1}\Gamma_{P})^{k+1}]h(k)^{-1}g(k) (31)
[(ΓP+h(k))1ΓP]k+1ΓP1BTPe(k).\displaystyle-[(\Gamma_{P}+h(k))^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}Pe(k).

It is evident from (31) that

zk(x)=\displaystyle z_{k}(x^{*})= [I((ΓP+h)1ΓP)k+1]h1g\displaystyle-[I-((\Gamma_{P}+h^{*})^{-1}\Gamma_{P})^{k+1}]{h^{*}}^{-1}g^{*}
[(ΓP+h)1ΓP]k+1ΓP1BTPe\displaystyle-[(\Gamma_{P}+h^{*})^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}Pe^{*}
=\displaystyle= 0,\displaystyle 0, (32)
zk(x)=\displaystyle z_{k}^{\prime}(x^{*})= (I((ΓP+h))1ΓP)k+1)\displaystyle-(I-((\Gamma_{P}+{h^{*}}))^{-1}\Gamma_{P})^{k+1})
[(ΓP+h)1ΓP]k+1ΓP1BTPB.\displaystyle-[(\Gamma_{P}+h^{*})^{-1}\Gamma_{P}]^{k+1}\Gamma_{P}^{-1}B^{T}PB. (33)

From (24), one has

x(k+1)x\displaystyle x(k+1)-x^{*}
=\displaystyle= x(k)x+zk(x(k))\displaystyle x(k)-x^{*}+z_{k}(x(k))
\displaystyle\approx x(k)x+[zk(x)+zk(x)(x(k)x)]\displaystyle x(k)-x^{*}+[z_{k}(x^{*})+z_{k}^{\prime}(x^{*})(x(k)-x^{*})]
=\displaystyle= x(k)x+zk(x)(x(k)x)\displaystyle x(k)-x^{*}+z_{k}^{\prime}(x^{*})(x(k)-x^{*})
=\displaystyle= [((ΓP+h))1ΓP)k+1ΓP1R](x(k)x).\displaystyle[((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})^{k+1}\Gamma_{P}^{-1}R](x(k)-x^{*}). (34)

Noting that ΓP\Gamma_{P} is positive definite, there exists a matrix CC satisfying ΓP=CC\Gamma_{P}=C^{\prime}C. To this end, it generates (ΓP+h))1ΓP=(ΓP+h))1CC=C1C(ΓP+h))1CC(\Gamma_{P}+h^{*}))^{-1}\Gamma_{P}=(\Gamma_{P}+h^{*}))^{-1}C^{\prime}C=C^{-1}C(\Gamma_{P}+h^{*}))^{-1}C^{\prime}C. The second equality indicates (ΓP+h))1ΓP=C¯1Λ¯C¯(\Gamma_{P}+h*))^{-1}\Gamma_{P}=\bar{C}^{-1}\bar{\Lambda}\bar{C}, where Λ¯\bar{\Lambda} is a diagonal matrix. Due to ρ((ΓP+h))1ΓP)<1\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})<1, it immediately gets ρ(Λ¯)=ρ((ΓP+h))1ΓP)<1\rho(\bar{\Lambda})=\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})<1 and

x(k+1)x\displaystyle||x(k+1)-x^{*}||
=\displaystyle= C¯1Λ¯k+1C¯ΓP1R(x(k)x)\displaystyle||\bar{C}^{-1}\bar{\Lambda}^{k+1}\bar{C}\Gamma_{P}^{-1}R(x(k)-x^{*})||
\displaystyle\leq ρ((ΓP+h))1ΓP)k+1||C¯1||m||C¯||m\displaystyle\rho((\Gamma_{P}+h^{*}))^{-1}\Gamma_{P})^{k+1}||\bar{C}^{-1}||_{m}\ ||\bar{C}||_{m}\
×ΓP1Rmx(k)x.\displaystyle\times||\Gamma_{P}^{-1}R||_{m}\ ||x(k)-x^{*}||. (35)

By letting r1=C¯1mC¯mΓP1Rmr_{1}=||\bar{C}^{-1}||_{m}||\bar{C}||_{m}||\Gamma_{P}^{-1}R||_{m}\ , (30) is obtained accordingly. The proof is completed.

Now, the superlinear convergence of DOAOC algorithm is proved.

Theorem 2

Under the condition of Assumption 2, selecting 0<η<10<\eta<1 and c=Iηh<1c=\|I-\eta h^{*}\|<1, when {x(k)}\{x(k)\} generated by DOAOC algorithm converges, then there exists a scalar r2>0r_{2}>0 such that

x(k+1)xr2ckx(k)x.\displaystyle\|x(k+1)-x^{*}\|\leq r_{2}c^{k}\|x(k)-x^{*}\|. (36)

That is to say, DOAOC algorithm is superlinearly convergent.

Proof:

Following from the direct derivation , it shows

d¯k(x)=\displaystyle\bar{d}_{k}(x^{*})= [I(Iηh)k+1]h1g\displaystyle-[I-(I-\eta h^{*})^{k+1}]{h^{*}}^{-1}g^{*}
((Iηh)kηBTP)e=0,\displaystyle-((I-\eta h^{*})^{k}\eta B^{T}P)e^{*}=0, (37)
d¯k(x)=\displaystyle\bar{d}^{\prime}_{k}(x^{*})= (I(Iηh)k+1)η(Iηh)kBTPB.\displaystyle-(I-(I-\eta h^{*})^{k+1})-\eta(I-\eta h^{*})^{k}B^{T}PB. (38)

Let I(ηh+BTPB)=GI-(\eta h^{*}+B^{T}PB)=G. From (23), it follows that

x(k+1)x\displaystyle x(k+1)-x^{*}
=\displaystyle= x(k)x+d¯k(x(k))\displaystyle x(k)-x^{*}+\bar{d}_{k}(x(k))
\displaystyle\approx x(k)x+[d¯k(x)+d¯k(x)(x(k)x)]\displaystyle x(k)-x^{*}+[\bar{d}_{k}(x^{*})+\bar{d}^{\prime}_{k}(x^{*})(x(k)-x^{*})]
=\displaystyle= x(k)x+d¯k(x)(x(k)x)\displaystyle x(k)-x^{*}+\bar{d}^{\prime}_{k}(x^{*})(x(k)-x^{*})
=\displaystyle= (I+d¯k(x))(xkx)\displaystyle(I+\bar{d}^{\prime}_{k}(x^{*}))(x_{k}-{x^{*}})
=\displaystyle= ((Iηh)k+1η(Iηh)kBTPB)(x(k)x)\displaystyle((I-\eta h^{*})^{k+1}-\eta(I-\eta h^{*})^{k}B^{T}PB)(x(k)-{x^{*}})
=\displaystyle= [(Iηh)kG](x(k)x).\displaystyle[(I-\eta h^{*})^{k}G](x(k)-{x^{*}}). (39)

Since Iηh<1\|I-\eta h^{*}\|<1, when η<2/m2\eta<2/m_{2} it follows that x(k+1)x)IηhkGmx(k)x)\|x(k+1)-x^{*})\|\leq\|I-\eta h^{*}\|^{k}\|G\|_{m}\|x(k)-x^{*})\|. By letting r2=Gmr_{2}=\|G\|_{m}, then the superlinearly convergence of the DOAOC algorithm can be obtained. The proof is completed. ∎

V CONCLUSIONS

This paper has proposed distributed second-order optimization algorithms with global superlinear convergence from the viewpoint of optimal control theory. It successfully integrates second-order information into distributed optimization without requiring the inversion of the Hessian matrix. A connection has been revealed between algorithms and traditional optimization methods. The convergence analysis of algorithms also have been achieved.

References

  • [1] D. Belegundu, and R. Chandrupatla. Optimization concepts and applications in engineering. Journal of Management, 52(11): 56, 2000.
  • [2] R. Eini, and S. Abdelwahed. Distributed model predictive control for intelligent trafic system. International Conference on Internet of Things (iThings), 909-915, 2019
  • [3] K. Molzahn, F. Do¨\ddot{o}rfler, H. Sandberg, H. Low, S. Chakrabarti, R. Baldick, and J. Lavaei. A survey of distributed optimization and control algorithms for electric power systems. IEEE Transactions on Smart grid, 8(6): 2941-2962, 2017.
  • [4] A. Olshevsky. Efficient Information Aggregation for Distributed Control and Signal Processing. Ph.D. dissertation, MIT, Cambridge, MA, 2010.
  • [5] Y. Xue, B. Li, and K. Nahrstedt. Optimal resource allocation in wireless ad hoc networks: a price-based approach. IEEE Transactions on Mobile Computing, 5(4):347–364, 2006
  • [6] P. Wan, and D. Lemmon. Event-triggered distributed optimization in sensor networks. International Conference on Information Processing in Sensor Networks, 49-60, 2009.
  • [7] A. Nedic´\acute{c}, and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1): 48–61, 2009.
  • [8] A.Nedic´\acute{c}, A. Ozdaglar, and A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatica Control, 55(4): 922-938, 2010.
  • [9] R. Islamov, X. Qian, and P. Richta´\acute{a}rik. Distributed second order methods with fast rates and compressed communication. Proceedings of the 38 th International Conference on Machine Learning, 139: 4617-4628, 2021.
  • [10] E. Wei, A. Ozdaglar, and A. Jadbabaie. A distributed Newton method for network utility maximization-I: Algorithm. IEEE Transactions on Automatic Control, 58(9): 2162-2175, 2013.
  • [11] E. Wei, A. Ozdaglar, and A. Jadbabaie. A distributed Newton method for network utility maximization-II: Convergence. IEEE Transactions on Automatic Control, 58(9): 2176-2188 .
  • [12] J. Zhang, K. You, and T. Baş\c{s}ar. Distributed adaptive Newton methods with global superlinear convergence. Automatica,138: 110156, 2022.
  • [13] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato. Newton-Raphson consensus for distributed convex optimization. IEEE Transactions on Automatica Control, 61(4): 994-1009, 2016.
  • [14] H. Zhang, and H. Wang. Optimization methods rooting in optimal control. arXiv preprint arXiv:2312.01334, 1-8, 2024.
  • [15] V. D. Blondel, J. M. Hendrickx, A. Olshevsky, and J. N. Tsitsiklis. Convergence in multiagent coordination, consensus, and flocking. Proceedings of the 44th IEEE Conference on Decision and Control, pp. 2996-3000, 2005. IEEE.
  • [16] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • [17] L. Zhang, J. Xu, H. Zhang, and L. Xie. Distributed Optimal Control and Application to Consensus of Multi-Agent Systems. arXiv preprint arXiv:2309.12577, 2023.
  • [18] H. Zhang, H. Wang, and L. Li. Adapted and casual maximum principle and analytical solution to optimal control for stochastic multiplicativenoise systems with multiple input-delays. in Proc. 51st IEEE Conf. Decision Control, Maui, HI, USA, pp. 2122–2127, 2012.
  • [19] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245-1260, 2017. IEEE.
  • [20] R. Olfati-Saber and R. M. Murray. Consensus problems in networks of agents with switching topology and time-delays. IEEE Transactions on Automatic Control, vol. 49, no. 9, pp. 1520-1533, 2004. IEEE.
  • [21] S. Soori, K. Mishchenko, A. Mokhtari, M. M. Dehnavi, and M. Gurbuzbalaban. DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate. International Conference on Artificial Intelligence and Statistics, pp. 1965-1976, 2020. PMLR.
  • [22] Y. Xu, Z. Guo, K. Lu, and H. Zhang. Distributed Optimization Algorithm with Superlinear Convergence Rate. arXiv preprint arXiv:2409.12392, 2024.
  • [23] H. Wang, Y. Xu, Z. Guo, and H. Zhang. Superlinear Optimization Algorithms. arXiv preprint arXiv:2403.11115, 2024.
  • [24] J. Zhang, K. You, and T. Başar. Distributed adaptive Newton methods with global superlinear convergence. Automatica, vol. 138, pp. 110156, 2022. Elsevier.
  • [25] T. Charalambous and C. N. Hadjicostis. Laplacian-based matrix design for finite-time average consensus in digraphs. 2018 IEEE Conference on Decision and Control (CDC), pp. 3654-3659, 2018. IEEE.
  • [26] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
  • [27] S.G. Wang, M.X. Wu, and Z.Z. Jia. Matrix inequalities. Chinese Science Press, Beijing, 2006.