This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\paperwidth\paperheight

Distributed Algorithms for Linearly-Solvable Optimal Control in Networked Multi-Agent Systems

Neng Wan1, Aditya Gahlawat1, Naira Hovakimyan1,
Evangelos A. Theodorou2, and Petros G. Voulgaris3
Abstract

Distributed algorithms for both discrete-time and continuous-time linearly solvable optimal control (LSOC) problems of networked multi-agent systems (MASs) are investigated in this paper. A distributed framework is proposed to partition the optimal control problem of a networked MAS into several local optimal control problems in factorial subsystems, such that each (central) agent behaves optimally to minimize the joint cost function of a subsystem that comprises a central agent and its neighboring agents, and the local control actions (policies) only rely on the knowledge of local observations. Under this framework, we not only preserve the correlations between neighboring agents, but moderate the communication and computational complexities by decentralizing the sampling and computational processes over the network. For discrete-time systems modeled by Markov decision processes, the joint Bellman equation of each subsystem is transformed into a system of linear equations and solved using parallel programming. For continuous-time systems modeled by Itô diffusion processes, the joint optimality equation of each subsystem is converted into a linear partial differential equation, whose solution is approximated by a path integral formulation and a sample-efficient relative entropy policy search algorithm, respectively. The learned control policies are generalized to solve the unlearned tasks by resorting to the compositionality principle, and illustrative examples of cooperative UAV teams are provided to verify the effectiveness and advantages of these algorithms.

00footnotetext: 1Neng Wan, Aditya Gahlawat, and Naira Hovakimyan are with the Department of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801. {nengwan2, gahlawat, nhovakim}@illinois.edu.
2Evangelos A. Theodorou is with the Department of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332. [email protected].
3Petros G. Voulgaris is with the Department of Mechanical Engineering, University of Nevada, Reno, NV 89557. [email protected].

Keywords: Multi-Agent Systems, Linearly-Solvable Optimal Control, Path Integral Control, Relative Entropy Policy Search, Compositionality.

Introduction

The research of control and planning in multi-agent systems (MASs) has been developing rapidly during the past decade with the growing demands from areas, such as cooperative vehicles [1, 2], Internet of Things [3], intelligent infrastructures [4, 5, 6], and smart manufacturing [7]. Distinct from other control problems, control of MASs is characterized by the issues and challenges, which include, but are not limited to, a great diversity of possible planning and execution schemes, limited information and resources of the local agents, constraints and randomness of communication networks, optimality and robustness of joint performance. A good summary of recent progress in multi-agent control can be found in [8, 9, 10, 11, 12, 13, 14]. Building upon these results and challenges, this paper puts forward a distributed optimal control scheme for stochastic MASs by extending the linearly-solvable optimal control algorithms to MASs subject to stochastic dynamics in the presence of an explicit communication network and limited feedback information.

Linearly-solvable optimal control (LSOC) generally refers to the model-based stochastic optimal control (SOC) problems that can be linearized and solved with the facilitation of Cole-Hopf transformation, i.e. exponential transformation of value function [15, 16]. Compared with other model-based SOC techniques, since LSOC formulates the optimality equations in linear form, it enjoys the superiority of analytical solution [17] and superposition principle [18], which makes LSOC a popular control scheme for robotics [19, 20]. LSOC technique was first introduced to linearize and solve the Hamilton–Jacobi–Bellman (HJB) equation for continuous-time SOC problems [15], and the application to discrete-time SOC, also known as the linearly-solvable Markov decision process (LSMDP), was initially studied in [16]. More recent progress on single-agent LSOC problems can be found in [21, 22, 23, 24, 17, 25].

Different from many prevailing distributed control algorithms [10, 11, 12, 13], such as consensus and synchronization that usually assume a given behavior, multi-agent SOC allows agents to have different objectives and optimizes the action choices for more general scenarios [8]. Nevertheless, it is not straightforward to extend the single-agent SOC methods to multi-agent problems. The exponential growth of dimensionality in MASs and the consequent surges in computation and data storage demand more sophisticated and preferably distributed planning and execution algorithms. The involvement of communication networks (and constraints) requires the multi-agent SOC algorithms to achieve stability and optimality subject to local observation and more involved cost function. While the multi-agent Markov decision process (MDP) problem has received plenty of attention from both the fields of computer science and control engineering [26, 27, 8, 28, 14, 29], there are relatively fewer results focused on multi-agent LSMDP. A recent result on multi-agent LSMDP represented the MAS problem as a single-agent problem by stacking the states of all agents into a joint state vector, and the scalability of the problem was addressed by parameterizing the value function [30]; however, since the planning and execution of the control action demand the knowledge of global states as well as a centralized coordination, the parallelization scheme of the algorithm was postponed in that paper. While there are more existing results focused on the multi-agent LSOC in continuous-time setting, most of these algorithms still depend on the knowledge of the global states, i.e. a fully connected communication network, which may not be feasible or affordable to attain in practice. Some multi-agent LSOC algorithms also assume that the joint cost function can be factorized over agents, which basically simplifies the multi-agent control problem into multiple single-agent problems, and some features and advantages of MASs are therefore forfeited. Broek et al. investigated the multi-agent LSOC problem for continuous-time systems governed by Itô diffusion process [31]; a path integral formula was put forward to approximate the optimal control actions, and a graphical model inference approach was adopted to predict the optimal path distribution; nonetheless, the optimal control law assumed an accurate and complete knowledge of global states, and the inference was performed on the basis of mean-field approximation, which assumes that the cost function can be disjointly factorized over agents and ignores the correlations between agents. A distributed LSOC algorithm with infinite-horizon and discounted cost was studied in [32] for solving a distance-based formation problem of nonholonomic vehicular network without explicit communication topology. The multi-agent LSOC problem was also recently discussed in [25] as an accessory result for a novel single-agent LSOC algorithm; an augmented dynamics was built by piling up the dynamics of all agents, and a single-agent LSOC algorithm was then applied to the augmented system. Similar to the discrete-time result in [30], the continuous-time result resorting to augmented dynamics also presumes the fully connected network and faces the challenge that the computation and sampling schemes that originated from single-agent problem may become inefficient and possibly fail as the dimensions of augmented state and control grow exponentially in the number of agents.

To address the aforementioned challenges, this paper investigates the distributed LSOC algorithms for discrete-time and continuous-time MASs with consideration of local observation, correlations between neighboring agents, efficient sampling and parallel computing. A distributed framework is put forward to partition the connected network into multiple factorial subsystems, each of which comprises a (central) agent and its neighboring agents, such that the local control action of each agent, depending on the local observation, optimizes the joint cost function of a factorial subsystem, and the sampling and computational complexities of each agent are related to the size of the factorial subsystem instead of the entire network. Sampling and computation are parallelized to expedite the algorithms and exploit the resource in network, with state measurements, intermediate solutions, and sampled data exchanged over the communication network. For discrete-time multi-agent LSMDP problem, we linearize the joint Bellman equation of each factorial subsystem into a system of linear equations, which can be solved with parallel programming, making both the planning and execution phases fully decentralized. For continuous-time multi-agent LSOC problem, instead of adopting the mean-field assumption and ignoring the correlations between neighboring agents, joint cost functions are permitted in the subsystems; the joint optimality equation of each subsystem is first cast into a joint stochastic HJB equation, and then solved with a distributed path integral control method and a sample-efficient relative entropy policy search (REPS) method, respectively. The compositionality of LSOC is utilized to efficiently generate a composite controller for unlearned task from the existing controllers for learned tasks. Illustrative examples of coordinated UAV teams are presented to verify the effectiveness and advantages of multi-agent LSOC algorithms. Building upon our preliminary work on distributed path integral control for continuous-time MASs [33], this paper not only integrates the distributed LSOC algorithms for both discrete-time and continuous-time MASs, but supplements the previous result with a distributed LSMDP algorithm for discrete-time MASs, a distributed REPS algorithm for continuous-time MASs, a compositionality algorithm for task generalization, and more illustrative examples.

The paper is organized as follows: Section 2 introduces the preliminaries and formulations of multi-agent LSOC problems; Section 3 presents the distributed LSOC algorithms for discrete-time and continuous-time MASs, respectively; Section 4 shows the numerical examples, and Section 5 draws the conclusions. Some notations used in this paper are defined as follows: For a set 𝒮\mathcal{S}, |𝒮||\mathcal{S}| represents the cardinality of the set 𝒮\mathcal{S}; for a matrix XX and a vector vv, detX\det X denotes the determinant of matrix XX, and weighted square norm vX2:=vXv\|v\|^{2}_{X}:=v^{\top}Xv.

Preliminaries and Problem Formulation

Preliminaries on MASs and LSOC are introduced in this section. The communication networks underlying MASs are represented by graphs, and the discrete-time and continuous-time LSOC problems are extended from single-agent scenario to MASs under a distributed planning and execution framework.

Multi-Agent Systems and Distributed Framework

For a MAS consisting of NN\in\mathbb{N} agents, 𝒟={1,2,,N}\mathcal{D}=\{1,2,\cdots,N\} denotes the index set of agents, and the communication network among agents is described by an undirected graph 𝒢={𝒱,}\mathcal{G}=\{\mathcal{V},\mathcal{E}\}, which implies that the communication channel between any two agents is bilateral. The communication network 𝒢\mathcal{G} is assumed to be connected. An agent i𝒟i\in\mathcal{D} in the network is denoted as a vertex vi𝒱={v1,v2,,vN}v_{i}\in\mathcal{V}=\{v_{1},v_{2},\cdots,v_{N}\}, and an undirected edge (vi,vj)𝒱×𝒱(v_{i},v_{j})\in\mathcal{E}\subset\mathcal{V}\times\mathcal{V} in graph 𝒢\mathcal{G} implies that the agents ii and jj can measure the states of each other, which is denoted by xi=[xi(1),xi(2),,xi(M)]Mx_{i}=[x_{i(1)},x_{i(2)},\cdots,x_{i(M)}]^{\top}\in\mathbb{R}^{M} for agent i𝒟i\in\mathcal{D}. Agents ii and jj are neighboring or adjacent agents, if there exists a communication channel between them, and the index set of agents neighboring to agent ii is denoted by 𝒩i\mathcal{N}_{i} with 𝒩¯i=𝒩i{i}\bar{\mathcal{N}}_{i}=\mathcal{N}_{i}\cup\{i\}. We use column vectors x¯i=[xi,xj𝒩i]M|𝒩¯i|\bar{x}_{i}=[x_{i}^{\top},x^{\top}_{j\in\mathcal{N}_{i}}]^{\top}\in\mathbb{R}^{M\cdot|\mathcal{\bar{N}}_{i}|} to denote the joint state of agent ii and its adjacent agents 𝒩i\mathcal{N}_{i}, which together group up the factorial subsystem 𝒩¯i\bar{\mathcal{N}}_{i} of agent ii, and x=[x1,x2,,xn]MNx=[x_{1}^{\top},x_{2}^{\top},\cdots,x^{\top}_{n}]^{\top}\in\mathbb{R}^{M\cdot N} denotes the global states of MAS. Figure 1 shows a MAS and all its factorial subsystems.

To optimize the correlations between neighboring agents while not intensifying the communication and computational complexities of the network or each agent, our paper proposes a distributed planning and execution framework, as a trade-off scheme between the easiness of implementation and optimality of performance, for multi-agent LSOC problems. Under this distributed framework, the local control action uiu_{i} of agent iNi\in N is computed by solving the local LSOC problem defined in factorial subsystem 𝒩¯i\bar{\mathcal{N}}_{i}. Instead of requiring the cost functions fully factorized over agents [31] or the knowledge of global states [30, 25], joint cost functions are permitted in every factorial subsystem, which captures the correlations and cooperation between neighboring agents, and the local control action uiu_{i} only relies on the local observation x¯i\bar{x}_{i} of agent ii, which also simplifies the structure of communication network. Meanwhile, the global computational complexity is no longer exponential with respect to the total amount of agents in network |𝒟||\mathcal{D}|, but becomes linear with respect to it in general, and the computational complexity of local agent ii is only related to the number of agents in factorial subsystem 𝒩¯i\bar{\mathcal{N}}_{i}. However, since the local control actions are computed from the local observations with only partial information, the distributed LSOC law obtained under this framework is usually a sub-optimal solution, despite that it coincides with the global optimal solution when the communication network is fully connected. More explanations and discussions on this will be given at the end of this section. Before that, we first reformulate the discrete-time and continuous-time LSOC problems from the single-agent scenario to a multi-agent setting that is compatible with our distributed framework.

Refer to caption
Figure 1: An example of MAS and factorial subsystems. MAS 𝒢\mathcal{G} with four agents can be partitioned into four factorial subsystems 𝒩¯1,𝒩¯2,𝒩¯3\bar{\mathcal{N}}_{1},\bar{\mathcal{N}}_{2},\bar{\mathcal{N}}_{3}, and 𝒩¯4\bar{\mathcal{N}}_{4}, and each subsystem is assumed to be fully connected.

Discrete-Time Dynamics

Discrete-time SOC for single-agent systems, also known as the single-agent MDP, is briefly reviewed and then generalized to the networked MAS scenario. We consider the single-agent MDPs with finite state space and continuous control space in the first-exit setting. For a single agent i𝒟i\in\mathcal{D}, the state variable xix_{i} belongs to a finite set 𝒮i={s1i,s2i,}=ii\mathcal{S}_{i}=\{s^{i}_{1},s^{i}_{2},\cdots\}=\mathcal{I}_{i}\cup\mathcal{B}_{i}, which may be generated from an infinite-dimensional state problem by an appropriate coding scheme [34]. i𝒮i\mathcal{I}_{i}\subset\mathcal{S}_{i} denotes the set of interior states of agent ii, and i𝒮i\mathcal{B}_{i}\subset\mathcal{S}_{i} denotes the set of boundary states. Without communication and interference from other agents, the passive dynamics of agent ii follows the probability distribution

xipi(|xi),x^{\prime}_{i}\sim p_{i}(\cdot|x_{i}),

where xi,xi𝒮ix_{i},x^{\prime}_{i}\in\mathcal{S}_{i}, and pi(xi|xi)p_{i}(x^{\prime}_{i}|x_{i}) denotes the transition probability from state xix_{i} to xix^{\prime}_{i}. When taking control action uiu_{i} at state xix_{i}, the controlled dynamics of agent ii is described by the distribution mapping

xiui(|xi)=pi(|xi,ui),x_{i}^{\prime}\sim u_{i}(\cdot|x_{i})=p_{i}(\cdot|x_{i},u_{i}), (1)

where ui(xi|xi)u_{i}(x^{\prime}_{i}|x_{i}) or pi(xi|xi,ui)p_{i}(x^{\prime}_{i}|x_{i},u_{i}) denotes the transition probability from state xix_{i} to state xix^{\prime}_{i} subject to control uiu_{i} that belongs to a continuous space. We require that ui(xi|xi)=0u_{i}(x^{\prime}_{i}|x_{i})=0, whenever pi(xi|xi)=0p_{i}(x^{\prime}_{i}|x_{i})=0, to prevent the direct transitions to the goal states. When xii𝒮ix_{i}\in\mathcal{I}_{i}\subset\mathcal{S}_{i}, the immediate or running cost function of LSMDPs is designed as:

ci(xi,ui)=qi(xi)+KL(ui(|xi)pi(|xi)),c_{i}(x_{i},u_{i})=q_{i}(x_{i})+\textrm{KL}(u_{i}(\cdot|x_{i})\parallel p_{i}(\cdot|x_{i})), (2)

where the state cost qi(xi)q_{i}(x_{i}) can be an arbitrary function encoding how (un)desirable different states are, and the KL-divergence***The KL-divergence (relative entropy) between two discrete probability mass functions p(x)p(x) and q(x)q(x) is defined as KL(pq)=x𝒳p(x)log[p(x)/q(x)],\textrm{KL}(p\parallel q)=\sum_{x\in\mathcal{X}}p(x)\log[{p(x)}/{q(x)}], which has an absolute minimum 0 when p(x)=q(x),x𝒳p(x)=q(x),\forall x\in\mathcal{X}. For two continuous probability density functions p(x)p(x) and q(x)q(x), the KL-divergence is defined as KL(pq)=xχp(x)log[p(x)/q(x)]𝑑x.\textrm{KL}(p\parallel q)=\int_{x\in\chi}p(x)\log[p(x)/q(x)]dx. measures the cost of control actions. When xii𝒮ix_{i}\in\mathcal{B}_{i}\subset\mathcal{S}_{i}, the final cost function is defined as ϕi(xi)0\phi_{i}(x_{i})\geq 0. The cost-to-go function of first-exit problem starting at state-time pair (xit0,t0)(x_{i}^{t_{0}},t_{0}) is defined as

Jiui(xit0,t0)=𝔼ui[ϕi(xitf)+τ=t0tf1ci(xiτ,uiτ)],J^{u_{i}}_{i}(x^{t_{0}}_{i},t_{0})=\mathbb{E}^{u_{i}}\bigg{[}\phi_{i}(x_{i}^{t_{f}})+\sum_{\tau=t_{0}}^{t_{f}-1}c_{i}(x_{i}^{\tau},u^{\tau}_{i})\bigg{]}, (3)

where (xiτ,uiτ)(x^{\tau}_{i},u_{i}^{\tau}) is the state-action pair of agent ii at time step τ\tau, xitfx^{t_{f}}_{i} is the terminal or exit state, and the expectation 𝔼ui\mathbb{E}^{u_{i}} is taken with respect to the probability measure under which xix_{i} satisfies (1) given the control law ui=(uit0,uit1,,uitf1)u_{i}=(u^{t_{0}}_{i},u^{t_{1}}_{i},\cdots,u^{t_{f}-1}_{i}) and initial condition xit0x_{i}^{t_{0}}. The objective of discrete-time stochastic optimal control problem is to find the optimal policy uiu_{i}^{*} and value functions Vi(xi)V_{i}(x_{i}) by solving the Bellman equation

Vi(xi)=minui{ci(xi,ui)+𝔼xiui(|xi)[Vi(xi)]},V_{i}(x_{i})=\min_{u_{i}}\left\{c_{i}(x_{i},u_{i})+\mathbb{E}_{x^{\prime}_{i}\sim u_{i}(\cdot|x_{i})}[V_{i}(x^{\prime}_{i})]\right\}, (4)

where the value function Vi(xi)V_{i}(x_{i}) is defined as the expected cumulative cost for starting at state xix_{i} and acting optimally thereafter, i.e. Vi(xi)=minuiJiui(xit0,t0)V_{i}(x_{i})=\min_{u_{i}}J^{u_{i}}_{i}(x^{t_{0}}_{i},t_{0}).

Based on the formulations of single-agent LSMDP and augmented dynamics, we introduce a multi-agent LSMDP formulation subject to the distributed framework of the factorial subsystem. For simplicity, we assume that the passive dynamics of agents in MAS are homogeneous and mutually independent, i.e. agents without control are governed by identical dynamics and do not interfere or collide with each other. This assumption is also posited in many previous papers on multi-agent LSOC or distributed control [35, 31, 11, 25]. Since agent i𝒟i\in\mathcal{D} can only observe the states of neighboring agents 𝒩i\mathcal{N}_{i}, we are interested in the subsystem 𝒩¯i=𝒩i{i}\bar{\mathcal{N}}_{i}=\mathcal{N}_{i}\cup\{i\} when computing the control law of agent ii. Hence, the autonomous dynamics of subsystem 𝒩¯i\bar{\mathcal{N}}_{i} follow the distribution mapping

x¯ip¯i(|x¯i)=j𝒩¯ipj(|xj),\bar{x}^{\prime}_{i}\sim\bar{p}_{i}(\cdot|\bar{x}_{i})=\prod_{j\in{\mathcal{\bar{N}}}_{i}}p_{j}(\cdot|x_{j}), (5)

where the joint state x¯i,x¯ij𝒩¯iSj=𝒮¯i={s¯1i,s¯2i,}=¯i¯i\bar{x}_{i},\bar{x}^{\prime}_{i}\in\prod_{j\in\mathcal{\bar{N}}_{i}}S_{j}=\bar{\mathcal{S}}_{i}=\{\bar{s}^{i}_{1},\bar{s}^{i}_{2},\cdots\}=\bar{\mathcal{I}}_{i}\cup\bar{\mathcal{B}}_{i}, and distribution information p¯i(|x¯i)\bar{p}_{i}(\cdot|\bar{x}_{i}) are generally only accessible to agent ii. Similarly, the global state of all agents xp(|x)=i=1Npi(|xi)x^{\prime}\sim p(\cdot|x)=\prod_{i=1}^{N}p_{i}(\cdot|x_{i}), which is usually not available to a local agent ii unless it can receive the information from all other agents 𝒟\{i}\mathcal{D}\backslash\{i\}, e.g. agent 3 in Figure 1. Since the local control action uiu_{i} only relies on the local observation of agent ii, we assume that the joint posterior state x¯i\bar{x}^{\prime}_{i} of subsystem 𝒩¯i\bar{\mathcal{N}}_{i} is exclusively determined by the joint prior state x¯i\bar{x}_{i} and joint control u¯i\bar{u}_{i} when computing the optimal control actions in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}. More intuitively, under this assumption, the local LSOC algorithm in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} only requires the measurement of joint state x¯i\bar{x}_{i} and treats subsystem 𝒩¯i\bar{\mathcal{N}}_{i} as a complete connected network, as shown in Figure 1. When each agent in 𝒩¯i\mathcal{\bar{N}}_{i} samples their control action independently, the joint controlled dynamics of the factorial subsystem 𝒩¯i\bar{\mathcal{N}}_{i} satisfies

x¯iu¯i(|x¯i)=j𝒩¯iuj(|x¯i)=j𝒩¯ipj(|xi,xj𝒩i,uj),\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})=\prod_{j\in\mathcal{\bar{N}}_{i}}u_{j}(\cdot|\bar{x}_{i})=\prod_{j\in\mathcal{\bar{N}}_{i}}p_{j}(\cdot|x_{i},x_{j\in\mathcal{N}_{i}},u_{j}), (6)

where the joint state x¯i\bar{x}_{i} and joint distribution u¯i(|x¯i)\bar{u}_{i}(\cdot|\bar{x}_{i}) are only accessible to agent ii in general. Once we figure out the joint control distribution u¯i(|x¯i)\bar{u}_{i}(\cdot|\bar{x}_{i}) for subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, the local control distribution ui(|x¯i)u_{i}(\cdot|\bar{x}_{i}) of agent ii can be retrieved by calculating the marginal distribution. The joint immediate cost function for subsystem 𝒩¯i\bar{\mathcal{N}}_{i} when x¯i¯i\bar{x}_{i}\in\mathcal{\bar{I}}_{i} is defined as follows

ci(x¯i,u¯i)=qi(x¯i)+KL(u¯i(|x¯i)p¯i(|x¯i))=qi(x¯i)+j𝒩¯iKL(uj(|x¯i)pj(|xj)),c_{i}(\bar{x}_{i},\bar{u}_{i})=q_{i}(\bar{x}_{i})+\textrm{KL}(\bar{u}_{i}(\cdot|\bar{x}_{i})\parallel\bar{p}_{i}(\cdot|\bar{x}_{i}))=q_{i}(\bar{x}_{i})+\sum_{j\in\bar{\mathcal{N}}_{i}}\textrm{KL}(u_{j}(\cdot|\bar{x}_{i})\parallel p_{j}(\cdot|x_{j})), (7)

where the state cost qi(x¯i)q_{i}(\bar{x}_{i}) can be an arbitrary function of joint state x¯i\bar{x}_{i}, i.e. a constant or the norm of disagreement vector, and the second equality follows from (5) and (6), which implies that the joint control cost is the cumulative sum of the local control costs. When x¯i¯i\bar{x}_{i}\in\mathcal{\bar{B}}_{i}, the exit cost function ϕi(x¯i)=j𝒩¯iωjiϕj(xj)\phi_{i}(\bar{x}_{i})=\sum_{j\in\mathcal{\bar{N}}_{i}}\omega^{i}_{j}\cdot\phi_{j}(x_{j}), where ωji>0\omega^{i}_{j}>0 is a weight measuring the priority of assignment on agent jj. In order to improve the success rate in application, it is preferable to assign ωii\omega_{i}^{i} as the largest weight when computing the control distribution u¯i\bar{u}_{i} in subsystem 𝒩¯i\mathcal{\bar{N}}_{i}. Subsequently, the joint cost-to-go function of first-exit problem in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} becomes

Jiu¯i(x¯it0,t0)=𝔼u¯i[ϕi(x¯itf)+τ=t0tf1ci(x¯iτ,u¯iτ)].J^{\bar{u}_{i}}_{i}(\bar{x}^{t_{0}}_{i},t_{0})=\mathbb{E}^{\bar{u}_{i}}\bigg{[}\phi_{i}(\bar{x}_{i}^{t_{f}})+\sum_{\tau=t_{0}}^{t_{f}-1}c_{i}(\bar{x}_{i}^{\tau},\bar{u}^{\tau}_{i})\bigg{]}. (8)

Some abuses of notations occur when we define the cost functions cic_{i} and ϕi\phi_{i}, cost-to-go function JiJ_{i}, and value function ViV_{i} in single-agent setting and the factorial subsystem; one can differentiate the different settings from the arguments of these functions. Derived from the single-agent Bellman equation (4), the joint optimal control action u¯i(|x¯i)\bar{u}_{i}^{*}(\cdot|\bar{x}_{i}) subject to the joint cost function (7) can be solved from the following joint Bellman equation in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}

Vi(x¯i)=minu¯i{ci(x¯i,u¯i)+𝔼x¯iu¯i(|x¯i)[Vi(x¯i)]},V_{i}(\bar{x}_{i})=\min_{\bar{u}_{i}}\left\{c_{i}(\bar{x}_{i},\bar{u}_{i})+\mathbb{E}_{\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}[V_{i}(\bar{x}^{\prime}_{i})]\right\}, (9)

where Vi(x¯i)V_{i}(\bar{x}_{i}) is the (joint) value function of joint state x¯i\bar{x}_{i}. A linearization method as well as a parallel programming method for solving (9) will be discussed in Section 3.

Continuous-Time Dynamics

For continuous-time LSOC problems, we first consider the dynamics of single agent ii described by the following Itô diffusion process

dxi=fi(xi,t)dt+Bi(xi)[ui(xi,t)dt+σidwi],dx_{i}=f_{i}(x_{i},t)dt+B_{i}(x_{i})[u_{i}(x_{i},t)dt+\sigma_{i}dw_{i}], (10)

where xiMx_{i}\in\mathbb{R}^{M} is agent ii’s state vector from an uncountable state space; fi(xi,t)+Bi(xi)ui(xi,t)Mf_{i}(x_{i},t)+B_{i}(x_{i})\cdot u_{i}(x_{i},t)\in\mathbb{R}^{M} is the deterministic drift term with passive dynamics fi(xi,t)f_{i}(x_{i},t), control matrix Bi(xi)M×PB_{i}(x_{i})\in\mathbb{R}^{M\times P} and control action ui(xi,t)Pu_{i}(x_{i},t)\in\mathbb{R}^{P}; noise dwiPdw_{i}\in\mathbb{R}^{P} is a vector of possibly correlatedWhen the components of dw~i=[dw~i,(1),,dw~i,(P)]d{\tilde{w}}_{i}=[d{\tilde{w}}_{i,(1)},\cdots,d{\tilde{w}}_{i,(P)}]^{\top} are correlated and satisfy a multi-variate normal distribution N(0,Σi)N(0,\Sigma_{i}), by using the Cholesky decomposition Σi=σiσi\Sigma_{i}=\sigma_{i}\sigma^{\top}_{i}, we can rewrite dw~i=σidwid\tilde{w}_{i}=\sigma_{i}dw_{i}, where dwidw_{i} is a vector of Brownian components with zero drift and unit-variance rate. Brownian components with zero mean and unit rate of variance, and the positive semi-definite matrix σiP×P\sigma_{i}\in\mathbb{R}^{P\times P} denotes the covariance of noise dwidw_{i}. When xiix_{i}\in\mathcal{I}_{i}, the running cost function is defined as

ci(xi,ui)=qi(xi)+12ui(xi,t)Riui(xi,t),c_{i}(x_{i},u_{i})=q_{i}(x_{i})+\dfrac{1}{2}u_{i}(x_{i},t)^{\top}R_{i}u_{i}(x_{i},t), (11)

where qi(xi)0q_{i}(x_{i})\geq 0 is the state-related cost, and uiRiuiu_{i}^{\top}R_{i}u_{i} is the control-quadratic term with matrix RiP×PR_{i}\in\mathbb{R}^{P\times P} being positive definite. When xitfix^{t_{f}}_{i}\in\mathcal{B}_{i}, the terminal cost function is ϕi(xitf)\phi_{i}(x^{t_{f}}_{i}), where tft_{f} is the exit time. Hence, the cost-to-go function of first-exit problem is defined as

Jiui(xit,t)=𝔼xit,tui[ϕi(xitf)+ttfci(xi(τ),ui(τ))𝑑τ],J^{u_{i}}_{i}(x_{i}^{t},t)=\mathbb{E}^{u_{i}}_{x_{i}^{t},t}\left[\phi_{i}(x_{i}^{t_{f}})+\int_{t}^{t_{f}}c_{i}(x_{i}(\tau),u_{i}(\tau))\ d\tau\right], (12)

where the expectation is taken with respect to the probability measure under which xix_{i} is the solution to (10) given the control law uiu_{i} and initial condition xi(t)x_{i}(t). The value function is defined as the minimal cost-to-go function Vi(xi,t)=minuiJiui(xit,t)V_{i}(x_{i},t)=\min_{u_{i}}J_{i}^{u_{i}}(x_{i}^{t},t). Subject to the dynamics (10) and running cost function (11), the optimal control action uiu_{i}^{*} can be solved from the following single-agent stochastic Hamilton–Jacobi–Bellman (HJB) equation:

tVi(xi,t)=minui{ci(xi,ui)+[fi(xi,t)\displaystyle-\partial_{t}V_{i}(x_{i},t)=\min_{u_{i}}\Big{\{}c_{i}(x_{i},u_{i})+[f_{i}(x_{i},t) +Bi(xi)ui(xi,t)]xiVi(xi,t)\displaystyle+B_{i}(x_{i})u_{i}(x_{i},t)]^{\top}\cdot\nabla_{x_{i}}V_{i}(x_{i},t) (13)
+12tr[Bi(xi)σiσiBi(xi)xixi2Vi(xi,t)]},\displaystyle+\frac{1}{2}\textrm{tr}\left[B_{i}(x_{i})\sigma_{i}\sigma^{\top}_{i}B_{i}(x_{i})^{\top}\cdot\nabla^{2}_{x_{i}x_{i}}V_{i}(x_{i},t)\right]\Big{\}},

where xi\nabla_{x_{i}} and xixi2\nabla^{2}_{x_{i}x_{i}} respectively refer to the gradient and Hessian matrix with xiVi=[Vi/xi(1),,Vi/xi(M)]\nabla_{x_{i}}V_{i}=[\partial V_{i}/\partial x_{i(1)},\break\cdots,\partial V_{i}/\partial x_{i(M)}]^{\top} and elements [xixi2Vi]m,n=2Vi/xi(m)xi(n)[\nabla^{2}_{x_{i}x_{i}}V_{i}]_{m,n}=\partial^{2}V_{i}/\partial x_{i(m)}\partial x_{i(n)}. A few methods have been proposed to solve the stochastic HJB in (13), such as the approximation methods via discrete-time MDPs or eigenfunction [36] and path integral approaches [37, 31, 22].

Similar to the extension of discrete-time LSMDP from single-agent setting to MAS, the joint continuous-time dynamics for factorial subsystem 𝒩¯i\bar{\mathcal{N}}_{i} is described by

dx¯i=f¯i(x¯i,t)dt+B¯i(x¯i)[u¯i(x¯i,t)dt+σ¯idw¯i],d\bar{x}_{i}=\bar{f}_{i}(\bar{x}_{i},t)dt+\bar{B}_{i}(\bar{x}_{i})\left[\bar{u}_{i}(\bar{x}_{i},t)dt+\bar{\sigma}_{i}d\bar{w}_{i}\right], (14)

where the joint passive dynamics vector is denoted by f¯i(x¯i,t)=[fi(xi,t),fj𝒩i(xj,t)]M|𝒩¯i|\bar{f}_{i}(\bar{x}_{i},t)=[f_{i}(x_{i},t)^{\top},f_{j\in\mathcal{N}_{i}}(x_{j},t)^{\top}]^{\top}\in\mathbb{R}^{M\cdot|\mathcal{\bar{N}}_{i}|}, the joint control matrix is denoted by B¯i(x¯i)=diag{Bi(xi),Bj𝒩i(xj)}M|𝒩¯i|×P|𝒩¯i|\bar{B}_{i}(\bar{x}_{i})=\textrm{diag}\{B_{i}(x_{i}),B_{j\in\mathcal{N}_{i}}(x_{j})\}\in\break\mathbb{R}^{M\cdot|\bar{\mathcal{N}}_{i}|\times P\cdot|\bar{\mathcal{N}}_{i}|}, u¯i(x¯i,t)=[ui(x¯i,t),uj𝒩i(x¯i,t)]P|𝒩¯i|\bar{u}_{i}(\bar{x}_{i},t)=[u_{i}(\bar{x}_{i},t)^{\top},u_{j\in\mathcal{N}_{i}}(\bar{x}_{i},t)^{\top}]^{\top}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|} is the joint control action, dw¯i=[dwi,dwj𝒩i]P|𝒩¯i|d\bar{w}_{i}=[dw^{\top}_{i},dw^{\top}_{j\in\mathcal{N}_{i}}]^{\top}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|} is the joint noise vector, and the joint covariance matrix is denoted by σ¯i=diag{σi,σj𝒩i}P|𝒩¯i|×P|𝒩¯i|\bar{\sigma}_{i}=\textrm{diag}\{\sigma_{i},\sigma_{j\in\mathcal{N}_{i}}\}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|\times P\cdot|\bar{\mathcal{N}}_{i}|}. Analogous to the discrete-time scenario, we assume that the passive dynamics of agents are homogeneous and mutually independent, and for the local planning algorithm on agent ii or subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, which computes the local control action ui(x¯i,t)u_{i}(\bar{x}_{i},t) for agent ii and joint control action u¯i(x¯i,t)\bar{u}_{i}(\bar{x}_{i},t) in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, the evolution of joint state x¯i\bar{x}_{i} only depends on the current values of x¯i\bar{x}_{i} and joint control u¯i(x¯i,t)\bar{u}_{i}(\bar{x}_{i},t). When x¯i¯i\bar{x}_{i}\in\mathcal{\bar{I}}_{i}, the joint immediate cost function for subsystem 𝒩¯i\bar{\mathcal{N}}_{i} is defined as

ci(x¯i,u¯i)=qi(x¯i)+12u¯i(x¯i,t)R¯iu¯i(x¯i,t),c_{i}(\bar{x}_{i},\bar{u}_{i})=q_{i}(\bar{x}_{i})+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},t)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},t), (15)

where the state-related cost qi(x¯i)q_{i}(\bar{x}_{i}) can be an arbitrary function measuring the (un)desirability of different joint states x¯i𝒮¯i\bar{x}_{i}\in\mathcal{\bar{S}}_{i}, and u¯iR¯iu¯i\bar{u}_{i}^{\top}\bar{R}_{i}\bar{u}_{i} is the control-quadratic term with matrix R¯iP|𝒩¯i|×P|𝒩¯i|\bar{R}_{i}\in\mathbb{R}^{P\cdot|\bar{\mathcal{N}}_{i}|\times P\cdot|\bar{\mathcal{N}}_{i}|} being positive definite. When R¯i=diag{Ri,Rj𝒩i}\bar{R}_{i}=\textrm{diag}\{R_{i},R_{j\in\mathcal{N}_{i}}\} with RiR_{i} and RjR_{j} defined in (11), the joint control cost term in (15) satisfies u¯iR¯iu¯i=j𝒩¯iujRjuj\bar{u}_{i}^{\top}\bar{R}_{i}\bar{u}_{i}=\sum_{j\in\mathcal{\bar{N}}_{i}}u_{j}^{\top}R_{j}u_{j}, which is symmetric with respect to the relationship of discrete-time control costs in (7). When x¯i¯i\bar{x}_{i}\in\bar{\mathcal{B}}_{i}, the terminal cost function is defined as ϕi(x¯i)=j𝒩¯iωjiϕj(xj)\phi_{i}(\bar{x}_{i})=\sum_{j\in\mathcal{\bar{N}}_{i}}\omega_{j}^{i}\cdot\phi_{j}(x_{j}), where ωji>0\omega_{j}^{i}>0 is the weight measuring the priority of assignment on agent jj, and we let the weight ωii\omega_{i}^{i} dominate other weights ωj𝒩ii\omega_{j\in\mathcal{N}_{i}}^{i} to improve the success rate. Compared with the cost functions fully factorized over agents, the joint cost functions in (15) can gauge and facilitate the correlation and cooperation between neighboring agents. Subsequently, the joint cost-to-go function of first-exit problem in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} is defined as

Jiu¯i(x¯it,t)=𝔼x¯it,tu¯i[ϕi(x¯itf)+ttfci(x¯i(τ),u¯i(τ))𝑑τ].J^{\bar{u}_{i}}_{i}(\bar{x}_{i}^{t},t)=\mathbb{E}^{\bar{u}_{i}}_{\bar{x}_{i}^{t},t}\left[\phi_{i}(\bar{x}_{i}^{t_{f}})+\int_{t}^{t_{f}}c_{i}(\bar{x}_{i}(\tau),\bar{u}_{i}(\tau))\ d\tau\right].

Let the (joint) value function Vi(x¯i,t)V_{i}(\bar{x}_{i},t) be the minimal cost-to-go function, i.e. Vi(x¯i,t)=minu¯iJiu¯i(x¯it,t)V_{i}(\bar{x}_{i},t)=\min_{\bar{u}_{i}}J_{i}^{\bar{u}_{i}}(\bar{x}_{i}^{t},t). We can then compute the joint optimal control action u¯i\bar{u}_{i}^{*} of subsystem 𝒩¯i\mathcal{\bar{N}}_{i} by solving the following joint optimality equation

Vi(x¯i,t)=minu¯i𝔼x¯it,tu¯i[ϕi(x¯itf)+ttfci(x¯i(τ),u¯i(τ))𝑑τ].V_{i}(\bar{x}_{i},t)=\min_{\bar{u}_{i}}\mathbb{E}^{\bar{u}_{i}}_{\bar{x}_{i}^{t},t}\left[\phi_{i}(\bar{x}_{i}^{t_{f}})+\int_{t}^{t_{f}}c_{i}(\bar{x}_{i}(\tau),\bar{u}_{i}(\tau))\ d\tau\right]. (16)

A distributed path integral control algorithm and a distributed REPS algorithm for solving (16) will be respectively discussed in Section 3. Discussion on the relationship between discrete-time and continuous-time LSOC dynamics can be found in [16, 38].

Remark 1.

Although each agent i𝒟i\in\mathcal{D} under this framework acts optimally to minimize a joint cost-to-go function defined in their subsystem 𝒩¯i\mathcal{\bar{N}}_{i}, the distributed control law obtained from solving the local problem (9) or (16) is still a sub-optimal solution unless the communication network 𝒢\mathcal{G} is fully connected. Two main reasons account for this sub-optimality. First, when solving (9) or (16) for the joint (or local) optimal control u¯i\bar{u}_{i}^{*} (or uiu_{i}^{*}), we ignore the connections of agents outside the subsystem 𝒩¯i\bar{\mathcal{N}}_{i} and assume that the evolution of joint state x¯i\bar{x}_{i} only relies on the current values of x¯i\bar{x}_{i} and joint control u¯i\bar{u}_{i}. This simplification is reasonable and almost accurate for the central agent ii of subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, but not for the non-central agents j𝒩ij\in\mathcal{N}_{i}, which are usually adjacent to other agents in 𝒩j\𝒩i\mathcal{N}_{j}\backslash\mathcal{N}_{i}. Therefore, the local optimal control actions u¯j\bar{u}_{j}^{*} of other agents j𝒟\{i}j\in\mathcal{D}\backslash\{i\} are respectively computed from their own subsystems 𝒩¯j\bar{\mathcal{N}}_{j}, which may contradict the joint optimal control u¯i\bar{u}_{i} solved in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} and result in a sub-optimal solution. Similar conflicts widely exist in the distributed control and optimization problems subject to limited communication and partial observation, and some serious and heuristic studies on the global- and sub-optimality of distributed subsystems have been conducted in [39, 40, 11, 41]. We will not dive into those technical details in this paper, as we believe that the executability of a sub-optimal plan with moderate communication and computational complexities should outweigh the performance gain of a global-optimal but computationally intractable plan in practice. In this regard, the distributed framework built upon factorial subsystems, which captures the correlations between neighboring agents while ignoring the further connections outside subsystems, provides a trade-off alternative between the optimality and complexity and is analogous to the structured prediction framework in supervised learning.

Distributed Linearly-Solvable Optimal Control

Subject to the multi-agent LSOC problems formulated in Section 2, the linearization methods and distributed algorithms for solving the joint discrete-time Bellman equation (9) and joint continuous-time optimality equation (16) are discussed in this section.

Discrete-Time Systems

We first consider the discrete-time MAS with dynamics (6) and immediate cost function (7), which give the joint Bellman equation (9) in subsystem 𝒩¯i\mathcal{\bar{N}}_{i}

Vi(x¯i)=minu¯i{ci(x¯i,u¯i)+𝔼x¯iu¯i(|x¯i)[Vi(x¯i)]}.V_{i}(\bar{x}_{i})=\min_{\bar{u}_{i}}\left\{c_{i}(\bar{x}_{i},\bar{u}_{i})+\mathbb{E}_{\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}[V_{i}(\bar{x}^{\prime}_{i})]\right\}.

In order to compute the value function Vi(x¯i)V_{i}(\bar{x}_{i}) and optimal control action u¯i(|x¯i)\bar{u}_{i}^{*}(\cdot|\bar{x}_{i}) from equation (9), an exponential or Cole-Hopf transformation is employed to linearize (9) into a system of linear equations, which can be cast into a decentralized programming and solved in parallel. Local optimal control distribution ui(|x¯i)u_{i}^{*}(\cdot|\bar{x}_{i}) that depends on the local observation of agent ii is then derived by marginalizing the joint control distribution u¯i(|x¯i)\bar{u}_{i}^{*}(\cdot|\bar{x}_{i}).

A. Linearization of Joint Bellman Equation

Motivated by the exponential transformation employed in [38] for single-agent system, we define the desirability function Zi(x¯i)Z_{i}(\bar{x}_{i}) for joint state x¯i¯i\bar{x}_{i}\in\mathcal{\bar{I}}_{i} in subsystem 𝒩¯i\mathcal{\bar{N}}_{i} as

Zi(x¯i)=exp[Vi(x¯i)],Z_{i}(\bar{x}_{i})=\exp[-V_{i}(\bar{x}_{i})], (17)

which implies that the desirability function Zi(x¯i)Z_{i}(\bar{x}_{i}) is negatively correlated with the value function Vi(x¯i)V_{i}(\bar{x}_{i}), and the value function can also be written conversely as a logarithm of the desirability function, Vi(x¯i)=log1/Zi(x¯i)V_{i}(\bar{x}_{i})=\log 1/Z_{i}(\bar{x}_{i}). For boundary states x¯i¯i\bar{x}_{i}\in\mathcal{\bar{B}}_{i}, the desirability functions are defined as Zi(x¯i)=exp[ϕi(x¯i)]Z_{i}(\bar{x}_{i})=\exp[-\phi_{i}(\bar{x}_{i})]. Based on the transformation (17), a linearized joint Bellman equation (9) along with the joint optimal control distribution is presented in Theorem 1.

Theorem 1.

With exponential transformation (17), the joint Bellman equation (9) for subsystem 𝒩¯i\mathcal{\bar{N}}_{i} is equivalent to the following linear equation with respect to the desirability function

Zi(x¯i)=exp(qi(x¯i))x¯ip¯i(x¯i|x¯i)Zi(x¯i),Z_{i}(\bar{x}_{i})=\exp(-q_{i}(\bar{x}_{i}))\cdot\sum_{\bar{x}^{\prime}_{i}}\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z_{i}(\bar{x}^{\prime}_{i}), (18)

where qi(x¯i)q_{i}(\bar{x}_{i}) is the state-related cost defined in (15), and p¯i(x¯i|x¯i)\bar{p}_{i}(\bar{x}_{i}^{\prime}|\bar{x}_{i}) is the transition probability of passive dynamics in (5). The joint optimal control action u¯i(|x¯i)\bar{u}_{i}^{*}(\cdot|\bar{x}_{i}) solving (9) satisfies

u¯i(|x¯i)=p¯i(|x¯i)Zi()x¯ip¯i(x¯i|x¯i)Zi(x¯i),\bar{u}_{i}^{*}(\cdot|\bar{x}_{i})=\frac{\bar{p}_{i}(\cdot|\bar{x}_{i})Z_{i}(\cdot)}{\sum_{\bar{x}^{\prime}_{i}}\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z_{i}(\bar{x}^{\prime}_{i})}, (19)

where u¯i(x¯i|x¯i)\bar{u}_{i}^{*}(\bar{x}_{i}^{\prime}|\bar{x}_{i}) is the transition probability from x¯i\bar{x}_{i} to x¯i\bar{x}_{i}^{\prime} in controlled dynamics (6).

Proof.

See Appendix A for the proof. ∎

Joint optimal control action u¯i(x¯i|x¯i)\bar{u}^{*}_{i}(\bar{x}_{i}^{\prime}|\bar{x}_{i}) in (19) only relies on the joint state x¯i\bar{x}_{i} of subsystem 𝒩¯i\mathcal{\bar{N}}_{i}, i.e. local observation of agent ii. To execute this control action (19), we still need to figure out the values of desirability functions or value functions, which can be solved from (18). A conventional approach for solving (18), which was adopted in [16, 18, 38] for single-agent LSMDP, is to rewrite (18) as a recursive formula and approximate the solution by iterations. We can approximate the desirability functions of interior states x¯i¯i\bar{x}_{i}\in\mathcal{\bar{I}}_{i} by recursively executing the following update law

Z=ΘZ+ΩZ,Z_{\mathcal{I}}=\Theta Z_{\mathcal{I}}+\Omega Z_{\mathcal{B}}, (20)

where ZZ_{\mathcal{I}} and ZZ_{\mathcal{B}} are respectively the desirability vectors of interior states and boundary states; the diagonal matrix Θ=diag{exp(q)}P\Theta=\textrm{diag}\{\exp(-q_{\mathcal{I}})\}\cdot P_{\mathcal{II}} with qq_{\mathcal{I}} denoting the state-related cost of interior states, P=[pmn]P_{\mathcal{II}}=[p_{mn}] denoting the transition probability matrix between interior states, and pmn=p¯i(x¯i=s¯ni|x¯i=s¯mi)p_{mn}=\bar{p}_{i}(\bar{x}_{i}^{\prime}=\bar{s}^{i}_{n}\ |\ \bar{x}_{i}=\bar{s}^{i}_{m}) for s¯ni,s¯mi¯i\bar{s}^{i}_{n},\bar{s}^{i}_{m}\in\mathcal{\bar{I}}_{i}; the matrix Ω=diag{exp(q)}P\Omega=\textrm{diag}\{\exp(-q_{\mathcal{I}})\}\cdot P_{\mathcal{IB}} with P=[pmn]P_{\mathcal{IB}}=[p_{mn}] denoting the transition probability matrix from interior states to boundary states, and pmn=p¯i(x¯i=s¯ni|x¯i=s¯mi)p_{mn}=\bar{p}_{i}(\bar{x}_{i}^{\prime}=\bar{s}^{i}_{n}\ |\ \bar{x}_{i}=\bar{s}^{i}_{m}) for s¯mi¯i\bar{s}^{i}_{m}\in\mathcal{\bar{I}}_{i} and s¯ni¯i\bar{s}^{i}_{n}\in\mathcal{\bar{B}}_{i}. Assigning an initial value to the desirability vector ZZ_{\mathcal{I}}, the recursive formula (20) is guaranteed to converge to a unique solution, since the spectral radius of matrix Θ\Theta is less than 11. More detailed convergence analysis on this iterative solver of LSMDPs has been given in [16]. However, this centralized solver is inefficient when dealing with the MAS problems with high-dimensional state space, which requires the development of a distributed solver that exploits the resource of network and expedites the computation.

B. Distributed Planning Algorithm

While most of the distributed SOC algorithms are executed by local agents in a decentralized approach, a great number of these algorithms still demand a centralized solver in planning phase, which becomes a bottleneck for their implementations when the amount of agents scales up [8]. For a fully connected MAS with NN agents, when each agent has |||\mathcal{I}| interior states, the dimension of the vector ZZ_{\mathcal{I}} in (20) is ||N|\mathcal{I}|^{N}, and as the number of agents NN grows up, it will become more intractable for a central computation unit to store all the data and execute all the computation required by (20) due to the curse of dimensionality. Although the subsystem-based distributed framework can alleviate this problem by demanding a less complex network and making the dimension and computational complexity only related to the sizes of factorial subsystems, it is still preferable to utilize the resources of MAS by distributing the data and computational task of (20) to each local agent in 𝒩¯i\bar{\mathcal{N}}_{i}, instead of relying on a central planning agent. Hence, we rewrite the linear equation (20) in the following form

(IΘ)Z=ΩZ,(I-\Theta)Z_{\mathcal{I}}=\Omega Z_{\mathcal{B}}, (21)

and formulate (21) into a parallel programming problem. In order to solve the desirability vector ZZ_{\mathcal{I}} from (21) via a distributed approach, each agent in subsystem 𝒩¯i\mathcal{\bar{N}}_{i} only needs to know (store) a subset (rows) of the partitioned matrix [IΘ,ΩZ]\left[I-\Theta,\hskip 5.0pt\Omega Z_{\mathcal{B}}\right]. Subject to the equality constraints laid by its portion of coefficients, agent j𝒩¯ij\in\mathcal{\bar{N}}_{i} first initializes its own version of solution Z,j(0)Z_{\mathcal{I},j}^{(0)} to (21). Every agent has access to the solutions of its neighboring agents, and the central agent ii can access the solutions of agents in 𝒩i\mathcal{N}_{i}. A consensus of ZZ_{\mathcal{I}} in (21) can be reached among Z,j𝒩¯iZ_{\mathcal{I},j\in\mathcal{\bar{N}}_{i}}, when implementing the following synchronous distributed algorithm on each computational agent in 𝒩¯i\mathcal{\bar{N}}_{i} [42]:

Z,j(n+1)=Z,j(n)Pj(Z,j(n)1djk𝒩j𝒩¯iZ,k(n)),Z^{(n+1)}_{\mathcal{I},j}=Z^{(n)}_{\mathcal{I},j}-P_{j}\left(Z^{(n)}_{\mathcal{I},j}-\dfrac{1}{d_{j}}\sum_{k\in\mathcal{N}_{j}\cap\mathcal{\bar{N}}_{i}}Z^{(n)}_{\mathcal{I},k}\right), (22)

where PjP_{j} is the orthogonal projection matrix on the kernel of [IΘ]j[I-\Theta]_{j}, rows of the matrix IΘI-\Theta stored in agent jj; dj=|𝒩j𝒩¯i|d_{j}=|\mathcal{N}_{j}\cap\mathcal{\bar{N}}_{i}| is the amount of neighboring agents of agent jj in subsystem 𝒩¯i\mathcal{\bar{N}}_{i}; and nn is the index of update iteration. An asynchronous distributed algorithm [43], which does not require agents to concurrently update their solution Z,iZ_{\mathcal{I},i}, can also be invoked to solve (21). Under these distributed algorithms, different versions of solution Z,jZ_{\mathcal{I},j} are exchanged across the network, and the requirements of data storage and computation can be allocated evenly to each agent in network, which improve the overall efficiency of algorithms. Meanwhile, these fully parallelized planning algorithms can be optimized and boosted further by naturally incorporating some parallel computing and Internet of Things (IoT) techniques, such as edge computing [44]. Nonetheless, for MASs with massive population, i.e. NN\rightarrow\infty, most of the control schemes and algorithms introduced in this paper will become fruitless, and we may resort to the mean-field theory [9, 45, 46], which describes MASs by probability density model rather than connected graph model.

C. Local Control Action

After we figure out the desirability functions Zi(x¯i)Z_{i}(\bar{x}_{i}) for all the joint states x¯i¯i¯i\bar{x}_{i}\in\bar{\mathcal{I}}_{i}\cup\bar{\mathcal{B}}_{i} in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, the local optimal control distribution ui(xi|x¯i)u_{i}^{*}(x^{\prime}_{i}|\bar{x}_{i}) for central agent ii is derived by calculating the marginal distribution of u¯i(x¯i|x¯i)\bar{u}^{*}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})

ui(xi|x¯i)=j𝒩iu¯i(xi,xj𝒩i|x¯i),u_{i}^{*}(x^{\prime}_{i}|\bar{x}_{i})=\sum_{j\in\mathcal{N}_{i}}\bar{u}_{i}^{*}(x^{\prime}_{i},x^{\prime}_{j\in\mathcal{N}_{i}}|\bar{x}_{i}),

where both the joint distribution u¯i(x¯i|x¯i)\bar{u}^{*}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i}) and local distribution ui(xi|x¯i)u_{i}^{*}(x^{\prime}_{i}|\bar{x}_{i}) rely on the local observation of central agent ii. By sampling control action from marginal distribution ui(xi|x¯i)u_{i}^{*}(x^{\prime}_{i}|\bar{x}_{i}), agent ii behaves optimally to minimize the joint cost-to-go function (8) defined in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, and the local optimal control distribution uk(xk|x¯k)u_{k}^{*}(x^{\prime}_{k}|\bar{x}_{k}) of other agents k𝒟\{i}k\in\mathcal{D}\backslash\{i\} in network 𝒢\mathcal{G} can be derived by repeating the preceding procedures in subsystems 𝒩¯k\bar{\mathcal{N}}_{k}. The procedures of multi-agent LSMDP algorithm are summarized as Algorithm 1 in Appendix E.

Continuous-Time Systems

We now consider the LSOC for continuous-time MASs subject to joint dynamics (14) and joint immediate cost function (15), which can be formulated into the joint optimality equation (16) as follows

Vi(x¯i,t)=minu¯i𝔼x¯it,tu¯i[ϕi(x¯itf)+ttfci(x¯i(τ),u¯i(τ))𝑑τ].V_{i}(\bar{x}_{i},t)=\min_{\bar{u}_{i}}\mathbb{E}^{\bar{u}_{i}}_{\bar{x}_{i}^{t},t}\left[\phi_{i}(\bar{x}_{i}^{t_{f}})+\int_{t}^{t_{f}}c_{i}(\bar{x}_{i}(\tau),\bar{u}_{i}(\tau))\ d\tau\right].

In order to solve (16) and derive the local optimal control action ui(x¯i,t)u^{*}_{i}(\bar{x}_{i},t) for agent ii, we first cast the joint optimality equation (16) into a joint stochastic HJB equation that gives an analytic form for joint optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t). To solve for the value function, by resorting to the Cole-Hopf transformation, the stochastic HJB equation is linearized into a partial differential equation (PDE) with respect to desirability function. Feynman-Kac formula is then invoked to formulate the solution of the linearized PDE and joint optimal control action as the path integral formulae forward in time, which are later approximated respectively by a distributed Monte Carlo (MC) sampling method and a sample-efficient distributed REPS algorithm.

A. Linearization of Joint Optimality Equation

Similar to the transformation (17) for discrete-time systems, we adopt the following Cole-Hopf or exponential transformation in continuous-time systems

Z(x¯i,t)=exp[Vi(x¯i,t)/λi],Z(\bar{x}_{i},t)=\exp[-V_{i}(\bar{x}_{i},t)/\lambda_{i}], (23)

where λi\lambda_{i}\in\mathbb{R} is a scalar, and Z(x¯i,t)Z(\bar{x}_{i},t) is the desirability function of joint state x¯i\bar{x}_{i} at time tt. Conversely, we also have Vi(x¯i,t)=λilogZ(x¯i,t)V_{i}(\bar{x}_{i},t)=\lambda_{i}\log Z(\bar{x}_{i},t) from (23). In the following theorem, we convert the optimality equation (16) into a joint stochastic HJB equation, which reveals an analytic form of joint optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t), and linearize the HJB equation into a linear PDE that has a closed-form solution for the desirability function.

Theorem 2.

Subject to the joint dynamics (14) and immediate cost function (15), the joint optimality equation (16) in subsystem 𝒩¯i\mathcal{\bar{N}}_{i} is equivalent to the joint stochastic HJB equation

tVi(x¯i,t)=\displaystyle-\partial_{t}V_{i}(\bar{x}_{i},t)= minu¯i𝔼x¯i,tu¯i[j𝒩¯i[fj(xj,t)+Bj(xj)uj(x¯i,t)]xjVi(x¯i,t)+qi(x¯i,t)\displaystyle\min_{\bar{u}_{i}}\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}\bigg{[}\sum_{j\in\bar{\mathcal{N}}_{i}}[f_{j}(x_{j},t)+B_{j}(x_{j})u_{j}(\bar{x}_{i},t)]^{\top}\cdot\nabla_{x_{j}}V_{i}(\bar{x}_{i},t)+q_{i}(\bar{x}_{i},t) (24)
+12u¯i(x¯i,t)R¯iu¯i(x¯i,t)+12j𝒩¯itr(Bj(xj)σjσjBj(xj)xjxjVi(x¯i,t))],\displaystyle+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},t)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},t)+\frac{1}{2}\sum_{j\in\bar{\mathcal{N}}_{i}}\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}x_{j}}V_{i}(\bar{x}_{i},t)\right)\bigg{]},

with boundary condition Vi(x¯i,tf)=ϕi(x¯i)V_{i}(\bar{x}_{i},t_{f})=\phi_{i}(\bar{x}_{i}), and the optimum of (24) can be attained with the joint optimal control action

u¯i(x¯i,t)=R¯i1B¯i(x¯i)x¯iVi(x¯i,t).\bar{u}^{*}_{i}(\bar{x}_{i},t)=-\bar{R}_{i}^{-1}\bar{B}_{i}(\bar{x}_{i})^{\top}\nabla_{\bar{x}_{i}}V_{i}(\bar{x}_{i},t). (25)

Subject to the transformation (23), control action (25) and condition R¯i=(σ¯iσ¯i/λi)1\bar{R}_{i}=(\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}/\lambda_{i})^{-1}, the joint stochastic HJB equation (24) can be linearized as

tZi(x¯i,t)=[qi(x¯i,t)λij𝒩¯ifj(xj,t)xj12j𝒩¯itr(Bj(xj)σjσjBj(xj)xjxj)]Zi(x¯i,t)\partial_{t}Z_{i}(\bar{x}_{i},t)=\bigg{[}\frac{q_{i}(\bar{x}_{i},t)}{\lambda_{i}}-\sum_{j\in\bar{\mathcal{N}}_{i}}f_{j}(x_{j},t)^{\top}\nabla_{x_{j}}-\frac{1}{2}\sum_{j\in\bar{\mathcal{N}}_{i}}\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\nabla_{x_{j}x_{j}}\right)\bigg{]}Z_{i}(\bar{x}_{i},t) (26)

with boundary condition Zi(x¯i,tf)=exp[ϕi(x¯i)/λi]Z_{i}(\bar{x}_{i},t_{f})=\exp[-\phi_{i}(\bar{x}_{i})/\lambda_{i}], which has a solution

Zi(x¯i,t)=𝔼x¯i,t[exp(1λiϕi(y¯itf)1λittfqi(y¯i,τ)𝑑τ)],Z_{i}(\bar{x}_{i},t)=\mathbb{E}_{\bar{x}_{i},t}\left[\exp\left(-\frac{1}{\lambda_{i}}\phi_{i}(\bar{y}^{t_{f}}_{i})-\frac{1}{\lambda_{i}}\int_{t}^{t_{f}}q_{i}(\bar{y}_{i},\tau)\ d\tau\right)\right], (27)

where the diffusion process y¯(t)\bar{y}(t) satisfies the uncontrolled dynamics dy¯i(τ)=f¯i(y¯i,τ)dτ+B¯i(y¯i)σ¯idw¯i(τ)d\bar{y}_{i}(\tau)=\bar{f}_{i}(\bar{y}_{i},\tau)d\tau+\bar{B}_{i}(\bar{y}_{i})\bar{\sigma}_{i}\cdot d\bar{w}_{i}(\tau) with initial condition y¯i(t)=x¯i(t)\bar{y}_{i}(t)=\bar{x}_{i}(t).

Proof.

See Appendix B for the proof. ∎

Based on the transformation (23), the gradient of value function satisfies x¯iVi(x¯i,t)=λix¯iZi(x¯i,t)/Zi(x¯i,t)\nabla_{\bar{x}_{i}}V_{i}(\bar{x}_{i},t)=-\lambda_{i}\cdot\nabla_{\bar{x}_{i}}Z_{i}(\bar{x}_{i},t)/Z_{i}(\bar{x}_{i},t). Hence, the joint optimal control action (25) can be rewritten as

u¯i(x¯i,t)=λiR¯i1B¯i(x¯i)x¯iZi(x¯i,t)Zi(x¯i,t)=σ¯iσ¯iB¯i(x¯i)x¯iZi(x¯i,t)Zi(x¯i,t),\bar{u}^{*}_{i}(\bar{x}_{i},t)=\lambda_{i}\bar{R}_{i}^{-1}\bar{B}^{\top}_{i}(\bar{x}_{i})\cdot\frac{\nabla_{\bar{x}_{i}}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)}=\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}(\bar{x}_{i})^{\top}\cdot\frac{\nabla_{\bar{x}_{i}}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)}, (28)

where the second equality follows from the condition R¯i=(σ¯iσ¯i/λi)1\bar{R}_{i}=(\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}/\lambda_{i})^{-1}. Meanwhile, by virtue of Feynman-Kac formula, the stochastic HJB equation (24) that must be solved backward in time can now be solved by an expectation of diffusion process evolving forward in time. While a closed-form solution of the desirability function Zi(x¯i,t)Z_{i}(\bar{x}_{i},t) is given in (27), the expectation 𝔼x¯i,t()\mathbb{E}_{\bar{x}_{i},t}(\cdot) is defined on the sample space consisting of all possible uncontrolled trajectories initialized at (x¯i,t)(\bar{x}_{i},t), which makes this expectation intractable to compute. A common approach in statistical physics and quantum mechanics is to first formulate this expectation as a path integral [47, 37, 48], and then approximate the integral or optimal control action with various techniques, such as MC sampling [37] and policy improvement [22]. In the following subsections, we first formulate the desirability function (26) and control action (28) as path integrals and then approximate them with a distributed MC sampling algorithm and a sample-efficient distributed REPS algorithm, respectively.

B. Path Integral Formulation

Before we show a path integral formula for the desirability function Zi(x¯i,t)Z_{i}(\bar{x}_{i},t) in (27), some manipulations on joint dynamics (14) are performed to avoid singularity problems [22]. By rearranging the components of joint states x¯i\bar{x}_{i} in (14), the joint state vector x¯i\bar{x}_{i} in subsystem 𝒩¯i\mathcal{\bar{N}}_{i} can be partitioned as [x¯i(n),x¯i(d)][\bar{x}^{\top}_{i(n)},\bar{x}^{\top}_{i(d)}]^{\top}, where x¯i(n)U|𝒩¯i|\bar{x}_{i(n)}\in\mathbb{R}^{U\cdot|\bar{\mathcal{N}}_{i}|} and x¯i(d)¯D|𝒩¯i|\bar{x}_{i(d)}\in\bar{\mathbb{R}}^{D\cdot|\bar{\mathcal{N}}_{i}|} respectively indicate the joint non-directly actuated states and joint directly actuated states of subsystem 𝒩¯i\mathcal{\bar{N}}_{i}; UU and DD denote the dimensions of non-directly actuated states and directly actuated states for a single agent. Consequently, the joint passive dynamics term f¯i(x¯i,t)\bar{f}_{i}(\bar{x}_{i},t) and the joint control transition matrix B¯i(x¯i)\bar{B}_{i}(\bar{x}_{i}) in (14) are partitioned as [f¯i(n),f¯i(d)][\bar{f}^{\tiny\ \top}_{i(n)},\bar{f}^{\tiny\ \top}_{i(d)}]^{\top} and [0,B¯i(d)(x¯i)][0,\bar{B}^{\top}_{i(d)}(\bar{x}_{i})]^{\top}, respectively. Hence, the joint dynamics (14) can be rewritten in a partitioned vector form as follows

(dx¯i(n)dx¯i(d))=(f¯i(n)(x¯i,t)f¯i(d)(x¯i,t))dt+(0B¯i(d)(x¯i))[u¯i(x¯i,t)dt+σ¯idw¯i].\left(\begin{matrix}d\bar{x}_{i(n)}\\ d\bar{x}_{i(d)}\end{matrix}\right)=\left(\begin{matrix}\bar{f}_{i(n)}(\bar{x}_{i},t)\\ \bar{f}_{i(d)}(\bar{x}_{i},t)\end{matrix}\right)dt+\left(\begin{matrix}0\\ \bar{B}_{i(d)}(\bar{x}_{i})\end{matrix}\right)\left[\bar{u}_{i}(\bar{x}_{i},t)dt+\bar{\sigma}_{i}d\bar{w}_{i}\right]. (29)

With the partitioned dynamics (29), the path integral formulae for the desirability function (27) and joint optimal control action (28) are given in Proposition 3.

Proposition 3.

Partition the time interval from tt to tft_{f} into KK intervals of equal length ε>0\varepsilon>0, t=t0<t1<<tK=tft=t_{0}<t_{1}<\cdots<t_{K}=t_{f}, and let the trajectory variable x¯i(k)=[x¯i(n)(k),x¯i(d)(k)]\bar{x}_{i}^{(k)}=[\bar{x}_{i(n)}^{(k)\top},\bar{x}_{i(d)}^{(k)\top}]^{\top} denote the segments of joint uncontrolled trajectories on time interval [tk1,tk)[t_{k-1},t_{k}), governed by joint dynamics (29) with u¯i(x¯i,t)=0\bar{u}_{i}(\bar{x}_{i},t)=0 and initial condition x¯i(t)=x¯i(0)\bar{x}_{i}(t)=\bar{x}_{i}^{(0)}. The desirability function (27) in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} can then be reformulated as a path integral

Zi(x¯i,t)=limε0exp(S~iε,λi(x¯i(0),¯i,t0)KD|𝒩¯i|/2log(2πλiε))𝑑¯i,Z_{i}(\bar{x}_{i},t)=\lim_{\varepsilon\downarrow 0}\int\exp\left(-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-KD|\mathcal{\bar{N}}_{i}|/2\cdot\log(2\pi\lambda_{i}\varepsilon)\right)d\bar{\ell}_{i}, (30)

where the integral is over path variable ¯i=(x¯i(1),,x¯i(K))\bar{\ell}_{i}=(\bar{x}^{(1)}_{i},\cdots,\bar{x}^{(K)}_{i}), i.e. set of all uncontrolled trajectories initialized at (x¯i,t)(\bar{x}_{i},t), and the generalized path value

S~iε,λi(x¯i(0),¯i,t0)=ϕi(x¯i(K))λi+ελi\displaystyle\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})=\frac{\phi_{i}(\bar{x}^{(K)}_{i})}{\lambda_{i}}+\frac{\varepsilon}{\lambda_{i}} k=0K1qi(x¯i(k),tk)+12k=0K1logdet(Hi(k))\displaystyle\sum_{k=0}^{K-1}q_{i}(\bar{x}^{(k)}_{i},t_{k})+\frac{1}{2}\sum_{k=0}^{K-1}\log\det(H_{i}^{(k)})\allowdisplaybreaks (31)
+ε2λik=0K1x¯i(d)(k+1)x¯i(d)(k)εf¯i(d)(x¯i(k),tk)(Hi(k))12\displaystyle+\frac{\varepsilon}{2\lambda_{i}}\sum_{k=0}^{K-1}\left\|\frac{\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}}{\varepsilon}-\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\right\|^{2}_{\left(H_{i}^{(k)}\right)^{-1}}

with Hi(k)=B¯i(d)(x¯i(k))σ¯iσ¯iB¯i(d)(x¯i(k))=λiB¯i(d)(x¯i(k))R¯i1B¯i(d)(x¯i(k))H_{i}^{(k)}=\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})^{\top}=\lambda_{i}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\bar{R}_{i}^{-1}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})^{\top}. Hence, the joint optimal control action in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} can be reformulated as a path integral

u¯i(x¯i,t)=λiR¯i1B¯i(d)(x¯i)limε0p~i(¯i|x¯i(0),t0)u~i(x¯i(0),¯i,t0)𝑑¯i,\bar{u}^{*}_{i}(\bar{x}_{i},t)=\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i(d)}(\bar{x}_{i})^{\top}\cdot\lim_{\varepsilon\downarrow 0}\int\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})\ d\bar{\ell}_{i}, (32)

where

p~i(¯i|x¯i(0),t0)=exp(S~iε,λi(x¯i(0),¯i,t0))exp(S~iε,λi(x¯i(0),¯i,t0))𝑑¯i\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})=\frac{\exp(-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}))}{\int\exp(-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}))\ d\bar{\ell}_{i}} (33)

is the optimal path distribution, and

u~i(x¯i(0),¯i,t0)=ελix¯i(d)(0)qi(x¯i(0),t0)+(Hi(0))1(x¯i(d)(1)x¯i(d)(0)εf¯i(d)(x¯i(0),t0))\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})=-\frac{\varepsilon}{\lambda_{i}}\nabla_{\bar{x}^{(0)}_{i(d)}}q_{i}(\bar{x}^{(0)}_{i},t_{0})+\left(H_{i}^{(0)}\right)^{-1}\left(\frac{\bar{x}_{i(d)}^{(1)}-\bar{x}_{i(d)}^{(0)}}{\varepsilon}-\bar{f}_{i(d)}(\bar{x}_{i}^{(0)},t_{0})\right) (34)

is the initial control variable.

Proof.

See Appendix C for the proof. ∎

While Proposition 3 gives the path integral formulae for desirability (value) function Zi(x¯i,t)Z_{i}(\bar{x}_{i},t) and joint optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t), one of the challenges when implementing these formulae is the approximation of optimal path distribution (33) and integrals in (30) and (32), since these integrals are defined on the set of all possible uncontrolled trajectories initialized at (x¯i,t)(\bar{x}_{i},t) or (x¯i(0),t0)(\bar{x}_{i}^{(0)},t_{0}), which are intractable to exhaust and might become computationally expensive to sample as the state dimension and the amount of agents scale up. An intuitive solution is to formulate this problem as a statistical inference problem and predict the optimal path distribution p~i(¯i|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) via various inference techniques, such as the easiest MC sampling [37] or Metropolis-Hastings sampling [31]. Provided a batch of uncontrolled trajectories 𝒴i={(x¯i(0),¯i[y])}y=1,,Y\mathcal{Y}_{i}=\{(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]})\}_{y=1,\cdots,Y} and a fixed ε\varepsilon, the Monte Carlo estimation of optimal path distribution in (33) is

p~i(¯i[y]|x¯i(0),t0)exp(S~iε,λi(x¯i(0),¯i[y],t0))y=1Yexp(S~iε,λi(x¯i(0),¯i[y],t0)),\tilde{p}^{*}_{i}(\bar{\ell}_{i}^{[y]}|\bar{x}_{i}^{(0)},t_{0})\approx\frac{\exp(-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]},t_{0}))}{\sum_{y=1}^{Y}\exp(-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i},t_{0}))}, (35)

where S~iε,λi(x¯i(0),¯i[y],t0)\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]},t_{0}) denotes the generalized path value (31) of trajectory (x¯i(0),¯i[y])(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]}), and the estimation of joint optimal control action in (32) is

u¯i(x¯i,t)=λiR¯i1B¯i(d)(xi)y=1Yp~i(¯i[y]|x¯i(0),t0)u~i(x¯i(0),¯i[y],t0),\bar{u}^{*}_{i}(\bar{x}_{i},t)=\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i(d)}(x_{i})^{\top}\cdot\sum_{y=1}^{Y}\tilde{p}^{*}_{i}(\bar{\ell}^{[y]}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i},t_{0}), (36)

where u~i(x¯i(0),¯i[y],t0)\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i},t_{0}) is the initial control of sample trajectory (x¯i(0),¯i[y])(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]}). To expedite the process of sampling, the sampling tasks of 𝒴i𝒟\mathcal{Y}_{i\in\mathcal{D}} can be distributed to different agents in the network, i.e. each agent jj in MAS 𝒢\mathcal{G} only samples its local uncontrolled trajectories {(xj(0),j[y])}y=1,,Y\{(x_{j}^{(0)},\ell_{j}^{[y]})\}_{y=1,\cdots,Y}, and the optimal path distributions of each subsystem 𝒩¯i\mathcal{\bar{N}}_{i} are approximated via (35) after the central agent ii restores 𝒴i\mathcal{Y}_{i} by collecting the samples from its neighbors j𝒩¯ij\in\mathcal{\bar{N}}_{i}. Meanwhile, the local trajectories {(xj(0),j[y])}y=1,,Y\{(x_{j}^{(0)},\ell_{j}^{[y]})\}_{y=1,\cdots,Y} of agent jj not only can be utilized by the local control algorithm of agent jj in subsystem 𝒩¯j\bar{\mathcal{N}}_{j}, but also can be used by the local control algorithms of the neighboring agents k𝒩jk\in\mathcal{N}_{j} in subsystem 𝒩¯k\bar{\mathcal{N}}_{k}, and the parallel computation of GPUs can further facilitate the sampling processes by allowing each agent to concurrently generate multiple sample trajectories [25]. The continuous-time multi-agent LSOC algorithm based on estimator (35) is summarized as Algorithm 2 in Appendix E.

However, since the aforementioned auxiliary techniques are still proposed on the basis of pure sampling estimator (35), the total amount of samples required for a good approximation is not reduced, which will hinder the implementations of Proposition 3 and estimator (35) when the sample trajectories are expensive to generate. Various investigations have been conducted to mitigate this issue. For the sample-efficient approximation of desirability (value) function Zi(x¯i,t)Z_{i}(\bar{x}_{i},t), parameterized desirability (value) function, Laplace approximation [37], approximation based on mean-field assumption [31], and a forward-backward framework based on Gaussian process can be used [17]. Meanwhile, it is more common in practice to only efficiently predict the optimal path distribution p~i(¯i|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) and optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t), while omitting the value of desirability function. Numerous statistical inference algorithms, such as Bayesian inference and variational inference [49], can be adopted to predict the optimal path distribution, and diversified policy improvement and policy search algorithms, such as PI2 [22] and REPS [23], can be employed to update the parameterized optimal control policy. Hence, in the following subsection, a sample-efficient distributed REPS algorithm is introduced to update the parameterized joint control policy for each subsystem and MAS.

C. Relative Entropy Policy Search in MAS

Since the initial control u~i(x¯i(0),¯i,t0)\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}) can be readily calculated from sample trajectories and network communication, we only need to determine the optimal path distribution p~i(¯i|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) before evaluating the joint optimal control action from (32). To approximate this distribution more efficiently, we extend the REPS algorithm [21, 23] for path integral control in MAS. REPS is a model-based algorithm with competitive convergence rate, whose objective is to search for a parametric policy πi(k)(u¯i(k)|x¯i(k))\pi^{(k)}_{i}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)}) that generates a path distribution p~iπ(¯i|x¯i(0),t0)\tilde{p}^{\pi}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) to approximate the optimal path distribution p~i(¯i|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}). Compared with reinforcement learning algorithms and model-free policy improvement or search algorithms, since REPS is built on the basis of model, the policy and trajectory generated from REPS can automatically satisfy the constraints of the model. Moreover, for the MAS problems, REPS algorithm allows to consider a more general scenario when the initial condition x¯i(t)=x¯i(0)\bar{x}_{i}(t)=\bar{x}^{(0)}_{i} is stochastic and satisfies a distribution μi(x¯i(0),t0)\mu_{i}(\bar{x}^{(0)}_{i},t_{0}). REPS algorithm alternates between two steps: a) Learning step that approximates the optimal path distribution from samples generated by the current or initial policy, i.e. p~iπ(¯i|x¯i(0),t0)p~i(¯i|x¯i(0),t0)\tilde{p}^{\pi}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\rightarrow\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}), and b) Updating step that updates the parametric policy πi(k)(u¯i(k)|x¯i(k))\pi^{(k)}_{i}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)}) to reproduce the desired path distribution p~iπ(¯i|x¯i(0),t0)\tilde{p}^{\pi}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) generated in step a). The algorithm terminates when the policy and approximate path distribution converge.

During the learning step, we approximate the joint optimal distribution p~i(x¯i(0),¯i)=p~i(¯i|x¯i(0),t0)μi(x¯i(0),t0)\tilde{p}_{i}^{*}(\bar{x}^{(0)}_{i},\bar{\ell}_{i})=\tilde{p}_{i}^{*}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\mu_{i}(\bar{x}^{(0)}_{i},t_{0}) by minimizing the relative entropy (KL-divergence) between an approximate distribution p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}^{(0)}_{i},\bar{\ell}_{i}) and p~i(x¯i(0),¯i)\tilde{p}_{i}^{*}(\bar{x}^{(0)}_{i},\bar{\ell}_{i}) subject to a few constraints and a batch of sample trajectories 𝒴i={(x¯i(0),¯i[y])}y=1,,Y\mathcal{Y}_{i}=\{(\bar{x}_{i}^{(0)},\allowbreak\bar{\ell}_{i}^{[y]})\}_{y=1,\cdots,Y} generated by the current or initial policy related to old distribution q~i(x¯i(0),¯i)\tilde{q}_{i}(\bar{x}^{(0)}_{i},\bar{\ell}_{i}) from prior iteration. Similar to the distributed MC sampling method, the computation task of sample trajectories 𝒴i\mathcal{Y}_{i} can either be exclusively assigned to the central agent ii or distributed among the agents in subsystem 𝒩¯i\mathcal{\bar{N}}_{i} and then collected by the central agent ii. However, unlike the local uncontrolled trajectories {(xi(0),i[y])}y=1,,Y\{(x_{i}^{(0)},\ell_{i}^{[y]})\}_{y=1,\cdots,Y} in the MC sampling method, which can be interchangeably used by the control algorithms of all agents in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, since the trajectory sets 𝒴i𝒟\mathcal{Y}_{i\in\mathcal{D}} of the REPS approach are generated subject to different policies πi(k)(ui(k)|x¯i(k))\pi^{(k)}_{i}(u^{(k)}_{i}|\bar{x}^{(k)}_{i}) for i𝒟i\in\mathcal{D}, the sample data in 𝒴i\mathcal{Y}_{i} can only be specifically used by the local REPS algorithm in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}. We can then update the approximate path distribution p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}^{(0)}_{i},\bar{\ell}_{i}) by solving the following optimization problem

argmaxp~i(x¯i(0),¯i)\displaystyle\arg\max_{\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})} p~i(x¯i(0),¯i)[S~iε,λi(x¯i(0),¯i,t0)logp~i(x¯i(0),¯i)]𝑑x¯i(0)𝑑¯i,\displaystyle\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\left[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-\log\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\right]\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i},\allowdisplaybreaks (37)
s.t. p~i(x¯i(0),¯i)logp~i(x¯i(0),¯i)q~i(x¯i(0),¯i)dx¯i(0)d¯iδ,\displaystyle\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\log\frac{\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})}{\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})}\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}\leq\delta,\allowdisplaybreaks
p~i(x¯i(0),¯i)ψ(x¯i(0))𝑑x¯i(0)𝑑¯i=ψ^i(0),\displaystyle\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\psi(\bar{x}_{i}^{(0)})\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}=\hat{\psi}_{i}^{(0)},\allowdisplaybreaks
p~i(x¯i(0),¯i)𝑑x¯i(0)𝑑¯i=1,\displaystyle\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}=1,

where δ>0\delta>0 is a parameter confining the update rate of approximate path distribution; ψi(x¯i(0))\psi_{i}(\bar{x}_{i}^{(0)}) is a state feature vector of the initial condition; and ψ^i(0)\hat{\psi}^{(0)}_{i} is the expectation of state feature vector subject to initial distribution μi(x¯i(0))\mu_{i}(\bar{x}^{(0)}_{i}). When the initial state x¯i(0)\bar{x}_{i}^{(0)} is deterministic, (37) degenerates to an optimization problem for path distribution p~i(¯i|x¯i(0),t0)\tilde{p}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}), and the second constraint in (37) can be neglected. The optimization problem (37) can be solved analytically with the method of Lagrange multipliers, which gives

p~i(x¯i(0),¯i)=exp(1+κ+η1+κ)q~i(x¯i(0),¯i)κ1+κexp(S~iε,λi(x¯i(0),¯i,t0)+θψi(x¯i(0))1+κ),\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})=\exp\left(-\frac{1+\kappa+\eta}{1+\kappa}\right)\cdot\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})^{\frac{\kappa}{1+\kappa}}\cdot\exp\left(-\frac{\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})+\theta^{\top}\psi_{i}(\bar{x}_{i}^{(0)})}{1+\kappa}\right), (38)

where κ\kappa and θ\theta are the Lagrange multipliers that can be solved from the dual problem

argminκ,θg(κ,θ),s.t. κ>0,\arg\min_{\kappa,\theta}\ g(\kappa,\theta),\qquad\textrm{s.t. }\kappa>0, (39)

with objective function

g(κ,θ)\displaystyle g(\kappa,\theta) =κδ+θψ^i(0)+(1+κ)logq~i(x¯i(0),¯i)κ1+κ\displaystyle=\kappa\delta+\theta^{\top}\hat{\psi}_{i}^{(0)}+(1+\kappa)\cdot\log\int\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})^{\frac{\kappa}{1+\kappa}} (40)
×exp(S~iε,λi(x¯i(0),¯i,t0)+θψi(x¯i(0))1+κ)dx¯i(0)d¯i;\displaystyle\hskip 116.0pt\times\exp\left(-\frac{\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})+\theta^{\top}\psi_{i}(\bar{x}_{i}^{(0)})}{1+\kappa}\right)\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i};

and the dual variable η\eta can then be determined by the constraint obtained by substituting (39) into the normalization (last) constraint in (37). The approximation of the dual function g(κ,θ)g(\kappa,\theta) from sample trajectories 𝒴i\mathcal{Y}_{i}, calculation of the old (or initial) path distribution q~i(¯i,x¯i(0))\tilde{q}_{i}(\bar{\ell}_{i},\bar{x}_{i}^{(0)}) from the current (or initial) policy πi(k)(u¯i(k)|x¯i(k))\pi^{(k)}_{i}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)}), and a brief interpretation on the formulation of optimization problem (37) are explained in Appendix D.

In the updating step of REPS algorithm, we update the policy πi(k)\pi_{i}^{(k)} such that the joint distribution p~iπ(x¯i(k+1)|x¯i(k))=p(x¯i(k+1)|x¯i(k),u¯i(k))πi(k)(u¯i(k)|x¯i(k))𝑑u¯i(k)\tilde{p}_{i}^{\pi}(\bar{x}_{i}^{(k+1)}|\bar{x}_{i}^{(k)})=\int p(\bar{x}_{i}^{(k+1)}|\bar{x}_{i}^{(k)},\bar{u}_{i}^{(k)})\cdot\pi_{i}^{(k)}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)})\ d\bar{u}^{(k)}_{i} generated by policy πi(k)\pi^{(k)}_{i} approximates the path distribution p~i(¯i|x¯i(0),t0)\tilde{p}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) generated in the learning step (37) and ultimately converges to the optimal path distribution p~i(¯i|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) in (33). More explanations on distribution p~iπ\tilde{p}_{i}^{\pi} and old distribution q~i\tilde{q}_{i} can be found in (84) of Appendix D. In order to provide a concrete example, we consider a set of parameterized time-dependent Gaussian policies that are linear in states, i.e. πi(k)(u¯i(k)|x¯i(k),a^i(k),b^i(k),Σ^i(k))𝒩(u¯i(k)|a^i(k)x¯i(k)+b^i(k),Σ^i(k))\pi_{i}^{(k)}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)},\hat{a}_{i}^{(k)},\hat{b}_{i}^{(k)},\hat{\Sigma}^{(k)}_{i})\sim\mathcal{N}(\bar{u}_{i}^{(k)}|\hat{a}_{i}^{(k)}\bar{x}_{i}^{(k)}+\hat{b}_{i}^{(k)},\hat{\Sigma}_{i}^{(k)}) at time step tk<tft_{k}<t_{f}, where χi(k)=(a^i(k),b^i(k),Σ^i(k))\chi_{i}^{(k)}=(\hat{a}_{i}^{(k)},\hat{b}_{i}^{(k)},\hat{\Sigma}_{i}^{(k)}) are the policy parameters to be updated. For simplicity, one can also construct a stationary Gaussian policy π^i(u¯i|x¯i,a^i,b^i,Σ^i)𝒩(u¯i|a^ix¯i+b^i,Σ^i)\hat{\pi}_{i}(\bar{u}_{i}|\bar{x}_{i},\hat{a}_{i},\hat{b}_{i},\hat{\Sigma}_{i})\sim\mathcal{N}(\bar{u}_{i}|\hat{a}_{i}\bar{x}_{i}+\hat{b}_{i},\hat{\Sigma}_{i}), which was employed in [50] and shares the same philosophy as the parameter average techniques discussed in [31, 22]. The parameters χi(k)\chi_{i}^{*(k)} in policy πi(k)\pi_{i}^{(k)} can be updated by minimizing the relative entropy between p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) and p~iπ(x¯i(0),¯i)\tilde{p}^{\pi}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}), i.e.

χi(k)=argmaxχi(k)p~i(x¯i(0),¯i)logπi(k)(u¯i(k)|x¯i(k),χi(k))𝑑x¯i(0)𝑑¯i,\chi_{i}^{*(k)}=\arg{\textstyle\max_{\chi_{i}^{(k)}}}\ \int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\cdot\log{\pi}_{i}^{(k)}(\bar{u}_{i}^{*(k)}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}, (41)

where u¯i(k)\bar{u}_{i}^{*(k)} is the optimal control action. More detailed interpretation on policy update as well as approximation of (41) from sample data 𝒴i\mathcal{Y}_{i} can be found in Appendix D. When implementing the REPS algorithm introduced in this subsection, we initialize each iteration by generating a batch of sample trajectories 𝒴i\mathcal{Y}_{i} from the current (or initial) policy, which was updated to restore the old approximate path distribution q~i(x¯i(0),¯i)\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) in the prior iteration. With these sample trajectories, we update the path distribution through (37)-(40) and update the policy again through (41) till convergence of the algorithm. This REPS-based continuous-time MAS control algorithm is also summarized as Algorithm 3 in Appendix E.

D. Local Control Action

The preceding two distributed LSOC algorithms will return either a joint optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t) or a joint optimal control policy πi(u¯i(t)|x¯i(t),χi)\pi_{i}^{*}(\bar{u}^{*}_{i}(t)|\bar{x}_{i}(t),\chi_{i}^{*}) for subsystem 𝒩¯i\bar{\mathcal{N}}_{i} at joint state x¯i(t)\bar{x}_{i}(t). Similar to the treatment of distributed LSMDP in discrete-time MAS, only the central agent ii selects or samples its local control action ui(x¯i,t)u^{*}_{i}(\bar{x}_{i},t) from u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t) or πi(u¯i(t)|x¯i(t),χi)\pi_{i}^{*}(\bar{u}^{*}_{i}(t)|\bar{x}_{i}(t),\chi_{i}^{*}), while other agents j𝒟\{i}j\in\mathcal{D}\backslash\{i\} are guided by the control law, u¯j(x¯j,t)\bar{u}^{*}_{j}(\bar{x}_{j},t) or πj(u¯j(t)|x¯j(t),χj)\pi_{j}^{*}(\bar{u}^{*}_{j}(t)|\bar{x}_{j}(t),\chi_{j}^{*}), computed in their own subsystems 𝒩¯j\bar{\mathcal{N}}_{j}.

For distributed LSOC based on the MC sampling estimator (36), the local optimal control action ui(x¯i,t)u^{*}_{i}(\bar{x}_{i},t) can be directly selected from the joint control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t). For distributed LSOC based on the REPS method, the recursive algorithm generates control policy πi(k)(u¯i(k)|x¯i(k),χi(k))𝒩(u¯i(k)|a^i(k)x¯i(k)+b^i(k),Σ^i(k))\pi^{(k)}_{i}(\bar{u}^{(k)}_{i}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})\sim\mathcal{N}(\bar{u}_{i}^{(k)}|\hat{a}_{i}^{(k)}\bar{x}_{i}^{(k)}+\hat{b}_{i}^{(k)},\hat{\Sigma}_{i}^{(k)}) for each time step t=t0<t1<<tK=tft=t_{0}<t_{1}<\cdots<t_{K}=t_{f} per iteration. When generating the trajectory set 𝒴i\mathcal{Y}_{i} from the current control policy πi(k)(u¯i(k)|x¯i(k),χi(k))\pi_{i}^{(k)}(\bar{u}^{(k)}_{i}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)}), we sample the local control action of agent ii from the marginal distribution πi(k)(ui(k)|x¯i(k),χi(k))=j𝒩iπi(k)(ui(k),uj𝒩i(k)|x¯i(k),χi(k))\pi_{i}^{(k)}(u^{(k)}_{i}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})=\sum_{j\in\mathcal{N}_{i}}\pi^{(k)}_{i}({u}^{(k)}_{i},{u}^{(k)}_{j\in\mathcal{N}_{i}}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)}). After the convergence of local REPS algorithm in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, the local optimal control action ui(x¯i,t)u_{i}^{*}(\bar{x}_{i},t) of the central agent ii at state xi(t)x_{i}(t) can be sampled from the following marginal distribution

πi(0)(ui(0)|x¯i(0),χi(0))=j𝒩iπi(0)(ui(0),uj𝒩i(0)|x¯i(0),χi(0)),\pi_{i}^{*(0)}(u^{*(0)}_{i}|\bar{x}_{i}^{(0)},\chi_{i}^{*(0)})=\sum_{j\in\mathcal{N}_{i}}\pi^{*(0)}_{i}({u}^{*(0)}_{i},{u}^{*(0)}_{j\in\mathcal{N}_{i}}|\bar{x}_{i}^{(0)},\chi_{i}^{*(0)}), (42)

which minimizes the joint cost-to-go function (12) in subsystem 𝒩¯i\mathcal{\bar{N}}_{i} and only relies on the local observation of agent ii.

Generalization with Compositionality

While several accessory techniques have been introduced in Section 3.1 and Section 3.2 to facilitate the computation of optimal control law, we have to repeat all the aforementioned procedures when a new optimal control task with different terminal cost or preferred exit state is assigned. A possible approach to solve this problem and generalize LSOC is to resort to contextual learning, such as the contextual policy search algorithm [50], which adapts to different terminal conditions by introducing hyper-parameters and an additional layer of learning algorithm. However, this supplementary learning algorithm will undoubtedly increase the complexity and computational task for the entire control scheme. Thanks to the linearity of LSOC problems, a task-optimal controller for a new (unlearned) task can also be constructed from existing (learned) controllers by utilizing the compositionality principle [31, 18, 51, 17]. Suppose that we have FF previously learned (component) LSOC problems and a new (composite) LSOC problem with different terminal cost from the previous FF problems. Apart from different terminal costs, these F+1F+1 multi-agent LSOC problems share the same communication network 𝒢\mathcal{G}, state-related cost q(x¯i)q(\bar{x}_{i}), exit time tft_{f}, and dynamics, which means that the state space, control space, and sets of interior and boundary states of these F+1F+1 problems are also identical. Let ϕi{f}(x¯i)\phi_{i}^{\{f\}}(\bar{x}_{i}) with f{1,2,,F}f\in\{1,2,\cdots,F\} be the terminal costs of FF component problems in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, and ϕi(x¯i)\phi_{i}(\bar{x}_{i}) denotes the terminal cost of composite or new problem in subsystem 𝒩¯i\bar{\mathcal{N}}_{i}. We can efficiently construct a joint optimal control action u¯i(x¯i|x¯i)\bar{u}^{*}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i}) for the new task from the existing component controllers u¯i{f}(x¯i|x¯i)\bar{u}^{*\{f\}}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i}) by the compositionality principle.

For multi-agent LSMDP in discrete-time MAS, when there exists a set of weights ω¯i{f}\bar{\omega}_{i}^{\{f\}} such that

ϕi(x¯i)=log[f=1Fω¯i{f}exp(ϕi{f}(x¯i))],\phi_{i}(\bar{x}_{i})=-\log\bigg{[}\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}\cdot\exp\Big{(}-\phi_{i}^{\{f\}}(\bar{x}_{i})\Big{)}\bigg{]},

by the definition of discrete-time desirability function (17), we can imply that

Zi(x¯i)=f=1Fω¯i{f}Zi{f}(x¯i)Z_{i}(\bar{x}_{i})=\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}\cdot Z_{i}^{\{f\}}(\bar{x}_{i}) (43)

for all x¯i¯i\bar{x}_{i}\in\bar{\mathcal{B}}_{i}. Due to the linear relation in (20), identity (43) should also hold for all interior states x¯i¯i\bar{x}_{i}\in\bar{\mathcal{I}}_{i}. Substituting (43) into the optimal control action (19), the task-optimal control action for the new task with terminal cost ϕi(x¯i)\phi_{i}(\bar{x}_{i}) can be immediately generated from the existing controllers

u¯i(x¯i|x¯i)=f=1FW¯i{f}(x¯i)u¯i{f}(x¯i|x¯i),\bar{u}^{*}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})=\sum_{f=1}^{F}\bar{W}_{i}^{\{f\}}(\bar{x}_{i})\cdot\bar{u}^{*\{f\}}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i}), (44)

where W¯i{f}(x¯i)=ω¯i{f}𝒲i{f}(x¯i)/(e=1Fω¯i{e}𝒲i{e}(x¯i))\bar{W}_{i}^{\{f\}}(\bar{x}_{i})=\bar{\omega}_{i}^{\{f\}}\mathcal{W}^{\{f\}}_{i}(\bar{x}_{i})/(\sum_{e=1}^{F}\bar{\omega}_{i}^{\{e\}}\mathcal{W}_{i}^{\{e\}}(\bar{x}_{i})) and 𝒲i{f}(x¯i)=x¯ip¯i(x¯i|x¯i)Zi{f}(x¯i)\mathcal{W}_{i}^{\{f\}}(\bar{x}_{i})=\sum_{\bar{x}^{\prime}_{i}}\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z^{\{f\}}_{i}(\bar{x}^{\prime}_{i}). For LSOC in MAS, compositionality principle can be used for not only the generalization of controllers for the same subsystem 𝒩¯i\bar{\mathcal{N}}_{i}, but the generalization of control law across the network. For any two subsystems that satisfy the aforementioned compatible conditions, the task-optimal controller of one subsystem can also be directly constructed from the existing computational result of the other subsystem by resorting to (44).

The generalization of continuous-time LSOC problem can be readily inferred by symmetry. For F+1F+1 continuous-time LSOC problems that satisfy the compatible conditions, when there exist scalars λi\lambda_{i}, λi{f}\lambda_{i}^{\{f\}} and weights ω¯i{f}\bar{\omega}_{i}^{\{f\}} such that

ϕi(x¯i)=λilog[f=1Fω¯i{f}exp(1λi{f}ϕi{f}(x¯i))],\phi_{i}(\bar{x}_{i})=-\lambda_{i}\log\bigg{[}\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}\cdot\exp\Big{(}-\frac{1}{\lambda_{i}^{\{f\}}}\cdot\phi_{i}^{\{f\}}(\bar{x}_{i})\Big{)}\bigg{]}, (45)

by the definition of continuous-time desirability function (23), we readily have Zi(x¯i,tf)=f=1Fω¯i{f}Zi{f}(x¯i,tf)Z_{i}(\bar{x}_{i},t_{f})=\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}\cdot Z_{i}^{\{f\}}(\bar{x}_{i},t_{f}) for all x¯i¯i\bar{x}_{i}\in\bar{\mathcal{B}}_{i}. Since Zi{f}(x¯i,τ)Z^{\{f\}}_{i}(\bar{x}_{i},\tau) is the solution to linearized stochastic HJB equation (26), the linear combination of desirability function

Zi(x¯i,τ)=f=1Fω¯i{f}Zi{f}(x¯i,τ)Z_{i}(\bar{x}_{i},\tau)=\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}\cdot Z_{i}^{\{f\}}(\bar{x}_{i},\tau) (46)

holds everywhere from tt to tft_{f} on condition that (46) holds for all terminal states ¯i\bar{\mathcal{B}}_{i}, which is guaranteed by (45). Substituting (46) into the continuous-time optimal controller (28), the task-optimal composite controller u¯i(x¯i)\bar{u}_{i}^{*}(\bar{x}_{i}) can be constructed from the component controllers u¯i{f}(x¯i)\bar{u}_{i}^{\{f\}*}(\bar{x}_{i}) by

u¯i(x¯i,t)=f=1FW¯i{f}(x¯i,t)u¯i{f}(x¯i,t),\bar{u}_{i}^{*}(\bar{x}_{i},t)=\sum_{f=1}^{F}\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)\cdot\bar{u}^{\{f\}*}_{i}(\bar{x}_{i},t),

where W¯i{f}(x¯i,t)=ω¯i{f}Zi{f}(x¯i,t)/(f=1Fω¯i{f}Zi{f}(x¯i,t))\bar{W}_{i}^{\{f\}}(\bar{x}_{i},t)=\bar{\omega}_{i}^{\{f\}}Z_{i}^{\{f\}}(\bar{x}_{i},t)/(\sum_{f=1}^{F}\bar{\omega}_{i}^{\{f\}}Z_{i}^{\{f\}}(\bar{x}_{i},t)). However, this generalization technique rigorously does not apply to the control policies based on the trajectory optimization or policy search methods, such as PI2 [22] and REPS in Section 3.2, since these policy approximation methods usually only predict the optimal path distribution or optimal control policy, while leaving the value or desirability function unknown.

Illustrative Examples

Distributed LSOC can be deployed on a variety of MASs in reality, such as the distributed HVAC system in smart buildings, cooperative unmanned aerial vehicle (UAV) teams, and synchronous wind farms. Motivated by the experiments in [31, 25, 30], we demonstrate the distributed LSOC algorithms with a cooperative UAV team in cluttered environment. Compared with these preceding experiments, the distributed LSOC algorithms in this paper allow to consider an explicit and simpler communication network underlying the UAV team, and the joint cost function between neighboring agents can be constructed and optimized with less computation. Consider a cooperative UAV team consisting of three agents illustrated in Figure 2. UAV 1 and 2, connected by a solid line, are strongly coupled via their joint cost functions q1(x¯1)q_{1}(\bar{x}_{1}) and q2(x¯2)q_{2}(\bar{x}_{2}), and their local controllers u1(x¯1)u_{1}(\bar{x}_{1}) and u2(x¯2)u_{2}(\bar{x}_{2}) are designed to drive UAV 1 and 2 towards their exit state while minimizing the distance between two UAVs and avoiding obstacles. These setups are useful when we need multiple UAVs to fly closely towards an identical destination, e.g. carrying and delivering a heavy package together or maintaining communication channels/networks. By contrast, although UAV 3 is also connected with UAV 1 and 2 via dotted lines, the immediate cost function of UAV 3 is designed to be independent or fully factorized from the states of UAV 1 and 2 in each subsystem, such that UAV 3 is only loosely coupled with UAVs 1 and 2 through their terminal cost functions, which restores the scenario considered in [31, 25]. The local controller of UAV 3 is then synthesized to guide the UAV to its exit state with minimal cost. In the following subsections, we will verify both discrete-time and continuous-time distributed LSOC algorithms subject to this cooperative UAV team scenario.

Refer to caption
Figure 2: Communication network of UAV team. UAV 1 and 2 are strongly coupled through their running cost functions. UAV 3 is loosely coupled with UAV 1 and 2 through their terminal cost functions.

UAVs wtih Discrete-Time Dynamics

In order to verify the distributed LSMDP algorithm, we consider a cooperative UAV team described by the probability model introduced in Section 2.2. The flight environment is described by a 5×55\times 5 grid with total 2525 cells to locate the positions of UAVs, and the shaded cells in Figure 3(a) represent the obstacles. Each UAV is described by a state vector xi=[ri,ci]x_{i}=[r_{i},c_{i}]^{\top}, where rir_{i} and ci{1,2,3,4,5}c_{i}\in\{1,2,3,4,5\} are respectively the row and column indices locating the ithi^{\rm th} UAV. The cell with circled index ii denotes the initial state of the ithi^{\rm th} UAV, and the one with boxed index ii indicates the exit state of the ithi^{\rm th} UAV. The passive transition probabilities of interior, edge, and corner cells are shown in Figure 3(b), Figure 3(c), and Figure 3(d), respectively, and the passive probability of a UAV transiting to an adjacent cell can be interpreted as the result of random winds. To fulfill the requirements of aforementioned scenario, the controlled transition distributions should i) drive UAV 1 and 2 to their terminal state (5,5)(5,5) while shortening the distance between two UAVs, and ii) guide UAV 3 to terminal state (1,5)(1,5) with minimum cost, which is related to obstacle avoidance, control cost, length of path, and flying time.

Refer to caption
Figure 3: Flight environment and UAV’s passive transition dynamics. (a) Flight environment, initial states, and exit states of UAVs. (b) The passive transition probability of UAVs in the interior cells. (c) The passive transition probability of UAVs in the edge cells. (d) The passive transition probability of UAVs in the corner cells.

In order to realize these objectives, we consider the state-related cost functions as follows

q1(x¯1)\displaystyle q_{1}(\bar{x}_{1}) =w12(|r1r2|+|c1c2|)+o1(x1)o2(x2),\displaystyle=w_{12}\cdot(|r_{1}-r_{2}|+|c_{1}-c_{2}|)+o_{1}(x_{1})\cdot o_{2}(x_{2}),\allowdisplaybreaks
q2(x¯2)\displaystyle q_{2}(\bar{x}_{2}) =w21(|r2r1|+|c2c1|)+o1(x1)o2(x2),\displaystyle=w_{21}\cdot(|r_{2}-r_{1}|+|c_{2}-c_{1}|)+o_{1}(x_{1})\cdot o_{2}(x_{2}),\allowdisplaybreaks (47)
q3(x¯3)\displaystyle q_{3}(\bar{x}_{3}) =o3(x3),\displaystyle=o_{3}(x_{3}),\allowdisplaybreaks

where in the following examples, weights on relative distance x1x21\|x_{1}-x_{2}\|_{1}, defined by Manhattan distance, are w12=w21=3.5w_{12}=w_{21}=3.5; state and obstacle cost oi(xi)=30o_{i}(x_{i})=30 when the ithi^{\rm th} UAV is in an obstacle cell, and oi(xi)=2.2o_{i}(x_{i})=2.2 when UAV ii is in a regular cell; and terminal cost qi(x¯i)=0q_{i}(\bar{x}_{i})=0 for x¯i¯i\bar{x}_{i}\in\bar{\mathcal{B}}_{i}. State-related cost q1(x¯1)q_{1}(\bar{x}_{1}) and q2(x¯2)q_{2}(\bar{x}_{2}), which involve both the states of UAV 1 and 2, can measure the joint performance between two UAVs, while cost q3(x¯3)q_{3}(\bar{x}_{3}) only contains the state of UAV 3. Without the costs on relative distance, i.e. w12=w21=0w_{12}=w_{21}=0, this distributed LSMDP problem will degenerate to three independent shortest path problems with obstacle avoidance as considered in [31, 25, 30], and the optimal trajectories are straightforwardThe shortest path for UAV 1 is (1,5)(1,4)(2,4)(3,4)(3,5)(4,5)(5,5)(1,5)\rightarrow(1,4)\rightarrow(2,4)\rightarrow(3,4)\rightarrow(3,5)\rightarrow(4,5)\rightarrow(5,5). There are three different shortest paths for UAV 2, and the shortest path for UAV 3 is shown in Figure 4 (c).. Desirability function Zi(x¯i)Z_{i}(\bar{x}_{i}) and local optimal control distribution ui(|x¯i)u^{*}_{i}(\cdot|\bar{x}_{i}) can then be computed by following Algorithm 1 in Appendix E. Figure 4 shows the maximum likelihood controlled trajectories of UAV team subject to passive dynamics in Figure 3 and cost functions (4.1). Some curious readers may wonder why UAV 1 in Figure 4(a) decides to stay in (1,4)(1,4) cell for two consecutive time steps rather than moving forward to (1,3)(1,3) and then flying along with UAV 2, which generates a lower state-related cost. While this alternative corresponds to a lower state-related cost, it may not be the optimal trajectory minimizing the control cost and the overall immediate cost in (7). In order to verify this speculation, we increase the passive transition probabilities pi(|x¯i)p_{i}(\cdot|\bar{x}_{i}) to surpass certain thresholds in Figure 5(a)-(c), which can be interpreted as stronger winds in reality. With this altered passive dynamics and cost functions in (4.1), the maximum likelihood controlled trajectory of UAV 1 is shown in Figure 5(d), which verifies our preceding reasoning. Trajectories of UAV 2 and 3 subject to altered passive dynamics are identical to the results in Figure 4. To provide an intuitive view on the efficiency improvements of our distributed LSMDP algorithm, Figure 6 presents the average data size and computational complexity on each UAV (|𝒮i|=25|\mathcal{S}_{i}|=25), gauged by the row number mm of matrices and vectors in (20) and (22), subject to the centralized programming (20) and parallel programming (22) in different communication network, such as line, ring, complete binary tree, and fully connected topology, which restores the scenario with exponential complexity considered in [31, 30].

Refer to caption
Figure 4: UAVs’ trajectories subject to optimal control policy. (a) Controlled trajectory of UAV 1: (1,5)(1,4)(1,4)(1,4)(2,4)(3,4)(3,5)(4,5)(5,5)(1,5)\rightarrow(1,4)\rightarrow(1,4)\rightarrow(1,4)\rightarrow(2,4)\rightarrow(3,4)\rightarrow(3,5)\rightarrow(4,5)\rightarrow(5,5). (b) Controlled trajectory of UAV 2: (1,1)(1,2)(1,3)(1,4)(2,4)(3,4)(3,5)(4,5)(5,5)(1,1)\rightarrow(1,2)\rightarrow(1,3)\rightarrow(1,4)\rightarrow(2,4)\rightarrow(3,4)\rightarrow(3,5)\rightarrow(4,5)\rightarrow(5,5). (c) Controlled trajectory of UAV 3: (1,5)(1,4)(2,4)(3,4)(3,3)(4,3)(5,3)(5,2)(5,1)(1,5)\rightarrow(1,4)\rightarrow(2,4)\rightarrow(3,4)\rightarrow(3,3)\rightarrow(4,3)\rightarrow(5,3)\rightarrow(5,2)\rightarrow(5,1).
Refer to caption
Figure 5: Altered passive transition dynamics and controlled trajectory of UAV 1 subject to new dynamics. (a-c) are respectively the passive transition probabilities for interior, cell, and corner cells. (d) Controlled trajectory of UAV 1 subject to the altered passive dynamics: (1,5)(1,4)(1,3)(1,4)(2,4)(3,4)(3,5)(4,5)(5,5)(1,5)\rightarrow(1,4)\rightarrow(1,3)\rightarrow(1,4)\rightarrow(2,4)\rightarrow(3,4)\rightarrow(3,5)\rightarrow(4,5)\rightarrow(5,5).
Refer to caption
Figure 6: Relationship between amount of agents NN and data dimension and computational complexity O(m)O(m). Vertical axis is the value of mm. Solid and dashed lines respectively represent the results attained by centralized algorithm (20) and parallel algorithm (22). Blue asterisk, red cross, green square, and magenta triangle respectively denote the results associated with fully connected, line, ring, and complete binary tree communication networks.

UAVs wtih Continuous-Time Dynamics

In order to verify the continuous-time distributed LSOC algorithms, consider the continuous-time UAV dynamics described by the following equation [31, 52]:

(dxidyidvidφi)=(vicosφivisinφi00)dt+(00001001)[(uiωi)dt+(σi00νi)dwi],\left(\begin{matrix}dx_{i}\\ dy_{i}\\ dv_{i}\\ d\varphi_{i}\end{matrix}\right)=\left(\begin{matrix}v_{i}\cos\varphi_{i}\\ v_{i}\sin\varphi_{i}\\ 0\\ 0\end{matrix}\right)dt+\left(\begin{matrix}0&0\\ 0&0\\ 1&0\\ 0&1\end{matrix}\right)\left[\left(\begin{matrix}u_{i}\\ \omega_{i}\end{matrix}\right)dt+\left(\begin{matrix}\sigma_{i}&0\\ 0&\nu_{i}\end{matrix}\right)dw_{i}\right], (48)

where (xi,yi)(x_{i},y_{i}), viv_{i}, and φi\varphi_{i} respectively denote the position, forward velocity, and direction angle of the ithi^{\rm th} UAV; forward acceleration uiu_{i} and angular velocity ωi\omega_{i} are the control inputs, and disturbance wiw_{i} is a standard Brownian motion. Control transition matrix B¯i(xi)\bar{B}_{i}(x_{i}) is a constant matrix in (48), and we set noise level parameters σi=0.1\sigma_{i}=0.1 and νi=0.05\nu_{i}=0.05 in simulation. The communication network underlying the UAV team as well as the control objectives are the same as in the discrete-time scenario. Instead of adopting the Manhattan distance, the distance in continuous-time problem is associated with L2L_{2} norm. Hence, we consider the state-related costs defined as follows

q1(x¯1)=w11((x1,y1)(x1tf,y1tf)2d1max)+w12((x1,y1)(x2,y2)2d12max),q2(x¯2)=w22((x2,y2)(x2tf,y2tf)2d2max)+w21((x2,y2)(x1,y1)2d21max),q3(x¯3)=w33((x3,y3)(x3tf,y3tf)2d3max),\begin{split}q_{1}(\bar{x}_{1})&=w_{11}\cdot(\|(x_{1},y_{1})-(x_{1}^{t_{f}},y_{1}^{t_{f}})\|_{2}-d^{\max}_{1})+w_{12}\cdot(\|(x_{1},y_{1})-(x_{2},y_{2})\|_{2}-d^{\max}_{12}),\\ q_{2}(\bar{x}_{2})&=w_{22}\cdot(\|(x_{2},y_{2})-(x_{2}^{t_{f}},y_{2}^{t_{f}})\|_{2}-d^{\max}_{2})+w_{21}\cdot(\|(x_{2},y_{2})-(x_{1},y_{1})\|_{2}-d^{\max}_{21}),\\ q_{3}(\bar{x}_{3})&=w_{33}\cdot(\|(x_{3},y_{3})-(x_{3}^{t_{f}},y_{3}^{t_{f}})\|_{2}-d^{\max}_{3}),\end{split} (49)

where wiiw_{ii} is the weight on distance between the ithi^{\rm th} UAV and its exit position; wijw_{ij} is the weight on distance between the ithi^{\rm th} and jthj^{\rm th} UAVs; dimaxd^{\max}_{i} usually denotes the initial distance between the ithi^{\rm th} UAV and its destination, and dijmaxd_{ij}^{\max} denotes the initial distance between the ithi^{\rm th} and jthj^{\rm th} UAVs. The parameters dimaxd^{\max}_{i} and dijmaxd_{ij}^{\max} are chosen to regularize the numerical accuracy and stability of algorithms.

To show an intuitive improvement brought by the joint state-related cost, we first verify the continuous-time LSOC algorithm in a simple flight environment without obstacle. Consider three UAVs forming the network in Figure 2 and with initial states x10=(5,5,0.5,0)x_{1}^{0}=(5,5,0.5,0)^{\top}, x20=(5,45,0.5,0)x_{2}^{0}=(5,45,0.5,0)^{\top}, x30=(5,25,0.5,0)x_{3}^{0}=(5,25,0.5,0)^{\top} and identical exit state xitf=(45,25,0,0)x_{i}^{t_{f}}=(45,25,0,0)^{\top} for i=1,2,3i=1,2,3. The exit time is tf=25t_{f}=25, and the length of each control cycle is 0.20.2. When sampling trajectory roll-outs 𝒴i\mathcal{Y}_{i}, the time interval from tt to tft_{f} is partitioned into K=7K=7 intervals of equal length ε\varepsilon, i.e. εK=tft\varepsilon K=t_{f}-t, until ε\varepsilon becomes less than 0.2. Meanwhile, to make the exploration process more aggressive, we increase the noise level parameters σi=0.75\sigma_{i}=0.75 and νi=0.65\nu_{i}=0.65 when sampling trajectory roll-outs. The size of data set 𝒴i\mathcal{Y}_{i} for estimator (35) in each control cycle is 400400 sample trajectories, which can be generated concurrently by GPU [25]. Control weight matrices R¯i\bar{R}_{i} are selected as identity matrices. Guided by the sampling-based distributed LSOC algorithm, Algorithm 2 in Appendix E, UAV trajectories and the relative distance between UAV 1 and 2 in condition of both joint and independent state-related costs are presented in Figure 7. Letting update rate constraint be δi=25\delta_{i}=25 and the size of trajectory set be |𝒴i|=400|\mathcal{Y}_{i}|=400 and 150150 for the initial and subsequent policy iterations in every control period, simulation results obtained from the REPS-based distributed LSOC algorithm, Algorithm 3 in Appendix E, are given in Figure 8. Figure 7 and Figure 8 imply that joint state-related costs can significantly influence the trajectories and shorten the relative distance between UAV 1 and 2, which fulfills our preceding control requirements.

Refer to caption
Figure 7: UAV trajectories and the distance between UAV 1 and 2 from 50 trails subject to the sampling-based distributed LSOC algorithm. (a) Trajectories of UAVs. Red, blue and green lines are trajectories for UAV 1, 2 and 3, respectively. Dashed lines are from a trail with factorized (or independent) state costs (w11=w22=0.75w_{11}=w_{22}=0.75, w33=1w_{33}=1, w12=w21=0w_{12}=w_{21}=0), and solid (transparent) lines are from trails with joint state costs (w11=w22=0.75w_{11}=w_{22}=0.75, w33=1w_{33}=1, w12=w21=1.5w_{12}=w_{21}=1.5). (b) Distance between UAV 1 and 2. Red dashed line and blue solid line are respectively the mean distances from tails with factorized state cost and joint state cost. Height of strip represents one standard deviation.
Refer to caption
Figure 8: UAV trajectories and distance between UAV 1 and 2 from 50 trails subject to the REPS-based distributed LSOC algorithm. (a) Trajectories of UAVs. Red, blue and green lines are trajectories for UAV 1, 2 and 3, respectively. Dashed lines are from a trail with factorized state costs (w11=w22=w33=0.1w_{11}=w_{22}=w_{33}=0.1, w12=w21=0w_{12}=w_{21}=0), and solid (transparent) lines are from trails with joint state costs (w11=w22=w33=0.1w_{11}=w_{22}=w_{33}=0.1, w12=w21=0.2w_{12}=w_{21}=0.2). (b) Red dashed line and blue solid line are respectively the mean distances from tails with factorized state cost and joint state cost. Height of strip represents one standard deviation.

We then consider a cluttered flight environment as the discrete-time example. Suppose three UAVs forming the network in Figure 2 and with initial states x10=(45,5,0.35,π)x_{1}^{0}=(45,5,0.35,\pi)^{\top}, x20=(5,5,0.65,0)x_{2}^{0}=(5,5,0.65,0)^{\top}, x30=(45,5,0.5,π)x_{3}^{0}=(45,5,0.5,\pi)^{\top} and exit states x1tf=x2tf=(45,45,0,π/2)x_{1}^{t_{f}}=x_{2}^{t_{f}}=(45,45,0,\pi/2)^{\top}, x3tf=(5,45,0,π)x_{3}^{t_{f}}=(5,45,0,\pi)^{\top}. The exit time is tf=30t_{f}=30, and the length of each control cycle is 0.20.2. When sampling trajectory roll-outs 𝒴i\mathcal{Y}_{i}, the time interval from tt to tft_{f} is partitioned into K=18K=18 intervals of equal length. The size of 𝒴i\mathcal{Y}_{i} is 400400 trajectory roll-outs when adopting random sampling estimator, and the other parameters are the same as in the preceding continuous-time example. Subject to the sampling-based distributed LSOC algorithm, UAV trajectories and the relative distance between UAVs 1 and 2 subject to both joint and independent state-related costs are presented in Figure 9. Letting the update rate constraint be δi=50\delta_{i}=50 and the size of trajectory set be |𝒴i|=400|\mathcal{Y}_{i}|=400 and 150150 for the initial and subsequent policy iterations in every control period, experimental results obtained from the REPS-based distributed LSOC algorithm are given in Figure 10. Figure 9 and Figure 10 show that our continuous-time distributed LSOC algorithms can guide UAVs to their terminal states, avoid obstacles, and shorten the relative distance between UAVs 1 and 2. It is also worth noticing that since there exist more than one shortest path for UAV 2 in condition of factorized state cost (see the footnote in Section 4.1), the standard variations of distance in Figure 9 and Figure 10 are significantly larger than other cases. Lastly, we compare the sample-efficiency between the sampling-based and REPS-based distributed LSOC algorithms in preceding two continuous-time examples. Figure 11 shows the value of immediate cost function c2(x¯2,u¯2)c_{2}(\bar{x}_{2},\bar{u}_{2}) from subsystem 𝒩¯2\bar{\mathcal{N}}_{2} versus the amount of trajectory roll-outs, and the maximum numbers of trajectory roll-outs on horizontal axes are determined by the REPS-based trails with minimum amounts of sample roll-outs. In Figure 11, we can tell that the REPS-based distributed LSOC algorithm is more sample-efficient than the sampling-based algorithm.

Refer to caption
Figure 9: UAV trajectories and relative distance between UAV 1 and 2 from 100 trails based on random sampling estimator. (a) Trajectories of UAVs. Red, blue and green lines are trajectories for UAV 1, 2 and 3, respectively. Dashed lines are from a trail with factorized (or independent) state costs (w11=w22=w33=1w_{11}=w_{22}=w_{33}=1, w12=w21=0w_{12}=w_{21}=0), and solid (transparent) lines are from trails with joint state costs (w11=w22=w33=1w_{11}=w_{22}=w_{33}=1, w12=1.5,w21=0.5w_{12}=1.5,w_{21}=0.5). (b) Distance between UAV 1 and 2. Red dashed line and blue solid line are respectively the mean distances from tails with independent state cost and joint state cost. Height of strip represents one standard deviation.
Refer to caption
Figure 10: UAV trajectories and relative distance between UAV 1 and 2 from 100 trails based on REPS. (a) Trajectories of UAVs. Red, blue and green lines are trajectories for UAV 1, 2 and 3, respectively. Dashed lines are from a trail with factorized state costs (w11=w22=w33=0.18w_{11}=w_{22}=w_{33}=0.18, w12=w21=0w_{12}=w_{21}=0), and solid (transparent) lines are from trails with joint state costs (w11=w22=w33=0.18w_{11}=w_{22}=w_{33}=0.18, w12=0.27,w21=0.1w_{12}=0.27,w_{21}=0.1). (b) Distance between UAV 1 and 2. Red dashed line and blue solid line are respectively the mean distances from tails with independent state cost and joint state cost. Height of strip is one standard deviation.
Refer to caption
Figure 11: Sample-efficiency of continuous-time algorithms from 100 trails. (a) Immediate cost c2(x¯2,u¯2)c_{2}(\bar{x}_{2},\bar{u}_{2}) in simple scenario without obstacle. Red dashed line is the mean of immediate cost subject to random sampling approach. Blue solid line is the mean of immediate cost subject to REPS algorithm. The height of strip is one standard deviation. (b) Immediate cost c2(x¯2,u¯2)c_{2}(\bar{x}_{2},\bar{u}_{2}) in complex scenario with obstacles. Interpretation is the same as (a).

To verify the effectiveness of distributed LSOC algorithms in larger MAS with more agents, we consider a line-shape network consisting of nine UAVs as shown in Figure 12. These nine UAVs i={1,2,,9}i=\{1,2,\cdots,9\} are initially distributed at xi0=(10,10010i,0.5,0)x_{i}^{0}=(10,100-10i,0.5,0) as shown in Figure 13, and they can be divided into three groups based on their exit states. UAV 1 to 6 share a same exit state AA at x1:6tf=(90,65,0,0)x_{1:6}^{t_{f}}=(90,65,0,0); the exit state BB of UAV 7 and 8 is at x7:8tf=(90,25,0,0)x_{7:8}^{t_{f}}=(90,25,0,0), and UAV 9 is expected to exit at state CC, x9tf=(90,10,0,0)x_{9}^{t_{f}}=(90,10,0,0), where the exit time tf=40 sect_{f}=40\textrm{ sec}. As exhibited in Figure 12, UAVs from different groups are either loosely coupled through their terminal cost functions or mutually independent, where the latter scenario does not require any communication between agents. A state-related cost function qi(x¯i)q_{i}(\bar{x}_{i}) in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} is designed to consider and optimize the distances between neighboring agents and towards the exit state of agent ii:

qi(x¯i)=wii((xi,yi)(xitf,yitf)2dimax)\displaystyle q_{i}(\bar{x}_{i})=w_{ii}\cdot(\|(x_{i},y_{i})-(x_{i}^{t_{f}},y_{i}^{t_{f}})\|_{2}-d_{i}^{\max}) +wi,i1((xi,yi)(xi1,yi1)di,i1max)\displaystyle+w_{i,i-1}\cdot(\|(x_{i},y_{i})-(x_{i-1},y_{i-1})\|-d^{\max}_{i,i-1})
+wi,i+1((xi,yi)(xi+1,yi+1)di,i+1max),\displaystyle+w_{i,i+1}\cdot(\|(x_{i},y_{i})-(x_{i+1},y_{i+1})\|-d^{\max}_{i,i+1}),

where wi,jw_{i,j} is the weight related to the distance between agent ii and jj; wi,j=0w_{i,j}=0 when j=0j=0 or 1010; di,jmaxd_{i,j}^{\max} is the regularization term for numerical stability, which is assigned by the initial distance between agents ii and jj in this demonstration, and the remaining notations and parameters are the same as the assignments in (49) and the first example in this subsection if not explicitly stated. Trajectories of UAV team subject to two distributed LSOC algorithms, Algorithm 2 and Algorithm 3 in Appendix E, are presented in Figure 13. For some network structures, such as line, loop, star and complete binary tree, in which the scale of every factorial subsystem is tractable, increasing the total number of agents in network will not dramatically boost the computational complexity on local agents thanks to the distributed LSOC framework proposed in this paper. Verification along with more simulation examples on the generalization of distributed LSOC controllers discussed in Section 3.3 is supplemented in [53].

Refer to caption
Figure 12: Communication network of a UAV team with nine agents. UAV 1 to 6, as well as UAV 7 and 8 are strongly coupled (represented by solid lines) through their immediate cost functions. UAV 6 and 7, as well as UAV 8 and 9 are loosely coupled (represented by dashed lines) through their terminal cost functions.
Refer to caption
Figure 13: UAV trajectories subject to two distributed LSOC algorithms. (a) Trajectories of UAV team controlled by the sampling-based distributed LSOC algorithm with w1,2=w2,3=w3,4=w6,5=w7,8=w8,7=w99=1w_{1,2}=w_{2,3}=w_{3,4}=w_{6,5}=w_{7,8}=w_{8,7}=w_{99}=1, wii=w4,3=w4,5=w5,4=w5,6=0.5w_{ii}=w_{4,3}=w_{4,5}=w_{5,4}=w_{5,6}=0.5, and w2,1=w3,2=w6,7=w7,6=w9,8=0w_{2,1}=w_{3,2}=w_{6,7}=w_{7,6}=w_{9,8}=0. (b) Trajectories of UAV team controlled by the REPS-based distributed LSOC algorithm with w1,2=w2,3=w3,4=w6,5=w7,8=w8,7=w99=0.2w_{1,2}=w_{2,3}=w_{3,4}=w_{6,5}=w_{7,8}=w_{8,7}=w_{99}=0.2, wii=w4,3=w4,5=w5,4=w5,6=0.1w_{ii}=w_{4,3}=w_{4,5}=w_{5,4}=w_{5,6}=0.1, and w2,1=w3,2=w6,7=w7,6=w9,8=0w_{2,1}=w_{3,2}=w_{6,7}=w_{7,6}=w_{9,8}=0.

Many interesting problems remain unsolved in the area of distributed (linearly solvable) stochastic optimal control and deserve further investigation. Most of existing papers, including this paper, assumed that the passive and controlled dynamics of different agents are mutually independent. However, when we consider some practical constraints, such as the collisions between different UAVs, the passive and controlled dynamics of different agents are usually not mutually dependent. Meanwhile, while this paper considered the scenario when all states of local agents are fully observable, it will be interesting to study the MAS of partially observable agents with hidden states. Lastly, distributed LSOC algorithms subject random network, communication delay, infinite horizon or discounted cost are also worth our attention.

Conclusion

Discrete-time and continuous-time distributed LSOC algorithms for networked MASs have been investigated in this paper. A distributed control framework based on factorial subsystems has been proposed, which allows to optimize the joint state or cost function between neighboring agents with local observation and tractable computational complexity. Under this distributed framework, the discrete-time multi-agent LSMDP problem was addressed by respectively solving the local systems of linear equations in each subsystem, and a parallel programming scheme was proposed to decentralize and expedite the computation. The optimal control action/policy for continuous-time multi-agent LSOC problem was formulated as a path integral, which was approximated by a distributed sampling method and a distributed REPS method, respectively. Numerical examples of coordinated UAV teams were presented to verify the effectiveness and advantages of these algorithms, and some open problems were given at the end of this paper.

Acknowledgments

This work was supported in part by NSF-NRI, AFOSR, and ZJU-UIUC Institute Research Program. The authors would like to appreciate the constructive comments from Dr. Hunmin Lee and Gabriel Haberfeld. In this arXiv version, the authors would also like to thank the readers and staff on arXiv.org.

Conflict of Interest Statement

The authors have agreed to publish this article, and we declare that there is no conflict of interests regarding the publication of this article.

Appendix A: Proof for Theorem 1

Proof for Theorem 1: Substituting the joint running cost function (7) into the joint Bellman equation (9), and by the definitions of KL-divergence and exponential transformation (17), we have

Vi(x¯i)=minu¯i{qi(x¯i)+KL(u¯i(|x¯i)p¯i(|x¯i))+𝔼x¯iu¯i(|x¯i)[Vi(x¯i)]}=minu¯i{qi(x¯i)+𝔼x¯iu¯i(|x¯i)[logu¯i(x¯i|x¯i)p¯i(x¯i|x¯i)]+𝔼x¯u¯i(|x¯i)[log1Zi(x¯i)]}=minu¯i{qi(x¯i)+𝔼x¯iu¯i(|x¯i)[logu¯i(x¯i|x¯i)p¯i(x¯i|x¯i)Zi(x¯i)]}.\begin{split}V_{i}(\bar{x}_{i})&=\min_{\bar{u}_{i}}\left\{q_{i}(\bar{x}_{i})+\textrm{KL}(\bar{u}_{i}(\cdot|\bar{x}_{i})\ \|\ \bar{p}_{i}(\cdot|\bar{x}_{i}))+\mathbb{E}_{\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}[V_{i}(\bar{x}^{\prime}_{i})]\right\}\\ &=\min_{\bar{u}_{i}}\left\{q_{i}(\bar{x}_{i})+\mathbb{E}_{\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}\left[\log\dfrac{\bar{u}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})}{\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})}\right]+\mathbb{E}_{\bar{x}^{\prime}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}\left[\log\frac{1}{Z_{i}(\bar{x}^{\prime}_{i})}\right]\right\}\\ &=\min_{\bar{u}_{i}}\left\{q_{i}(\bar{x}_{i})+\mathbb{E}_{\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}\left[\log\dfrac{\bar{u}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})}{\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z_{i}(\bar{x}^{\prime}_{i})}\right]\right\}.\end{split} (50)

The optimal policy will be straightforward if we can rewrite the expectation on the RHS of (50) as a KL-divergence and exploit the minimum condition of KL-divergence. While the control mapping u¯i(x¯i|x¯i)\bar{u}_{i}(\bar{x}_{i}^{\prime}|\bar{x}_{i}) in (50) is a probability distribution, the denominator p(x|x)Z(x)p(x^{\prime}|x)Z(x^{\prime}) is not necessarily a probability distribution. Hence, we define the following normalized term

𝒲i(x¯i)=x¯ip¯i(x¯i|x¯i)Zi(x¯i).\mathcal{W}_{i}(\bar{x}_{i})=\sum_{\bar{x}^{\prime}_{i}}\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z_{i}(\bar{x}^{\prime}_{i}). (51)

Since p(x|x)Z(x)/𝒲i(x¯i)p(x^{\prime}|x)Z(x^{\prime})/\mathcal{W}_{i}(\bar{x}_{i}) is a well-defined probability distribution, we can rewrite the joint Bellman equation (50) as follows

Vi(x¯i)=minu¯i{qi(x¯i)+𝔼x¯iu¯i(|x¯i)[logu¯i(x¯i|x¯i)p¯i(x¯i|x¯i)Zi(x¯i)/𝒲i(x¯i)log𝒲i(x¯i)]}=minu¯i{qi(x¯i)log𝒲i(x¯i)+KL(u¯i(|x¯i)p¯i(|x¯i)Zi()𝒲i(x¯i))},\begin{split}V_{i}(\bar{x}_{i})&=\min_{\bar{u}_{i}}\left\{q_{i}(\bar{x}_{i})+\mathbb{E}_{\bar{x}^{\prime}_{i}\sim\bar{u}_{i}(\cdot|\bar{x}_{i})}\left[\log\dfrac{\bar{u}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})}{\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z_{i}(\bar{x}^{\prime}_{i})/\mathcal{W}_{i}(\bar{x}_{i})}-\log\mathcal{W}_{i}(\bar{x}_{i})\right]\right\}\\ &=\min_{\bar{u}_{i}}\left\{q_{i}(\bar{x}_{i})-\log\mathcal{W}_{i}(\bar{x}_{i})+\mathrm{KL}\left(\bar{u}_{i}(\cdot|\bar{x}_{i})\ \Bigg{\|}\ \frac{\bar{p}_{i}(\cdot|\bar{x}_{i})Z_{i}(\cdot)}{\mathcal{W}_{i}(\bar{x}_{i})}\right)\right\},\end{split} (52)

where only the last term depends on the joint control action u¯i(|x¯i)\bar{u}_{i}(\cdot|\bar{x}_{i}). According to the minimum condition of KL-divergence, the last term in (52) attains its absolute minimum at 0 if and only if

u¯i(|x¯i)=p¯i(|x¯i)Zi()𝒲i(x¯i),\bar{u}_{i}^{*}(\cdot|\bar{x}_{i})=\frac{\bar{p}_{i}(\cdot|\bar{x}_{i})Z_{i}(\cdot)}{\mathcal{W}_{i}(\bar{x}_{i})},

which gives the optimal control action (19) in Theorem 1. By substituting (19) into (52), we can minimize the RHS of joint Bellman equation (9) and remove the minimum operator:

Vi(x¯i)=qi(x¯i)log𝒲i(x¯i).V_{i}(\bar{x}_{i})=q_{i}(\bar{x}_{i})-\log\mathcal{W}_{i}(\bar{x}_{i}). (53)

Exponentiating both sides of (53) and substituting (17) and (51) into the result, the Bellman equation (53) can be rewritten as a linear equation with respect to the desirability function

Zi(x¯i)=exp[Vi(x¯i)]=exp[qi(x¯i)]𝒲i(x¯i)=exp[qi(x¯i)]x¯ip¯i(x¯i|x¯i)Zi(x¯i),\begin{split}Z_{i}(\bar{x}_{i})=\exp[-V_{i}(\bar{x}_{i})]=\exp[-q_{i}(\bar{x}_{i})]\cdot\mathcal{W}_{i}(\bar{x}_{i})=\exp[-q_{i}(\bar{x}_{i})]\cdot\sum_{\bar{x}^{\prime}_{i}}\bar{p}_{i}(\bar{x}^{\prime}_{i}|\bar{x}_{i})Z_{i}(\bar{x}^{\prime}_{i}),\end{split} (54)

which implies (18) in Theorem 1. This completes the proof. ∎

Appendix B: Proof for Theorem 2

Before we present the proof for Theorem 2, Feynman–Kac formula that builds an important relationship between parabolic PDEs and stochastic processes is introduced as Lemma 4.

Lemma 1.

[Feynman–Kac formula] Consider the Kolmogorov backward equation (KBE) described as follows

tZ(x,t)=q(x,t)λZ(x,t)f(x,t)xZ(x,t)12tr(B(x)σσB(x)xxZ(x,t)),\partial_{t}Z(x,t)=\frac{q(x,t)}{\lambda}\cdot Z(x,t)-f(x,t)^{\top}\cdot\nabla_{x}Z(x,t)-\frac{1}{2}\textrm{tr}\left(B(x)\sigma\sigma^{\top}B(x)^{\top}\cdot\nabla_{xx}Z(x,t)\right),

where the terminal condition is given by Z(x,tf)=exp[ϕ(x(tf))/λ]Z(x,t_{f})=\exp[-\phi(x({t_{f}}))/\lambda]. Then the solution to this KBE can be written as a conditional expectation

Z(x,t)=𝔼x,t[exp(1λϕ(y(tf)))1λttfq(y,τ)𝑑τ|y(t)=x],Z(x,t)=\mathbb{E}_{x,t}\left[\exp\left(-\frac{1}{\lambda}\phi(y(t_{f}))\right)-\frac{1}{\lambda}\int_{t}^{t_{f}}q(y,\tau)\ d\tau\ \Big{|}\ y(t)=x\right],

under the probability measure that yy is an Itô diffusion process driven by the equation dy(τ)=f(y,τ)dτ+B(y)σdw(τ)dy(\tau)=f(y,\tau)d\tau+B(y)\sigma\cdot dw(\tau) with initial condition y(t)=xy(t)=x. ∎

We now show the proof for Theorem 2.

Proof for Theorem 2: First, we show that the joint optimality equation (16) can be formulated into the joint stochastic HJB equation (24) that gives an analytic expression of optimal control action (25). Substituting immediate cost function (15) into optimality equation (16) and letting ss be a time step between tt and tft_{f}, optimality equation (16) can be rewritten as

Vi(x¯i,t)=minu¯i𝔼x¯i,tu¯i[ϕi(x¯i,tf)+ttfqi(x¯i,τ)+12u¯i(x¯i,τ)R¯iu¯i(x¯i,τ)dτ]=minu¯i𝔼x¯i,tu¯i[Vi(x¯i,s)+tsqi(x¯i,τ)+12u¯i(x¯i,τ)R¯iu¯i(x¯i,τ)dτ].\begin{split}V_{i}(\bar{x}_{i},t)&=\min_{\bar{u}_{i}}\mathbb{E}^{\bar{u}_{i}}_{\bar{x}_{i},t}\left[\phi_{i}(\bar{x}_{i},t_{f})+\int_{t}^{t_{f}}q_{i}(\bar{x}_{i},\tau)+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},\tau)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},\tau)\ d\tau\right]\\ &=\min_{\bar{u}_{i}}\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}\left[V_{i}(\bar{x}_{i},s)+\int_{t}^{s}q_{i}(\bar{x}_{i},\tau)+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},\tau)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},\tau)\ d\tau\right].\end{split} (55)

With some rearrangements and dividing both sides of (55) by st>0s-t>0, we have

0=minu¯i𝔼x¯i,tu¯i[Vi(x¯i,s)Vi(x¯i,t)st+1sttsqi(x¯i,τ)+12u¯i(x¯i,τ)R¯iu¯i(x¯i,τ)dτ].0=\min_{\bar{u}_{i}}\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}\left[\frac{V_{i}(\bar{x}_{i},s)-V_{i}(\bar{x}_{i},t)}{s-t}+\frac{1}{s-t}\int_{t}^{s}q_{i}(\bar{x}_{i},\tau)+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},\tau)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},\tau)\ d\tau\right]. (56)

By letting sts\rightarrow t, the optimality equation (56) becomes

0=minu¯i𝔼x¯i,tu¯i[dVi(x¯i,t)dt+qi(x¯i,t)+12u¯i(x¯i,t)R¯iu¯i(x¯i,t)].0=\min_{\bar{u}_{i}}\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}\left[\frac{dV_{i}(\bar{x}_{i},t)}{dt}+q_{i}(\bar{x}_{i},t)+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},t)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},t)\right]. (57)

Applying Itô’s formula [54], the differential dVi(x¯i,t)dV_{i}(\bar{x}_{i},t) in (57) can be expanded as

dVi(x¯i,t)=j𝒩¯im=1MVi(x¯i,t)xj(m)dxj(m)+Vi(x¯i,t)tdt+12j,k𝒩¯im,n=1M2Vi(x¯i,t)xj(m)xk(n)dxj(m)dxk(n).dV_{i}(\bar{x}_{i},t)=\sum_{j\in\bar{\mathcal{N}}_{i}}\sum_{m=1}^{M}\frac{\partial V_{i}(\bar{x}_{i},t)}{\partial x_{j(m)}}dx_{j(m)}+\frac{\partial V_{i}(\bar{x}_{i},t)}{\partial t}dt+\frac{1}{2}\sum_{j,k\in\bar{\mathcal{N}}_{i}}\sum_{m,n=1}^{M}\frac{\partial^{2}V_{i}(\bar{x}_{i},t)}{\partial x_{j(m)}\partial x_{k(n)}}dx_{j(m)}dx_{k(n)}. (58)

For conciseness, we will omit the indices (or subscripts) of state components, (m)(m) and (n)(n), in the following derivations. Dividing both sides of (58) by dtdt, taking the expectation over all trajectories that initialized at (x¯it,t)(\bar{x}_{i}^{t},t) and subject to control action u¯i\bar{u}_{i}, and substituting the joint dynamics (14) into the result, we have

𝔼x¯i,tu¯i[dVi(x¯i,t)dt]=j𝒩¯i[fj(xj,t)+Bj(xj)uj(x¯i,t)]xjVi(x¯i,t)+Vi(x¯i,t)t+12j𝒩¯itr(Bj(xj)σjσjBj(xj)xjxjVi(x¯i,t)),\begin{split}\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}\left[\frac{dV_{i}(\bar{x}_{i},t)}{dt}\right]=\sum_{j\in\bar{\mathcal{N}}_{i}}[f_{j}(x_{j},t)&+B_{j}(x_{j})u_{j}(\bar{x}_{i},t)]^{\top}\cdot\nabla_{x_{j}}V_{i}(\bar{x}_{i},t)+\frac{\partial V_{i}(\bar{x}_{i},t)}{\partial t}\\ &+\frac{1}{2}\sum_{j\in\bar{\mathcal{N}}_{i}}\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}x_{j}}V_{i}(\bar{x}_{i},t)\right),\end{split} (59)

where the identity 𝔼x¯i,tu¯i[dxj(m)dxk(n)]=(σjσj)mmδjkδmndt\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}[dx_{j(m)}dx_{k(n)}]=(\sigma_{j}\sigma^{\top}_{j})_{mm}\delta_{jk}\delta_{mn}dt derived from the property of standard Brownian motion 𝔼x¯i,tu¯i[dwj(m)dwk(n)]=δjkδmndt\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}[dw_{j(m)}dw_{k(n)}]=\delta_{jk}\delta_{mn}dt is invoked, and operators xj\nabla_{x_{j}} and xjxj\nabla_{x_{j}x_{j}} follow the same definitions in (13). Substituting (59) into (57), the joint stochastic HJB equation (24) in Theorem 2 is obtained

tVi(x¯i,t)=minu¯i𝔼x¯i,tu¯i[j𝒩¯i[fj(xj,t)+Bj(xj)uj(x¯i,t)]xjVi(x¯i,t)+qi(x¯i,t)+12u¯i(x¯i,t)R¯iu¯i(x¯i,t)+12j𝒩¯itr(Bj(xj)σjσjBj(xj)xjxjVi(x¯i,t))],\begin{split}-\partial_{t}V_{i}(\bar{x}_{i},t)=&\min_{\bar{u}_{i}}\mathbb{E}_{\bar{x}_{i},t}^{\bar{u}_{i}}\bigg{[}\sum_{j\in\bar{\mathcal{N}}_{i}}[f_{j}(x_{j},t)+B_{j}(x_{j})u_{j}(\bar{x}_{i},t)]^{\top}\cdot\nabla_{x_{j}}V_{i}(\bar{x}_{i},t)+q_{i}(\bar{x}_{i},t)\\ &+\frac{1}{2}\bar{u}_{i}(\bar{x}_{i},t)^{\top}\bar{R}_{i}\bar{u}_{i}(\bar{x}_{i},t)+\frac{1}{2}\sum_{j\in\bar{\mathcal{N}}_{i}}\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}x_{j}}V_{i}(\bar{x}_{i},t)\right)\bigg{]},\end{split}

where the boundary condition is given by Vi(x¯i,tf)=ϕi(x¯i)V_{i}(\bar{x}_{i},t_{f})=\phi_{i}(\bar{x}_{i}). The joint optimal control action u¯i(x¯i,t)\bar{u}_{i}^{*}(\bar{x}_{i},t) can be obtained by setting the derivative of (24) with respect to u¯i(x¯i,t)\bar{u}_{i}(\bar{x}_{i},t) equal to zero. When the control weights RjR_{j} of each agent j𝒩¯ij\in\bar{\mathcal{N}}_{i} are coupled, i.e. the joint control weight matrix R¯i\bar{R}_{i} cannot be formulated as a block diagonal matrix, the joint optimal control action for subsystem 𝒩¯i\mathcal{\bar{N}}_{i} is given as (25) in Theorem 2, u¯i(x¯i,t)=R¯i1\bar{u}^{*}_{i}(\bar{x}_{i},t)=-\bar{R}_{i}^{-1}B¯i(x¯i)x¯iVi(xi,t)\bar{B}_{i}(\bar{x}_{i})^{\top}\cdot\nabla_{\bar{x}_{i}}V_{i}(x_{i},t), where x¯i\nabla_{\bar{x}_{i}} denotes the gradient with respect to the joint state x¯i\bar{x}_{i}. However, it is more common in practice that the joint control weight matrix is given by R¯i=diag{Ri,Rj𝒩i}\bar{R}_{i}=\textrm{diag}\{R_{i},R_{j\in\mathcal{N}_{i}}\} as in (15), and the joint control cost satisfies 12u¯iR¯iu¯i=j𝒩¯i12ujRjuj\frac{1}{2}\bar{u}_{i}^{\top}\bar{R}_{i}\bar{u}_{i}=\sum_{j\in\mathcal{\bar{N}}_{i}}\frac{1}{2}u_{j}^{\top}R_{j}u_{j}. Setting the derivatives of (24) with respect to uj(x¯i,t){u}_{j}(\bar{x}_{i},t) equal to zero in the latter case gives the local optimal control action of agent j𝒩¯ij\in\mathcal{\bar{N}}_{i}

uj(x¯i,t)=Rj1Bj(xj)xjVi(x¯i,t).u^{*}_{j}(\bar{x}_{i},t)=-R_{j}^{-1}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}}V_{i}(\bar{x}_{i},t). (60)

For conciseness of derivations and considering the formulation of (24), we will mainly focus on the latter scenario, R¯i=diag{Ri,Rj𝒩i}\bar{R}_{i}=\textrm{diag}\{R_{i},R_{j\in\mathcal{N}_{i}}\}, in the remaining part of this proof.

In order to solve the stochastic HJB equation (24) and evaluate the optimal control action (25), we consider to linearize (24) with the Cole-Hopf transformation (23), Vi(x¯i,t)=λilogZ(x¯i,t)V_{i}(\bar{x}_{i},t)=\lambda_{i}\log Z(\bar{x}_{i},t). Subject to this transformation, the derivative and gradients in (24) satisfy

tVi(x¯i,t)=λitZi(x¯i,t)Zi(x¯i,t),\partial_{t}V_{i}(\bar{x}_{i},t)=-\lambda_{i}\cdot\frac{\partial_{t}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)}, (61)
xjVi(x¯i,t)=λixjZi(x¯i,t)Zi(x¯i,t),\nabla_{x_{j}}V_{i}(\bar{x}_{i},t)=-\lambda_{i}\cdot\frac{\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)}, (62)
xjxjVi(x¯i,t)=λi[xjxjZi(x¯i,t)Zi(x¯i,t)xjZi(x¯i,t)xjZi(x¯i,t)Zi(x¯i,t)2].\nabla_{x_{j}x_{j}}V_{i}(\bar{x}_{i},t)=-\lambda_{i}\cdot\left[\frac{\nabla_{x_{j}x_{j}}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)}-\frac{\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t)\cdot\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t)^{\top}}{Z_{i}(\bar{x}_{i},t)^{2}}\right]. (63)

Equalities (62) and (63) will still hold when we replace the agent’s state xjx_{j} in xj\nabla_{x_{j}} and xjxj\nabla_{x_{j}x_{j}} by the joint state x¯i\bar{x}_{i} and reformulate the joint HJB (24) in a more compact form. Substituting local optimal control action (60), and gradients (62) and (63) into the stochastic HJB equation (24), the corresponding terms of each agent j𝒩¯ij\in\mathcal{\bar{N}}_{i} in (24) satisfy

[Bj(xj)uj(x¯i,t)]xj\displaystyle[B_{j}(x_{j})u_{j}(\bar{x}_{i},t)]^{\top}\nabla_{x_{j}} Vi(x¯i,t)+12uj(x¯i,t)Rjuj(x¯i,t)\displaystyle V_{i}(\bar{x}_{i},t)+\frac{1}{2}{u}_{j}(\bar{x}_{i},t)^{\top}{R}_{j}u_{j}(\bar{x}_{i},t)
=\displaystyle= 12xjVi(x¯i,t)Bj(xj)Rj1Bj(xj)xjVi(x¯i,t)\displaystyle-\frac{1}{2}\nabla_{x_{j}}V_{i}(\bar{x}_{i},t)^{\top}\cdot B_{j}(x_{j})R_{j}^{-1}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}}V_{i}(\bar{x}_{i},t) (64)
=\displaystyle= λi22Zi(x¯i,t)2xjZi(x¯i,t)Bj(xj)Rj1Bj(xj)xjZi(x¯i,t),\displaystyle\ \frac{-\lambda_{i}^{2}}{2\cdot Z_{i}(\bar{x}_{i},t)^{2}}\cdot\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t)^{\top}\cdot B_{j}(x_{j})R_{j}^{-1}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t),
12tr(Bj(xj)σj\displaystyle\frac{1}{2}\textrm{tr}\big{(}B_{j}(x_{j})\sigma_{j} σjBj(xj)xjxjVi(x¯i,t))\displaystyle\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}x_{j}}V_{i}(\bar{x}_{i},t)\big{)}
=\displaystyle= λi2Zi(x¯i,t)tr(Bj(xj)σjσjBj(xj)xjxjZi(x¯i,t))\displaystyle\ \frac{-\lambda_{i}}{2\cdot Z_{i}(\bar{x}_{i},t)}\cdot\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}x_{j}}Z_{i}(\bar{x}_{i},t)\right) (65)
+λi2Zi(x¯i,t)2tr(Bj(xj)σjσjBj(xj)xjZi(x¯i,t)xjZi(x¯i,t)).\displaystyle+\frac{\lambda_{i}}{2\cdot Z_{i}(\bar{x}_{i},t)^{2}}\cdot\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\cdot\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t)\cdot\nabla_{x_{j}}Z_{i}(\bar{x}_{i},t)^{\top}\right).

By the properties of trace operator, the quadratic terms in (Appendix B: Proof for Theorem 2) and (Appendix B: Proof for Theorem 2) will be canceled if

σjσj=λiRj1,\sigma_{j}\sigma^{\top}_{j}=\lambda_{i}R_{j}^{-1}, (66)

i.e. Rj=(σjσj/λi)1R_{j}=(\sigma_{j}\sigma^{\top}_{j}/\lambda_{i})^{-1}, which is equivalent to the condition σ¯iσ¯i=λiR¯i1\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}=\lambda_{i}\bar{R}^{-1}_{i} or R¯i=(σ¯iσ¯i/λi)1\bar{R}_{i}=(\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}/\lambda_{i})^{-1} subject to joint dynamics (14). Substituting (25), (61), (Appendix B: Proof for Theorem 2) and (Appendix B: Proof for Theorem 2) into stochastic HJB equation (24), we then remove the minimization operator and obtain the linearized PDE as (26) in Theorem 2

tZi(x¯i,t)=[qi(x¯i,t)λij𝒩¯ifj(xj,t)xj12j𝒩¯itr(Bj(xj)σjσjBj(xj)xjxj)]Zi(x¯i,t),\partial_{t}Z_{i}(\bar{x}_{i},t)=\left[\frac{q_{i}(\bar{x}_{i},t)}{\lambda_{i}}-\sum_{j\in\bar{\mathcal{N}}_{i}}f_{j}(x_{j},t)\nabla_{x_{j}}-\frac{1}{2}\sum_{j\in\bar{\mathcal{N}}_{i}}\textrm{tr}\left(B_{j}(x_{j})\sigma_{j}\sigma_{j}^{\top}B_{j}(x_{j})^{\top}\nabla_{x_{j}x_{j}}\right)\right]Z_{i}(\bar{x}_{i},t),

where the boundary condition is given by Zi(x¯i,tf)=exp[ϕi(x¯i)/λi]Z_{i}(\bar{x}_{i},t_{f})=\exp[-\phi_{i}(\bar{x}_{i})/\lambda_{i}]. Once the value of desirability function Zi(x¯i,t)Z_{i}(\bar{x}_{i},t) in (26) is solved, we can readily figure out the value function Vi(x¯i,t)V_{i}(\bar{x}_{i},t) and joint optimal control action from (23) and (25), respectively. Invoking the Feynman–Kac formula introduced in Lemma 1, a solution to (26) can be formulated as (27) in Theorem 2

Zi(x¯i,t)=𝔼x¯i,t[exp(1λiϕi(y¯itf)1λittfqi(y¯i,τ)𝑑τ)],Z_{i}(\bar{x}_{i},t)=\mathbb{E}_{\bar{x}_{i},t}\left[\exp\left(-\frac{1}{\lambda_{i}}\phi_{i}(\bar{y}^{t_{f}}_{i})-\frac{1}{\lambda_{i}}\int_{t}^{t_{f}}q_{i}(\bar{y}_{i},\tau)\ d\tau\right)\right],

where y¯(t)\bar{y}(t) satisfies the uncontrolled dynamics dy¯i(τ)=f¯i(y¯i,τ)dτ+B¯i(y¯i)σ¯idw¯i(τ)d\bar{y}_{i}(\tau)=\bar{f}_{i}(\bar{y}_{i},\tau)d\tau+\bar{B}_{i}(\bar{y}_{i})\bar{\sigma}_{i}\cdot d\bar{w}_{i}(\tau) with initial condition y¯i(t)=x¯i(t)\bar{y}_{i}(t)=\bar{x}_{i}(t). This completes the proof. ∎

Appendix C: Proof of Proposition 3

Proof of Proposition 3: First, we formulate the desirability function (27) as a path integral shown in (30). Partitioning the time interval from tt to tft_{f} into KK intervals of equal length ε>0\varepsilon>0, t=t0<t1<<tK=tft=t_{0}<t_{1}<\cdots<t_{K}=t_{f}, we can rewrite (27) as the following path integral

Zi(x¯i,t)=𝔼x¯i,t[exp(1λiϕi(y¯itf)1λittfqi(y¯i,τ)𝑑τ)]=𝑑x¯i(1)exp(1λiϕi(x¯i(K)))k=0K1Zi(x¯i(k+1),tk+1;x¯i(k),tk)dx¯i(K),\begin{split}Z_{i}(\bar{x}_{i},t)=\ &\mathbb{E}_{\bar{x}_{i},t}\left[\exp\left(-\frac{1}{\lambda_{i}}\phi_{i}(\bar{y}^{t_{f}}_{i})-\frac{1}{\lambda_{i}}\int_{t}^{t_{f}}q_{i}(\bar{y}_{i},\tau)\ d\tau\right)\right]\\ =\ &\int d\bar{x}^{(1)}_{i}\cdots\int\exp\left(-\frac{1}{\lambda_{i}}\phi_{i}(\bar{x}^{(K)}_{i})\right)\cdot\prod_{k=0}^{K-1}Z_{i}(\bar{x}_{i}^{(k+1)},t_{k+1};\bar{x}_{i}^{(k)},t_{k})\ d\bar{x}_{i}^{(K)},\end{split} (67)

where the integral of variable x¯i(k)\bar{x}_{i}^{(k)} is over the set of all joint uncontrolled trajectories x¯i(τ)\bar{x}_{i}(\tau) on time interval [tk1,tk)[t_{k-1},t_{k}) and with initial condition x¯i(0)=x¯i(t0)=x¯i(t)\bar{x}_{i}^{(0)}=\bar{x}_{i}(t_{0})=\bar{x}_{i}(t), which can be measured by agent ii at initial time tt, and the function Zi(x¯i(k+1),tk+1;x¯i(k),tk)Z_{i}(\bar{x}_{i}^{(k+1)},t_{k+1};\bar{x}_{i}^{(k)},t_{k}) is implicitly defined by

f(x¯i(k+1))Zi(x¯i(k+1),tk+1;x¯i(k),tk)dx¯i(k+1)=𝔼x¯i(k),tk[f(x¯i(k+1))exp(1λititi+1qi(y¯i,τ)𝑑τ)|y¯i(tk)=x¯i(k)]\begin{split}\int f(\bar{x}_{i}^{(k+1)})\cdot&Z_{i}(\bar{x}_{i}^{(k+1)},t_{k+1};\bar{x}_{i}^{(k)},t_{k})\ d\bar{x}_{i}^{(k+1)}\\ &=\mathbb{E}_{\bar{x}_{i}^{(k)},t_{k}}\left[f(\bar{x}_{i}^{(k+1)})\cdot\exp\left(-\frac{1}{\lambda_{i}}\int_{t_{i}}^{t_{i+1}}q_{i}(\bar{y}_{i},\tau)\ d\tau\right)\Big{|}\ \bar{y}_{i}(t_{k})=\bar{x}_{i}^{(k)}\right]\end{split} (68)

for arbitrary functions f(x¯i(k+1))f(\bar{x}_{i}^{(k+1)}). Based on definition (68) and in the limit of infinitesimal ε\varepsilon, the function Zi(x¯i(k+1),tk+1;x¯i(k),tk)Z_{i}(\bar{x}_{i}^{(k+1)},t_{k+1};\bar{x}_{i}^{(k)},t_{k}) can be approximated by

Zi(x¯i(k+1),tk+1;x¯i(k),tk)=pi(x¯i(k+1),tk+1|x¯i(k),tk)exp(ελiqi(x¯i(k),tk)),Z_{i}(\bar{x}_{i}^{(k+1)},t_{k+1};\bar{x}_{i}^{(k)},t_{k})=p_{i}(\bar{x}_{i}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},t_{k})\cdot\exp\left(-\frac{\varepsilon}{\lambda_{i}}\cdot q_{i}(\bar{x}_{i}^{(k)},t_{k})\right), (69)

where pi(x¯i(k+1),tk+1|x¯i(k),tk)p_{i}(\bar{x}_{i}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},t_{k}) is the transition probability of uncontrolled dynamics from state-time pair (x¯i(k),tk)(\bar{x}_{i}^{(k)},t_{k}) to (x¯i(k+1),tk+1)(\bar{x}_{i}^{(k+1)},t_{k+1}) and can be factorized as follows

pi(x¯i(k+1),tk+1|x¯i(k),tk)=pi(x¯i(n)(k+1),tk+1|x¯i(k),tk)pi(x¯i(d)(k+1),tk+1|x¯i(k),tk)=pi(x¯i(n)(k+1),tk+1|x¯i(d)(k),x¯i(n)(k),tk)pi(x¯i(d)(k+1),tk+1|x¯i(d)(k),x¯i(n)(k),tk)pi(x¯i(d)(k+1),tk+1|x¯i(k),tk),\begin{split}p_{i}(\bar{x}_{i}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},t_{k})&=p_{i}(\bar{x}_{i(n)}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},t_{k})\cdot p_{i}(\bar{x}_{i(d)}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},t_{k})\\ &=p_{i}(\bar{x}_{i(n)}^{(k+1)},t_{k+1}|\bar{x}_{i(d)}^{(k)},\bar{x}_{i(n)}^{(k)},t_{k})\cdot p_{i}(\bar{x}_{i(d)}^{(k+1)},t_{k+1}|\bar{x}_{i(d)}^{(k)},\bar{x}_{i(n)}^{(k)},t_{k})\\ &\propto p_{i}(\bar{x}_{i(d)}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},t_{k}),\\ \end{split} (70)

where pi(x¯i(n)(k+1),tk+1|x¯i(d)(k),x¯i(n)(k),tk)p_{i}(\bar{x}_{i(n)}^{(k+1)},t_{k+1}|\bar{x}_{i(d)}^{(k)},\bar{x}_{i(n)}^{(k)},t_{k}) is a Dirac delta function, since x¯i(n)(k+1)\bar{x}_{i(n)}^{(k+1)} can be deterministically calculated from x¯i(n)(k)\bar{x}_{i(n)}^{(k)} and x¯i(d)(k)\bar{x}_{i(d)}^{(k)}. Provided the directly actuated uncontrolled dynamics from (29)

x¯i(d)(k+1)x¯i(d)(k)=f¯i(d)(x¯i(k),tk)ε+B¯i(d)(x¯i(k))σ¯iw¯i\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}=\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\varepsilon+\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\cdot\bar{\sigma}_{i}\bar{w}_{i}

with Brownian motion w¯i𝒩(0,εIM)\bar{w}_{i}\sim\mathcal{N}(0,\varepsilon{I}_{M}), the directly actuated states x¯i(d)(k)\bar{x}_{i(d)}^{(k)} and x¯i(d)(k+1)\bar{x}_{i(d)}^{(k+1)} satisfy Gaussian distribution x¯i(d)(k+1)𝒩(x¯i(d)(k)+f¯i(d)(x¯i(k),tk)ε,Σi(k))\bar{x}_{i(d)}^{(k+1)}\sim\mathcal{N}(\bar{x}_{i(d)}^{(k)}+\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\varepsilon,\Sigma^{(k)}_{i}) with covariance Σi(k)=εB¯i(d)(x¯i(k))σ¯iσ¯iB¯i(d)(x¯i(k))\Sigma^{(k)}_{i}=\varepsilon\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\bar{\sigma}_{i}\bar{\sigma}^{\top}_{i}\cdot\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})^{\top}. When condition σ¯iσ¯i=λiR¯i1\bar{\sigma}_{i}\bar{\sigma}^{\top}_{i}=\lambda_{i}\bar{R}_{i}^{-1} in Theorem 2 is fulfilled, the covariance is Σi(k)=ελiB¯i(d)(x¯i(k))R¯i1B¯i(d)(x¯i(k))=εHi(k)\Sigma_{i}^{(k)}=\varepsilon\lambda_{i}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\cdot\bar{R}_{i}^{-1}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})^{\top}=\varepsilon H_{i}^{(k)} with Hi(k)=λiB¯i(d)(x¯i(k))R¯i1B¯i(d)(x¯i(k))=B¯i(d)(x¯i(k))σ¯iσ¯B¯i(d)(x¯i(k))H_{i}^{(k)}=\lambda_{i}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\bar{R}_{i}^{-1}\cdot\allowbreak\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})^{\top}=\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})\bar{\sigma}_{i}\bar{\sigma}^{\top}\bar{B}_{i(d)}(\bar{x}^{(k)}_{i})^{\top}. Hence, the transition probability in (70) satisfies

pi(x¯i(d)(k+1),\displaystyle p_{i}(\bar{x}_{i(d)}^{(k+1)}, tk+1|x¯i(k),tk)=1[det(2πΣi(k))]1/2exp(12x¯i(d)(k+1)x¯i(d)(k)f¯i(d)(x¯i(k),tk)ε(Σi(k))12)\displaystyle t_{k+1}|\bar{x}_{i}^{(k)},t_{k})=\frac{1}{[{\det(2\pi\Sigma_{i}^{(k)})}]^{1/2}}\exp\left(-\frac{1}{2}\left\|\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}-\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\varepsilon\right\|^{2}_{\left(\Sigma_{i}^{(k)}\right)^{-1}}\right)
=\displaystyle= 1[det(2πΣi(k))]1/2exp(ε2x¯i(d)(k+1)x¯i(d)(k)εf¯i(d)(x¯i(k),tk)(Hi(k))12).\displaystyle\frac{1}{[{\det(2\pi\Sigma_{i}^{(k)})}]^{1/2}}\cdot\exp\left(-\frac{\varepsilon}{2}\left\|\frac{\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}}{\varepsilon}-\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\right\|_{\left(H_{i}^{(k)}\right)^{-1}}^{2}\right). (71)

Substituting (69), (70) and (Appendix C: Proof of Proposition 3) into (67) and in the limit of infinitesimal ε\varepsilon, the desirability function Zi(x¯i,t)Z_{i}(\bar{x}_{i},t) can then be rewritten as a path integral

Zi(x¯i,t)=limε0Zi(ε)(x¯i(0),t0).Z_{i}(\bar{x}_{i},t)=\lim_{\varepsilon\downarrow 0}Z_{i}^{(\varepsilon)}(\bar{x}^{(0)}_{i},t_{0}). (72)

Defining a path variable ¯i=(x¯i(1),,x¯i(K))\bar{\ell}_{i}=(\bar{x}^{(1)}_{i},\cdots,\bar{x}^{(K)}_{i}), the discretized desirability function in (72) can be expressed as

Zi(ε)(x¯i(0),t0)\displaystyle Z^{(\varepsilon)}_{i}(\bar{x}^{(0)}_{i},t_{0}) =exp(Siε,λi(x¯i(0),¯i,t0)12k=0K1logdet(2πΣi(k)))𝑑¯i\displaystyle=\int\exp\left(-S_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-\frac{1}{2}\sum_{k=0}^{K-1}\log\det(2\pi\Sigma_{i}^{(k)})\right)\ d\bar{\ell}_{i} (73)
=exp(Siε,λi(x¯i(0),¯i,t0)12k=0K1logdet(Hi(k))KD|𝒩¯i|2log(2πε))𝑑¯i\displaystyle=\int\exp\left(-S_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-\frac{1}{2}\sum_{k=0}^{K-1}\log\det(H_{i}^{(k)})-\frac{KD|\mathcal{\bar{N}}_{i}|}{2}\log(2\pi\varepsilon)\right)\ d\bar{\ell}_{i}
=exp(S~iε,λi(x¯i(0),¯i,t0)KD|𝒩¯i|2log(2πε))𝑑¯i,\displaystyle=\int\exp\left(-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-\frac{KD|\mathcal{\bar{N}}_{i}|}{2}\log(2\pi\varepsilon)\right)\ d\bar{\ell}_{i},

where Siε,λi(x¯i(0),¯i,t0)S_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}) is the path value for a trajectory (x¯i(0),,x¯i(K))(\bar{x}_{i}^{(0)},\cdots,\bar{x}_{i}^{(K)}) starting at space-time pair (x¯i,t)(\bar{x}_{i},t) or (x¯i(0),t0)(\bar{x}_{i}^{(0)},t_{0}) and takes the form of

Siε,λi(x¯i(0),¯i,t0)=ϕi(x¯i(K))λi+εk=0K1[qi(x¯i(k),tk)λi+12x¯i(d)(k+1)x¯i(d)(k)εf¯i(d)(x¯i(k),tk)(Hi(k))12];S^{\varepsilon,\lambda_{i}}_{i}(\bar{x}^{(0)}_{i},\bar{\ell}_{i},t_{0})=\frac{\phi_{i}(\bar{x}^{(K)}_{i})}{\lambda_{i}}+\varepsilon\sum_{k=0}^{K-1}\Bigg{[}\frac{q_{i}(\bar{x}^{(k)}_{i},t_{k})}{\lambda_{i}}+\frac{1}{2}\left\|\frac{\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}}{\varepsilon}-\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\right\|^{2}_{\left(H_{i}^{(k)}\right)^{-1}}\Bigg{]};

S~iε,λi(x¯i(0),¯i,t0)\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}) is the generalized path value and satisfies S~iε,λi(x¯i(0),¯i,t0)=Siε,λi(x¯i(0),¯i,t0)+12k=0K1logdet(Hi(k))\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})=S_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})\allowbreak+\frac{1}{2}\sum_{k=0}^{K-1}\allowbreak\log\det(H_{i}^{(k)}), and the constant KD|𝒩¯i|/2log(2πε)KD|\bar{\mathcal{N}}_{i}|/2\cdot\log(2\pi\varepsilon) in (73) is related to the numerical stability, which demands a careful choice of ε\varepsilon and a fine partition over [t,tf)[t,t_{f}). Identical to the expectation in (27) and the integral in (67), the integral in (73) is subject to the set of all uncontrolled trajectories x¯i(τ)\bar{x}_{i}(\tau) initialized at (x¯i,t)(\bar{x}_{i},t) or (x¯i(0),t0)(\bar{x}_{i}^{(0)},t_{0}). Summarizing the preceding deviations, (30) and (31) in Proposition 3 can be restored.

Substituting gradient (62) and discretized desirability function (72) into the joint optimal control action (25), we have

u¯i(x¯i,t)\displaystyle\bar{u}^{*}_{i}(\bar{x}_{i},t) =λiR¯i1B¯i(x¯i)x¯iZi(x¯i,t)Zi(x¯i,t)=σ¯iσ¯iB¯i(x¯i)x¯iZi(x¯i,t)Zi(x¯i,t)\displaystyle=\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i}(\bar{x}_{i})^{\top}\frac{\nabla_{\bar{x}_{i}}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)}=\bar{\sigma}_{i}\bar{\sigma}_{i}^{\top}\bar{B}_{i}(\bar{x}_{i})^{\top}\frac{\nabla_{\bar{x}_{i}}Z_{i}(\bar{x}_{i},t)}{Z_{i}(\bar{x}_{i},t)} (74)
=λiR¯i1B¯i(x¯i)limε0x¯i(0)exp[S~iε,λi(x¯i(0),¯i,t0)KD|𝒩¯i|/2log(2πε)]𝑑¯iexp[S~iε,λi(x¯i(0),¯i,t0)KD|𝒩¯i|/2log(2πε)]𝑑¯i\displaystyle=\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i}(\bar{x}_{i})^{\top}\cdot\lim_{\varepsilon\downarrow 0}\frac{\nabla_{\bar{x}^{(0)}_{i}}\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-KD|\mathcal{\bar{N}}_{i}|/2\cdot\log(2\pi\varepsilon)]\ d\bar{\ell}_{i}}{\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-KD|\mathcal{\bar{N}}_{i}|/2\cdot\log(2\pi\varepsilon)]\ d\bar{\ell}_{i}}\allowdisplaybreaks
=(a)λiR¯i1B¯i(x¯i)limε0exp[KD|𝒩¯i|/2log(2πε)]x¯i(0)exp[S~iε,λi(x¯i(0),¯i,t0)]𝑑¯iexp[KD|𝒩¯i|/2log(2πε)]exp[S~iε,λi(x¯i(0),¯i,t0)]𝑑¯i\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(a)}}}}{{=}}\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i}(\bar{x}_{i})^{\top}\cdot\lim_{\varepsilon\downarrow 0}\frac{\exp[-KD|\mathcal{\bar{N}}_{i}|/2\cdot\log(2\pi\varepsilon)]\cdot\nabla_{\bar{x}^{(0)}_{i}}\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\ d\bar{\ell}_{i}}{\exp[-KD|\mathcal{\bar{N}}_{i}|/2\cdot\log(2\pi\varepsilon)]\cdot\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\ d\bar{\ell}_{i}}\allowdisplaybreaks
=(b)λiR¯i1B¯i(x¯i)limε0exp[S~iε,λi(x¯i(0),¯i,t0)]x¯i(0)[S~iε,λi(x¯i(0),¯i,t0)]di¯exp[S~iε,λi(x¯i(0),¯i,t0)]𝑑¯i\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(b)}}}}{{=}}\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i}(\bar{x}_{i})^{\top}\cdot\lim_{\varepsilon\downarrow 0}\frac{\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\cdot\nabla_{\bar{x}^{(0)}_{i}}[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\ d\bar{\ell_{i}}}{\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\ d\bar{\ell}_{i}}
=(c)λiR¯i1B¯i(x¯i)limε0p~i(¯i|x¯i(0),t0)x¯i(0)[S~iε,λi(x¯i(0),¯i,t0)]d¯i\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(c)}}}}{{=}}\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i}(\bar{x}_{i})^{\top}\cdot\lim_{\varepsilon\downarrow 0}\int\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\nabla_{\bar{x}^{(0)}_{i}}[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\ d\bar{\ell}_{i}
=(d)λiR¯i1B¯i(d)(x¯i)limε0p~i(¯i|x¯i(0),t0)u~i(x¯i(0),¯i,t0)𝑑¯i.\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(d)}}}}{{=}}\lambda_{i}\bar{R}_{i}^{-1}\bar{B}_{i(d)}(\bar{x}_{i})^{\top}\cdot\lim_{\varepsilon\downarrow 0}\int\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})\ d\bar{\ell}_{i}.

(a) follows from the fact that exp[KD|𝒩¯i|/2log(2πε)]\exp[-KD|\mathcal{\bar{N}}_{i}|/2\cdot\log(2\pi\varepsilon)] is independent from the path variables (x¯i(0),¯i)(\bar{x}_{i}^{(0)},\bar{\ell}_{i}); (b) employs the differentiation rule for exponential function and requires that the integrand exp[S~iε,λi(x¯i(0),¯i,t0)]\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\allowbreak\bar{\ell}_{i},t_{0})] be continuously differentiable in ε\varepsilon and along the trajectory (x¯i(0),¯i)(\bar{x}_{i}^{(0)},\bar{\ell}_{i}); (c) follows from the optimal path distribution p~i(¯i|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0}) that satisfies

p~i(¯i|x¯i(0),t0)=exp[S~iε,λi(x¯i(0),¯i,t0)]exp[S~iε,λi(x¯i(0),¯i,t0)]𝑑¯i;\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})=\frac{\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]}{\int\exp[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})]\ d\bar{\ell}_{i}};

(d) follows the partitions B¯i(x¯i)=[0,B¯i(d)(x¯i)]\bar{B}_{i}(\bar{x}_{i})=[0,\bar{B}_{i(d)}(\bar{x}_{i})^{\top}]^{\top} and x¯i(0)S~iε,λi(x¯i(0),¯i,t0)=[x¯i(n)(0)S~iε,λi(x¯i(0),¯i,t0),x¯i(d)(0)S~iε,λi(x¯i(0),¯i,t0)]-\nabla_{\bar{x}^{(0)}_{i}}\ \tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})=[-\nabla_{\bar{x}^{(0)}_{i(n)}}\allowbreak\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\allowbreak\bar{\ell}_{i},t_{0})^{\top},-\nabla_{\bar{x}^{(0)}_{i(d)}}\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})^{\top}]^{\top}, and the initial control variable u~i(x¯i(0),¯i,t0)\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}) is determined by

x¯i(d)(0)S~iε,λi(x¯i(0),¯i,t0)=x¯i(d)(0)[ϕi(x¯i(K))λi+ελik=0K1qi(x¯i(k),tk)+12k=0K1logdet(Hi(k))\displaystyle\nabla_{\bar{x}^{(0)}_{i(d)}}\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})=\nabla_{\bar{x}^{(0)}_{i(d)}}\bigg{[}\frac{\phi_{i}(\bar{x}^{(K)}_{i})}{\lambda_{i}}+\frac{\varepsilon}{\lambda_{i}}\sum_{k=0}^{K-1}q_{i}(\bar{x}^{(k)}_{i},t_{k})+\frac{1}{2}\sum_{k=0}^{K-1}\log\det(H_{i}^{(k)}) (75)
+ε2k=0K1x¯i(d)(k+1)x¯i(d)(k)εf¯i(d)(x¯i(k),tk)(Hi(k))12].\displaystyle\hskip 191.0pt+\frac{\varepsilon}{2}\sum_{k=0}^{K-1}\left\|\frac{\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}}{\varepsilon}-\bar{f}_{i(d)}(\bar{x}^{(k)}_{i},t_{k})\right\|^{2}_{\left(H_{i}^{(k)}\right)^{-1}}\bigg{]}.

In the following, we calculate the gradients in (75). Since the terminal cost ϕi(x¯i(K))\phi_{i}(\bar{x}^{(K)}_{i}) usually is a constant, the first gradient in (75) is zero, i.e. x¯i(d)(0)[ϕi(x¯i(K))/λi]=0\nabla_{\bar{x}^{(0)}_{i(d)}}[\phi_{i}(\bar{x}_{i}^{(K)})/\lambda_{i}]=0. When the immediate cost qi(x¯i(0),t0)q_{i}(\bar{x}^{(0)}_{i},t_{0}) is a function of state x¯i(d)(0)\bar{x}^{(0)}_{i(d)}, the second gradient in (75) can be computed as follows

x¯i(d)(0)ελik=0K1qi(x¯i(k),tk)=ελix¯i(d)(0)qi(x¯i(0),t0);\nabla_{\bar{x}^{(0)}_{i(d)}}\frac{\varepsilon}{\lambda_{i}}\sum_{k=0}^{K-1}q_{i}(\bar{x}^{(k)}_{i},t_{k})=\frac{\varepsilon}{\lambda_{i}}\nabla_{\bar{x}^{(0)}_{i(d)}}q_{i}(\bar{x}^{(0)}_{i},t_{0}); (76)

when qi(x¯i(0),t0)q_{i}(\bar{x}^{(0)}_{i},t_{0}) is not related to the value of x¯i(0)\bar{x}^{(0)}_{i}, i.e. a constant or an indicator function, the second gradient is then zero. The third gradient in (75) follows

x¯i(d)(0)12k=0K1logdet(Hi(k))=12x¯i(d)(0)logdet(Hi(0)).\nabla_{\bar{x}^{(0)}_{i(d)}}\frac{1}{2}\sum_{k=0}^{K-1}\log\det(H_{i}^{(k)})=\frac{1}{2}\nabla_{\bar{x}^{(0)}_{i(d)}}\log\det(H_{i}^{(0)}). (77)

Letting αi(k)=(x¯i(d)(k+1)x¯i(d)(k))/εf¯i(d)(x¯i(k),tk)\alpha_{i}^{(k)}=(\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)})/\varepsilon-\bar{f}_{i(d)}(\bar{x}_{i}^{(k)},t_{k}) and βi(k)=(Hi(k))1αi(k)\beta_{i}^{(k)}=(H_{i}^{(k)})^{-1}\alpha_{i}^{(k)}, the gradient of the fourth term in (75) satisfies

x¯i(d)(0)ε2k=0K1αi(k)(Hi(k))12=ε2x¯i(d)(0)(αi(0))βi(0)\displaystyle\nabla_{\bar{x}_{i(d)}^{(0)}}\frac{\varepsilon}{2}\sum_{k=0}^{K-1}\left\|\alpha_{i}^{(k)}\right\|^{2}_{\left(H_{i}^{(k)}\right)^{-1}}=\frac{\varepsilon}{2}\cdot\nabla_{\bar{x}^{(0)}_{i(d)}}(\alpha_{i}^{(0)})^{\top}\beta_{i}^{(0)}
=ε2[(x¯i(d)(0)αi(0))βi(0)+(x¯i(d)(0)βi(0))αi(0)]\displaystyle=\frac{\varepsilon}{2}\left[\left(\nabla_{\bar{x}^{(0)}_{i(d)}}\alpha_{i}^{(0)}\right)\beta_{i}^{(0)}+\left(\nabla_{\bar{x}_{i(d)}^{(0)}}\beta_{i}^{(0)}\right)\alpha_{i}^{(0)}\right] (78)
=12βi(0)ε2[(x¯i(d)(0)f¯i(d)(x¯i(0),t0))βi(0)αi(0)x¯i(d)(0)βi(0)]\displaystyle=-\frac{1}{2}\beta_{i}^{(0)}-\frac{\varepsilon}{2}\left[\left(\nabla_{\bar{x}^{(0)}_{i(d)}}\bar{f}_{i(d)}(\bar{x}_{i}^{(0)},t_{0})\right)\beta_{i}^{(0)}-\alpha_{i}^{(0)}\nabla_{\bar{x}^{(0)}_{i(d)}}\beta_{i}^{(0)}\right]
=(Hi(0))1αi(0)ε[x¯i(d)(0)f¯i(d)(x¯i(0))](Hi(0))1αi(0)+ε2(αi(0))[x¯i(d)(0)(Hi(0))1]αi(0).\displaystyle=-\left(H_{i}^{(0)}\right)^{-1}\alpha_{i}^{(0)}-\varepsilon\left[\nabla_{\bar{x}^{(0)}_{i(d)}}\bar{f}_{i(d)}(\bar{x}_{i}^{(0)})\right]\left(H_{i}^{(0)}\right)^{-1}\alpha_{i}^{(0)}+\frac{\varepsilon}{2}\left(\alpha_{i}^{(0)}\right)^{\top}\left[\nabla_{\bar{x}^{(0)}_{i(d)}}\left(H_{i}^{(0)}\right)^{-1}\right]\alpha_{i}^{(0)}.

Detailed interpretations on the calculation of (Appendix C: Proof of Proposition 3) can be found in [22, 55]. Meanwhile, after substituting (Appendix C: Proof of Proposition 3) into (74), one can verify that the integrals in (74) satisfy

p~i(¯i|x¯i(0),t0)ε[x¯i(d)(0)f¯i(d)(x¯i(0))](Hi(0))1αi(0)𝑑=0,\displaystyle\int\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\varepsilon\left[\nabla_{\bar{x}^{(0)}_{i(d)}}\bar{f}_{i(d)}(\bar{x}_{i}^{(0)})\right]\left(H_{i}^{(0)}\right)^{-1}\alpha_{i}^{(0)}d\ell=0, (79)
p~i(¯i|x¯i(0),t0)ε2(αi(0))[x¯i(d)(0)(Hi(0))1]αi(0)d=12x¯i(d)(0)logdet(Hi(0)).\displaystyle\int\tilde{p}^{*}_{i}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)},t_{0})\cdot\frac{\varepsilon}{2}\left(\alpha_{i}^{(0)}\right)^{\top}\left[\nabla_{\bar{x}^{(0)}_{i(d)}}\left(H_{i}^{(0)}\right)^{-1}\right]\alpha_{i}^{(0)}d\ell=-\frac{1}{2}\nabla_{\bar{x}^{(0)}_{i(d)}}\log\det(H_{i}^{(0)}). (80)

Substituting (75-80) into (74), we obtain the initial control variable u~i(x¯i(0),¯i,t0)\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0}) as follows

u~i(x¯i(0),¯i,t0)=ελix¯i(d)(0)qi(x¯i(0),t0)+(Hi(0))1(x¯i(d)(1)x¯i(d)(0)εf¯i(d)(x¯i(0),t0)).\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})=-\frac{\varepsilon}{\lambda_{i}}\nabla_{\bar{x}^{(0)}_{i(d)}}q_{i}(\bar{x}^{(0)}_{i},t_{0})+\left(H_{i}^{(0)}\right)^{-1}\left(\frac{\bar{x}_{i(d)}^{(1)}-\bar{x}_{i(d)}^{(0)}}{\varepsilon}-\bar{f}_{i(d)}(\bar{x}_{i}^{(0)},t_{0})\right).

This completes the proof. ∎

Appendix D: Relative Entropy Policy Search

The local REPS algorithm in subsystem 𝒩¯i\bar{\mathcal{N}}_{i} alternates between two steps, learning the optimal path distribution and updating the parameterized policy, till the convergence of the algorithm. First, we consider the learning step realized by the optimization problem (37). Since we want to minimize the relative entropy between the current approximate path distribution p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) and the optimal path distribution p~i(x¯i(0),¯i)\tilde{p}_{i}^{*}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}), the objective function of the learning step follows

argminp~iKL(p~i(x¯i(0),¯i)p~i(x¯i(0),¯i))=(a)argmaxp~ip~i(x¯i(0),¯i)[logp~i(¯i|x¯i(0))+logμ(x¯i(0))logp~i(x¯i(0),¯i)]𝑑x¯i(0)𝑑¯i=(b)argmaxp~ip~i(x¯i(0),¯i)[S~iε,λi(x¯i(0),¯i,t0)logp~i(x¯i(0),¯i)]𝑑x¯i(0)𝑑¯i.\begin{split}&\arg\min_{\tilde{p}_{i}}\textrm{KL}(\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\ \|\ \tilde{p}_{i}^{*}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}))\\ \stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(a)}}}}{{=}}&\arg\max_{\tilde{p}_{i}}\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\left[\log\tilde{p}_{i}^{*}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)})+\log\mu(\bar{x}_{i}^{(0)})-\log\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\right]\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}\\ \stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(b)}}}}{{=}}&\arg\max_{\tilde{p}_{i}}\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\left[-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-\log\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\right]\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}.\end{split}

(a) transforms the minimization problem to a maximization problem and adopts the identity p~i(x¯i(0),¯i)=p~i(¯i|x¯i(0))μi(x¯i(0))\tilde{p}_{i}^{*}(\bar{x}^{(0)}_{i},\bar{\ell}_{i})=\tilde{p}_{i}^{*}(\bar{\ell}_{i}|\bar{x}_{i}^{(0)})\cdot\mu_{i}(\bar{x}^{(0)}_{i}), where time arguments are generally omitted in this appendix for brevity; and (b) employs identity (33) and omits the terms that are independent from the path variable ¯i\bar{\ell}_{i}, since these terms have no influence on the optimization problem. In order to construct the information loss from old distribution and avoid overly greedy policy updates [21, 23], we restrict the update rate with constraint

p~i(x¯i(0),¯i)logp~i(x¯i(0),¯i)q~i(x¯i(0),¯i)dx¯i(0)d¯iδ,\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\log\frac{\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})}{\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})}\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}\leq\delta,

where δ>0\delta>0 can be used as a trade-off between exploration and exploitation, and LHS is the relative entropy between the current approximate path distribution p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) and the old approximate path distribution q~i(x¯i(0),¯i)\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}). Meanwhile, the marginal distribution p~i(x¯i(0))=p~i(x¯i(0),¯i)𝑑¯i\tilde{p}_{i}(\bar{x}_{i}^{(0)})=\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\ d\bar{\ell}_{i} needs to match the initial distribution μi(x¯i(0))\mu_{i}(\bar{x}^{(0)}_{i}), which is known to the designer. However, this condition could generate an infinite number of constraints in optimization problem (37) and is too restrictive for practice [50, 23, 34]. Hence, we relax this condition by only considering to match the state feature averages of initial state

p~i(x¯i(0),¯i)ψi(x¯i(0))𝑑x¯i(0)𝑑¯i=μi(x¯i(0))ψi(x¯i(0))𝑑x¯i(0)=ψ^i(0),\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\cdot\psi_{i}(\bar{x}_{i}^{(0)})\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}=\int\mu_{i}(\bar{x}_{i}^{(0)})\cdot\psi_{i}(\bar{x}_{i}^{(0)})\ d\bar{x}_{i}^{(0)}=\hat{\psi}^{(0)}_{i},

where ψi(x¯i(0))\psi_{i}(\bar{x}_{i}^{(0)}) is a feature vector of initial state, and ψ^i(0)\hat{\psi}^{(0)}_{i} is the expectation of the state feature vector subject to initial distribution μi(x¯i(0))\mu_{i}(\bar{x}^{(0)}_{i}). In general, ψ(x¯i(0))\psi(\bar{x}_{i}^{(0)}) can be a vector made up with linear and quadratic terms of initial states, x¯i(m)(0)\bar{x}_{i(m)}^{(0)} and x¯i(m)(0)x¯i(n)(0)\bar{x}_{i(m)}^{(0)}\bar{x}_{i(n)}^{(0)}, such that the mean and the covariance of marginal distribution p~i(x¯i(0))\tilde{p}_{i}(\bar{x}_{i}^{(0)}) match those of initial distribution μi(x¯i(0))\mu_{i}(\bar{x}^{(0)}_{i}). Lastly, we consider the following normalization constraint

p~i(x¯i(0),¯i)𝑑x¯i(0)𝑑¯i=1,\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}=1, (81)

which ensures that p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) defines a probability distribution.

The optimization problem (37) can be solved analytically by the method of Lagrange multipliers. Defining the Lagrange multipliers κ>0\kappa>0, η\eta\in\mathbb{R} and vector θ\theta, the Lagrangian is

=η+κδ+θψ^i(0)+p~i(x¯i(0),¯i)[\displaystyle\mathcal{L}=\eta+\kappa\delta+\theta^{\top}\hat{\psi}_{i}^{(0)}+\int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\Bigg{[} S~iε,λi(x¯i(0),¯i)logp~i(x¯i(0),¯i)η\displaystyle-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})-\log\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})-\eta (82)
θψi(x¯i(0))κlogp~i(x¯i(0),¯i)q~i(x¯i(0),¯i)]dx¯i(0)d¯i.\displaystyle-\theta^{\top}\psi_{i}(\bar{x}_{i}^{(0)})-\kappa\log\frac{\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})}{\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})}\Bigg{]}\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}.

We can maximize the Lagrangian \mathcal{L} and derive the maximizer p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) in (37) by letting /p~i(x¯i(0),¯i)=0\partial\mathcal{L}/\partial\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})=0. This condition will hold for arbitrary initial distributions μi(x¯i(0))\mu_{i}(\bar{x}_{i}^{(0)}) if and only if the derivative of the integrand in (82) is identically equal to zero, i.e.

S~iε,λi(x¯i(0),¯i,t0)(1+κ)[1+logp~i(x¯i(0),¯i)]ηθψi(x¯i(0))+κlogq~i(x¯i(0)¯i)=0,-\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i},t_{0})-(1+\kappa)\left[1+\log\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\right]-\eta-\theta^{\top}\psi_{i}(\bar{x}_{i}^{(0)})+\kappa\log\tilde{q}_{i}(\bar{x}_{i}^{(0)}\bar{\ell}_{i})=0,

from which we can find the maximizer p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) as shown in (38). In order to evaluate p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) in (38), we then determine the values of dual variables κ,η\kappa,\eta and θ\theta by solving the dual problem. Substituting (38) into the normalization constraint (81), we have the identity

exp(1+κ+η1+κ)=q~i(x¯i(0),¯i)κ1+κexp(S~iε,λi(x¯i(0),¯i)+θψi(x¯i(0))1+κ)𝑑x¯i(0)𝑑¯i,\exp\left(\frac{1+\kappa+\eta}{1+\kappa}\right)=\int\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})^{\frac{\kappa}{1+\kappa}}\cdot\exp\left(-\frac{\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})+\theta^{\top}\psi_{i}(\bar{x}_{i}^{(0)})}{1+\kappa}\right)\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}, (83)

which can be used to determine η\eta provided the values of κ\kappa and θ\theta. To figure out the values of κ\kappa and θ\theta, we solve the dual problem (39), where the objective function g(κ,θ)g(\kappa,\theta) in (40) is obtained by substituting (38) and (83) into (82)

g(κ,θ)=(a)η+κδ+θψ^i(0)+1+κ=κδ+θψ^i(0)+(1+κ)1+κ+η1+κ=(b)κδ+θψ^i(0)+(1+κ)logq~i(x¯i(0),¯i)κ1+κexp(S~iε,λi(x¯i(0),¯i)+θψi(x¯i(0))1+κ)𝑑x¯i(0)𝑑¯i.\begin{split}g(\kappa,&\theta)\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(a)}}}}{{=}}\eta+\kappa\delta+\theta^{\top}\hat{\psi}_{i}^{(0)}+1+\kappa\\ &=\kappa\delta+\theta^{\top}\hat{\psi}_{i}^{(0)}+(1+\kappa)\frac{1+\kappa+\eta}{1+\kappa}\\ &\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(b)}}}}{{=}}\kappa\delta+\theta^{\top}\hat{\psi}_{i}^{(0)}+(1+\kappa)\log\int\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})^{\frac{\kappa}{1+\kappa}}\cdot\exp\left(-\frac{\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})+\theta^{\top}\psi_{i}(\bar{x}_{i}^{(0)})}{1+\kappa}\right)\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}.\end{split}

(a) substitutes (38) into (82), and (b) substitutes (83) into the result. By applying the Monte Carlo method and using the data set 𝒴i={(x¯i(0),¯i[y])}y=1,,Y\mathcal{Y}_{i}=\{(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})\}_{y=1,\cdots,Y}, the objective function (40) can be approximated from sample trajectories by

g(κ,θ)=κδ+θψ^i(0)+(1+κ)log[1Yy=1Yq~i(x¯i(0),¯i[y])κ1+κexp(S~iε,λi(x¯i(0),¯i[y])+ψi(x¯i(0))θ1+κ)],g(\kappa,\theta)=\kappa\delta+\theta^{\top}\hat{\psi}_{i}^{(0)}+(1+\kappa)\log\Bigg{[}\frac{1}{Y}\sum_{y=1}^{Y}\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})^{\frac{\kappa}{1+\kappa}}\exp\left(-\frac{\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})+\psi_{i}^{\top}(\bar{x}_{i}^{(0)})\cdot\theta}{1+\kappa}\right)\Bigg{]},

where S~iε,λi(x¯i(0),¯i[y])\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i}) and ψi(x¯i(0))\psi_{i}(\bar{x}_{i}^{(0)}) are respectively the generalized path value and state feature vector of sample trajectory (x¯i(0),¯i[y])(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i}), and the old path probability q~i(x¯i(0),¯i[y])\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i}) can be evaluated with the current or initial policy by

q~i(x¯i(0),¯i[y])=μi(x¯i(0))k=0K1pi(x¯i(k+1),tk+1|x¯i(k),u¯i(k),tk)πi(k)(u¯i(k)|x¯i(k))𝑑u¯i(k),\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})=\mu_{i}(\bar{x}_{i}^{(0)})\cdot\prod_{k=0}^{K-1}\int p_{i}(\bar{x}_{i}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},\bar{u}_{i}^{(k)},t_{k})\cdot\pi^{(k)}_{i}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)})\ d\bar{u}_{i}^{(k)}, (84)

where the state variables x¯i(0)\bar{x}_{i}^{(0)}, x¯i(k)\bar{x}_{i}^{(k)} and x¯i(k+1)\bar{x}_{i}^{(k+1)} in (84) are from sample trajectory (x¯i(0),¯i[y])(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]}); the control policy πi(k)(u¯i(k)|x¯i(k))\pi^{(k)}_{i}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)}) is given either as an initialization or an optimization result from updating step (Appendix D: Relative Entropy Policy Search), and the controlled transition probability pi(x¯i(k+1),tk+1|x¯i(k),u¯i(k),tk)p_{i}(\bar{x}_{i}^{(k+1)},t_{k+1}|\bar{x}_{i}^{(k)},\bar{u}_{i}^{(k)},t_{k}) with x¯i(k+1)𝒩(x¯i(k)+f¯i(x¯i(k),tk)ε+B¯i(x¯i(k))u¯i(x¯i(k),tk)ε,Σi(k))\bar{x}_{i}^{(k+1)}\sim\mathcal{N}(\bar{x}_{i}^{(k)}+\bar{f}_{i}(\bar{x}^{(k)}_{i},t_{k})\varepsilon+\bar{B}_{i}(\bar{x}_{i}^{(k)})\bar{u}_{i}(\bar{x}_{i}^{(k)},t_{k})\varepsilon,\Sigma^{(k)}_{i}) can be obtained by following the similar steps when deriving the uncontrolled transition probability in (Appendix C: Proof of Proposition 3). When policy πi(k)\pi_{i}^{(k)} is Gaussian, (84) can be analytically evaluated.

In the policy updating step, we can find the optimal parameters χi(k)\chi_{i}^{*(k)} for Gaussian policy by minimizing the relative entropy between the joint distribution p~i(x¯i(0),¯i)\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) from learning step (37) and joint distribution p~iπ(x¯i(0),¯i)\tilde{p}_{i}^{\pi}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}) generated by parametric policy πi(k)(u¯i(k)|x¯i(k),χi(k))𝒩(u¯i(k)|a^i(k)x¯i(k)+b^i(k),Σ^i(k))\pi_{i}^{(k)}(\bar{u}_{i}^{(k)}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})\allowbreak\sim\mathcal{N}(\bar{u}_{i}^{(k)}|\hat{a}_{i}^{(k)}\bar{x}_{i}^{(k)}+\hat{b}_{i}^{(k)},\hat{\Sigma}_{i}^{(k)}). To determine the policy parameters χi(k)=(a^i(k),b^i(k),Σ^i(k))\chi_{i}^{(k)}=(\hat{a}_{i}^{(k)},\hat{b}_{i}^{(k)},\hat{\Sigma}_{i}^{(k)}) at time tkt_{k}, we need to solve the following optimization problem, which is also a weighted maximum likelihood problem

χi(k)\displaystyle\chi_{i}^{*(k)} =argminχi(k)KL(p~i(x¯i(0),¯i)p~iπ(x¯i(0),¯i))\displaystyle=\arg{\textstyle\min_{\chi_{i}^{(k)}}}\ \textrm{KL}(\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\ \|\ \tilde{p}_{i}^{\pi}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}))
=(a)argmaxχi(k)p~i(x¯i(0),¯i)logp~iπ(x¯i(k+1)|x¯i(k))p~i(x¯i(k+1)|x¯i(k))dx¯i(0)d¯i\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(a)}}}}{{=}}\arg{\textstyle\max_{\chi_{i}^{(k)}}}\ \int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\cdot\log\frac{\tilde{p}_{i}^{\pi}(\bar{x}_{i}^{(k+1)}|\bar{x}_{i}^{(k)})}{\tilde{p}_{i}(\bar{x}_{i}^{(k+1)}|\bar{x}_{i}^{(k)})}\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}\allowdisplaybreaks
(b)argmaxχi(k)p~i(x¯i(0),¯i)logπi(k)(u¯i(k)|x¯i(k),χi(k))𝑑x¯i(0)𝑑¯i\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(b)}}}}{{\approx}}\arg{\textstyle\max_{\chi_{i}^{(k)}}}\ \int\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}_{i})\cdot\log{\pi}_{i}^{(k)}(\bar{u}_{i}^{*(k)}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})\ d\bar{x}_{i}^{(0)}d\bar{\ell}_{i}\allowdisplaybreaks (85)
=(c)argmaxχi(k)y=1Yp~i(x¯i(0),¯i[y])q~i(x¯i(0),¯i[y])logπi(k)(u¯i(k)|x¯i(k),χi(k))\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(c)}}}}{{=}}\arg{\textstyle\max_{\chi_{i}^{(k)}}}\ \sum_{y=1}^{Y}\frac{\tilde{p}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})}{\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})}\cdot\log{\pi}_{i}^{(k)}(\bar{u}_{i}^{*(k)}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})\allowdisplaybreaks
=(d)argmaxχi(k)y=1Ydi[y]logπi(k)(u¯i(k)|x¯i(k),χi(k)).\displaystyle\stackrel{{\scriptstyle\smash{\footnotesize\mathrm{(d)}}}}{{=}}\arg{\textstyle\max_{\chi_{i}^{(k)}}}\sum_{y=1}^{Y}d_{i}^{[y]}\cdot\log{\pi}_{i}^{(k)}(\bar{u}_{i}^{*(k)}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)})\allowdisplaybreaks.

(a) converts the minimization problem to a maximization problem and replaces the joint distributions by the products of step-wise transition distributions; (b) employs the assumption that the distribution of controlled transition equals the product of passive transition distribution and control policy distribution [23], i.e. p~iπ(x¯i(k+1)|x¯i(k))=p~i(x¯i(k+1)|x¯i(k))πi(k)(u¯i(k)|x¯i(k),χi(k))\tilde{p}_{i}^{\pi}(\bar{x}_{i}^{(k+1)}|\bar{x}_{i}^{(k)})=\tilde{p}_{i}(\bar{x}_{i}^{(k+1)}|\bar{x}_{i}^{(k)})\cdot{\pi}_{i}^{(k)}(\bar{u}_{i}^{*(k)}|\bar{x}_{i}^{(k)},\chi_{i}^{(k)}), and the control action u¯i(k)=[B¯i(d)(x¯i(k))B¯i(d)(x¯i(k))]1B¯i(d)(x¯i(k))[x¯i(d)(k+1)x¯i(d)(k)εf¯i(d)(x¯i(k),tk)]/ε\bar{u}_{i}^{*(k)}=[\bar{B}_{i(d)}(\bar{x}_{i}^{(k)})^{\top}\bar{B}_{i(d)}(\bar{x}_{i}^{(k)})]^{-1}\allowbreak\bar{B}_{i(d)}(\bar{x}_{i}^{(k)})^{\top}[\bar{x}_{i(d)}^{(k+1)}-\bar{x}_{i(d)}^{(k)}-\varepsilon\bar{f}_{i(d)}(\bar{x}_{i}^{(k)},t_{k})]/\varepsilon from control affine system dynamics maximizes the likelihood function; (c) approximates the integral by using sample trajectories from q~i(x¯i(0),¯i[y])\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i}); (d) substitutes (38), and the weight of likelihood function is

di[y]=q~i(x¯i(0),¯i[y])11+κexp(S~iε,λi(x¯i(0),¯i[y])+ψi(x¯i(0))θ1+κ).d_{i}^{[y]}=\tilde{q}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})^{\frac{-1}{1+\kappa}}\cdot\exp\left(-\frac{\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i})+\psi_{i}^{\top}(\bar{x}_{i}^{(0)})\cdot\theta}{1+\kappa}\right).

Constant terms are omitted in steps (a) to (d).

Appendix E: Multi-Agent LSOC Algorithms

Algorithm 1 gives the procedures of distributed LSMDP algorithm introduced in Section 3.1.

  Algorithm 1 Distributed LSMDP based on factorial subsystems

 

1: agent set 𝒟\mathcal{D}, communication network 𝒢\mathcal{G}, initial time t0t_{0}, exit time tft_{f}, initial states xit0x_{i}^{t_{0}}, exit states xitfx_{i}^{t_{f}}, joint state-related costs qi(x¯i)q_{i}(\bar{x}_{i}), exit costs ϕ(xttf)\phi(x_{t}^{t_{f}}), weights on exit costs wjiw_{j}^{i}, and error bound ϵ\epsilon.
2:factorial subsystems 𝒩¯i𝒟\mathcal{\bar{N}}_{i\in\mathcal{D}}, and joint exit costs ϕi(x¯i)\phi_{i}(\bar{x}_{i}).
3: 
4:for i𝒟={1,,N}i\in\mathcal{D}=\{1,\cdots,N\} do
5:/*Calculate joint desirability Zi()Z_{i}(\cdot) for subsystem 𝒩¯i\mathcal{\bar{N}}_{i}*/
6:    Compute coefficients Θ\Theta, Ω\Omega and ZZ_{\mathcal{B}} in (20) for subsystem 𝒩¯i\bar{\mathcal{N}}_{i}. For distributed planning (22), partition the coefficients [IΘ,ΩZ]j[I-\Theta,\Omega Z_{\mathcal{B}}]_{j} and calculate the projection matrices PjP_{j} for j𝒩¯ij\in\mathcal{\bar{N}}_{i}.
7:   while  Z(n+1)Z(n)>ϵ\|Z_{\mathcal{I}}^{(n+1)}-Z_{\mathcal{I}}^{(n)}\|>\epsilon do
8:       Update desirability ZZ_{\mathcal{I}} with (20). For distributed planning, exchange local solutions Z,j(n)Z^{(n)}_{\mathcal{I},j} with neighboring agents k𝒩j𝒩¯ik\in\mathcal{N}_{j}\cap\bar{\mathcal{N}}_{i} and update desirability Z(n+1)Z_{\mathcal{I}}^{(n+1)} with (22).
9:   end while
10:/*Calculate control distribution for agent ii*/
11:   Compute the joint optimal control distribution u¯i(|x¯i)\bar{u}_{i}^{*}(\cdot|\bar{x}_{i}) of subsystem 𝒩¯i\mathcal{\bar{N}}_{i} by (19).
12:   Derive the local optimal control distribution ui(|x¯i)u_{i}^{*}(\cdot|\bar{x}_{i}) for agent ii by marginalizing u¯i(|x¯i)\bar{u}_{i}^{*}(\cdot|\bar{x}_{i}).
13:end for
14:
15:while t<tft<t_{f} or xx\notin\mathcal{B} do
16:   for i𝒟={1,,N}i\in\mathcal{D}=\{1,\cdots,N\} do
17:      Measure joint state x¯i(t)\bar{x}_{i}(t) by collecting state information from neighboring agents j𝒩ij\in\mathcal{N}_{i}.
18:      Sample control action or posterior state xix^{\prime}_{i} from ui(|x¯i)u_{i}^{*}(\cdot|\bar{x}_{i}).
19:   end for
20:end while

 

Algorithm 2 illustrates the procedures of sampling-based distributed LSOC algorithm introduced in Section 3.2.

  Algorithm 2 Distributed LSOC based on sampling estimator

 

1: agent set 𝒟\mathcal{D}, communication network 𝒢\mathcal{G}, initial time t0t_{0}, exit time tft_{f}, initial states xit0x_{i}^{t_{0}}, exit states xitfx_{i}^{t_{f}}, joint state-related costs qi(x¯i)q_{i}(\bar{x}_{i}), control weight matrices R¯i\bar{R}_{i}, exit costs ϕ(xttf)\phi(x_{t}^{t_{f}}), and weights on exit costs wjiw_{j}^{i}
2:factorial subsystems 𝒩¯i𝒟\mathcal{\bar{N}}_{i\in\mathcal{D}}, and joint exit costs ϕi(x¯i)\phi_{i}(\bar{x}_{i}).
3: 
4:while t<tft<t_{f} or xx\notin\mathcal{B} do
5:   for i𝒟={1,,N}i\in\mathcal{D}=\{1,\cdots,N\} do
6:      Measure joint state x¯i(t)\bar{x}_{i}(t) by collecting state information from neighboring agents j𝒩ij\in\mathcal{N}_{i}.
7:       Generate uncontrolled trajectory set 𝒴i\mathcal{Y}_{i} by sampling or collecting data from neighboring agents.
8:       Evaluate generalized path value S~iε,λi(x¯i(0),¯i[y],t0)\tilde{S}_{i}^{\varepsilon,\lambda_{i}}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i},t_{0}) and initial control u~i(x¯i(0),¯i[y],t0)\tilde{u}_{i}(\bar{x}_{i}^{(0)},\bar{\ell}^{[y]}_{i},t_{0}) of each sample trajectory (x¯i(0),¯i[y])(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]}) in 𝒴i\mathcal{Y}_{i} by (31) and (34).
9:       Approximate the optimal path distribution p~i(¯i[y]|x¯i(0),t0)\tilde{p}^{*}_{i}(\bar{\ell}_{i}^{[y]}|\bar{x}_{i}^{(0)},t_{0}) and joint optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t) by (35), (36) or other sampling techniques.
10:      Select and execute local control action ui(x¯i,t)u_{i}^{*}(\bar{x}_{i},t) from joint optimal control action u¯i(x¯i,t)\bar{u}^{*}_{i}(\bar{x}_{i},t).
11:   end for
12:end while

 

Algorithm 3 illustrates the procedures of REPS-based distributed LSOC algorithm introduced in Section 3.2.

  Algorithm 3 Distributed LSOC based on REPS

 

1: agent set 𝒟\mathcal{D}, communication network 𝒢\mathcal{G}, initial time t0t_{0}, exit time tft_{f}, initial states xit0x_{i}^{t_{0}}, exit states xitfx_{i}^{t_{f}}, joint state-related costs qi(x¯i)q_{i}(\bar{x}_{i}), control weight matrices R¯i\bar{R}_{i}, exit costs ϕ(xttf)\phi(x_{t}^{t_{f}}), weights on exit costs wjiw_{j}^{i}, and initial policy πi(k)(u¯i(k)|x¯i(k),χi(k))\pi^{(k)}_{i}(\bar{u}^{(k)}_{i}|\bar{x}_{i}^{(k)},\chi^{(k)}_{i}).
2:factorial subsystems 𝒩¯i𝒟\mathcal{\bar{N}}_{i\in\mathcal{D}}, and joint exit costs ϕi(x¯i)\phi_{i}(\bar{x}_{i}).
3: 
4:while t<tft<t_{f} or xx\notin\mathcal{B} do
5:   for i𝒟={1,,N}i\in\mathcal{D}=\{1,\cdots,N\} do
6:      Measure joint state x¯i(t)\bar{x}_{i}(t) by collecting state information from neighboring agents j𝒩ij\in\mathcal{N}_{i}.
7:      repeat
8:          Generate trajectory set 𝒴i\mathcal{Y}_{i} by sampling with (initial) policy πi(k)(ui(k)|x¯i(k),χi(k))\pi^{(k)}_{i}(u^{(k)}_{i}|\bar{x}^{(k)}_{i},\chi^{(k)}_{i}) or collecting data from neighboring agents j𝒩ij\in\mathcal{N}_{i}.
9:         Solve dual variables κ,θ\kappa,\theta and η\eta from dual problem (39) and condition (83).
10:         Compute path distribution q~(x¯i(0),¯i[y])\tilde{q}(\bar{x}_{i}^{(0)},\bar{\ell}_{i}^{[y]}) of each trajectory in 𝒴i\mathcal{Y}_{i} by (84).
11:         Update parameter χi(k)\chi^{(k)}_{i} by solving weighted maximum likelihood problem (Appendix D: Relative Entropy Policy Search).
12:      until convergence of parametric policy πi(k)(u¯i(k)|x¯i(k),χi(k))\pi^{(k)}_{i}(\bar{u}^{(k)}_{i}|\bar{x}_{i}^{(k)},\chi^{(k)}_{i}).
13:      Marginalize joint optimal control policy πi(0)(u¯i(0)|x¯i(0),χi(0))\pi^{(0)}_{i}(\bar{u}^{(0)}_{i}|\bar{x}_{i}^{(0)},\chi^{(0)}_{i}) by (42).
14:       Sample and execute local control action ui(x¯i,t)u^{*}_{i}(\bar{x}_{i},t) from local optimal control policy πi(0)(ui(0)|x¯i(0),χi(0))\pi_{i}^{*(0)}(u^{*(0)}_{i}|\bar{x}_{i}^{(0)},\chi_{i}^{*(0)}).
15:   end for
16:end while

 

References

  • [1] R. Mahony, V. Kumar, and P. Corke, “Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor,” IEEE Robotics & Systems Magazine, vol. 19, no. 3, pp. 20–32, 2012.
  • [2] V. Cichella, R. Choe, S. B. Mehdi, E. Xargay, N. Hovakimyan, V. Dobrokhodov, I. Kaminer, A. M. Pascoal, and A. P. Aguiar, “Safe coordinated maneuvering of teams of multirotor unmanned aerial vehicles: A cooperative control framework for multivehicle, time-critical missions,” IEEE Control Systems Magazine, vol. 36, no. 4, pp. 59–82, 2016.
  • [3] Y. Ota, H. Taniguchi, T. Nakajima, K. M. Liyanage, J. Baba, and A. Yokoyama, “Autonomous distributed V2G (Vehicle-to-Grid) satisfying scheduled charging,” IEEE Transactions on Smart Grid, vol. 3, no. 1, pp. 559–564, 2012.
  • [4] F. Blaabjerg, R. Teodorescu, M. Liserre, and A. V. Timbus, “Overview of control and grid synchronization for distributed power generation systems,” IEEE Transactions on Industrial Electronics, vol. 53, no. 5, pp. 1398–1409, 2006.
  • [5] J. M. Guerrero, M. Chandorkar, T. L. Lee, and P. C. Loh, “Advanced control architectures for intelligent microgrids — Part I: Decentralized and hierarchical control,” IEEE Transactions on Industrial Electronics, vol. 60, no. 4, pp. 1254–1262, 2013.
  • [6] F. Dorfler, J. W. Simpson-Porco, and F. Bullo, “Breaking the hierarchy: Distributed control and economic optimality in microgrids,” IEEE Transactions on Control of Network Systems, vol. 3, no. 3, pp. 241–253, 2016.
  • [7] P. Leitao, “Agent-based distributed manufacturing control: A state-of-the-art survey,” Engineering Applications of Artificial Intelligence, vol. 22, no. 7, pp. 979–991, 2009.
  • [8] C. Amato, G. Chowdhary, A. Geramifard, N. K. Ure, and M. J. Kochenderfer, “Decentralized control of partially observable Markov decision processes,” in 52nd IEEE Conference on Decision and Control, Florence, Italy, 2013.
  • [9] A. Bensoussan, J. Frehse, and P. Yam, Mean Field Games and Mean Field Type Control Theory.   Springer, 2013.
  • [10] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,” IEEE Transactions on Industrial Informatics, vol. 9, no. 1, pp. 427–438, 2013.
  • [11] F. L. Lewis, H. Zhang, K. Hengster-Movric, and A. Das, Cooperative Control of Multi-Agent Systems: Optimal and Adaptive Design Approaches.   Springer, 2013.
  • [12] K. K. Oh, M. C. Park, and H. S. Ahn, “A survey of multi-agent formation control,” Automatica, vol. 53, pp. 424–440, 2015.
  • [13] J. Qin, Q. Ma, Y. Shi, and L. Wang, “Recent advances in consensus of multi-agent systems: A brief survey,” IEEE Transactions on Industrial Electronics, vol. 64, no. 6, pp. 4972–4983, 2017.
  • [14] K. Zhang, Z. Yang, and T. Basar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” arXiv, no. 1911.10635, 2019.
  • [15] W. H. Fleming, “Exit probabilities and optimal stochastic control,” Applied Mathematics and Optimization, vol. 4, no. 1, pp. 329–346, 1977.
  • [16] E. Todorov, “Linearly-solvable Markov decision problems,” in Advances in Neural Information Processing Systems, Vancouver, Canada, 2007.
  • [17] Y. Pan, E. A. Theodorou, and M. Kontitsi, “Sample efficient path integral control under uncertainty,” in Advances in Neural Information Processing Systems, Montreal, Canada, 2015.
  • [18] E. Todorov, “Compositionality of optimal control laws,” in Advances in Neural Information Processing Systems, Vancouver, Canada, 2009.
  • [19] A. Kupcsik, M. P. Deisenroth, J. Peters, L. A. Poh, P. Vadakkepat, and G. Neumann, “Model-based contextual policy search for data-efficient generalization of robot skills,” Artificial Intelligence, vol. 247, p. 415–439, 2017.
  • [20] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Information-theoretic model predictive control: Theory and applications to autonomous driving,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1603–1622, 2018.
  • [21] J. Peters, K. Mulling, and Y. Altun, “Relative entropy policy search,” in 24th AAAI Conference on Artificial Intelligence, Atlanta, USA,, 2010.
  • [22] E. A. Theodorou, “A generalized path integral control approach to reinforcement learning,” Journal of Machine Learning Research, no. 11, pp. 3137–3181, 2010.
  • [23] V. Gomez, H. J. Kappen, J. Peters, and G. Neumann, “Policy search for path integral control,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 2014.
  • [24] P. Guan, M. Raginsky, and R. M. Willett, “Online Markov decision processes with Kullback–Leibler control cost,” IEEE Transactions on Automatic Control, vol. 59, no. 6, pp. 1423–1438, 2014.
  • [25] G. Williams, A. Aldrich, and E. A. Theodorou, “Model predictive path integral control: From theory to parallel computation,” Journal of Guidance, Control, and Dynamics, vol. 40, no. 2, pp. 344–357, 2017.
  • [26] C. Guestrin, D. Koller, and R. Parr, “Multiagent planning with factored MDPs,” in Advances in Neural Information Processing Systems, Vancouver, Canada, 2002.
  • [27] R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman, “Solving transition independent decentralized Markov decision processes,” Journal of Artificial Intelligence Research, vol. 22, p. 423–455, 2004.
  • [28] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” Proceedings of Machine Learning Research, vol. 80, pp. 5872–5881, 2018.
  • [29] K. Zhang, Z. Yang, and T. Basar, “Decentralized multi-agent reinforcement learning with networked agents: Recent advances,” arXiv, no. 1912.03821, 2019.
  • [30] B. C. Daniel, “Cross-entropy method for Kullback-Leibler control in multi-agent systems,” Master’s thesis, Universitat Pompeu Fabra, 2017.
  • [31] B. Broek, W. Wiegerinck, and B. Kappen, “Graphical model inference in optimal control of stochastic multi-agent systems,” Journal of Artificial Intelligence Research, no. 32, pp. 95–122, 2008.
  • [32] R. P. Anderson and D. Milutinovic, “Stochastic optimal enhancement of distributed formation control using Kalman smoothers,” Robotica, vol. 32, pp. 305–324, 2014.
  • [33] N. Wan, A. Gahlawat, N. Hovakimyan, E. A. Theodorou, and P. G. Voulgaris, “Cooperative path integral control for stochastic multi-agent systems,” arXiv, no. 2009.14775, 2020.
  • [34] R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   MIT Press, 2018.
  • [35] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and cooperation in networked multi-agent systems,” Proceedings of the IEEE, vol. 95, no. 1, pp. 215–233, 2007.
  • [36] E. Todorov, “Eigenfunction approximation methods for linearly-solvable optimal control problems,” in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, USA, 2009.
  • [37] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,” Physical Review Letters, no. 95, pp. 200 201–1–4, 2005.
  • [38] E. Todorov, “Efficient computation of optimal actions,” Proceedings of the National Academy of Sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009.
  • [39] R. Johari and J. N. Tsitsiklis, “Efficiency loss in a network resource allocation game,” Mathematics of Operations Research, vol. 29, no. 3, pp. 407–435, 2004.
  • [40] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
  • [41] P. G. Voulgaris and N. Elia, “Social optimization problems with decentralized and selfish optimal strategies,” in 56th IEEE Conference on Decision and Control, Melbourne, Australia, 2017.
  • [42] S. Mou, J. Liu, and A. S. Morse, “A distributed algorithm for solving a linear algebraic equation,” IEEE Transactions on Automatic Control, vol. 60, no. 11, pp. 2863–2878, 2015.
  • [43] J. Liu, S. Mou, and A. S. Morse, “Asynchronous distributed algorithms for solving linear algebraic equations,” IEEE Transactions on Automatic Control, vol. 63, no. 2, pp. 372–385, 2018.
  • [44] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637 – 646, 2016.
  • [45] K. Bakshi, D. Fan, and E. A. Theodorou, “Schrödinger approach to optimal control of large-size populations,” arXiv, no. 1810.06064, 2018.
  • [46] K. Bakshi, P. Grover, and E. A. Theodorou, “On mean field games for agents with Langevin dynamics,” IEEE Transactions on Control Systems Technology, 2019.
  • [47] P. D. Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications.   Springer, 2004.
  • [48] E. A. Theodorou, “Nonlinear stochastic control and information theoretic dualities: Connections, interdependencies and thermodynamic interpretations,” Entropy, vol. 17, no. 5, pp. 3352–3375, 2015.
  • [49] G. Boutselis, M. Pereira, and E. A. Theodorou, “Variational inference for stochastic control of infinite dimensional systems,” arXiv, no. 1809.03035, 2018.
  • [50] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-efficient generalization of robot skills with contextual policy search,” in AAAI Conference on Artificial Intelligence, Bellevue, USA, 2013.
  • [51] F. D. M. Da Silva and J. Popovic, “Linear Bellman combination for control of character animation,” ACM Transactions on Graphics, vol. 28, no. 3, pp. 82:1–10, 2009.
  • [52] W. Yao, N. Qi, N. Wan, and Y. Liu, “An iterative strategy for task assignment and path planning of distributed multiple unmanned aerial vehicles,” Aerospace Science and Technology, vol. 86, pp. 455–464, 2019.
  • [53] L. Song, N. Wan, A. Gahlawat, N. Hovakimyan, and E. A. Theodorou, “Compositionality of linearly solvable optimal control in networked multi-agent systems,” arXiv, no. 2009.13609, 2020.
  • [54] J. F. Le Gall, Brownian Motion, Martingales, and Stochastic Calculus.   Springer, 2016.
  • [55] E. A. Theodorou, “Iterative path integral stochastic optimal control: Theory and applications to motor control,” Ph.D. dissertation, Dept. of Computer Sci., Univ. of Southern California, Los Angeles, California, USA, 2011.