This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\tocauthor

Karthikeya S Parunandi1, Aayushman Sharma2, Suman Chakravorty3 and Dileep Kalathil4 11institutetext: Texas A&\&M University, College Station TX 77843, USA
11email: karthikeyasharma91@gmail.com1, 11email: aayushmansharma@tamu.edu2, 11email: s.chakrav@tamu.edu3, and 11email: dileep.kalathil@tamu.edu4

D2C 2.0: Decoupled Data-Based Approach for Learning to Control Stochastic Nonlinear Systems via Model-Free ILQR

Karthikeya S Parunandi1    Aayushman Sharma2    Suman Chakravorty3 and Dileep Kalathil4
Abstract

In this paper, we propose a structured linear parameterization of a feedback policy to solve the model-free stochastic optimal control problem. This parametrization is corroborated by a decoupling principle that is shown to be near-optimal under a small noise assumption, both in theory and by empirical analyses. Further, we incorporate a model-free version of the Iterative Linear Quadratic Regulator (ILQR) in a sample-efficient manner into our framework. Simulations on systems over a range of complexities reveal that the resulting algorithm is able to harness the superior second-order convergence properties of ILQR. As a result, it is fast and is scalable to a wide variety of higher dimensional systems. Comparisons are made with a state-of-the-art reinforcement learning algorithm, the Deep Deterministic Policy Gradient (DDPG) technique, in order to demonstrate the significant merits of our approach in terms of training-efficiency.

keywords:
Machine Learning, Motion and Path Planning, Optimization and Optimal Control

1 INTRODUCTION

It is well-known that a general solution to stochastic optimal control problem is computationally intractable. Hence, most methods rely on approximated dynamic programming methods (See, for e.g., [1]). In case of an unknown system model, the problem is further complicated. This problem of learning to optimally control an unknown general nonlinear system has been well-studied in the stochastic adaptive control literature [2] [3]. A very similar problem is also addressed in the field of reinforcement learning (RL) except that it specifically aims to solve a discrete-time stochastic control process (Markov Decision Process), more often with an infinite horizon formulation [4]. Also, RL has attracted a great attention with its recent success, especially in its sub-domain of deep RL [5] [6] [7].
Most methods in the RL can be divided into model-based and model-free approaches, based on whether a model is constructed before solving for an optimal policy or not, respectively. Among them, model-based RL typically requires lesser number of samples and can be global or local model-based. Global models are typically approximated with Gaussian Processes (GPs) or Neural Networks, and hence can be inaccurate in some portions of its state space. Local models (such as in [8][9]) in contrast offer more sample efficient and accurate solutions. On the other hand, model-free RL, since has to bypass learning the system model requires substantially more number of training samples to train the policy [10]. While an overwhelming number of successful works in the recent past have relied on a complex parameterizations such as by using neural networks [7] [11][12] [13], they are challenged with non-determinism in reproducing the results and in benchmarking against existing methods [14][15]. Recent works such as by [16] showed that a simple linear parameterization can be very effective and can result in performances competitive to that of the former methods. The current work, aimed at deriving a simple but structured, sample-efficient and thereby a fast policy, falls under the local model-based regime, though we won’t explicitly build the system model but rather deal with its gradients in the neighborhood of a nominal trajectory.

In the prequel of this work, called D2C [17], instead of solving for the optimal closed-loop policy via dynamic programming as in typical RL techniques, our approach searches for a computationally tractable policy that is nonetheless near-optimal to second order in a noise parameter ϵ\epsilon. This is done via decoupling of the open and the closed loops, wherein we first solve for an optimal open-loop trajectory, and then design an optimal linear feedback law to stabilize the linear perturbations about the nominal trajectory. The success of this approach was previously demonstrated in [18][19] for model-based problems. This way of decoupling results in a compact parameterization of the feedback law in terms of an open loop control sequence and a linear feedback policy, and thereby, constitutes a novel approach to the RL problem via the decoupled search. The contributions of this sequel paper are as follows:
i) We incorporate a data-efficient, and hence, significantly faster way to compute the gradients in the ILQR framework in a model free fashion. This, in turn, is incorporated in our D2C framework to solve for the optimal open-loop trajectory. This is in contrast to the ‘direct’ gradient-descent based approach introduced in the earlier version of our algorithm (D2C 1.0).
ii) We make a comparison with a state-of-the-art reinforcement learning algorithm - DDPG [7] as well as D2C 1.0, and show the significant merit of the current method, termed D2C 2.0, in terms of its computational efficiency.
iii) D2C 2.0 also constitutes a novel approach to solving complex robotic motion planning problems under uncertainty in a data efficient model-free fashion.

2 RELATED WORK

Differential Dynamic Programming (DDP) [20] is a class of iterative algorithms for trajectory optimization. One salient aspect of DDP is that it exhibits quadratic convergence for a general non-linear optimal control problem and can be solved in a single step for linear systems. In spite of its origins in the 1960s, it gained popularity only in the last decade due to the success of the modified algorithm called ILQR [21]. Though DDP (theoretically) guarantees a quadratic convergence, some of the terms in it involve computing second order derivatives of the system dynamical models. Since the dynamical models of most systems are multivariable vector-valued functions and their second order derivatives being third order tensors, DDP in its original form was not effective for practical implementation. ILQR [22] [23] dropped these terms and introduced regularization schemes [24][25] to handle the absence of these computationally expensive terms. This resulted in a faster and stable convergence to the optimal solution.

One advantage that ILQR has in terms of solving it in a model-free manner is that this algorithm is explicit in terms of system model and their first order derivatives. Earlier works such as by [25] and [26] employed finite differencing in computing the Jacobians in ILQR. Typically, a forward Euler scheme is chosen to independently determine each element (i.e, gradient) in a Jacobian matrix. [9] presented ILQR-LD by learning the dynamics through Locally Weighted Projection Regression (LWPR) approximation of the model first and then obtaining a corresponding analytical Jacobian. Their work also demonstrated a better computational efficiency over finite differences.

Spall introduced the ‘Simultaneous Perturbation Stochastic Approximation’ (SPSA) [27] method that evaluates the function only twice to calculate the Jacobian of the cost function. In this paper, a similar formulation is derived to compute the Jacobians of the system model online, in a model-free fashion, through the least squares method in a central-difference scheme.

The rest of the paper is organized as follows. Section 3 introduces the terminology and provides the problem statement. In section 4, we present a near-optimal decoupling principle that forms the basis of our algorithm. Section 5 describes our algorithm - D2C 2.0 in detail. In Sections 6 and 7, we provide results and a discussion of the application of the D2C 2.0 to several benchmark examples, along with comparisons to DDPG and the D2C technique.

3 PROBLEM FORMULATION

Let 𝐱𝐭{\bf x_{t}} \in 𝒳\mathcal{X}\in nx\mathbb{R}^{n_{x}} represent the state of a system and 𝐮𝐭𝒳nu{\bf u_{t}}\in\mathcal{X}\in\mathbb{R}^{n_{u}} be the control signal at time tt respectively. Let f:nx×nunxf:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R}^{n_{x}} denote the deterministic state transition model of the system. Let πt:nxnu\pi_{t}:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R}^{n_{u}} be the policy to be applied on the system. Let ct:nx×nuc_{t}:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R} be the time-indexed incremental cost function for all t{0,1,,N1}t\in\{0,1,\cdots,N-1\} and cN:nxc_{N}:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R} be the terminal cost function. Let Jtπt:nxJ_{t}^{\pi_{t}}:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R} represent the cost-to-go function under the policy πt\pi_{t} and Qtπt:nx×nuQ_{t}^{\pi_{t}}:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R} be the corresponding action-value function, both at time tt. Finally, let us denote the derivatives of a variable at time tt w.r.t. the state 𝐱𝐭{\bf x_{t}} or the control 𝐮𝐭{\bf u_{t}} by placing respective variables in the subscript. For example, J𝐱𝐭=Jt𝐱𝐭|𝐱¯𝐭J_{\bf x_{t}}=\frac{\partial J_{t}}{\partial{\bf x_{t}}}|_{\bf\bar{x}_{t}}. Unless stated otherwise, they are evaluated at the nominal state or control at time tt.

Now, let us define the notion of process noise in the system through an important parameter called ‘noise scaling factor’, denoted by ϵ\epsilon. We assume that the noise is additive and white, zero-mean Gaussian. Hence, the state evolution equation of our stochastic system is represented as 𝐱𝐭+𝟏=f(𝐱𝐭,𝐮𝐭)+ϵωt{\bf x_{t+1}}=f({\bf x_{t}},{\bf u_{t}})+\epsilon\omega_{t}, where ϵ<1\epsilon<1 and ωt\omega_{t} is zero-mean Gaussian distributed.
Let us define the notion of a ‘nominal’ trajectory, given a policy πt(.)\pi_{t}(.), as follows - the nominal control actions of the policy are the control actions output by the policy when all the noise variables ωt=0\omega_{t}=0. Let us represent the nominal trajectory variables with bars over them i.e,i.e, 𝐱¯𝐭{\bf\bar{x}_{t}} and 𝐮¯𝐭{\bf\bar{u}_{t}}, both at time tt. So, the nominal trajectory is given by, 𝕋nom={𝐱¯𝟎:𝐍,𝐮¯𝟎:𝐍𝟏}\mathbb{T}_{nom}=\{\bf\bar{x}_{0:N},\bar{u}_{0:N-1}\}, where NN is the time horizon, and 𝐱¯t+1=f(𝐱t¯,𝐮¯t)\bar{\mathbf{x}}_{t+1}=f(\bar{\mathbf{x}_{t}},\bar{\mathbf{u}}_{t}), where 𝐮¯t=πt(𝐱¯t)\bar{\mathbf{u}}_{t}=\pi_{t}(\bar{\mathbf{x}}_{t}).

Using the above definitions, the problem of stochastic optimal control that we consider is to find the set of time-varying optimal control policies π={π1,π2,,πN1}\pi=\{\pi_{1},\pi_{2},\cdots,\pi_{N-1}\} (subscripts denote the time-indices) that minimize the finite horizon total expected cumulative cost, i.e,i.e,

J~π=𝔼π[Jπ]=𝔼π[t=0N1ct(𝐱𝐭,πt(𝐱𝐭))+cN(𝐱𝐍)],\tilde{J}^{\pi}=\mathbb{E}_{\pi}[J^{\pi}]=\mathbb{E}_{\pi}[\sum_{t=0}^{N-1}c_{t}({\bf x_{t}},\pi_{t}({\bf x_{t}}))+c_{N}({\bf x_{N}})], (1)

such that the system model i.e, 𝐱𝐭+𝟏=f(𝐱𝐭,πt(𝐱𝐭))+ϵωt{\bf x_{t+1}}=f({\bf x_{t}},\pi_{t}({\bf x_{t}}))+\epsilon\omega_{t}, relevant constraints and boundary conditions are satisfied. Since the current approach deals with model-free problem, the system model and its constraints are assumed to be implicitly satisfied in the data that the system/agent gathers from its environment (through a simulator or a real world experiment).

4 A NEAR-OPTIMAL DECOUPLING PRINCIPLE

Note : The results provided in this particular section are restated from our prequel paper on D2C primarily for the reader’s convenience and for the sake of completeness.
In this section, we describe a near-optimal decoupling principle which provides the theoretical foundation for our D2C technique.
Assuming that f(.)f(.) and πt(.)\pi_{t}(.) are sufficiently smooth, let us describe the linearized dynamics about the nominal trajectory 𝕋nom\mathbb{T}_{nom} corresponding to the policy πt(.)\pi_{t}(.) as follows. The state and control perturbations, given by δ𝐱𝐭=𝐱𝐭𝐱¯𝐭\delta{\bf x_{t}}={\bf x_{t}}-{\bf\bar{x}_{t}} and δ𝐮𝐭=𝐮𝐭𝐮¯𝐭\delta{\bf u_{t}}={\bf u_{t}}-{\bf\bar{u}_{t}} respectively, evolve as follows:

δ𝐱𝐭+𝟏\displaystyle\delta{\bf x_{t+1}} =Atδ𝐱𝐭+Btδ𝐮𝐭+St(δ𝐱𝐭)+Rt(δ𝐮𝐭)+ϵωt,\displaystyle=A_{t}\delta{\bf x_{t}}+B_{t}\delta{\bf u_{t}}+S_{t}(\delta{\bf x_{t}})+R_{t}({\delta{\bf u_{t}}})+\epsilon\omega_{t}, (2)
δ𝐮𝐭\displaystyle\delta{\bf u_{t}} =Ktδ𝐱𝐭+S~t(δ𝐱𝐭),\displaystyle=K_{t}\delta{\bf x_{t}}+\tilde{S}_{t}(\delta{\bf x_{t}}), (3)

where At=f𝐱|(𝐱¯𝐭,𝐮¯𝐭)A_{t}=\frac{\partial f}{\partial{\bf x}}|_{{(\bf\bar{x}_{t}},{\bf\bar{u}_{t}})}, Bt=f𝐮|(𝐱¯𝐭,𝐮¯𝐭)B_{t}=\frac{\partial f}{\partial{\bf u}}|_{({\bf\bar{x}_{t}},{\bf\bar{u}_{t}})}, Kt=π(𝐱𝐭)x|𝐱¯𝐭K_{t}=\frac{\partial\pi({\bf x_{t}})}{\partial x}|_{\bf\bar{x}_{t}} and St(.)S_{t}(.), S~t(.)\tilde{S}_{t}(.), Rt(.)R_{t}(.) are higher order terms in respective Taylor’s expansions. By substituting (3) in (2), we have

δ𝐱𝐭+𝟏=(At+BtKt)A¯tδ𝐱𝐭+S¯t(δ𝐱𝐭)+ϵωt,\displaystyle\delta{\bf x_{t+1}}=\underbrace{(A_{t}+B_{t}K_{t})}_{\bar{A}_{t}}\delta{\bf x_{t}}+\bar{S}_{t}(\delta{\bf x_{t}})+\epsilon\omega_{t}, (4)

where S¯t(δ𝐱𝐭)=BtS~t(δ𝐱𝐭)+St(δ𝐱𝐭)+Rt(Ktδ𝐱𝐭+S~t(δ𝐱𝐭))\bar{S}_{t}(\delta{\bf x_{t}})=B_{t}\tilde{S}_{t}(\delta{\bf x_{t}})+S_{t}(\delta{\bf x_{t}})+R_{t}(K_{t}\delta{\bf x_{t}}+\tilde{S}_{t}(\delta{\bf x_{t}})). Now, similarly expanding the cost function ct(𝐱𝐭,𝐮𝐭)c_{t}({\bf x_{t}},{\bf u_{t}}) under the policy πt(𝐱𝐭)\pi_{t}({\bf x_{t}}) t{0,1,,N1}\forall t\in\{0,1,\cdots,N-1\} about the nominal trajectory 𝕋nom\mathbb{T}_{nom} as ct(𝐱𝐭,πt(𝐱𝐭))=c¯t+C¯tδ𝐱𝐭+H¯t(δ𝐱𝐭)c_{t}({\bf x_{t}},\pi_{t}({\bf x_{t}}))=\bar{c}_{t}+\bar{C}_{t}\delta{\bf x_{t}}+\bar{H}_{t}(\delta{\bf x_{t}}), where c¯t=ct(𝐱¯𝐭,πt(𝐱¯𝐭))\bar{c}_{t}=c_{t}({\bf\bar{x}_{t}},\pi_{t}({\bf\bar{x}_{t}})) and C¯t=ct(𝐱𝐭,πt(𝐱𝐭))𝐱𝐭|𝐱¯𝐭\bar{C}_{t}=\frac{\partial c_{t}({\bf x_{t}},\pi_{t}({\bf x_{t}}))}{\partial{\bf x_{t}}}|_{\bf\bar{x}_{t}} results in

Jπ=t=0Nc¯t+t=0NC¯tδ𝐱𝐭+t=0NH¯t(δ𝐱𝐭).J^{\pi}=\sum_{t=0}^{N}\bar{c}_{t}+\sum_{t=0}^{N}\bar{C}_{t}\delta{\bf x_{t}}+\sum_{t=0}^{N}\bar{H}_{t}(\delta{\bf x_{t}}). (5)

Lemma 1: The state perturbation equation (4) can be equivalently separated into δ𝐱𝐭=δ𝐱𝐭𝐥+S¯t\delta{\bf x_{t}}=\delta{\bf x_{t}^{l}}+\bar{S}_{t} and δ𝐱𝐭+𝟏𝐥=A¯tδ𝐱𝐭𝐥+ϵωt\delta{\bf x_{t+1}^{l}}=\bar{A}_{t}\delta{\bf x_{t}^{l}}+\epsilon\omega_{t}, and S¯t\bar{S}_{t} is O(ϵ2)O(\epsilon^{2}).
Proof: Proof is provided in the supplementary file.

From (5) and lemma-1, we obtain

Jπ=t=0Nc¯tJ¯π+t=0NC¯tδ𝐱𝐭𝐥δJ1π+t=0NH¯t(δ𝐱𝐭)+C¯tS¯tδJ2π.J^{\pi}=\underbrace{\sum_{t=0}^{N}\bar{c}_{t}}_{\bar{J}^{\pi}}+\underbrace{\sum_{t=0}^{N}\bar{C}_{t}\delta{\bf x_{t}^{l}}}_{\delta J^{\pi}_{1}}+\underbrace{\sum_{t=0}^{N}\bar{H}_{t}(\delta{\bf x_{t}})+\bar{C}_{t}\bar{S}_{t}}_{\delta J^{\pi}_{2}}. (6)

Proposition 1: Given a feedback policy π\pi satisfying the smoothness requirements,

J~π\displaystyle\tilde{J}^{\pi} =𝔼[Jπ]=J¯π+O(ϵ2),\displaystyle=\mathbb{E}[J^{\pi}]=\bar{J}^{\pi}+O(\epsilon^{2}), (7)
Var(Jπ)\displaystyle Var(J^{\pi}) =Var(δJ1π)O(ϵ2)+O(ϵ4).\displaystyle=\underbrace{Var(\delta J_{1}^{\pi})}_{O(\epsilon^{2})}+O(\epsilon^{4}). (8)

Proof: Proof is provided in the supplementary file. Noting that 𝐮𝐭=πt(𝐱𝐭)=𝐮¯𝐭+Ktδ𝐱𝐭+S~t(δ𝐱𝐭){\bf u_{t}}=\pi_{t}({\bf x_{t}})={\bf\bar{u}_{t}}+K_{t}\delta{\bf x_{t}}+\tilde{S}_{t}(\delta{\bf x_{t}}), the above proposition entails the following observations:
i) Observation 1: The cost-to-go along the nominal trajectory 𝕋nom={𝐱¯𝟎:𝐍,𝐮¯𝟎:𝐍𝟏}\mathbb{T}_{nom}=\{{\bf\bar{x}_{0:N}},{\bf\bar{u}_{0:N-1}}\} given by J¯π\bar{J}^{\pi} is within O(ϵ2)O(\epsilon^{2}) of the expected cost-to-go J~π\tilde{J}^{\pi} of the policy π\pi.
ii) Observation 2: Given the nominal control sequence 𝐮¯𝟎:𝐍𝟏{\bf\bar{u}_{0:N-1}}, the variance of the cost-to-go is overwhelmingly determined by the linear feedback law Ktδ𝐱𝐭K_{t}\delta{\bf x_{t}} i.e,i.e, by the variance of the linear perturbation of the cost-to-go, δJ1π\delta J_{1}^{\pi}, under the linear closed loop system dynamics δ𝐱𝐭+𝟏=A¯tδ𝐱𝐭𝐥+ϵωt\delta{\bf x_{t+1}}=\bar{A}_{t}\delta{\bf x_{t}^{l}}+\epsilon\omega_{t}.

Proposition 1 and the above observations suggest that an optimal open-loop control sequence with a suitable linear feedback law (for the perturbed linear system) wrapped around it is approximately optimal. The following subsection summarizes the problems to be solved for each of them.

4.1 Decoupled Policy for Feedback Control

Open-Loop Trajectory Design: The open-loop trajectory is designed by solving the noiseless equivalent of the stochastic optimal control problem (1), as shown below:

min𝕋nom1N1ct(𝐱¯𝐭,𝐮¯𝐭)+cN(𝐱¯𝐍),\displaystyle\underset{\mathbb{T}_{nom}}{\textnormal{min}}\sum^{N-1}_{1}c_{t}({\bf\bar{x}_{t}},{\bf\bar{u}_{t}})+c_{N}({\bf\bar{x}_{N}}),
s.t.𝐱¯𝐭=f(𝐱¯𝐭,𝐮¯𝐭).\displaystyle\textnormal{s.t.}~{}{\bf\bar{x}_{t}}=f({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}).

Feedback Law: A linear feedback law is determined by solving for the optimal feedback gains K1:N1K_{1:N-1}^{*} that minimize the variance corresponding to the linear perturbation system around the optimal nominal trajectory 𝕋nominal\mathbb{T}^{*}_{nominal} found from the above, as follows:

min{K1:N1}\displaystyle\underset{\{K_{1:N-1}\}}{\textnormal{min}}~{} Var(δJ1π),\displaystyle Var(\delta J_{1}^{\pi}),
δJ1π\displaystyle\delta J_{1}^{\pi} =t=1NC¯tδ𝐱𝐭𝐥,\displaystyle=\sum_{t=1}^{N}\bar{C}_{t}\delta{\bf x_{t}^{l}},
δ𝐱𝐭+𝟏\displaystyle\delta{\bf x_{t+1}} =(At+BtKt)δ𝐱𝐭𝐥+ϵωt.\displaystyle=(A_{t}+B_{t}K_{t})\delta{\bf x_{t}^{l}}+\epsilon\omega_{t}.

Albeit convex, the above problem does not have a standard solution. However, since we are really only interested in a good variance rather than the optimal variance, we can solve an inexpensive time-varying LQR problem as a surrogate to reduce the variance of the cost-to-go. The corresponding problem can be posed as follows:

minδ𝐮𝟏:𝐍𝔼[t=1N1δ𝐱𝐭TQt𝐱𝐭+δ𝐮𝐭TRt𝐮𝐭+δ𝐱𝐍TQN𝐱𝐍],\displaystyle\underset{{\delta{\bf u_{1:N}}}}{\textnormal{min}}~{}\mathbb{E}[\sum_{t=1}^{N-1}\delta{\bf x_{t}}^{T}Q_{t}{\bf x_{t}}+\delta{\bf u_{t}}^{T}R_{t}{\bf u_{t}}+\delta{\bf x_{N}}^{T}Q_{N}{\bf x_{N}}],
s.t.δ𝐱𝐭+𝟏=Atδ𝐱𝐭+Bt𝐮𝐭+ϵωt.\displaystyle\textnormal{s.t.}~{}\delta{\bf x_{t+1}}=A_{t}\delta{\bf x_{t}}+B_{t}{\bf u_{t}}+\epsilon\omega_{t}.
Remark 4.1.

In fact, if the feedback gain is designed carefully rather than the heuristic LQR above, then one can obtain O(ϵ4)O(\epsilon^{4}) near-optimality. This requires some more developments, however, we leave this result out of this paper because of the paucity of space.

5 ALGORITHM

This section presents our decoupled data-based control (D2C-2.0) algorithm. There are two main steps involved in solving for the requisite policy:

  1. 1.

    Design an open-loop nominal trajectory by solving the optimization problem via a model-free ILQR based approach i.e,i.e, without explicitly making use of an analytic model of the system. This is described in subsection 5.1. This is the primary difference from the prequel D2C where we used a first order gradient based technique to solve the optimal open loop control problem.

  2. 2.

    Determine the parameters of the time-varying linear perturbation system about its optimal nominal trajectory and design an LQR controller based on the system identification. This step is detailed in subsection 5.2.

5.1 Open-Loop Trajectory Design

An open-loop trajectory is designed by solving the corresponding problem in subsection 4.1. However, since we are solving it in a model-free manner, we need a suitable formulation that can efficiently compute the solution.
In this subsection, we present an ILQR based model-free method to solve the open-loop optimization problem. The advantage with ILQR is that the equations involved in it are explicit in system dynamics and their gradients. Hence, in order to make it a model-free algorithm, it is sufficient if we could explicitly obtain the estimates of Jacobians. A sample-efficient way of doing it is described in the following subsubsection. Since ILQR/DDP is a well-established framework, we skip the details and instead present the essential equations in algorithm 1, algorithm 2 and algorithm 3, where we also reflect the choice of our regularization scheme.

5.1.1 Estimation of Jacobians in a Model-Free Setting

From the Taylor’s expansions of ‘f’ about the nominal trajectory (𝐱¯𝐭,𝐮¯𝐭)({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}) on both the positive and the negative sides, we obtain the following central difference equation:

f(𝐱¯𝐭+δ𝐱𝐭,𝐮¯𝐭+δ𝐮𝐭)f(𝐱¯𝐭δ𝐱𝐭,𝐮¯𝐭δ𝐮𝐭)=2[f𝐱𝐭f𝐮𝐭][δ𝐱𝐭δ𝐮𝐭]+O(δ𝐱𝐭3+δ𝐮𝐭3)\begin{split}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}})=~{}&2\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}\begin{bmatrix}\delta{\bf x_{t}}\\ \delta{\bf u_{t}}\end{bmatrix}+\\ &O(\|\delta{\bf x_{t}}\|^{3}+\|\delta{\bf u_{t}}\|^{3})\end{split} (9)

Multiplying by [δ𝐱𝐭Tδ𝐮𝐭T]\begin{bmatrix}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}^{T}\end{bmatrix} on both sides to the above equation:

[f(𝐱¯𝐭+δ𝐱𝐭,𝐮¯𝐭+δ𝐮𝐭)f(𝐱¯𝐭δ𝐱𝐭,𝐮¯𝐭δ𝐮𝐭)]×[δ𝐱𝐭Tδ𝐮𝐭T]=2[f𝐱𝐭f𝐮𝐭][δ𝐱𝐭δ𝐱𝐭Tδ𝐱𝐭δ𝐮𝐭Tδ𝐮𝐭δ𝐱𝐭Tδ𝐮𝐭δ𝐮𝐭T]+O(δ𝐱𝐭4+δ𝐮𝐭4)\begin{split}\begin{bmatrix}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}})\end{bmatrix}\times\begin{bmatrix}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}^{T}\end{bmatrix}\\ =2\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix}+O(\|\delta{\bf x_{t}}\|^{4}+\|\delta{\bf u_{t}}\|^{4})\end{split}

Assuming that [δ𝐱𝐭δ𝐱𝐭Tδ𝐱𝐭δ𝐮𝐭Tδ𝐮𝐭δ𝐱𝐭Tδ𝐮𝐭δ𝐮𝐭T]\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix} is invertible (which will later be proved to be true by some assumptions), let us perform inversions on either sides of the above equation as follows:

[f𝐱𝐭f𝐮𝐭]=12[f(𝐱¯𝐭+δ𝐱𝐭,𝐮¯𝐭+δ𝐮𝐭)f(𝐱¯𝐭δ𝐱𝐭,𝐮¯𝐭δ𝐮𝐭)]×[δ𝐱𝐭Tδ𝐮𝐭T][δ𝐱𝐭δ𝐱𝐭Tδ𝐱𝐭δ𝐮𝐭Tδ𝐮𝐭δ𝐱𝐭Tδ𝐮𝐭δ𝐮𝐭T]1+O(δ𝐱𝐭2+δ𝐮𝐭2)\begin{split}\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}=~{}&\frac{1}{2}\begin{bmatrix}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}})\end{bmatrix}\times\\ &\begin{bmatrix}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}^{T}\end{bmatrix}\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix}^{-1}+O(\|\delta{\bf x_{t}}\|^{2}+\|\delta{\bf u_{t}}\|^{2})\end{split} (10)

Equation (10) solves for Jacobians, f𝐱𝐭f_{\bf x_{t}} and f𝐮𝐭f_{\bf u_{t}}, simultaneously. It is noted that the above formulation requires only 2 evaluations of f(.)f(.), given the nominal state and control (𝐱¯𝐭,𝐮¯𝐭{\bf\bar{x}_{t}},{\bf\bar{u}_{t}}). The remaining terms (also, the error in the evaluation) are of the order that is quadratic in δ𝐱𝐭\delta{\bf x_{t}} and δ𝐮𝐭\delta{\bf u_{t}}. The following extends equation (10) to be used in practical implementations (sampling).
We are free to choose the distribution of δ𝐱𝐭\delta{\bf x_{t}} and δ𝐮𝐭\delta{\bf u_{t}}. We assume both are i.i.d. Gaussian distributed random variables with zero mean and a standard deviation of σ\sigma. This ensures that [δ𝐱𝐭δ𝐱𝐭Tδ𝐱𝐭δ𝐮𝐭Tδ𝐮𝐭δ𝐱𝐭Tδ𝐮𝐭δ𝐮𝐭T]\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix} is invertible. More on the advantage of using this distribution will be elaborated in the next paragraph. Let `ns`n_{s}^{\prime} be the number of samples for each of the random variables, δ𝐱𝐭\delta{\bf x_{t}} and δ𝐮𝐭\delta{\bf u_{t}}, as δ𝐗𝐭=[δ𝐱𝐭𝟏δ𝐱𝐭𝟐δ𝐱𝐭𝐧𝐬]\delta{\bf X_{t}}=\begin{bmatrix}\delta{\bf x_{t}^{1}}&\delta{\bf x_{t}^{2}}&\ldots&\delta{\bf x_{t}^{n_{s}}}\end{bmatrix} and δ𝐔𝐭=[δ𝐮𝐭𝟏δ𝐮𝐭𝟐δ𝐮𝐭𝐧𝐬]\delta{\bf U_{t}}=\begin{bmatrix}\delta{\bf u_{t}^{1}}&\delta{\bf u_{t}^{2}}&\ldots&\delta{\bf u_{t}^{n_{s}}}\end{bmatrix}, respectively. Then [f𝐱𝐭f𝐮𝐭]\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix} is given by the following :

[f𝐱𝐭f𝐮𝐭]=[f(𝐱¯𝐭+δ𝐱𝐭𝟏,𝐮¯𝐭+δ𝐮𝐭𝟏)f(𝐱¯𝐭δ𝐱𝐭𝟏,𝐮¯𝐭δ𝐮𝐭𝟏)f(𝐱¯𝐭+δ𝐱𝐭𝟐,𝐮¯𝐭+δ𝐮𝐭𝟐)f(𝐱¯𝐭δ𝐱𝐭𝟐,𝐮¯𝐭δ𝐮𝐭𝟐)f(𝐱¯𝐭+δ𝐱𝐭𝐧𝐬,𝐮¯𝐭+δ𝐮𝐭𝐧𝐬)f(𝐱¯𝐭δ𝐱𝐭𝐧𝐬,𝐮¯𝐭δ𝐮𝐭𝐧𝐬)]×[δ𝐗𝐭Tδ𝐔𝐭T][δ𝐗𝐭δ𝐗𝐭Tδ𝐗𝐭δ𝐔𝐭Tδ𝐔𝐭δ𝐗𝐭Tδ𝐔𝐭δ𝐔𝐭T]1\begin{split}\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}=&\begin{bmatrix}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}^{1}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}^{1}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}^{1}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}^{1}})\\ f({\bf\bar{x}_{t}}+\delta{\bf x_{t}^{2}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}^{2}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}^{2}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}^{2}})\\ \vdots\\ f({\bf\bar{x}_{t}}+\delta{\bf x_{t}^{n_{s}}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}^{n_{s}}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}^{n_{s}}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}^{n_{s}}})\end{bmatrix}\times\\ &\begin{bmatrix}\delta{\bf X_{t}}^{T}&\delta{\bf U_{t}}^{T}\end{bmatrix}\begin{bmatrix}\delta{\bf X_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf X_{t}}\delta{\bf U_{t}}^{T}\\ \delta{\bf U_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf U_{t}}\delta{\bf U_{t}}^{T}\end{bmatrix}^{-1}\end{split} (11)

Let us consider the terms in the matrix δ𝐗𝐔𝐭=[δ𝐗𝐭δ𝐗𝐭Tδ𝐗𝐭δ𝐔𝐭Tδ𝐔𝐭δ𝐗𝐭Tδ𝐔𝐭δ𝐔𝐭T]\delta{\bf XU_{t}}=\begin{bmatrix}\delta{\bf X_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf X_{t}}\delta{\bf U_{t}}^{T}\\ \delta{\bf U_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf U_{t}}\delta{\bf U_{t}}^{T}\end{bmatrix}δ𝐗𝐭δ𝐗𝐭T=i=1nsδ𝐱𝐭iδ𝐱𝐭iT\delta{\bf X_{t}}\delta{\bf X_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}. Similarly, δ𝐔𝐭δ𝐔𝐭T=i=1nsδ𝐮𝐭iδ𝐮𝐭iT\delta{\bf U_{t}}\delta{\bf U_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}, δ𝐔𝐭δ𝐗𝐭T=i=1nsδ𝐮𝐭iδ𝐱𝐭iT\delta{\bf U_{t}}\delta{\bf X_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T} and δ𝐗𝐭δ𝐔𝐭T=i=1nsδ𝐱𝐭iδ𝐮𝐭iT\delta{\bf X_{t}}\delta{\bf U_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}. From the definition of sample variance, we can write the above matrix as

δ𝐗𝐔𝐭=[i=1nsδ𝐱𝐭iδ𝐱𝐭iTi=1nsδ𝐱𝐭iδ𝐮𝐭iTi=1nsδ𝐮𝐭iδ𝐱𝐭iTi=1nsδ𝐮𝐭iδ𝐮𝐭iT][σ2(ns1)Inx0nx×nu0nu×nxσ2(ns1)Inu]=σ2(ns1)I(nx+nu)×(nx+nu)\begin{split}\delta{\bf XU_{t}}&=\begin{bmatrix}\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}&\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}\\ \sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}&\sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}\end{bmatrix}\\ &\approx\begin{bmatrix}\sigma^{2}(n_{s}-1){\text{I}_{n_{x}}}&{\text{0}_{n_{x}\times n_{u}}}\\ 0_{n_{u}\times n_{x}}&\sigma^{2}(n_{s}-1){\text{I}_{n_{u}}}\end{bmatrix}\\ &=\sigma^{2}(n_{s}-1){\text{I}}_{(n_{x}+n_{u})\times(n_{x}+n_{u})}\end{split}

Given that we have high enough number of samples ‘nsn_{s}’ (typically, slightly more than ‘nx+nun_{x}+n_{u}’ are sufficient), the above approximation holds good. Since the inversion of an identity matrix is trivial and always exists, the above matrix is invertible in equation (11). Thus, one can calculate f𝐱f_{\bf x} and f𝐮f_{\bf u} this way during the backward pass. For the sake of convenience in the later sections, let us refer to this method as ‘Linear Least Squares by Central Difference (LLS-CD)’. The entire algorithm to determine the optimal nominal trajectory in a model-free fashion is summarized together in Algorithm 1, Algorithm 3 and Algorithm 2.

Input: Initial State - 𝐱𝟎{\bf x_{0}}, System parameters - 𝒫\mathcal{P};
k1k\leftarrow 1.   /* Initialize the iteration number kk to 1.*/
forward_pass_flagforward\_pass\_flag = true.
/* Run until the difference in costs between subsequent iterations is less an ϵ\epsilon fraction of the former cost.*/
while k==1k==1 or (cost(𝕋nomk)/cost(𝕋nomk1))<1+ϵ({\text{c}ost}(\mathbb{T}_{nom}^{k})/{\text{c}ost}(\mathbb{T}_{nom}^{k-1}))<1+\epsilon do
       /*Each iteration has a backward pass followed by a forward pass.*/
       {k0:N1k,K0:N1kk^{k}_{0:N-1},K^{k}_{0:N-1}}, backward_pass_success_flagbackward\_pass\_success\_flag == Backward Pass(𝕋nomk\mathbb{T}_{nom}^{k}, 𝒫\mathcal{P}).
       if backward_pass_success_flag == true then
             𝕋nomk+1,forward_pass_flag\mathbb{T}_{nom}^{k+1},forward\_pass\_flag == Forward Pass(𝕋nomk\mathbb{T}_{nom}^{k},{k0:N1k,K0:N1k}\{k^{k}_{0:N-1},K^{k}_{0:N-1}\}, 𝒫\mathcal{P}).
             while forward_pass_flag == false do
                   𝕋nomk+1,forward_pass_flag\mathbb{T}_{nom}^{k+1},forward\_pass\_flag == Forward Pass(𝕋nomk\mathbb{T}_{nom}^{k},{k0:N1k,K0:N1k}\{k^{k}_{0:N-1},K^{k}_{0:N-1}\}, 𝒫\mathcal{P}).
                   Reduce α\alpha from 𝒫\mathcal{P}.
             end while
            
       end if
      else
             Increase μ\mu from 𝒫\mathcal{P}.  /* Regularization step */
            
       end if
      kk+1k\leftarrow k+1.
       𝕋nom𝕋nomk+1\mathbb{T}^{*}_{nom}\leftarrow\mathbb{T}^{k+1}_{nom}.
      
end while
return 𝕋nom\mathbb{T}^{*}_{nom}
Algorithm 1 Open-loop trajectory optimization via model-free DDP
Input: Nominal trajectory - 𝕋nomk\mathbb{T}_{nom}^{k}, previous iteration policy parameters - {k0:N1,K0:N1}\{k_{0:N-1},K_{0:N-1}\} and system and cost parameters - 𝒫\mathcal{P}{𝐱¯𝐭𝐩𝐫𝐞𝐯,𝐮¯𝐭𝐩𝐫𝐞𝐯}𝕋nomk\{{\bf\bar{x}_{t}^{prev},\bar{u}_{t}^{prev}}\}\leftarrow\mathbb{T}_{nom}^{k}.
t0t\leftarrow 0.
while t<Nt<N do
       /*Simulate one step forward (α\alpha is the line-search parameter.)*/
𝐮¯𝐭=𝐮¯𝐭𝐩𝐫𝐞𝐯+αkt+Kt(𝐱¯𝐭𝐱¯𝐭𝐩𝐫𝐞𝐯),𝐱¯𝐭+𝟏=simulate_forward_step(𝐱¯𝐭,𝐮¯𝐭).\begin{split}{\bf\bar{u}_{t}}&={\bf\bar{u}_{t}^{prev}}+\alpha k_{t}+K_{t}({\bf\bar{x}_{t}}-{\bf\bar{x}_{t}^{prev}}),\\ {\bf\bar{x}_{t+1}}&=simulate\_forward\_step({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}).\end{split}
tt+1t\leftarrow t+1.
end while
𝕋nomk+1\mathbb{T}_{nom}^{k+1} \leftarrow {𝐱¯𝟎:𝐍,𝐮¯𝟎:𝐍𝟏}.\{{\bf\bar{x}_{0:N}},{\bf\bar{u}_{0:N-1}}\}.
if 𝕋nomk+1\mathbb{T}_{nom}^{k+1} to 𝕋nomk\mathbb{T}_{nom}^{k} is acceptable then
       return 𝕋nomk+1\mathbb{T}_{nom}^{k+1}, true.
end if
else
       return 𝕋nomk\mathbb{T}_{nom}^{k}, false.
end if
Algorithm 2 Forward Pass
Input: Nominal trajectory - 𝕋nomk\mathbb{T}_{nom}^{k}, previous iteration policy parameters - {k0:N1,K0:N1}\{k_{0:N-1},K_{0:N-1}\}, horizon - N and system and cost parameters - 𝒫\mathcal{P}.
/* Backward pass starts from the final time-step i.e, N-1.*/
tN1t\leftarrow N-1
Compute J𝐱𝐍J_{\bf x_{N}} and J𝐱𝐍𝐱𝐍J_{\bf x_{N}x_{N}} using boundary conditions.
/*Keep a copy of previous policy gains.*/
k_oldk0:Nk\_old\leftarrow k_{0:N} and K_oldK0:NK\_old\leftarrow K_{0:N}.
while t>=0t>=0 do
       /*Obtain the Jacobians from simulator rollouts as shown in equation (11):*/
       f𝐱𝐭,f𝐮𝐭model_free_jacobian(𝐱¯𝐭,𝐮¯𝐭).f_{\bf x_{t}},f_{\bf u_{t}}\leftarrow model\_free\_jacobian({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}).
       /*Obtain the partials of the Q function as follows:*/
Q𝐱𝐭=c𝐱𝐭+f𝐱𝐭TJ𝐱𝐭+𝟏,Q𝐮𝐭=c𝐮𝐭+f𝐮𝐭TJ𝐱𝐭+𝟏,Q𝐱𝐭𝐱𝐭=c𝐱𝐭𝐱𝐭+f𝐱𝐭TJ𝐱𝐭+𝟏𝐱𝐭+𝟏f𝐱𝐭,Q𝐮𝐭𝐱𝐭=c𝐮𝐭𝐱𝐭+f𝐮𝐭T(J𝐱𝐭+𝟏𝐱𝐭+𝟏+μInx×nx)f𝐱𝐭,Q𝐮𝐭𝐮𝐭=c𝐮𝐭𝐮𝐭+f𝐮𝐭T(J𝐱𝐭+𝟏𝐱𝐭+𝟏+μInx×nx)f𝐮𝐭.\begin{split}Q_{\bf x_{t}}&=c_{\bf x_{t}}+f_{\bf x_{t}}^{T}J_{\bf x_{t+1}}^{\prime},\\ Q_{\bf u_{t}}&=c_{\bf u_{t}}+f_{\bf u_{t}}^{T}J_{\bf x_{t+1}}^{\prime},\\ Q_{\bf x_{t}x_{t}}&=c_{\bf x_{t}x_{t}}+f_{\bf x_{t}}^{T}J_{\bf x_{t+1}x_{t+1}}^{\prime}f_{\bf x_{t}},\\ Q_{\bf u_{t}x_{t}}&=c_{\bf u_{t}x_{t}}+f_{\bf u_{t}}^{T}(J_{\bf x_{t+1}x_{t+1}}^{\prime}+\mu I_{n_{x}\times n_{x}})f_{\bf x_{t}},\\ Q_{\bf u_{t}u_{t}}&=c_{\bf u_{t}u_{t}}+f_{\bf u_{t}}^{T}(J_{\bf x_{t+1}x_{t+1}}^{\prime}+\mu I_{n_{x}\times n_{x}})f_{\bf u_{t}}.\\ \end{split}
if Q𝐮𝐭𝐮𝐭Q_{\bf u_{t}u_{t}} is positive-definite then
            
kt=Q𝐮𝐭𝐮𝐭1Q𝐮𝐭,Kt=Q𝐮𝐭𝐮𝐭1Q𝐮𝐭𝐱𝐭.\begin{split}k_{t}&=-Q_{\bf u_{t}u_{t}}^{-1}Q_{\bf u_{t}},\\ K_{t}&=-Q_{\bf u_{t}u_{t}}^{-1}Q_{\bf u_{t}x_{t}}.\end{split}
       end if
      
      else
             /*If Q𝐮𝐭𝐮𝐭Q_{\bf u_{t}u_{t}} is not positive-definite, then, abort the backward pass.*/
             return {k_old,K_old}\{k\_old,K\_old\}, false.
       end if
      /*Obtain the partials of the value function JtJ_{t} as follows:*/
J𝐱𝐭=Q𝐱𝐭+KtTQ𝐮𝐭𝐮𝐭kt+KtTQ𝐮𝐭+Q𝐮𝐭𝐱𝐭Tkt,J𝐱𝐭𝐱𝐭=Q𝐱𝐭𝐱𝐭+KtTQ𝐮𝐭𝐮𝐭Kt+KtTQ𝐮𝐭𝐱𝐭+Q𝐮𝐭𝐱𝐭TKt.\begin{split}J_{\bf x_{t}}&=Q_{\bf x_{t}}+K_{t}^{T}Q_{\bf u_{t}u_{t}}k_{t}+K_{t}^{T}Q_{\bf u_{t}}+Q_{\bf u_{t}x_{t}}^{T}k_{t},\\ J_{\bf x_{t}x_{t}}&=Q_{\bf x_{t}x_{t}}+K_{t}^{T}Q_{\bf u_{t}u_{t}}K_{t}+K_{t}^{T}Q_{\bf u_{t}x_{t}}+Q_{\bf u_{t}x_{t}}^{T}K_{t}.\end{split}
tt1t\leftarrow t-1
end while
k_new=k0:N1,k\_new=k_{0:N-1},
K_new=K0:N1.K\_new=K_{0:N-1}.
return {k_new,K_new}\{k\_new,K\_new\}, true.
Algorithm 3 Backward Pass

5.2 Feedback Law

As discussed in the previous section, a time-varying LQR feedback law is used as a surrogate for the minimum variance problem. The LQR gains {K1:N1}\{K_{1:N-1}\} can be fully determined if QtQ_{t}, RtR_{t}, AtA_{t} and BtB_{t} are known. The first two are the tuning parameters in the cost function. Determining AtA_{t} and BtB_{t} for a linear system is a system identification problem that can be solved by ‘LLS-CD’ as described under subsubsection 5.1.1.

6 SIMULATIONS

This section reports the implementation details of D2C-2.0 in simulation on various systems and their training results. We train the systems with varying complexities in terms of the dimension of the state and the action spaces. The physical models of the system are deployed in the simulation platform ‘MuJoCo-2.0’ [28] as a surrogate to their analytical models. The models are imported from the OpenAI gym [29] and Deepmind’s control suite [30].

Figure 1 (left) shows the open-loop training plots of various systems under consideration. The data is averaged over 5 experimental runs to reflect the training variance and the repeatability of D2C-2.0 over multiple training sessions. One might notice that the episodic cost is not monotonically decreasing during training. This is because of the flexibility provided in training for the episodic cost to stay within a ‘band’ around the cost at the previous iteration. Though this might momentarily increase the cost in subsequent training iterations, it is observed to have improved the overall speed of convergence.

Refer to caption
(a) Inverted Pendulum
Refer to caption
(b) Cart-Pole
Refer to caption
(c) 3-link Swimmer
Refer to caption
(d) 6-link swimmer
Refer to caption
(e) Robotic fish
Refer to caption
(f) Inverted Pendulum
Refer to caption
(g) Cart-Pole
Refer to caption
(h) 3-link swimmer
Refer to caption
(i) 6-link swimmer
Refer to caption
(j) Fish
Figure 1: (Left) Episodic cost vs. training iteration number in D2C-2.0 (Right) Terminal state MSE during testing in D2C (2.0) vs DDPG

7 DISCUSSION

In this section, we compare D2C-2.0 with D2C and DDPG, and discuss the merits and demerits of each of the methods.

The manuscript on D2C provided an in-depth analysis on the comparison of D2C with DDPG. We have shown that training in D2C is much more sample-efficient and fares well w.r.t. the training time and the reproducibility of the resulting policy.

Table 1: Comparison of the simulation parameters and training outcomes of D2C-2.0 with other baselines.
Training time (in sec.)
System D2C
Open- Closed- DDPG D2C-
loop loop 2.0
Inverted 12.9 <0.1<0.1 2261.15 0.33
Pendulum
Cart pole 15.0 1.33 6306.7 1.62
3-link
Swimmer 4001.0 4.6 38833.64 186.2
6-link
Swimmer 3585.4 16.4 88160 127.2
Fish 6011.2 75.6 124367.6 54.8

Table 1 shows the training times of D2C, DDPG and D2C-2.0. It is evident from the table that for simple and lower dimensional systems such as pendulum and cart pole, the policy can be almost calculated online (parallelization could make it much faster). It is due to the tendency of ILQR/DDP to quickly converge if the linearization of system models has bigger basin of attraction (such as pendulum, which is, in fact, often approximated to be linear about its resting state). Moreover, for a higher dimensional system such as in the case of a robotic fish, D2C-2.0 took 54.8 seconds while the original D2C took around 6096 seconds, which is approximately 90 times faster. The table makes it evident that D2C-2.0 clearly exhibits much faster convergence when compared to both D2C and DDPG, while D2C is much faster than DDPG. The primary reason for the much larger training times in DDPG is due to the complex parametrization of the feedback policy in terms of deep neural networks which runs into hundreds of thousands of parameters while D2C has a parametrization, primarily the open loop control sequence which is of dimension pTpT, where pp is the number of inputs and TT is the time horizon of the problem, that runs from tens to a few thousands (see Table 2).

Table 2: Parameter size comparison between D2C and DDPG
System No. of No. of No. of No. of
steps actuators parameters parameters
optimized optimized
in D2C in DDPG
Inverted 30 1 30 244002
Pendulum
Cart pole 30 1 30 245602
3-link 1600 2 3200 251103
Swimmer
6-link 1500 5 7500 258006
Swimmer
Fish 1200 6 7200 266806

A major reason behind the differences in training efficiencies stems from their formulation. D2C is based on gradient descent, which is linearly convergent. It is a direct and generic method of solving an optimization problem. On the other hand, the equations involved in D2C-2.0 arise from the optimality conditions that inherently exploit the recursive optimal structure and the causality in the problem. Moreover, the equations are explicit in the Jacobians of the system model. As a result, it is possible to solve for it independently and in a sample efficient manner, at every time-step of each iteration. However, one should expect that as we consider very high dimensional systems, the algorithmic complexity of D2C-2.0 could outgrow that of D2C. This is because it is O(nx2)O(n_{x}^{2}) in terms of its computational complexity. Also, D2C, by the nature of its formulation does not need the state of the system, and hence, is generalizable to problems with partial state observability.

The closed-loop performance of D2C-2.0 w.r.t. that of DDPG is shown in the figure 1 (right). Noise is injected into the system via the control channel by varying the noise scaling factor-ϵ\epsilon and then, measuring the performance. We note here that DDPG fails to converge for the robotic fish problem, and thus, the nominal performance (ϵ=0\epsilon=0) is poor. We see that the performance of D2C is better than that of DDPG when the noise levels are lower, however, the performance of DDPG remains relatively similar for higher levels of noise while that of D2C degrades. In our opinion, this closed loop performance aspect of the methods needs to be studied in more detail in order to make a definitive judgement, and is left for future work. For instance, by the time DDPG overtakes D2C in terms of the average performance, the actual performance of both systems is quite poor, and thus, it is not clear if this can be construed as an advantage.

Now, as mentioned in section 2, earlier works have attempted to tackle the idea of model-free ILQR and have mostly confined to finite-differences for Jacobian computation. Table 3 shows the comparison of per-iteration computational times between ‘finite-differences’ and ‘linear least squares with central difference formulation’. It is clearly evident that, as the dimension of the state space increases, the method of finite-differences requires many more function evaluations, and hence, our LLS-CD method is much more efficient.

Table 3: Comparison of the computational times (in sec.) per iteration (averaged over 5 runs).
System FD LLS-CD
Inverted
Pendulum 0.017 0.033
Cart pole 0.0315 0.0463
Acrobot 0.554 0.625
3-link
Swimmer 4.39 1.86
6-link
Swimmer 14.43 1.278
Fish 34.75 2.74

8 Conclusion

In order to conclude the discussion that started the global vs local policy approximation dilemma in section 1, how well deep RL methods extrapolate their policy to the entire continuous state and action spaces remains an open question. Now, consider that the results shown in the Table 1 are based on serial implementations on an off-the-shelf computer and D2C-2.0 can be highly parallelizable. By looking at the order of magnitude of numbers in the same table, we expect with reasonable augmentation of computational power by parallelization, D2C-2.0 could offer a near-real time solution in high dimensional problems. In such cases, one could rely on the near-optimal policy described in this paper for low noises, and re-solve for the open-loop trajectory online by the approach presented in this paper along with the attendant feedback, whenever the noise takes the method out of the region of attraction of the linear feedback policy. This will be our motivation for the future work along with addressing the following limitation.
Limitations: D2C-2.0, in its current form, is not reliable for hybrid models such as legged robots that involve impacts. Primarily, the issue is in the feedback design which requires smoothness of the underlying dynamics while legged systems are only piece-wise smooth. Future work shall model impacts to tractably incorporate them in order to estimate the required gradients. This seems plausible as iLQR has been used to design controllers for such systems in the literature [25].

References

  • [1] W. B. Powell (2007). Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley & Sons.
  • [2] P.A. Ioannou and J. Sun (2012). Robust adaptive control, Courier Corporation.
  • [3] K.J. Astrom and B. Wittenmark (2012). Adaptive control, Courier Corporation.
  • [4] R.S. Sutton, and A.G. Barto (2018). Reinforcement learning: An introduction, MIT press.
  • [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran , D. Wierstra, S. Legg, D. Hassabis (2015). Human-level control through deep reinforcement learning, Nature. vol. 518, pp. 529-33.
  • [6] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al. (2016). Mastering the game of go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp. 484.
  • [7] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016). Continuous control with deep reinforcement learning, International Conference on Learning Representations.
  • [8] V. Kumar, E. Todorov, and S. Levine (2016). Optimal Control with Learned Local Models: Application to Dexterous Manipulation, International Conference for Robotics and Automation (ICRA).
  • [9] D. Mitrovic, S. Klanke, and S. Vijayakumar (2010). Adaptive Optimal Feedback Control with Learned Internal Dynamics Models, In: Sigaud O., Peters J. (eds) From Motor Learning to Interaction Learning in Robots. Studies in Computational Intelligence, vol 264. Springer, Berlin, Heidelberg, pp. 65-84.
  • [10] P. Abbeel, M. Quigley, and A.Y. Ng (2006). Using inaccurate models in reinforcement learning, Proceedings of the 23rd international conference on Machine learning (ICML ’06). ACM, New York, NY, USA, 1-8. DOI: https://doi.org/10.1145/1143844.1143845.
  • [11] Y. Wu, E. Mansimov, S. Liao, R. Grosse and J. Ba (2017). Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
  • [12] J. Schulman, S. Levine and P. Moritz, M. I. Jordan and P. Abbeel (2015). Trust Region Policy Optimization. arXiv:1502.05477.
  • [13] J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
  • [14] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018). Deep reinforcement learning that matters, in Thirty Second AAAI Conference on Artificial Intelligence.
  • [15] M. G. D. P. Riashat Islam, Peter Henderson (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control, Reproducibility in Machine Learning Workshop, ICML.
  • [16] A. Rajeswaran, K. Lowrey, E. V. Todorov, and S. M. Kakade (2017). Towards generalization and simplicity in continuous control, Advances in Neural Information Processing Systems, pp. 6550-6561.
  • [17] R. Wang, K.S. Parunandi, D. Yu, D. Kalathil and S. Chakravorty (2019). Decoupled Data Based Approach for Learning to Control Nonlinear Dynamical Systems, arXiv:1904.08361, under review for IEEE Tr. Aut. Control.
  • [18] M. Rafieisakhaei, S. Chakravorty, and P.R. Kumar (2017). A Near-optimal separation principle for nonlinear stochastic systems arising in robotic path planning and control, IEEE Conference on Decision and Control(CDC).
  • [19] K. S. Parunandi, S. Chakravorty (2019). T-PFC: A Trajectory-Optimized Perturbation Feedback Control Approach, IEEE Robotics and Automation Letters, vol. 4, pp. 3457-3464.
  • [20] D. Jacobsen, and D. Mayne (1970). Differential Dynamic Programming, Elsevier.
  • [21] W. Li and E. Todorov (2007). Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system, International Journal of Control, vol. 80, no. 9, pp. 1439-1453.
  • [22] W. Li, and E. Todorov (2004). Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems, ICINCO.
  • [23] E. Todorov, and W. Li (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. Proceedings of the 24th American Control Conference, volume 1 of (ACC 2005), pp. 300 - 306.
  • [24] Z. Xie, C. K. Liu, and K. Hauser (2017). Differential dynamic programming with nonlinear constraints, International Conference on Robotics and Automation (ICRA), pp. 695-702.
  • [25] Y. Tassa, T. Erez, and E. Todorov (2012). Synthesis and stabilization of complex behaviors through online trajectory optimization, International Conference on Intelligent Robots and Systems, pp. 4906-4913.
  • [26] S. Levine, and V. Koltun (2013). Guided policy search, Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. JMLR.org III-1-III-9.
  • [27] J. C. Spall (1998). An overview of the Simultaneous Perturbation Method for efficient optimization, Johns Hopkins APL Technical Digest, vol. 19, no. 4, pp. 482-492.
  • [28] T. Emanuel, E. Tom, and Y. Tassa (2012). Mujoco: A physics engine for model-based control, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026-5033.
  • [29] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W.Zaremba (2016). OpenAI Gym, preprint :1606.01540.
  • [30] T. Yuval, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller (2018). Deepmind control suite, :1801.00690.

9 Supplementary

Proof of Lemma 1:

We proceed by induction. The first general instance of the recursion occurs at t=3t=3. It can be shown that:

δ𝐱𝟑=(A¯2A¯1(ϵw0)+A¯2(ϵw1)+ϵw2)δ𝐱3l+\displaystyle\delta{\bf x_{3}}=\underbrace{(\bar{A}_{2}\bar{A}_{1}(\epsilon w_{0})+\bar{A}_{2}(\epsilon w_{1})+\epsilon w_{2})}_{\delta{\bf x}_{3}^{l}}+
{A¯2S¯1(ϵw0)+S¯2(A¯1(ϵw0)+S¯2(A¯1(ϵw0)+ϵw1+S¯1(ϵw0))}S¯¯3.\displaystyle\underbrace{\{\bar{A}_{2}\bar{S}_{1}(\epsilon w_{0})+\bar{S}_{2}(\bar{A}_{1}(\epsilon w_{0})+\bar{S}_{2}(\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1}+\bar{S}_{1}(\epsilon w_{0}))\}}_{\bar{\bar{S}}_{3}}. (12)

Noting that S¯1(.)\bar{S}_{1}(.) and S¯2(.)\bar{S}_{2}(.) are second and higher order terms, it follows that S¯¯3\bar{\bar{S}}_{3} is O(ϵ2)O(\epsilon^{2}).w
Suppose now that δ𝐱𝐭=δ𝐱𝐭𝐥+S¯¯t\delta{\bf x_{t}}=\delta{\bf x_{t}^{l}}+\bar{\bar{S}}_{t} where S¯¯t\bar{\bar{S}}_{t} is O(ϵ2)O(\epsilon^{2}). Then:

δ𝐱𝐭+𝟏=A¯t+1(δ𝐱𝐭𝐥+St¯¯)+ϵwt+S¯t+1(δ𝐱𝐭),\displaystyle\delta{\bf x_{t+1}}=\bar{A}_{t+1}(\delta{\bf x_{t}^{l}}+\bar{\bar{S_{t}}})+\epsilon w_{t}+\bar{S}_{t+1}(\delta{\bf x_{t}}),
=(A¯t+1δ𝐱𝐭𝐥+ϵwt)δ𝐱𝐭+𝟏𝐥+{A¯t+1S¯¯t+S¯t+1(δ𝐱𝐭)}S¯¯t+1.\displaystyle=\underbrace{(\bar{A}_{t+1}\delta{\bf x_{t}^{l}}+\epsilon w_{t})}_{\delta{\bf x_{t+1}^{l}}}+\underbrace{\{\bar{A}_{t+1}\bar{\bar{S}}_{t}+\bar{S}_{t+1}(\delta{\bf x_{t}})\}}_{\bar{\bar{S}}_{t+1}}. (13)

Noting that S¯t+1\bar{S}_{t+1} is O(ϵ2)O(\epsilon^{2}) and that S¯¯t+1\bar{\bar{S}}_{t+1} is O(ϵ2)O(\epsilon^{2}) by assumption, the result follows.

Proof of Proposition 1:

From (6), we get,

J~π=𝔼[Jπ]=𝔼[J¯π+δJ1π+δJ2π],\displaystyle\tilde{J}^{\pi}=\mathbb{E}[J^{\pi}]=\mathbb{E}[\bar{J}^{\pi}+\delta J_{1}^{\pi}+\delta J_{2}^{\pi}],
=J¯π+𝔼[δJ2π]=J¯π+O(ϵ2),\displaystyle=\bar{J}^{\pi}+\mathbb{E}[\delta J_{2}^{\pi}]=\bar{J}^{\pi}+O(\epsilon^{2}), (14)

The first equality in the last line of the equations before follows from the fact that 𝔼[δ𝐱𝐭𝐥]=0\mathbb{E}[\delta{\bf x_{t}^{l}}]=0, since its the linear part of the state perturbation driven by white noise and by definition δ𝐱𝟏𝐥=0\delta{\bf x_{1}^{l}}=0. The second equality follows from the fact that δJ2π\delta J_{2}^{\pi} is an O(ϵ2)O(\epsilon^{2}) function. Now,

Var(Jπ)=𝔼[JπJ~π]2\displaystyle\text{Var}(J^{\pi})=\mathbb{E}[J^{\pi}-\tilde{J}^{\pi}]^{2}
=𝔼[J¯0π+δJ1π+δJ2πJ¯0πδJ~2π]2\displaystyle=\mathbb{E}[\bar{J}_{0}^{\pi}+\delta J_{1}^{\pi}+\delta J_{2}^{\pi}-\bar{J}_{0}^{\pi}-\delta\tilde{J}_{2}^{\pi}]^{2}
=Var(δJ1π)+Var(δJ2π)+2𝔼[δJ1πδJ2π].\displaystyle=\text{Var}(\delta J_{1}^{\pi})+\text{Var}(\delta J_{2}^{\pi})+2\mathbb{E}[\delta J_{1}^{\pi}\delta J_{2}^{\pi}]. (15)

Since δJ2π\delta J_{2}^{\pi} is O(ϵ2)O(\epsilon^{2}), Var(δJ2π)\text{Var}(\delta J_{2}^{\pi}) is an O(ϵ4)O(\epsilon^{4}) function. It can be shown that 𝔼[δJ1πδJ2π]\mathbb{E}[\delta J_{1}^{\pi}\delta J_{2}^{\pi}] is O(ϵ4)O(\epsilon^{4}) as well (proof is given in Lemma 2). Finally Var(δJ1π)\text{Var}(\delta J_{1}^{\pi}) is an O(ϵ2)O(\epsilon^{2}) function because δ𝐱𝐥\delta{\bf x^{l}} is an O(ϵ)O(\epsilon) function. Combining these, we will get the desired result.

Lemma 2: Let δJ1π\delta J_{1}^{\pi}, δJ2π\delta J_{2}^{\pi} be as defined in (6). Then, 𝔼[δJ1δJ2]\mathbb{E}[\delta J_{1}\delta J_{2}] is an O(ϵ4)O(\epsilon^{4}) function.

Proof of Lemma 2:
In the following, we suppress the explicit dependence on π\pi for δJ1π\delta J_{1}^{\pi} and δJ2π\delta J_{2}^{\pi} for notational convenience. Recall that δJ1=t=0Tctxδ𝐱𝐭𝐥\delta J_{1}=\sum_{t=0}^{T}c_{t}^{x}\delta{\bf x_{t}^{l}}, and δJ2=t=0TH¯t(δ𝐱𝐭)+ctxS¯¯t\delta J_{2}=\sum_{t=0}^{T}\bar{H}_{t}(\delta{\bf x_{t}})+c_{t}^{x}\bar{\bar{S}}_{t}. For notational convenience, let us consider the scalar case, the vector case follows readily at the expense of more elaborate notation. Let us first consider S¯¯2\bar{\bar{S}}_{2}. We have that S¯¯2=A¯2S¯1(ϵw0)+S¯2(A¯1(ϵw0)+ϵw1+S¯1(ϵw0))\bar{\bar{S}}_{2}=\bar{A}_{2}\bar{S}_{1}(\epsilon w_{0})+\bar{S}_{2}(\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1}+\bar{S}_{1}(\epsilon w_{0})). Then, it follows that:

S¯¯2=A¯2S¯1(2)(ϵw0)2+S¯2(2)(A¯1ϵw0+ϵw1)2+O(ϵ3),\displaystyle\bar{\bar{S}}_{2}=\bar{A}_{2}\bar{S}_{1}^{(2)}(\epsilon w_{0})^{2}+\bar{S}_{2}^{(2)}(\bar{A}_{1}\epsilon w_{0}+\epsilon w_{1})^{2}+O(\epsilon^{3}), (16)

where S¯t(2)\bar{S}_{t}^{(2)} represents the coefficient of the second order term in the expansion of S¯t\bar{S}_{t}. A similar observation holds for H2(δ𝐱𝟐)H_{2}(\delta{\bf x_{2}}) in that:

H¯2(δ𝐱𝟐)=H¯2(2)(A¯1(ϵw0)+ϵw1)2+O(ϵ3),\displaystyle\bar{H}_{2}(\delta{\bf x_{2}})=\bar{H}_{2}^{(2)}(\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1})^{2}+O(\epsilon^{3}), (17)

where H¯t(2)\bar{H}_{t}^{(2)} is the coefficient of the second order term in the expansion of H¯t\bar{H}_{t}. Note that ϵw0=δ𝐱𝟏𝐥\epsilon w_{0}=\delta{\bf x_{1}^{l}} and A¯1(ϵw0)+ϵw1=δ𝐱𝟐𝐥\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1}=\delta{\bf x_{2}^{l}}. Therefore, it follows that we may write:

H¯t(δ𝐱𝐭)+CtxS¯¯t=τ=0t1qt,τ(δ𝐱τ𝐥)2+O(ϵ3),\displaystyle\bar{H}_{t}(\delta{\bf x_{t}})+C_{t}^{x}\bar{\bar{S}}_{t}=\sum_{\tau=0}^{t-1}q_{t,\tau}(\delta{\bf x_{\tau}^{l}})^{2}+O(\epsilon^{3}), (18)

for suitably defined coefficients qt,τq_{t,\tau}. Therefore, it follows that

δJ2=t=1TH¯t(δ𝐱𝐭)+CtxS¯¯t\displaystyle\delta J_{2}=\sum_{t=1}^{T}\bar{H}_{t}(\delta{\bf x_{t}})+C_{t}^{x}\bar{\bar{S}}_{t}
=τ=0Tq¯T,τ(δ𝐱τ𝐥)2+O(ϵ3),\displaystyle=\sum_{\tau=0}^{T}\bar{q}_{T,\tau}(\delta{\bf x_{\tau}^{l}})^{2}+O(\epsilon^{3}), (19)

for suitably defined q¯T,τ\bar{q}_{T,\tau}. Therefore:

δJ1δJ2=t,τCτx(δ𝐱τ𝐥)q¯T,t(δ𝐱𝐭𝐥)2+O(ϵ4).\displaystyle\delta J_{1}\delta J_{2}=\sum_{t,\tau}C_{\tau}^{x}(\delta{\bf x_{\tau}^{l}})\bar{q}_{T,t}(\delta{\bf x_{t}^{l}})^{2}+O(\epsilon^{4}). (20)

Taking expectations on both sides:

E[δJ1δJ2]=t,τCτxq¯T,tE[δ𝐱τ𝐥(δ𝐱𝐭𝐥)2]+O(ϵ4).\displaystyle E[\delta J_{1}\delta J_{2}]=\sum_{t,\tau}C_{\tau}^{x}\bar{q}_{T,t}E[\delta{\bf x_{\tau}^{l}}(\delta{\bf x_{t}^{l}})^{2}]+O(\epsilon^{4}). (21)

Break δ𝐱𝐭𝐥=(δ𝐱𝐭𝐥δ𝐱τ𝐥)+δ𝐱τ𝐥\delta{\bf x_{t}^{l}}=(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})+\delta{\bf x_{\tau}^{l}}, assuming τ<t\tau<t. Then, it follows that:

E[δ𝐱τ𝐥(δ𝐱𝐭𝐥)2]=E[δ𝐱τ𝐥(δ𝐱𝐭𝐥δ𝐱τ𝐥)2]+E[(δ𝐱τ𝐥)𝟑]\displaystyle E[\delta{\bf x_{\tau}^{l}}(\delta{\bf x_{t}^{l}})^{2}]=E[\delta{\bf x_{\tau}^{l}}(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})^{2}]+E[(\delta{\bf x_{\tau}^{l})^{3}}]
+2E[(δ𝐱𝐭𝐥δ𝐱τ𝐥)(δ𝐱τ𝐥)2]\displaystyle+2E[(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})(\delta{\bf x_{\tau}^{l}})^{2}]
=E[(δ𝐱τ𝐥)3],\displaystyle=E[(\delta{\bf x_{\tau}^{l}})^{3}], (22)

where the first and last terms in the first equality drop out due to the independence of the increment (δ𝐱𝐭𝐥δ𝐱τ𝐥)(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}}) from δ𝐱τ𝐥\delta{\bf x_{\tau}^{l}}, and the fact that E[δ𝐱𝐭𝐥δ𝐱τ𝐥]=0E[\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}}]=0 and E[δ𝐱τ𝐥]=0E[\delta{\bf x_{\tau}^{l}}]=0. Since δ𝐱τ𝐥\delta{\bf x_{\tau}^{l}} is the state of the linear system δ𝐱𝐭+𝟏=A¯tδ𝐱𝐭𝐥+ϵwt\delta{\bf x_{t+1}}=\bar{A}_{t}\delta{\bf x_{t}^{l}}+\epsilon w_{t}, it may again be shown that:

E[δxτl]3=s1,s2,s3Φτ,s1Φτ,s2Φτ,s3E[ws1ws2ws3],\displaystyle E[\delta x_{\tau}^{l}]^{3}=\sum_{s_{1},s_{2},s_{3}}\Phi_{\tau,s_{1}}\Phi_{\tau,s_{2}}\Phi_{\tau,s_{3}}E[w_{s_{1}}w_{s_{2}}w_{s_{3}}], (23)

where Φt,τ\Phi_{t,\tau} represents the state transitions operator between times τ\tau and tt, and follows from the closed loop dynamics. Now, due to the independence of the noise terms wtw_{t}, it follows that E[ws1ws2ws3]=0E[w_{s_{1}}w_{s_{2}}w_{s_{3}}]=0 regardless of s1,s2,s3s_{1},s_{2},s_{3}.
An analogous argument as above can be repeated for the case when τ>t\tau>t. Therefore, it follows that E[δJ1δJ2]=O(ϵ4)E[\delta J_{1}\delta J_{2}]=O(\epsilon^{4}).

DDPG Algorithm: Deep Deterministic Policy Gradient (DDPG) is a policy-gradient based off-policy reinforcement learning algorithm that operates in continuous state and action spaces. It relies on two function approximation networks one each for the actor and the critic. The critic network estimates the Q(s,a)Q(s,a) value given the state and the action taken, while the actor network engenders a policy given the current state. Neural networks are employed to represent the networks.

The off-policy characteristic of the algorithm employs a separate behavioural policy by introducing additive noise to the policy output obtained from the actor network. The critic network minimizes loss based on the temporal-difference (TD) error and the actor network uses the deterministic policy gradient theorem to update its policy gradient as shown below:

Critic update by minimizing the loss:

L=1NΣi(yiQ(si,ai|θQ))2L=\frac{1}{N}\Sigma_{i}(y_{i}-Q(s_{i},a_{i}|\theta^{Q}))^{2}

Actor policy gradient update:

θμ1NΣiaQ(s,a|θQ)|s=si,a=μ(si)θμμ(s|θμ)|si\nabla_{\theta^{\mu}}\approx\frac{1}{N}\Sigma_{i}\nabla_{a}Q(s,a|\theta^{Q})|_{s=s_{i},a=\mu(s_{i})}\nabla_{\theta^{\mu}}\mu(s|\theta^{\mu})|_{s_{i}}

The actor and the critic networks consists of two hidden layers with the first layer containing 400 (’relu’ activated) units followed by the second layer containing 300 (’relu’ activated) units. The output layer of the actor network has the number of (’tanh’ activated) units equal to that of the number of actions in the action space.

Target networks one each for the actor and the critic are employed for a gradual update of network parameters, thereby reducing the oscillations and a better training stability. The target networks are updated at τ=0.001\tau=0.001. Experience replay is another technique that improves the stability of training by training the network with a batch of randomized data samples from its experience. We have used a batch size of 32 for the inverted pendulum and the cartpole examples, whereas it is 64 for the rest. Finally, the networks are compiled using Adams’ optimizer with a learning rate of 0.001.

To account for state-space exploration, the behavioural policy consists of an off-policy term arising from a random process. We obtain discrete samples from Ornstein-Uhlenbeck (OU) process to generate noise as followed in the original DDPG method. The OU process has mean-reverting property and produces temporally correlated noise samples as follows:

dxt=Θ(μxt)dt+σdWdx_{t}=\Theta(\mu-x_{t})dt+\sigma dW

where Θ\Theta indicates how fast the process reverts to mean, μ\mu is the equilibrium or the mean value and σ\sigma corresponds to the degree of volatility of the process. Θ\Theta is set to 0.15, μ\mu to 0 and σ\sigma is annealed from 0.35 to 0.05 over the training process.