\tocauthor

Karthikeya S Parunandi¹, Aayushman Sharma², Suman Chakravorty³ and Dileep Kalathil⁴ ¹¹institutetext: Texas A $\&$ M University, College Station TX 77843, USA
¹¹email: [email protected]¹, ¹¹email: [email protected]², ¹¹email: [email protected]³, and ¹¹email: [email protected]⁴

D2C 2.0: Decoupled Data-Based Approach for Learning to Control Stochastic Nonlinear Systems via Model-Free ILQR

Karthikeya S Parunandi¹ Aayushman Sharma² Suman Chakravorty³ and Dileep Kalathil⁴

Abstract

In this paper, we propose a structured linear parameterization of a feedback policy to solve the model-free stochastic optimal control problem. This parametrization is corroborated by a decoupling principle that is shown to be near-optimal under a small noise assumption, both in theory and by empirical analyses. Further, we incorporate a model-free version of the Iterative Linear Quadratic Regulator (ILQR) in a sample-efficient manner into our framework. Simulations on systems over a range of complexities reveal that the resulting algorithm is able to harness the superior second-order convergence properties of ILQR. As a result, it is fast and is scalable to a wide variety of higher dimensional systems. Comparisons are made with a state-of-the-art reinforcement learning algorithm, the Deep Deterministic Policy Gradient (DDPG) technique, in order to demonstrate the significant merits of our approach in terms of training-efficiency.

keywords:

Machine Learning, Motion and Path Planning, Optimization and Optimal Control

1 INTRODUCTION

It is well-known that a general solution to stochastic optimal control problem is computationally intractable. Hence, most methods rely on approximated dynamic programming methods (See, for e.g., [1]). In case of an unknown system model, the problem is further complicated. This problem of learning to optimally control an unknown general nonlinear system has been well-studied in the stochastic adaptive control literature [2] [3]. A very similar problem is also addressed in the field of reinforcement learning (RL) except that it specifically aims to solve a discrete-time stochastic control process (Markov Decision Process), more often with an infinite horizon formulation [4]. Also, RL has attracted a great attention with its recent success, especially in its sub-domain of deep RL [5] [6] [7].
Most methods in the RL can be divided into model-based and model-free approaches, based on whether a model is constructed before solving for an optimal policy or not, respectively. Among them, model-based RL typically requires lesser number of samples and can be global or local model-based. Global models are typically approximated with Gaussian Processes (GPs) or Neural Networks, and hence can be inaccurate in some portions of its state space. Local models (such as in [8][9]) in contrast offer more sample efficient and accurate solutions. On the other hand, model-free RL, since has to bypass learning the system model requires substantially more number of training samples to train the policy [10]. While an overwhelming number of successful works in the recent past have relied on a complex parameterizations such as by using neural networks [7] [11][12] [13], they are challenged with non-determinism in reproducing the results and in benchmarking against existing methods [14][15]. Recent works such as by [16] showed that a simple linear parameterization can be very effective and can result in performances competitive to that of the former methods. The current work, aimed at deriving a simple but structured, sample-efficient and thereby a fast policy, falls under the local model-based regime, though we won’t explicitly build the system model but rather deal with its gradients in the neighborhood of a nominal trajectory.

In the prequel of this work, called D2C [17], instead of solving for the optimal closed-loop policy via dynamic programming as in typical RL techniques, our approach searches for a computationally tractable policy that is nonetheless near-optimal to second order in a noise parameter $\epsilon$ . This is done via decoupling of the open and the closed loops, wherein we first solve for an optimal open-loop trajectory, and then design an optimal linear feedback law to stabilize the linear perturbations about the nominal trajectory. The success of this approach was previously demonstrated in [18][19] for model-based problems. This way of decoupling results in a compact parameterization of the feedback law in terms of an open loop control sequence and a linear feedback policy, and thereby, constitutes a novel approach to the RL problem via the decoupled search. The contributions of this sequel paper are as follows:
i) We incorporate a data-efficient, and hence, significantly faster way to compute the gradients in the ILQR framework in a model free fashion. This, in turn, is incorporated in our D2C framework to solve for the optimal open-loop trajectory. This is in contrast to the ‘direct’ gradient-descent based approach introduced in the earlier version of our algorithm (D2C 1.0).
ii) We make a comparison with a state-of-the-art reinforcement learning algorithm - DDPG [7] as well as D2C 1.0, and show the significant merit of the current method, termed D2C 2.0, in terms of its computational efficiency.
iii) D2C 2.0 also constitutes a novel approach to solving complex robotic motion planning problems under uncertainty in a data efficient model-free fashion.

2 RELATED WORK

Differential Dynamic Programming (DDP) [20] is a class of iterative algorithms for trajectory optimization. One salient aspect of DDP is that it exhibits quadratic convergence for a general non-linear optimal control problem and can be solved in a single step for linear systems. In spite of its origins in the 1960s, it gained popularity only in the last decade due to the success of the modified algorithm called ILQR [21]. Though DDP (theoretically) guarantees a quadratic convergence, some of the terms in it involve computing second order derivatives of the system dynamical models. Since the dynamical models of most systems are multivariable vector-valued functions and their second order derivatives being third order tensors, DDP in its original form was not effective for practical implementation. ILQR [22] [23] dropped these terms and introduced regularization schemes [24][25] to handle the absence of these computationally expensive terms. This resulted in a faster and stable convergence to the optimal solution.

One advantage that ILQR has in terms of solving it in a model-free manner is that this algorithm is explicit in terms of system model and their first order derivatives. Earlier works such as by [25] and [26] employed finite differencing in computing the Jacobians in ILQR. Typically, a forward Euler scheme is chosen to independently determine each element (i.e, gradient) in a Jacobian matrix. [9] presented ILQR-LD by learning the dynamics through Locally Weighted Projection Regression (LWPR) approximation of the model first and then obtaining a corresponding analytical Jacobian. Their work also demonstrated a better computational efficiency over finite differences.

Spall introduced the ‘Simultaneous Perturbation Stochastic Approximation’ (SPSA) [27] method that evaluates the function only twice to calculate the Jacobian of the cost function. In this paper, a similar formulation is derived to compute the Jacobians of the system model online, in a model-free fashion, through the least squares method in a central-difference scheme.

The rest of the paper is organized as follows. Section 3 introduces the terminology and provides the problem statement. In section 4, we present a near-optimal decoupling principle that forms the basis of our algorithm. Section 5 describes our algorithm - D2C 2.0 in detail. In Sections 6 and 7, we provide results and a discussion of the application of the D2C 2.0 to several benchmark examples, along with comparisons to DDPG and the D2C technique.

3 PROBLEM FORMULATION

Let ${\bf x_{t}}$ $\in$ $\mathcal{X}\in$ $\mathbb{R}^{n_{x}}$ represent the state of a system and ${\bf u_{t}}\in\mathcal{X}\in\mathbb{R}^{n_{u}}$ be the control signal at time $t$ respectively. Let $f:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R}^{n_{x}}$ denote the deterministic state transition model of the system. Let $\pi_{t}:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R}^{n_{u}}$ be the policy to be applied on the system. Let $c_{t}:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R}$ be the time-indexed incremental cost function for all $t\in\{0,1,\cdots,N-1\}$ and $c_{N}:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R}$ be the terminal cost function. Let $J_{t}^{\pi_{t}}:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R}$ represent the cost-to-go function under the policy $\pi_{t}$ and $Q_{t}^{\pi_{t}}:\mathbb{R}^{n_{x}}\times\mathbb{R}^{n_{u}}\rightarrow\mathbb{R}$ be the corresponding action-value function, both at time $t$ . Finally, let us denote the derivatives of a variable at time $t$ w.r.t. the state ${\bf x_{t}}$ or the control ${\bf u_{t}}$ by placing respective variables in the subscript. For example, $J_{\bf x_{t}}=\frac{\partial J_{t}}{\partial{\bf x_{t}}}|_{\bf\bar{x}_{t}}$ . Unless stated otherwise, they are evaluated at the nominal state or control at time $t$ .

Now, let us define the notion of process noise in the system through an important parameter called ‘noise scaling factor’, denoted by $\epsilon$ . We assume that the noise is additive and white, zero-mean Gaussian. Hence, the state evolution equation of our stochastic system is represented as ${\bf x_{t+1}}=f({\bf x_{t}},{\bf u_{t}})+\epsilon\omega_{t}$ , where $\epsilon<1$ and $\omega_{t}$ is zero-mean Gaussian distributed.
Let us define the notion of a ‘nominal’ trajectory, given a policy $\pi_{t}(.)$ , as follows - the nominal control actions of the policy are the control actions output by the policy when all the noise variables $\omega_{t}=0$ . Let us represent the nominal trajectory variables with bars over them $i.e,$ ${\bf\bar{x}_{t}}$ and ${\bf\bar{u}_{t}}$ , both at time $t$ . So, the nominal trajectory is given by, $\mathbb{T}_{nom}=\{\bf\bar{x}_{0:N},\bar{u}_{0:N-1}\}$ , where $N$ is the time horizon, and $\bar{\mathbf{x}}_{t+1}=f(\bar{\mathbf{x}_{t}},\bar{\mathbf{u}}_{t})$ , where $\bar{\mathbf{u}}_{t}=\pi_{t}(\bar{\mathbf{x}}_{t})$ .

Using the above definitions, the problem of stochastic optimal control that we consider is to find the set of time-varying optimal control policies $\pi=\{\pi_{1},\pi_{2},\cdots,\pi_{N-1}\}$ (subscripts denote the time-indices) that minimize the finite horizon total expected cumulative cost, $i.e,$

\tilde{J}^{\pi}=\mathbb{E}_{\pi}[J^{\pi}]=\mathbb{E}_{\pi}[\sum_{t=0}^{N-1}c_{t}({\bf x_{t}},\pi_{t}({\bf x_{t}}))+c_{N}({\bf x_{N}})],

(1)

such that the system model i.e, ${\bf x_{t+1}}=f({\bf x_{t}},\pi_{t}({\bf x_{t}}))+\epsilon\omega_{t}$ , relevant constraints and boundary conditions are satisfied. Since the current approach deals with model-free problem, the system model and its constraints are assumed to be implicitly satisfied in the data that the system/agent gathers from its environment (through a simulator or a real world experiment).

4 A NEAR-OPTIMAL DECOUPLING PRINCIPLE

Note : The results provided in this particular section are restated from our prequel paper on D2C primarily for the reader’s convenience and for the sake of completeness.
In this section, we describe a near-optimal decoupling principle which provides the theoretical foundation for our D2C technique.
Assuming that $f(.)$ and $\pi_{t}(.)$ are sufficiently smooth, let us describe the linearized dynamics about the nominal trajectory $\mathbb{T}_{nom}$ corresponding to the policy $\pi_{t}(.)$ as follows. The state and control perturbations, given by $\delta{\bf x_{t}}={\bf x_{t}}-{\bf\bar{x}_{t}}$ and $\delta{\bf u_{t}}={\bf u_{t}}-{\bf\bar{u}_{t}}$ respectively, evolve as follows:

	$\displaystyle\delta{\bf x_{t+1}}$	$\displaystyle=A_{t}\delta{\bf x_{t}}+B_{t}\delta{\bf u_{t}}+S_{t}(\delta{\bf x_{t}})+R_{t}({\delta{\bf u_{t}}})+\epsilon\omega_{t},$		(2)
	$\displaystyle\delta{\bf u_{t}}$	$\displaystyle=K_{t}\delta{\bf x_{t}}+\tilde{S}_{t}(\delta{\bf x_{t}}),$		(3)

where $A_{t}=\frac{\partial f}{\partial{\bf x}}|_{{(\bf\bar{x}_{t}},{\bf\bar{u}_{t}})}$ , $B_{t}=\frac{\partial f}{\partial{\bf u}}|_{({\bf\bar{x}_{t}},{\bf\bar{u}_{t}})}$ , $K_{t}=\frac{\partial\pi({\bf x_{t}})}{\partial x}|_{\bf\bar{x}_{t}}$ and $S_{t}(.)$ , $\tilde{S}_{t}(.)$ , $R_{t}(.)$ are higher order terms in respective Taylor’s expansions. By substituting (3) in (2), we have

\displaystyle\delta{\bf x_{t+1}}=\underbrace{(A_{t}+B_{t}K_{t})}_{\bar{A}_{t}}\delta{\bf x_{t}}+\bar{S}_{t}(\delta{\bf x_{t}})+\epsilon\omega_{t},

(4)

where $\bar{S}_{t}(\delta{\bf x_{t}})=B_{t}\tilde{S}_{t}(\delta{\bf x_{t}})+S_{t}(\delta{\bf x_{t}})+R_{t}(K_{t}\delta{\bf x_{t}}+\tilde{S}_{t}(\delta{\bf x_{t}}))$ . Now, similarly expanding the cost function $c_{t}({\bf x_{t}},{\bf u_{t}})$ under the policy $\pi_{t}({\bf x_{t}})$ $\forall t\in\{0,1,\cdots,N-1\}$ about the nominal trajectory $\mathbb{T}_{nom}$ as $c_{t}({\bf x_{t}},\pi_{t}({\bf x_{t}}))=\bar{c}_{t}+\bar{C}_{t}\delta{\bf x_{t}}+\bar{H}_{t}(\delta{\bf x_{t}})$ , where $\bar{c}_{t}=c_{t}({\bf\bar{x}_{t}},\pi_{t}({\bf\bar{x}_{t}}))$ and $\bar{C}_{t}=\frac{\partial c_{t}({\bf x_{t}},\pi_{t}({\bf x_{t}}))}{\partial{\bf x_{t}}}|_{\bf\bar{x}_{t}}$ results in

J^{\pi}=\sum_{t=0}^{N}\bar{c}_{t}+\sum_{t=0}^{N}\bar{C}_{t}\delta{\bf x_{t}}+\sum_{t=0}^{N}\bar{H}_{t}(\delta{\bf x_{t}}).

(5)

Lemma 1: The state perturbation equation (4) can be equivalently separated into $\delta{\bf x_{t}}=\delta{\bf x_{t}^{l}}+\bar{S}_{t}$ and $\delta{\bf x_{t+1}^{l}}=\bar{A}_{t}\delta{\bf x_{t}^{l}}+\epsilon\omega_{t}$ , and $\bar{S}_{t}$ is $O(\epsilon^{2})$ .
Proof: Proof is provided in the supplementary file.

From (5) and lemma-1, we obtain

J^{\pi}=\underbrace{\sum_{t=0}^{N}\bar{c}_{t}}_{\bar{J}^{\pi}}+\underbrace{\sum_{t=0}^{N}\bar{C}_{t}\delta{\bf x_{t}^{l}}}_{\delta J^{\pi}_{1}}+\underbrace{\sum_{t=0}^{N}\bar{H}_{t}(\delta{\bf x_{t}})+\bar{C}_{t}\bar{S}_{t}}_{\delta J^{\pi}_{2}}.

(6)

Proposition 1: Given a feedback policy $\pi$ satisfying the smoothness requirements,

	$\displaystyle\tilde{J}^{\pi}$	$\displaystyle=\mathbb{E}[J^{\pi}]=\bar{J}^{\pi}+O(\epsilon^{2}),$		(7)
	$\displaystyle Var(J^{\pi})$	$\displaystyle=\underbrace{Var(\delta J_{1}^{\pi})}_{O(\epsilon^{2})}+O(\epsilon^{4}).$		(8)

Proof: Proof is provided in the supplementary file. Noting that ${\bf u_{t}}=\pi_{t}({\bf x_{t}})={\bf\bar{u}_{t}}+K_{t}\delta{\bf x_{t}}+\tilde{S}_{t}(\delta{\bf x_{t}})$ , the above proposition entails the following observations:
i) Observation 1: The cost-to-go along the nominal trajectory $\mathbb{T}_{nom}=\{{\bf\bar{x}_{0:N}},{\bf\bar{u}_{0:N-1}}\}$ given by $\bar{J}^{\pi}$ is within $O(\epsilon^{2})$ of the expected cost-to-go $\tilde{J}^{\pi}$ of the policy $\pi$ .
ii) Observation 2: Given the nominal control sequence ${\bf\bar{u}_{0:N-1}}$ , the variance of the cost-to-go is overwhelmingly determined by the linear feedback law $K_{t}\delta{\bf x_{t}}$ $i.e,$ by the variance of the linear perturbation of the cost-to-go, $\delta J_{1}^{\pi}$ , under the linear closed loop system dynamics $\delta{\bf x_{t+1}}=\bar{A}_{t}\delta{\bf x_{t}^{l}}+\epsilon\omega_{t}$ .

Proposition 1 and the above observations suggest that an optimal open-loop control sequence with a suitable linear feedback law (for the perturbed linear system) wrapped around it is approximately optimal. The following subsection summarizes the problems to be solved for each of them.

4.1 Decoupled Policy for Feedback Control

Open-Loop Trajectory Design: The open-loop trajectory is designed by solving the noiseless equivalent of the stochastic optimal control problem (1), as shown below:

	$\displaystyle\underset{\mathbb{T}_{nom}}{\textnormal{min}}\sum^{N-1}_{1}c_{t}({\bf\bar{x}_{t}},{\bf\bar{u}_{t}})+c_{N}({\bf\bar{x}_{N}}),$
	$\displaystyle\textnormal{s.t.}~{}{\bf\bar{x}_{t}}=f({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}).$

Feedback Law: A linear feedback law is determined by solving for the optimal feedback gains $K_{1:N-1}^{*}$ that minimize the variance corresponding to the linear perturbation system around the optimal nominal trajectory $\mathbb{T}^{*}_{nominal}$ found from the above, as follows:

	$\displaystyle\underset{\{K_{1:N-1}\}}{\textnormal{min}}~{}$	$\displaystyle Var(\delta J_{1}^{\pi}),$
	$\displaystyle\delta J_{1}^{\pi}$	$\displaystyle=\sum_{t=1}^{N}\bar{C}_{t}\delta{\bf x_{t}^{l}},$
	$\displaystyle\delta{\bf x_{t+1}}$	$\displaystyle=(A_{t}+B_{t}K_{t})\delta{\bf x_{t}^{l}}+\epsilon\omega_{t}.$

Albeit convex, the above problem does not have a standard solution. However, since we are really only interested in a good variance rather than the optimal variance, we can solve an inexpensive time-varying LQR problem as a surrogate to reduce the variance of the cost-to-go. The corresponding problem can be posed as follows:

	$\displaystyle\underset{{\delta{\bf u_{1:N}}}}{\textnormal{min}}~{}\mathbb{E}[\sum_{t=1}^{N-1}\delta{\bf x_{t}}^{T}Q_{t}{\bf x_{t}}+\delta{\bf u_{t}}^{T}R_{t}{\bf u_{t}}+\delta{\bf x_{N}}^{T}Q_{N}{\bf x_{N}}],$
	$\displaystyle\textnormal{s.t.}~{}\delta{\bf x_{t+1}}=A_{t}\delta{\bf x_{t}}+B_{t}{\bf u_{t}}+\epsilon\omega_{t}.$

Remark 4.1.

In fact, if the feedback gain is designed carefully rather than the heuristic LQR above, then one can obtain $O(\epsilon^{4})$ near-optimality. This requires some more developments, however, we leave this result out of this paper because of the paucity of space.

5 ALGORITHM

This section presents our decoupled data-based control (D2C-2.0) algorithm. There are two main steps involved in solving for the requisite policy:

1.

Design an open-loop nominal trajectory by solving the optimization problem via a model-free ILQR based approach $i.e,$ without explicitly making use of an analytic model of the system. This is described in subsection 5.1. This is the primary difference from the prequel D2C where we used a first order gradient based technique to solve the optimal open loop control problem.
2.

Determine the parameters of the time-varying linear perturbation system about its optimal nominal trajectory and design an LQR controller based on the system identification. This step is detailed in subsection 5.2.

5.1 Open-Loop Trajectory Design

An open-loop trajectory is designed by solving the corresponding problem in subsection 4.1. However, since we are solving it in a model-free manner, we need a suitable formulation that can efficiently compute the solution.
In this subsection, we present an ILQR based model-free method to solve the open-loop optimization problem. The advantage with ILQR is that the equations involved in it are explicit in system dynamics and their gradients. Hence, in order to make it a model-free algorithm, it is sufficient if we could explicitly obtain the estimates of Jacobians. A sample-efficient way of doing it is described in the following subsubsection. Since ILQR/DDP is a well-established framework, we skip the details and instead present the essential equations in algorithm 1, algorithm 2 and algorithm 3, where we also reflect the choice of our regularization scheme.

5.1.1 Estimation of Jacobians in a Model-Free Setting

From the Taylor’s expansions of ‘f’ about the nominal trajectory $({\bf\bar{x}_{t}},{\bf\bar{u}_{t}})$ on both the positive and the negative sides, we obtain the following central difference equation:

\begin{split}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}})=~{}&2\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}\begin{bmatrix}\delta{\bf x_{t}}\\ \delta{\bf u_{t}}\end{bmatrix}+\\ &O(\|\delta{\bf x_{t}}\|^{3}+\|\delta{\bf u_{t}}\|^{3})\end{split}

(9)

Multiplying by $\begin{bmatrix}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}^{T}\end{bmatrix}$ on both sides to the above equation:

\begin{split}\begin{bmatrix}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}})\end{bmatrix}\times\begin{bmatrix}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}^{T}\end{bmatrix}\\ =2\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix}+O(\|\delta{\bf x_{t}}\|^{4}+\|\delta{\bf u_{t}}\|^{4})\end{split}

Assuming that $\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix}$ is invertible (which will later be proved to be true by some assumptions), let us perform inversions on either sides of the above equation as follows:

\begin{split}\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}=~{}&\frac{1}{2}\begin{bmatrix}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}})\end{bmatrix}\times\\ &\begin{bmatrix}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}^{T}\end{bmatrix}\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix}^{-1}+O(\|\delta{\bf x_{t}}\|^{2}+\|\delta{\bf u_{t}}\|^{2})\end{split}

(10)

Equation (10) solves for Jacobians, $f_{\bf x_{t}}$ and $f_{\bf u_{t}}$ , simultaneously. It is noted that the above formulation requires only 2 evaluations of $f(.)$ , given the nominal state and control ( ${\bf\bar{x}_{t}},{\bf\bar{u}_{t}}$ ). The remaining terms (also, the error in the evaluation) are of the order that is quadratic in $\delta{\bf x_{t}}$ and $\delta{\bf u_{t}}$ . The following extends equation (10) to be used in practical implementations (sampling).
We are free to choose the distribution of $\delta{\bf x_{t}}$ and $\delta{\bf u_{t}}$ . We assume both are i.i.d. Gaussian distributed random variables with zero mean and a standard deviation of $\sigma$ . This ensures that $\begin{bmatrix}\delta{\bf x_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf x_{t}}\delta{\bf u_{t}}^{T}\\ \delta{\bf u_{t}}\delta{\bf x_{t}}^{T}&\delta{\bf u_{t}}\delta{\bf u_{t}}^{T}\end{bmatrix}$ is invertible. More on the advantage of using this distribution will be elaborated in the next paragraph. Let $`n_{s}^{\prime}$ be the number of samples for each of the random variables, $\delta{\bf x_{t}}$ and $\delta{\bf u_{t}}$ , as $\delta{\bf X_{t}}=\begin{bmatrix}\delta{\bf x_{t}^{1}}&\delta{\bf x_{t}^{2}}&\ldots&\delta{\bf x_{t}^{n_{s}}}\end{bmatrix}$ and $\delta{\bf U_{t}}=\begin{bmatrix}\delta{\bf u_{t}^{1}}&\delta{\bf u_{t}^{2}}&\ldots&\delta{\bf u_{t}^{n_{s}}}\end{bmatrix}$ , respectively. Then $\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}$ is given by the following :

\begin{split}\begin{bmatrix}f_{\bf x_{t}}&f_{\bf u_{t}}\end{bmatrix}=&\begin{bmatrix}f({\bf\bar{x}_{t}}+\delta{\bf x_{t}^{1}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}^{1}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}^{1}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}^{1}})\\ f({\bf\bar{x}_{t}}+\delta{\bf x_{t}^{2}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}^{2}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}^{2}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}^{2}})\\ \vdots\\ f({\bf\bar{x}_{t}}+\delta{\bf x_{t}^{n_{s}}},{\bf\bar{u}_{t}}+\delta{\bf u_{t}^{n_{s}}})-f({\bf\bar{x}_{t}}-\delta{\bf x_{t}^{n_{s}}},{\bf\bar{u}_{t}}-\delta{\bf u_{t}^{n_{s}}})\end{bmatrix}\times\\ &\begin{bmatrix}\delta{\bf X_{t}}^{T}&\delta{\bf U_{t}}^{T}\end{bmatrix}\begin{bmatrix}\delta{\bf X_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf X_{t}}\delta{\bf U_{t}}^{T}\\ \delta{\bf U_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf U_{t}}\delta{\bf U_{t}}^{T}\end{bmatrix}^{-1}\end{split}

(11)

Let us consider the terms in the matrix $\delta{\bf XU_{t}}=\begin{bmatrix}\delta{\bf X_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf X_{t}}\delta{\bf U_{t}}^{T}\\ \delta{\bf U_{t}}\delta{\bf X_{t}}^{T}&\delta{\bf U_{t}}\delta{\bf U_{t}}^{T}\end{bmatrix}$ . $\delta{\bf X_{t}}\delta{\bf X_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}$ . Similarly, $\delta{\bf U_{t}}\delta{\bf U_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}$ , $\delta{\bf U_{t}}\delta{\bf X_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}$ and $\delta{\bf X_{t}}\delta{\bf U_{t}}^{T}=\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}$ . From the definition of sample variance, we can write the above matrix as

\begin{split}\delta{\bf XU_{t}}&=\begin{bmatrix}\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}&\sum_{i=1}^{n_{s}}\delta{\bf x_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}\\ \sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf x_{t}}^{i}}^{T}&\sum_{i=1}^{n_{s}}\delta{\bf u_{t}}^{i}{\delta{\bf u_{t}}^{i}}^{T}\end{bmatrix}\\ &\approx\begin{bmatrix}\sigma^{2}(n_{s}-1){\text{I}_{n_{x}}}&{\text{0}_{n_{x}\times n_{u}}}\\ 0_{n_{u}\times n_{x}}&\sigma^{2}(n_{s}-1){\text{I}_{n_{u}}}\end{bmatrix}\\ &=\sigma^{2}(n_{s}-1){\text{I}}_{(n_{x}+n_{u})\times(n_{x}+n_{u})}\end{split}

Given that we have high enough number of samples ‘ $n_{s}$ ’ (typically, slightly more than ‘ $n_{x}+n_{u}$ ’ are sufficient), the above approximation holds good. Since the inversion of an identity matrix is trivial and always exists, the above matrix is invertible in equation (11). Thus, one can calculate $f_{\bf x}$ and $f_{\bf u}$ this way during the backward pass. For the sake of convenience in the later sections, let us refer to this method as ‘Linear Least Squares by Central Difference (LLS-CD)’. The entire algorithm to determine the optimal nominal trajectory in a model-free fashion is summarized together in Algorithm 1, Algorithm 3 and Algorithm 2.

Input: Initial State -

{\bf x_{0}}

, System parameters -

\mathcal{P}

;

k\leftarrow 1

. /* Initialize the iteration number

k

to 1.*/

forward\_pass\_flag

= true.

/* Run until the difference in costs between subsequent iterations is less an

\epsilon

fraction of the former cost.*/

while $k==1$ or $({\text{c}ost}(\mathbb{T}_{nom}^{k})/{\text{c}ost}(\mathbb{T}_{nom}^{k-1}))<1+\epsilon$ do

/*Each iteration has a backward pass followed by a forward pass.*/

{

k^{k}_{0:N-1},K^{k}_{0:N-1}

backward\_pass\_success\_flag

=

Backward Pass(

\mathbb{T}_{nom}^{k}

\mathcal{P}

if backward_pass_success_flag == true then

\mathbb{T}_{nom}^{k+1},forward\_pass\_flag

=

Forward Pass(

\mathbb{T}_{nom}^{k}

\{k^{k}_{0:N-1},K^{k}_{0:N-1}\}

\mathcal{P}

while forward_pass_flag == false do

\mathbb{T}_{nom}^{k+1},forward\_pass\_flag

=

Forward Pass(

\mathbb{T}_{nom}^{k}

\{k^{k}_{0:N-1},K^{k}_{0:N-1}\}

\mathcal{P}

Reduce

\alpha

from

\mathcal{P}

end while

end if

else

Increase

\mu

from

\mathcal{P}

. /* Regularization step */

end if

k\leftarrow k+1

\mathbb{T}^{*}_{nom}\leftarrow\mathbb{T}^{k+1}_{nom}

end while

return

\mathbb{T}^{*}_{nom}

Algorithm 1 Open-loop trajectory optimization via model-free DDP

Input: Nominal trajectory -

\mathbb{T}_{nom}^{k}

, previous iteration policy parameters -

\{k_{0:N-1},K_{0:N-1}\}

and system and cost parameters -

\mathcal{P}

\{{\bf\bar{x}_{t}^{prev},\bar{u}_{t}^{prev}}\}\leftarrow\mathbb{T}_{nom}^{k}

t\leftarrow 0

while $t<N$ do

/*Simulate one step forward (

\alpha

is the line-search parameter.)*/

\begin{split}{\bf\bar{u}_{t}}&={\bf\bar{u}_{t}^{prev}}+\alpha k_{t}+K_{t}({\bf\bar{x}_{t}}-{\bf\bar{x}_{t}^{prev}}),\\ {\bf\bar{x}_{t+1}}&=simulate\_forward\_step({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}).\end{split}

t\leftarrow t+1

end while

\mathbb{T}_{nom}^{k+1}

\leftarrow

\{{\bf\bar{x}_{0:N}},{\bf\bar{u}_{0:N-1}}\}.

if $\mathbb{T}_{nom}^{k+1}$ to $\mathbb{T}_{nom}^{k}$ is acceptable then

return

\mathbb{T}_{nom}^{k+1}

, true.

end if

else

return

\mathbb{T}_{nom}^{k}

, false.

end if

Algorithm 2 Forward Pass

Input: Nominal trajectory -

\mathbb{T}_{nom}^{k}

, previous iteration policy parameters -

\{k_{0:N-1},K_{0:N-1}\}

, horizon - N and system and cost parameters -

\mathcal{P}

/* Backward pass starts from the final time-step i.e, N-1.*/

t\leftarrow N-1

Compute

J_{\bf x_{N}}

and

J_{\bf x_{N}x_{N}}

using boundary conditions.

/*Keep a copy of previous policy gains.*/

k\_old\leftarrow k_{0:N}

and

K\_old\leftarrow K_{0:N}

while $t>=0$ do

/*Obtain the Jacobians from simulator rollouts as shown in equation (11):*/

f_{\bf x_{t}},f_{\bf u_{t}}\leftarrow model\_free\_jacobian({\bf\bar{x}_{t}},{\bf\bar{u}_{t}}).

/*Obtain the partials of the Q function as follows:*/

\begin{split}Q_{\bf x_{t}}&=c_{\bf x_{t}}+f_{\bf x_{t}}^{T}J_{\bf x_{t+1}}^{\prime},\\ Q_{\bf u_{t}}&=c_{\bf u_{t}}+f_{\bf u_{t}}^{T}J_{\bf x_{t+1}}^{\prime},\\ Q_{\bf x_{t}x_{t}}&=c_{\bf x_{t}x_{t}}+f_{\bf x_{t}}^{T}J_{\bf x_{t+1}x_{t+1}}^{\prime}f_{\bf x_{t}},\\ Q_{\bf u_{t}x_{t}}&=c_{\bf u_{t}x_{t}}+f_{\bf u_{t}}^{T}(J_{\bf x_{t+1}x_{t+1}}^{\prime}+\mu I_{n_{x}\times n_{x}})f_{\bf x_{t}},\\ Q_{\bf u_{t}u_{t}}&=c_{\bf u_{t}u_{t}}+f_{\bf u_{t}}^{T}(J_{\bf x_{t+1}x_{t+1}}^{\prime}+\mu I_{n_{x}\times n_{x}})f_{\bf u_{t}}.\\ \end{split}

if $Q_{\bf u_{t}u_{t}}$ is positive-definite then

\begin{split}k_{t}&=-Q_{\bf u_{t}u_{t}}^{-1}Q_{\bf u_{t}},\\ K_{t}&=-Q_{\bf u_{t}u_{t}}^{-1}Q_{\bf u_{t}x_{t}}.\end{split}

end if

else

/*If

Q_{\bf u_{t}u_{t}}

is not positive-definite, then, abort the backward pass.*/

return

\{k\_old,K\_old\}

, false.

end if

/*Obtain the partials of the value function

J_{t}

as follows:*/

\begin{split}J_{\bf x_{t}}&=Q_{\bf x_{t}}+K_{t}^{T}Q_{\bf u_{t}u_{t}}k_{t}+K_{t}^{T}Q_{\bf u_{t}}+Q_{\bf u_{t}x_{t}}^{T}k_{t},\\ J_{\bf x_{t}x_{t}}&=Q_{\bf x_{t}x_{t}}+K_{t}^{T}Q_{\bf u_{t}u_{t}}K_{t}+K_{t}^{T}Q_{\bf u_{t}x_{t}}+Q_{\bf u_{t}x_{t}}^{T}K_{t}.\end{split}

t\leftarrow t-1

end while

k\_new=k_{0:N-1},

K\_new=K_{0:N-1}.

return

\{k\_new,K\_new\}

, true.

Algorithm 3 Backward Pass

5.2 Feedback Law

As discussed in the previous section, a time-varying LQR feedback law is used as a surrogate for the minimum variance problem. The LQR gains $\{K_{1:N-1}\}$ can be fully determined if $Q_{t}$ , $R_{t}$ , $A_{t}$ and $B_{t}$ are known. The first two are the tuning parameters in the cost function. Determining $A_{t}$ and $B_{t}$ for a linear system is a system identification problem that can be solved by ‘LLS-CD’ as described under subsubsection 5.1.1.

6 SIMULATIONS

This section reports the implementation details of D2C-2.0 in simulation on various systems and their training results. We train the systems with varying complexities in terms of the dimension of the state and the action spaces. The physical models of the system are deployed in the simulation platform ‘MuJoCo-2.0’ [28] as a surrogate to their analytical models. The models are imported from the OpenAI gym [29] and Deepmind’s control suite [30].

Figure 1 (left) shows the open-loop training plots of various systems under consideration. The data is averaged over 5 experimental runs to reflect the training variance and the repeatability of D2C-2.0 over multiple training sessions. One might notice that the episodic cost is not monotonically decreasing during training. This is because of the flexibility provided in training for the episodic cost to stay within a ‘band’ around the cost at the previous iteration. Though this might momentarily increase the cost in subsequent training iterations, it is observed to have improved the overall speed of convergence.

Refer to caption — (a) Inverted Pendulum

7 DISCUSSION

In this section, we compare D2C-2.0 with D2C and DDPG, and discuss the merits and demerits of each of the methods.

The manuscript on D2C provided an in-depth analysis on the comparison of D2C with DDPG. We have shown that training in D2C is much more sample-efficient and fares well w.r.t. the training time and the reproducibility of the resulting policy.

Table 1: Comparison of the simulation parameters and training outcomes of D2C-2.0 with other baselines.

	Training time (in sec.)
System	D2C
	Open-	Closed-	DDPG	D2C-
	loop	loop		2.0
Inverted	12.9	$<0.1$	2261.15	0.33
Pendulum
Cart pole	15.0	1.33	6306.7	1.62
3-link
Swimmer	4001.0	4.6	38833.64	186.2
6-link
Swimmer	3585.4	16.4	88160	127.2
Fish	6011.2	75.6	124367.6	54.8

Table 1 shows the training times of D2C, DDPG and D2C-2.0. It is evident from the table that for simple and lower dimensional systems such as pendulum and cart pole, the policy can be almost calculated online (parallelization could make it much faster). It is due to the tendency of ILQR/DDP to quickly converge if the linearization of system models has bigger basin of attraction (such as pendulum, which is, in fact, often approximated to be linear about its resting state). Moreover, for a higher dimensional system such as in the case of a robotic fish, D2C-2.0 took 54.8 seconds while the original D2C took around 6096 seconds, which is approximately 90 times faster. The table makes it evident that D2C-2.0 clearly exhibits much faster convergence when compared to both D2C and DDPG, while D2C is much faster than DDPG. The primary reason for the much larger training times in DDPG is due to the complex parametrization of the feedback policy in terms of deep neural networks which runs into hundreds of thousands of parameters while D2C has a parametrization, primarily the open loop control sequence which is of dimension $pT$ , where $p$ is the number of inputs and $T$ is the time horizon of the problem, that runs from tens to a few thousands (see Table 2).

Table 2: Parameter size comparison between D2C and DDPG

System	No. of	No. of	No. of	No. of
	steps	actuators	parameters	parameters
			optimized	optimized
			in D2C	in DDPG
Inverted	30	1	30	244002
Pendulum
Cart pole	30	1	30	245602
3-link	1600	2	3200	251103
Swimmer
6-link	1500	5	7500	258006
Swimmer
Fish	1200	6	7200	266806

A major reason behind the differences in training efficiencies stems from their formulation. D2C is based on gradient descent, which is linearly convergent. It is a direct and generic method of solving an optimization problem. On the other hand, the equations involved in D2C-2.0 arise from the optimality conditions that inherently exploit the recursive optimal structure and the causality in the problem. Moreover, the equations are explicit in the Jacobians of the system model. As a result, it is possible to solve for it independently and in a sample efficient manner, at every time-step of each iteration. However, one should expect that as we consider very high dimensional systems, the algorithmic complexity of D2C-2.0 could outgrow that of D2C. This is because it is $O(n_{x}^{2})$ in terms of its computational complexity. Also, D2C, by the nature of its formulation does not need the state of the system, and hence, is generalizable to problems with partial state observability.

The closed-loop performance of D2C-2.0 w.r.t. that of DDPG is shown in the figure 1 (right). Noise is injected into the system via the control channel by varying the noise scaling factor- $\epsilon$ and then, measuring the performance. We note here that DDPG fails to converge for the robotic fish problem, and thus, the nominal performance ( $\epsilon=0$ ) is poor. We see that the performance of D2C is better than that of DDPG when the noise levels are lower, however, the performance of DDPG remains relatively similar for higher levels of noise while that of D2C degrades. In our opinion, this closed loop performance aspect of the methods needs to be studied in more detail in order to make a definitive judgement, and is left for future work. For instance, by the time DDPG overtakes D2C in terms of the average performance, the actual performance of both systems is quite poor, and thus, it is not clear if this can be construed as an advantage.

Now, as mentioned in section 2, earlier works have attempted to tackle the idea of model-free ILQR and have mostly confined to finite-differences for Jacobian computation. Table 3 shows the comparison of per-iteration computational times between ‘finite-differences’ and ‘linear least squares with central difference formulation’. It is clearly evident that, as the dimension of the state space increases, the method of finite-differences requires many more function evaluations, and hence, our LLS-CD method is much more efficient.

Table 3: Comparison of the computational times (in sec.) per iteration (averaged over 5 runs).

System	FD	LLS-CD
Inverted
Pendulum	0.017	0.033
Cart pole	0.0315	0.0463
Acrobot	0.554	0.625
3-link
Swimmer	4.39	1.86
6-link
Swimmer	14.43	1.278
Fish	34.75	2.74

8 Conclusion

In order to conclude the discussion that started the global vs local policy approximation dilemma in section 1, how well deep RL methods extrapolate their policy to the entire continuous state and action spaces remains an open question. Now, consider that the results shown in the Table 1 are based on serial implementations on an off-the-shelf computer and D2C-2.0 can be highly parallelizable. By looking at the order of magnitude of numbers in the same table, we expect with reasonable augmentation of computational power by parallelization, D2C-2.0 could offer a near-real time solution in high dimensional problems. In such cases, one could rely on the near-optimal policy described in this paper for low noises, and re-solve for the open-loop trajectory online by the approach presented in this paper along with the attendant feedback, whenever the noise takes the method out of the region of attraction of the linear feedback policy. This will be our motivation for the future work along with addressing the following limitation.
Limitations: D2C-2.0, in its current form, is not reliable for hybrid models such as legged robots that involve impacts. Primarily, the issue is in the feedback design which requires smoothness of the underlying dynamics while legged systems are only piece-wise smooth. Future work shall model impacts to tractably incorporate them in order to estimate the required gradients. This seems plausible as iLQR has been used to design controllers for such systems in the literature [25].

References

[1] W. B. Powell (2007). Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley & Sons.
[2] P.A. Ioannou and J. Sun (2012). Robust adaptive control, Courier Corporation.
[3] K.J. Astrom and B. Wittenmark (2012). Adaptive control, Courier Corporation.
[4] R.S. Sutton, and A.G. Barto (2018). Reinforcement learning: An introduction, MIT press.
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran , D. Wierstra, S. Legg, D. Hassabis (2015). Human-level control through deep reinforcement learning, Nature. vol. 518, pp. 529-33.
[6] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al. (2016). Mastering the game of go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp. 484.
[7] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016). Continuous control with deep reinforcement learning, International Conference on Learning Representations.
[8] V. Kumar, E. Todorov, and S. Levine (2016). Optimal Control with Learned Local Models: Application to Dexterous Manipulation, International Conference for Robotics and Automation (ICRA).
[9] D. Mitrovic, S. Klanke, and S. Vijayakumar (2010). Adaptive Optimal Feedback Control with Learned Internal Dynamics Models, In: Sigaud O., Peters J. (eds) From Motor Learning to Interaction Learning in Robots. Studies in Computational Intelligence, vol 264. Springer, Berlin, Heidelberg, pp. 65-84.
[10] P. Abbeel, M. Quigley, and A.Y. Ng (2006). Using inaccurate models in reinforcement learning, Proceedings of the 23rd international conference on Machine learning (ICML ’06). ACM, New York, NY, USA, 1-8. DOI: https://doi.org/10.1145/1143844.1143845.
[11] Y. Wu, E. Mansimov, S. Liao, R. Grosse and J. Ba (2017). Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
[12] J. Schulman, S. Levine and P. Moritz, M. I. Jordan and P. Abbeel (2015). Trust Region Policy Optimization. arXiv:1502.05477.
[13] J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
[14] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018). Deep reinforcement learning that matters, in Thirty Second AAAI Conference on Artificial Intelligence.
[15] M. G. D. P. Riashat Islam, Peter Henderson (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control, Reproducibility in Machine Learning Workshop, ICML.
[16] A. Rajeswaran, K. Lowrey, E. V. Todorov, and S. M. Kakade (2017). Towards generalization and simplicity in continuous control, Advances in Neural Information Processing Systems, pp. 6550-6561.
[17] R. Wang, K.S. Parunandi, D. Yu, D. Kalathil and S. Chakravorty (2019). Decoupled Data Based Approach for Learning to Control Nonlinear Dynamical Systems, arXiv:1904.08361, under review for IEEE Tr. Aut. Control.
[18] M. Rafieisakhaei, S. Chakravorty, and P.R. Kumar (2017). A Near-optimal separation principle for nonlinear stochastic systems arising in robotic path planning and control, IEEE Conference on Decision and Control(CDC).
[19] K. S. Parunandi, S. Chakravorty (2019). T-PFC: A Trajectory-Optimized Perturbation Feedback Control Approach, IEEE Robotics and Automation Letters, vol. 4, pp. 3457-3464.
[20] D. Jacobsen, and D. Mayne (1970). Differential Dynamic Programming, Elsevier.
[21] W. Li and E. Todorov (2007). Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system, International Journal of Control, vol. 80, no. 9, pp. 1439-1453.
[22] W. Li, and E. Todorov (2004). Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems, ICINCO.
[23] E. Todorov, and W. Li (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. Proceedings of the 24th American Control Conference, volume 1 of (ACC 2005), pp. 300 - 306.
[24] Z. Xie, C. K. Liu, and K. Hauser (2017). Differential dynamic programming with nonlinear constraints, International Conference on Robotics and Automation (ICRA), pp. 695-702.
[25] Y. Tassa, T. Erez, and E. Todorov (2012). Synthesis and stabilization of complex behaviors through online trajectory optimization, International Conference on Intelligent Robots and Systems, pp. 4906-4913.
[26] S. Levine, and V. Koltun (2013). Guided policy search, Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. JMLR.org III-1-III-9.
[27] J. C. Spall (1998). An overview of the Simultaneous Perturbation Method for efficient optimization, Johns Hopkins APL Technical Digest, vol. 19, no. 4, pp. 482-492.
[28] T. Emanuel, E. Tom, and Y. Tassa (2012). Mujoco: A physics engine for model-based control, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026-5033.
[29] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W.Zaremba (2016). OpenAI Gym, preprint :1606.01540.
[30] T. Yuval, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller (2018). Deepmind control suite, :1801.00690.

9 Supplementary

Proof of Lemma 1:

We proceed by induction. The first general instance of the recursion occurs at $t=3$ . It can be shown that:

	$\displaystyle\delta{\bf x_{3}}=\underbrace{(\bar{A}_{2}\bar{A}_{1}(\epsilon w_{0})+\bar{A}_{2}(\epsilon w_{1})+\epsilon w_{2})}_{\delta{\bf x}_{3}^{l}}+$
	$\displaystyle\underbrace{\{\bar{A}_{2}\bar{S}_{1}(\epsilon w_{0})+\bar{S}_{2}(\bar{A}_{1}(\epsilon w_{0})+\bar{S}_{2}(\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1}+\bar{S}_{1}(\epsilon w_{0}))\}}_{\bar{\bar{S}}_{3}}.$		(12)

Noting that $\bar{S}_{1}(.)$ and $\bar{S}_{2}(.)$ are second and higher order terms, it follows that $\bar{\bar{S}}_{3}$ is $O(\epsilon^{2})$ .w
Suppose now that $\delta{\bf x_{t}}=\delta{\bf x_{t}^{l}}+\bar{\bar{S}}_{t}$ where $\bar{\bar{S}}_{t}$ is $O(\epsilon^{2})$ . Then:

	$\displaystyle\delta{\bf x_{t+1}}=\bar{A}_{t+1}(\delta{\bf x_{t}^{l}}+\bar{\bar{S_{t}}})+\epsilon w_{t}+\bar{S}_{t+1}(\delta{\bf x_{t}}),$
	$\displaystyle=\underbrace{(\bar{A}_{t+1}\delta{\bf x_{t}^{l}}+\epsilon w_{t})}_{\delta{\bf x_{t+1}^{l}}}+\underbrace{\{\bar{A}_{t+1}\bar{\bar{S}}_{t}+\bar{S}_{t+1}(\delta{\bf x_{t}})\}}_{\bar{\bar{S}}_{t+1}}.$		(13)

Noting that $\bar{S}_{t+1}$ is $O(\epsilon^{2})$ and that $\bar{\bar{S}}_{t+1}$ is $O(\epsilon^{2})$ by assumption, the result follows.

Proof of Proposition 1:

From (6), we get,

	$\displaystyle\tilde{J}^{\pi}=\mathbb{E}[J^{\pi}]=\mathbb{E}[\bar{J}^{\pi}+\delta J_{1}^{\pi}+\delta J_{2}^{\pi}],$
	$\displaystyle=\bar{J}^{\pi}+\mathbb{E}[\delta J_{2}^{\pi}]=\bar{J}^{\pi}+O(\epsilon^{2}),$		(14)

The first equality in the last line of the equations before follows from the fact that $\mathbb{E}[\delta{\bf x_{t}^{l}}]=0$ , since its the linear part of the state perturbation driven by white noise and by definition $\delta{\bf x_{1}^{l}}=0$ . The second equality follows from the fact that $\delta J_{2}^{\pi}$ is an $O(\epsilon^{2})$ function. Now,

	$\displaystyle\text{Var}(J^{\pi})=\mathbb{E}[J^{\pi}-\tilde{J}^{\pi}]^{2}$
	$\displaystyle=\mathbb{E}[\bar{J}_{0}^{\pi}+\delta J_{1}^{\pi}+\delta J_{2}^{\pi}-\bar{J}_{0}^{\pi}-\delta\tilde{J}_{2}^{\pi}]^{2}$
	$\displaystyle=\text{Var}(\delta J_{1}^{\pi})+\text{Var}(\delta J_{2}^{\pi})+2\mathbb{E}[\delta J_{1}^{\pi}\delta J_{2}^{\pi}].$		(15)

Since $\delta J_{2}^{\pi}$ is $O(\epsilon^{2})$ , $\text{Var}(\delta J_{2}^{\pi})$ is an $O(\epsilon^{4})$ function. It can be shown that $\mathbb{E}[\delta J_{1}^{\pi}\delta J_{2}^{\pi}]$ is $O(\epsilon^{4})$ as well (proof is given in Lemma 2). Finally $\text{Var}(\delta J_{1}^{\pi})$ is an $O(\epsilon^{2})$ function because $\delta{\bf x^{l}}$ is an $O(\epsilon)$ function. Combining these, we will get the desired result.

Lemma 2: Let $\delta J_{1}^{\pi}$ , $\delta J_{2}^{\pi}$ be as defined in (6). Then, $\mathbb{E}[\delta J_{1}\delta J_{2}]$ is an $O(\epsilon^{4})$ function.

Proof of Lemma 2:
In the following, we suppress the explicit dependence on $\pi$ for $\delta J_{1}^{\pi}$ and $\delta J_{2}^{\pi}$ for notational convenience. Recall that $\delta J_{1}=\sum_{t=0}^{T}c_{t}^{x}\delta{\bf x_{t}^{l}}$ , and $\delta J_{2}=\sum_{t=0}^{T}\bar{H}_{t}(\delta{\bf x_{t}})+c_{t}^{x}\bar{\bar{S}}_{t}$ . For notational convenience, let us consider the scalar case, the vector case follows readily at the expense of more elaborate notation. Let us first consider $\bar{\bar{S}}_{2}$ . We have that $\bar{\bar{S}}_{2}=\bar{A}_{2}\bar{S}_{1}(\epsilon w_{0})+\bar{S}_{2}(\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1}+\bar{S}_{1}(\epsilon w_{0}))$ . Then, it follows that:

\displaystyle\bar{\bar{S}}_{2}=\bar{A}_{2}\bar{S}_{1}^{(2)}(\epsilon w_{0})^{2}+\bar{S}_{2}^{(2)}(\bar{A}_{1}\epsilon w_{0}+\epsilon w_{1})^{2}+O(\epsilon^{3}),

(16)

where $\bar{S}_{t}^{(2)}$ represents the coefficient of the second order term in the expansion of $\bar{S}_{t}$ . A similar observation holds for $H_{2}(\delta{\bf x_{2}})$ in that:

\displaystyle\bar{H}_{2}(\delta{\bf x_{2}})=\bar{H}_{2}^{(2)}(\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1})^{2}+O(\epsilon^{3}),

(17)

where $\bar{H}_{t}^{(2)}$ is the coefficient of the second order term in the expansion of $\bar{H}_{t}$ . Note that $\epsilon w_{0}=\delta{\bf x_{1}^{l}}$ and $\bar{A}_{1}(\epsilon w_{0})+\epsilon w_{1}=\delta{\bf x_{2}^{l}}$ . Therefore, it follows that we may write:

\displaystyle\bar{H}_{t}(\delta{\bf x_{t}})+C_{t}^{x}\bar{\bar{S}}_{t}=\sum_{\tau=0}^{t-1}q_{t,\tau}(\delta{\bf x_{\tau}^{l}})^{2}+O(\epsilon^{3}),

(18)

for suitably defined coefficients $q_{t,\tau}$ . Therefore, it follows that

	$\displaystyle\delta J_{2}=\sum_{t=1}^{T}\bar{H}_{t}(\delta{\bf x_{t}})+C_{t}^{x}\bar{\bar{S}}_{t}$
	$\displaystyle=\sum_{\tau=0}^{T}\bar{q}_{T,\tau}(\delta{\bf x_{\tau}^{l}})^{2}+O(\epsilon^{3}),$		(19)

for suitably defined $\bar{q}_{T,\tau}$ . Therefore:

\displaystyle\delta J_{1}\delta J_{2}=\sum_{t,\tau}C_{\tau}^{x}(\delta{\bf x_{\tau}^{l}})\bar{q}_{T,t}(\delta{\bf x_{t}^{l}})^{2}+O(\epsilon^{4}).

(20)

Taking expectations on both sides:

\displaystyle E[\delta J_{1}\delta J_{2}]=\sum_{t,\tau}C_{\tau}^{x}\bar{q}_{T,t}E[\delta{\bf x_{\tau}^{l}}(\delta{\bf x_{t}^{l}})^{2}]+O(\epsilon^{4}).

(21)

Break $\delta{\bf x_{t}^{l}}=(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})+\delta{\bf x_{\tau}^{l}}$ , assuming $\tau<t$ . Then, it follows that:

	$\displaystyle E[\delta{\bf x_{\tau}^{l}}(\delta{\bf x_{t}^{l}})^{2}]=E[\delta{\bf x_{\tau}^{l}}(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})^{2}]+E[(\delta{\bf x_{\tau}^{l})^{3}}]$
	$\displaystyle+2E[(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})(\delta{\bf x_{\tau}^{l}})^{2}]$
	$\displaystyle=E[(\delta{\bf x_{\tau}^{l}})^{3}],$		(22)

where the first and last terms in the first equality drop out due to the independence of the increment $(\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}})$ from $\delta{\bf x_{\tau}^{l}}$ , and the fact that $E[\delta{\bf x_{t}^{l}}-\delta{\bf x_{\tau}^{l}}]=0$ and $E[\delta{\bf x_{\tau}^{l}}]=0$ . Since $\delta{\bf x_{\tau}^{l}}$ is the state of the linear system $\delta{\bf x_{t+1}}=\bar{A}_{t}\delta{\bf x_{t}^{l}}+\epsilon w_{t}$ , it may again be shown that:

\displaystyle E[\delta x_{\tau}^{l}]^{3}=\sum_{s_{1},s_{2},s_{3}}\Phi_{\tau,s_{1}}\Phi_{\tau,s_{2}}\Phi_{\tau,s_{3}}E[w_{s_{1}}w_{s_{2}}w_{s_{3}}],

(23)

where $\Phi_{t,\tau}$ represents the state transitions operator between times $\tau$ and $t$ , and follows from the closed loop dynamics. Now, due to the independence of the noise terms $w_{t}$ , it follows that $E[w_{s_{1}}w_{s_{2}}w_{s_{3}}]=0$ regardless of $s_{1},s_{2},s_{3}$ .
An analogous argument as above can be repeated for the case when $\tau>t$ . Therefore, it follows that $E[\delta J_{1}\delta J_{2}]=O(\epsilon^{4})$ .

DDPG Algorithm: Deep Deterministic Policy Gradient (DDPG) is a policy-gradient based off-policy reinforcement learning algorithm that operates in continuous state and action spaces. It relies on two function approximation networks one each for the actor and the critic. The critic network estimates the $Q(s,a)$ value given the state and the action taken, while the actor network engenders a policy given the current state. Neural networks are employed to represent the networks.

The off-policy characteristic of the algorithm employs a separate behavioural policy by introducing additive noise to the policy output obtained from the actor network. The critic network minimizes loss based on the temporal-difference (TD) error and the actor network uses the deterministic policy gradient theorem to update its policy gradient as shown below:

Critic update by minimizing the loss:

L=\frac{1}{N}\Sigma_{i}(y_{i}-Q(s_{i},a_{i}|\theta^{Q}))^{2}

Actor policy gradient update:

\nabla_{\theta^{\mu}}\approx\frac{1}{N}\Sigma_{i}\nabla_{a}Q(s,a|\theta^{Q})|_{s=s_{i},a=\mu(s_{i})}\nabla_{\theta^{\mu}}\mu(s|\theta^{\mu})|_{s_{i}}

The actor and the critic networks consists of two hidden layers with the first layer containing 400 (’relu’ activated) units followed by the second layer containing 300 (’relu’ activated) units. The output layer of the actor network has the number of (’tanh’ activated) units equal to that of the number of actions in the action space.

Target networks one each for the actor and the critic are employed for a gradual update of network parameters, thereby reducing the oscillations and a better training stability. The target networks are updated at $\tau=0.001$ . Experience replay is another technique that improves the stability of training by training the network with a batch of randomized data samples from its experience. We have used a batch size of 32 for the inverted pendulum and the cartpole examples, whereas it is 64 for the rest. Finally, the networks are compiled using Adams’ optimizer with a learning rate of 0.001.

To account for state-space exploration, the behavioural policy consists of an off-policy term arising from a random process. We obtain discrete samples from Ornstein-Uhlenbeck (OU) process to generate noise as followed in the original DDPG method. The OU process has mean-reverting property and produces temporally correlated noise samples as follows:

dx_{t}=\Theta(\mu-x_{t})dt+\sigma dW

where $\Theta$ indicates how fast the process reverts to mean, $\mu$ is the equilibrium or the mean value and $\sigma$ corresponds to the degree of volatility of the process. $\Theta$ is set to 0.15, $\mu$ to 0 and $\sigma$ is annealed from 0.35 to 0.05 over the training process.