Optimizing Trajectories with Closed-Loop Dynamic SQP

Sumeet Singh¹, Jean-Jacques Slotine¹, Vikas Sindhwani¹ ¹ Sumeet Singh, Jean-Jacques Slotine (MIT), and Vikas Sindhwani are with Robotics at Google, New York City, NY 10011, USA {ssumeet,jeanjacquess, sindhwani}@google.com

Abstract

Indirect trajectory optimization methods such as Differential Dynamic Programming (DDP) have found considerable success when only planning under dynamic feasibility constraints. Meanwhile, nonlinear programming (NLP) has been the state-of-the-art approach when faced with additional constraints (e.g., control bounds, obstacle avoidance). However, a naïve implementation of NLP algorithms, e.g., shooting-based sequential quadratic programming (SQP), may suffer from slow convergence – caused from natural instabilities of the underlying system manifesting as poor numerical stability within the optimization. Re-interpreting the DDP closed-loop rollout policy as a sensitivity-based correction to a second-order search direction, we demonstrate how to compute analogous closed-loop policies (i.e., feedback gains) for constrained problems. Our key theoretical result introduces a novel dynamic programming-based constraint-set recursion that augments the canonical “cost-to-go” backward pass. On the algorithmic front, we develop a hybrid-SQP algorithm incorporating DDP-style closed-loop rollouts, enabled via efficient parallelized computation of the feedback gains. Finally, we validate our theoretical and algorithmic contributions on a set of increasingly challenging benchmarks, demonstrating significant improvements in convergence speed over standard open-loop SQP.

I Introduction

Trajectory optimization forms the backbone of model-based optimal control with myriad applications in robot mobility and manipulation [1, 2, 3, 4, 5]. The problem formulation is as follows: consider a robotic system with state $x\in\mathbb{R}^{n}$ , control input $u\in\mathbb{R}^{m}$ , subject to the discrete-time dynamics:

x_{k+1}=f(x_{k},u_{k}),\quad k\in\mathbb{N}_{\geq 0}.

(1)

Let $N\in\mathbb{N}_{>0}$ be some fixed planning horizon. Given some initial state $x_{0}$ , the trajectory optimization problem is as follows:


$\displaystyle\min_{\bm{u},\bm{x}}\qquad$	$\displaystyle\sum_{k=0}^{N-1}l_{k}(x_{k},u_{k})+l_{N}(x_{N})=:\mathcal{J}(\bm{u},\bm{x})$	(2a)
$\displaystyle\mathrm{s.t.}\qquad$	$\displaystyle k=0,\ldots,N-1:\quad\begin{cases}x_{k+1}=f(x_{k},u_{k})\\ c_{k}^{u}(u_{k})\geq 0\\ c_{k+1}^{x}(x_{k+1})\geq 0\end{cases},$	(2b)

where we use $(\bm{u},\bm{x})$ to denote the concatenations $(u_{0},\ldots,u_{N-1})$ and $(x_{0},\ldots,x_{N})$ , respectively. Here, $l_{k}:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}_{\geq 0}$ is the running cost, $l_{N}:\mathbb{R}^{n}\rightarrow\mathbb{R}_{\geq 0}$ is the terminal cost, and $c_{k}^{x}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n_{x}},c_{k}^{u}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n_{u}}$ are the vector-valued constraint functions on the state and control input. We assume that the control constraint encodes simple box constraints: $\underline{u}\leq u_{k}\leq\overline{u}$ , though, the results in this paper may be generalized beyond this assumption.

Solution methods generally fall into one of two approaches: optimal control-based (indirect methods), or optimization-based (direct methods). The former leverages necessary conditions of optimality for optimal control, such as dynamic programming (DP), while the latter treats the problem as a pure mathematical optimization program [6]. A further sub-categorization of the direct method distinguishes between a Full or a Condensed formulation, where the former treats both the states and controls as optimization variables, subject to dynamics equality constraints, while the latter optimizes only over the control variables, with the dynamics implicit.

Refer to caption — Figure 1: Motion planning using our methods for a planar quadrotor with attached pendulum, starting at rest with the pendulum in stable equilibrium (top-left, green), and ending at rest with the pendulum upright (bottom-right, red).

Lacking constraints beyond dynamic feasibility, ubiquitous indirect methods [7, 8, 9, 10] such as Differential Dynamic Programming (DDP) and its Gauss-Newton relaxation, iterative Linear Quadratic Regulator (iLQR) rely upon the DP recursion to split the full-horizon planning problem into a sequence of one-step optimizations, and alternate between a backward and forward pass through the time-steps. The backward pass recursively forms quadratic expansions of the optimal cost-to-go function and computes a time-varying affine perturbation policy that is subsequently rolled out through the system’s dynamics in the forward pass to yield the updated trajectory iterate. Under mild assumptions, DDP locally achieves quadratic convergence, and the proof relies upon establishing the close link between DDP and the Newton method, as applied to the condensed optimization-based formulation [9]. The stability properties of the underlying nonlinear system manifest as numerical stability during the optimization process, and hence the closed-loop nature of the forward pass in DDP typically leads to better performance [11] than Newton’s method, which implements open-loop rollouts.

Computing DDP-style closed-loop updates within the constrained setting (beyond dynamics feasibility) is much more challenging, since quadratization of the cost and linearization of the dynamics and constraints yields constrained quadratic programs (QPs) with piecewise-affine optimal perturbation policies [12], and complexity growing exponentially in the number of constraints and horizon of the problem. Consequently, direct methods (featuring open-loop updates) are the prevailing solution approach, typically combined with interior-point or SQP algorithms. Aside from possessing more variables, the direct formulation must additionally resolve dynamic feasibility, which can be non-trivial and lead to slower convergence even for unconstrained systems, as compared to indirect methods. Moreover, while the condensed formulation yields smaller problems, instabilities have been observed [13] due to divergences between the predicted step from the linearized dynamics, and the open-loop nonlinear rollout, leading to vanishing step-sizes.

Contributions: Re-interpreting the canonical DDP closed-loop rollout as a sensitivity-based correction to a second-order search direction, we demonstrate how to compute a locally affine approximation to the constrained perturbation policies, i.e., a set of feedback gains similar to those employed by the DDP rollouts. Our key theoretical result states that in the constrained setting, one must first compute an optimal perturbation sequence about the current trajectory iterate by solving a full-horizon QP (as opposed to the one-step DP backward recursion), and then augment the canonical cost-to-go backward pass with a constraint-set recursion. We then demonstrate how to approximate the desired feedback gains using an efficient, parallelized algorithm, eliminating the backward pass. The closed-loop rollout is integrated into an SQP line-search, yielding a hybrid indirect/direct algorithm that combines the theoretical foundations of SQP for constrained optimization with the algorithmic efficiencies of DDP-style forward rollouts. The method is rigorously evaluated within several environments, where we confirm significant convergence speed improvements over naïve (i.e., open-loop) SQP.

Related Work: Quasi-DDP methods for constrained trajectory optimization fall into one of three main categories: control-bounds only [14],[15], modified backward pass via KKT analysis [16, 17, 18, 19, 20], and augmented Lagrangian methods [21, 4, 3, 22, 23]. We provide a comprehensive overview of these approaches in Appendix A¹¹1All appendices referenced herein may be found in the online version of this work [24]..

II Shooting SQP

We detail below the core algorithmic steps for SQP, as applied to the shooting formulation of problem (2), i.e., where dynamics are treated implicitly and we optimize only over the control sequence $\bm{u}$ . The three steps are [25, 26]: (i) solving a QP sub-problem to compute a search direction, i.e., a sequence of control perturbations $\bm{\delta u}=(\delta u_{0},\ldots,\delta u_{N-1})$ , (ii) performing line-search along $\bm{\delta u}$ using a merit function, and (iii) monitoring termination conditions. We provide some details regarding (i) and (ii) here, and refer the reader to Appendices B and C for the rest.

Let $c_{k}(x_{k},u_{k})$ denote the concatenation $(c_{k}^{x}(x_{k}),c_{k}^{u}(u_{k}))$ , $k=0,\ldots,N$ (where $c_{N}^{u}$ is null), and let $y_{k}$ denote the corresponding dual variable. Consider the Lagrangian at current primal-dual iterate $(\bm{u},\bm{y})$ :

\mathcal{L}(\bm{u},\bm{x},\bm{y})=\sum_{k=0}^{N-1}l_{k}-y_{k}^{T}c_{k}+l_{N}-y_{N}^{T}c_{N}

(3)

For brevity, we omit the explicit arguments $(x_{k},u_{k})$ where possible. As we are only optimizing over $\bm{u}$ , let $\bm{x}[\bm{u}]$ represent the state trajectory starting at $x_{0}$ obtained from propagating the open-loop control sequence $\bm{u}$ through the discrete-time dynamics in (1) for $k=0,\ldots,N-1$ . Define the “reduced” objective and Lagrangian as $\mathcal{J}_{R}(\bm{u})=\mathcal{J}(\bm{u},\bm{x}[\bm{u}])$ and $\mathcal{L}_{R}(\bm{u},\bm{y}):=\mathcal{L}(\bm{u},\bm{x}[\bm{u}],\bm{y})$ , respectively, and define the sets $\{\delta\mathcal{X}_{k}\}_{k=1}^{N}$ and $\{\delta\mathcal{U}_{k}\}_{k=0}^{N-1}$ as $\delta\mathcal{X}_{k}:=\{\delta x:c_{k}^{x}+J_{k}^{x}\delta x\geq 0\}$ and $\delta\mathcal{U}_{k}:=\{\delta u:c_{k}^{u}+J_{k}^{u}\delta u\geq 0\}$ . Then, the QP sub-problem takes the form:


$\displaystyle\min_{\bm{\delta u},\bm{\delta x}}\quad$	$\displaystyle\langle\bm{\delta u},\nabla_{\bm{u}}\mathcal{J}_{R}(\bm{u})\rangle+\dfrac{1}{2}\langle\bm{\delta u},\nabla_{\bm{u}}^{2}\mathcal{L}_{R}(\bm{u},\bm{y})\bm{\delta u}\rangle$	(4a)
$\displaystyle\mathrm{s.t.}\quad$	$\displaystyle\delta x_{k+1}=A_{k}\delta x_{k}+B_{k}\delta u_{k},\ k=0,\ldots,N-1$	(4b)
	$\displaystyle\delta u_{k}\in\delta\mathcal{U}_{k},\quad\delta x_{k+1}\in\delta\mathcal{X}_{k+1},\ k=0,\ldots,N-1,$	(4c)

where $\delta x_{0}=0$ and $\langle\cdot,\cdot\rangle$ denotes the standard Euclidean dot product; $(A_{k},B_{k})$ are the dynamics Jacobians $(\partial f/\partial x,\partial f/\partial u)(x_{k},u_{k})$ , and $(J_{k}^{u},J_{k}^{x})$ are the constraint Jacobians $(\partial c_{k}^{u}(u_{k})/\partial u,\partial c_{k}^{x}(x_{k})/\partial x)$ . The sub-problem objective (4a) has the form:

\begin{split}&\sum_{k=0}^{N-1}\overbrace{\left[q_{k}^{T}\delta x_{k}+r_{k}^{T}\delta u_{k}+\dfrac{1}{2}\begin{bmatrix}\delta x_{k}\\ \delta u_{k}\end{bmatrix}^{T}Z_{k}\begin{bmatrix}\delta x_{k}\\ \delta u_{k}\end{bmatrix}\right]}^{:=\tilde{l}_{k}(\delta x_{k},\delta u_{k})}+\\ &\quad+\underbrace{\dfrac{1}{2}\delta x_{N}^{T}Z_{N}\delta x_{N}+q_{N}^{T}\delta x_{N}}_{:=\tilde{l}_{N}(\delta x_{N})}\end{split}

(5)

where the terms $\{q_{k},r_{k},Z_{k}\}$ are provided in Appendix C. Let $(\bm{\delta u}^{*},\bm{\delta x}^{*})$ represent the optimal primal, and $\hat{\bm{y}}$ the optimal inequality dual solutions for (4), and define the dual search direction $\bm{\delta y}^{*}:=\hat{\bm{y}}-\bm{y}$ .

II-A Line-Search

For $\alpha\in(0,1]$ , define $\bm{u}[\alpha]:=\bm{u}+\alpha\bm{\delta u}^{*}$ and $\bm{x}[\bm{u}[\alpha]]:=\bm{x}[\alpha]$ . The line-search merit function is defined as:

\phi(\alpha;\rho)=\mathcal{M}_{I}(\bm{u}[\alpha],\bm{y}+\alpha\bm{\delta y}^{*},\bm{s}+\alpha\bm{\delta s^{*}};\rho),\ \bm{x}=\bm{x}[\alpha],

(6)

where $\mathcal{M}_{I}$ is the augmented Lagrangian function for problem (2), $\bm{s}$ is a vector of slack variables for the inequality constraints with search direction $\bm{\delta s}^{*}$ , introduced solely for the line-search, and $\rho=\{\rho_{k}\}_{k=0}^{N}$ is a set of penalty parameters. Please see Appendix C for details on $\{\bm{s},\bm{\delta s},\rho\}$ .

III Dynamic Programming SQP

Implicit within the line-search is the open-loop rollout along the search direction $\bm{\delta u}^{*}$ , i.e., $\bm{x}[\alpha]=\bm{x}[\bm{u}+\alpha\bm{\delta u}^{*}]$ . For unstable nonlinear systems, this state trajectory may differ significantly from $\bm{x}+\bm{\delta x}^{*}$ , the “predicted” sequence from the QP sub-problem, forcing the line-search to take sub-optimal step-sizes and slowing convergence. This observation is corroborated in [11] in context of comparing DDP and Newton methods for unconstrained problems, and within [13] in the constrained context. Our objective therefore, is to efficiently compute a set of feedback gains to perform DDP-style closed-loop rollouts within the SQP line-search. We hypothesize that such an enhancement will (i) improve the numerical stability of the line-search, and (ii) accelerate convergence of Shooting SQP. We first demonstrate how the classical DP recursion is ill-posed in the context of constrained trajectory optimization, and propose a correction inspired from sensitivity analysis.

III-A Sensitivity-Based Dynamic Programming

The starting point for the derivation of iLQR and DDP algorithms for unconstrained problems is with the Bellman form of the optimal cost-to-go function:

\small\begin{split}V_{k}(x)&:=\min_{\pi_{k}}\left[l_{k}(x,\pi_{k}(x))+V_{k+1}(f(x,\pi_{k}(x)))\right],\ k=0,\ldots,N-1\\ V_{N}(x)&:=l_{N}(x),\end{split}

where $\pi_{k}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is a policy for time-step $k$ , mapping states to controls. For a non-optimal state-control sequence $(\bm{x},\bm{u})$ , consider the local expansion of the optimal cost-to-go function:

\begin{split}&\delta V_{k}(\delta x_{k}):=V_{k}(x_{k}+\delta x_{k})\\ &\quad=\min_{\delta\pi_{k}}\left[\underbrace{l_{k}(x_{k}+\delta x_{k},u_{k}+\delta\pi_{k}(\delta x_{k}))+\delta V_{k+1}(\delta x_{k+1})}_{:=Q_{k}(\delta x_{k},\delta\pi_{k})}\right],\end{split}

where $\delta x_{k+1}=f(x_{k}+\delta x_{k},u_{k}+\delta\pi_{k}(\delta x_{k}))-x_{k+1}$ , and $\delta\pi_{k}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is a perturbation policy for time-step $k$ at $x_{k}$ , as a function of $\delta x_{k}$ . Now, by recursively (in a backward pass) taking quadratic approximations of the state-action variation function $Q_{k}$ about $(\delta x_{k},\delta u_{k})=(0,0)$ , one can solve for an affine approximation to the minimizing perturbation policy. In particular, let $\breve{Q}_{k}$ represent the quadratic approximation of $Q_{k}$ , and define $\delta\breve{\pi}_{k}^{*}(\delta x_{k}):=\operatornamewithlimits{\mathrm{argmin}}_{\delta u}\breve{Q}_{k}(\delta x_{k},\delta u)$ . Then $\delta\breve{\pi}_{k}^{*}(\delta x_{k})=\overline{\delta u}_{k}+K_{k}\delta x_{k}$ . For step-length $\alpha$ , this perturbation policy is rolled out to obtain the new trajectory iterate:

\begin{split}\delta x_{k+1}[\alpha]&=f(x_{k}+\delta x_{k}[\alpha],u_{k}+\delta u_{k}[\alpha])-x_{k+1},\\ \delta u_{k}[\alpha]&=\alpha\overline{\delta u}_{k}+K_{k}\delta x_{k}[\alpha],\quad\alpha\in(0,1],\end{split}

(7)

where $\delta x_{0}[\alpha]=0$ . Notice that one may interpret the terms of the unconstrained perturbation policy $\delta\breve{\pi}_{k}^{*}$ as follows:

\overline{\delta u}_{k}:=\delta\breve{\pi}_{k}^{*}(0),\quad K_{k}:=\dfrac{\partial\delta\breve{\pi}_{k}^{*}(0)}{\partial\delta x_{k}}.

(8)

Remark 1.

Since $\delta\breve{\pi}^{*}_{k}(\delta x_{k})$ is the solution of an unconstrained convex quadratic, the argument $0$ is redundant for the sensitivity matrix $K_{k}$ . This will not be the case in the constrained setting.

Consider now the constrained setting, and define for $k\in\{0,\ldots,N-1\}$ :

\small\delta\pi_{k}^{*}(\delta x_{k}):=\operatornamewithlimits{\mathrm{argmin}}_{\delta u_{k}\in\delta\mathcal{U}_{k}}\quad\tilde{l}_{k}(\delta x_{k},\delta u_{k})+\delta V_{k+1}(\delta x_{k+1})

(9)

where $\tilde{l}_{k}$ is the stage- $k$ term in (5), $\delta x_{k+1}=A_{k}\delta x_{k}+B_{k}\delta u_{k}$ , and $\delta V_{k+1}$ is the optimal “cost-to-go” for problem (4). That is, for $k+1=N$ , $\delta x_{N}\in\delta\mathcal{X}_{N}$ , $\delta V_{N}(\delta x_{N})=\tilde{l}_{N}(\delta x_{N})$ , while for $k+1\in\{1,\ldots,N-1\}$ , $\delta x_{k+1}\in\delta\mathcal{X}_{k+1}$ , $\delta V_{k+1}(\delta x_{k+1})$ is the optimal value of the tail-truncation of QP sub-problem (4), starting at time-step $k+1$ at $\delta x_{k+1}$ .

Notice that since $\delta V_{k+1}$ is the optimal value of a constrained QP, $\delta\pi_{k}^{*}(0)$ and the sensitivity matrix $\partial\delta\pi_{k}^{*}(0)/\partial\delta x_{k}$ (paralleling the terms defined in (8)) may be ill-defined, for instance when the tail sub-problem is infeasible at $\delta x_{k}=0$ . This is a consequence of the linearized constraints, irrespective of the objective function used to define the DP recursion.

Instead, consider the following equivalent re-arrangement of the unconstrained DDP control law in (7):

\small\begin{split}\delta u_{k}[\alpha]&=\alpha\overline{\delta u}_{k}+K_{k}\delta x_{k}[\alpha]\\ &=\alpha\overline{\delta u}_{k}+K_{k}\alpha\delta x_{k}^{L}+K_{k}(\delta x_{k}[\alpha]-\alpha\delta x_{k}^{L})\\ &=\alpha\delta u_{k}^{L}+K_{k}(\delta x_{k}[\alpha]-\alpha\delta x_{k}^{L})\end{split}

(10)

where, the sequence $(\bm{\delta x}^{L},\bm{\delta u}^{L})$ is defined by the rollout of $\delta\breve{\pi}^{*}_{k}$ via the linearized dynamics:

\begin{split}\delta x_{k+1}^{L}&=A_{k}\delta x_{k}^{L}+B_{k}\delta u_{k}^{L},\quad\delta x_{0}^{L}=0\\ \delta u_{k}^{L}&:=\delta\breve{\pi}^{*}_{k}(\delta x_{k}^{L}).\end{split}

(11)

In light of the homogeneity of the above recursion (i.e., the sequence $\alpha\bm{\delta u}^{L}$ rolled out via the linear dynamics yields $\alpha\bm{\delta x}^{L}$ ), eq. (10) suggests interpreting $\bm{\delta u}^{L}$ as a search-direction, $\alpha\bm{\delta u}^{L}$ as the search step, and the feedback term as a sensitivity-based correction. Thus, we may interpret the DDP rollout as a local sensitivity-based correction to the Newton search direction ( $\bm{\delta u}^{L}$ ). Generalizing this interpretation to the constrained setting, consider the following control law:

\delta u_{k}[\alpha]=\mathrm{clip}_{\underline{u}-u_{k}}^{\overline{u}-u_{k}}\bigg{[}\alpha\delta\pi_{k}^{*}(\delta x_{k}^{L})+\dfrac{\partial\delta\pi_{k}^{*}(\alpha\delta x_{k}^{L})}{\partial\delta x_{k}}(\delta x_{k}[\alpha]-\alpha\delta x_{k}^{L})\bigg{]},

where similarly to (11), $\bm{\delta x}^{L}$ is obtained from rolling-out $\{\delta\pi_{k}^{*}(\delta x_{k}^{L})\}_{k=0}^{N-1}$ through the linearized dynamics. Now, it follows from Bellman’s principle of optimality that $\delta\pi_{k}^{*}(\delta x_{k}^{*})=\delta u_{k}^{*}$ , where $(\bm{\delta x},\bm{\delta u})^{*}$ are the optimal solution to (4). Further since $\delta x_{0}^{L}=\delta x_{0}^{*}=0$ , it follows inductively that $\bm{\delta x}^{L}=\bm{\delta x}^{*}$ . Thus, the final control law for $\delta u_{k}[\alpha]$ becomes

\mathrm{clip}_{\underline{u}-u_{k}}^{\overline{u}-u_{k}}\bigg{[}\alpha\delta u_{k}^{*}+\dfrac{\partial\delta\pi^{*}_{k}(\alpha\delta x_{k}^{*})}{\partial\delta x_{k}}(\delta x_{k}[\alpha]-\alpha\delta x_{k}^{*})\bigg{]}.

(12)

Given that $\delta\pi^{*}_{k}$ , as defined in (9), is implicitly the solution of a variable-horizon optimization, it is computationally prohibitive to compute the sensitivity matrices above via explicit differentiation. Instead, we next define a DP recursion to exactly compute these sensitivities about the fixed sequence $\bm{\delta x}^{*}$ .

Remark 2.

Since problem (9) is a multi-parametric QP in $\delta x_{k}$ , $\delta\pi^{*}_{k}$ is a piecewise-affine function of $\delta x_{k}$ [12]. Thus, for $\alpha=1$ , the expression inside the brackets in (12) lies within $\delta\mathcal{U}_{k}$ only for $\delta x_{k}[1]$ in a local region around $\delta x_{k}^{*}$ , thereby necessitating the clipping operation (i.e., projection onto $\delta\mathcal{U}_{k}$ ).

III-B DP Recursion for Computing Sensitivity Gains

We outline the DP recursion first and characterize its correctness in Theorem 1.

Initialization: Set $P_{N}=Z_{N},p_{N}=q_{N},v_{N}=0,G^{\mathrm{cr}}_{N}=-J_{N}^{x},h^{\mathrm{cr}}_{N}=c_{N}^{x}$ .

Time-step $k=N-1,\ldots,0$ : Define the function $\delta\tilde{V}_{k+1}(\delta x_{k+1}):=(1/2)\delta x_{k+1}^{T}P_{k+1}\delta x_{k+1}+p_{k+1}^{T}\delta x_{k+1}+v_{k+1}$ . For $\delta x_{k}\in\delta\mathcal{X}_{k}$ , consider the one-step QP:

\small\begin{split}\min_{\begin{subarray}{c}\delta u_{k}\\ \delta x_{k+1}\end{subarray}}\quad&\tilde{l}_{k}(\delta x_{k},\delta u_{k})+\delta\tilde{V}_{k+1}(\delta x_{k+1})\\ \mathrm{s.t.}\quad&\delta x_{k+1}=A_{k}\delta x_{k}+B_{k}\delta u_{k}\\ &\underbrace{\begin{bmatrix}-J_{k}^{u}&0\\ 0&G^{\mathrm{cr}}_{k+1}\end{bmatrix}}_{:=\bar{G}_{k}}\begin{bmatrix}\delta u_{k}\\ \delta x_{k+1}\end{bmatrix}\leq\begin{bmatrix}c^{u}_{k}\\ h^{cr}_{k+1}\end{bmatrix}:=\bar{h}_{k}.\end{split}

(13)

Let $\delta\hat{\pi}^{*}_{k}(\delta x_{k})$ and $\hat{y}^{*}_{k}(\delta x_{k})$ denote the optimal control perturbation and inequality dual solutions for the above one-step QP, as a function of $\delta x_{k}$ . Define the sensitivity matrices $K_{k}^{u}$ and $K_{k}^{y}$ as the Jacobians [27] of $\delta\hat{\pi}^{*}_{k}$ and $\hat{y}^{*}_{k}$ respectively, evaluated at $\delta x_{k}^{*}$ , and define the affine functions:


$\displaystyle\delta\hat{\pi}^{*}_{k,a}(\delta x_{k})$	$\displaystyle=\delta\hat{\pi}^{}_{k}(\delta x_{k}^{})+K_{k}^{u}(\delta x_{k}-\delta x_{k}^{*})$	(14a)
$\displaystyle\hat{y}^{*}_{k,a}(\delta x_{k})$	$\displaystyle=\hat{y}^{}_{k}(\delta x_{k}^{})+K_{k}^{y}(\delta x_{k}-\delta x_{k}^{*}).$	(14b)

Recurse: Compute:

\small G^{\mathrm{cr}}_{k}:=\begin{bmatrix}-J_{k}^{x}\\ \bar{G}_{k}\begin{bmatrix}K^{u}_{k}\\ \bar{A}_{k}\end{bmatrix}\\ -K_{k}^{y}\end{bmatrix},\ h^{\mathrm{cr}}_{k}=\begin{bmatrix}c_{k}^{x}\\ \bar{h}_{k}-\bar{G}_{k}\begin{bmatrix}I\\ B_{k}\end{bmatrix}\delta\hat{\pi}^{*}_{k,a}(0)\\ \hat{y}^{*}_{k,a}(0)\end{bmatrix},

(15)

where $\bar{A}_{k}:=A_{k}+B_{k}K_{k}^{u}$ , and:


$\displaystyle\bar{r}_{k}$	$\displaystyle=r_{k}+B_{k}^{T}p_{k+1}\quad\bar{R}_{k}=Z_{k,uu}+B_{k}^{T}P_{k+1}B_{k}$	(16a)
$\displaystyle\bar{M}_{k}$	$\displaystyle=Z_{k,xu}+A_{k}^{T}P_{k+1}B_{k}$	(16b)
$\displaystyle p_{k}$	$\displaystyle=q_{k}+A_{k}^{T}p_{k+1}+K_{k}^{u^{T}}\bar{r}_{k}+(K_{k}^{u^{T}}\bar{R}_{k}+\bar{M}_{k})\delta\hat{\pi}^{*}_{k,a}(0)$	(16c)
$\displaystyle P_{k}$	$\displaystyle=Z_{k,xx}+A_{k}^{T}P_{k+1}A_{k}+K_{k}^{u^{T}}\bar{R}_{k}K_{k}^{u^{T}}+$
	$\displaystyle\quad+\bar{M}_{k}K_{k}^{u}+K_{k}^{u^{T}}\bar{M}_{k}^{T}$	(16d)
$\displaystyle v_{k}$	$\displaystyle=\bar{r}_{k}^{T}\delta\hat{\pi}^{}_{k,a}(0)+\dfrac{1}{2}\delta\hat{\pi}^{}_{k,a}(0)^{T}\bar{R}_{k}\delta\hat{\pi}^{*}_{k,a}(0)+v_{k+1}.$	(16e)

We now characterize the correctness of this DP recursion in the following theorem; the proof is provided in Appendix D.

Theorem 1.

Suppose that for each $k\in\{N-1,\ldots,0\}$ , the solution of the one-step QP in (13) at $\delta x_{k}=\delta x_{k}^{*}$ satisfies Linear Independent Constraint Qualification (LICQ). Then, the recursion in (15)–(16) is well-defined. Define the sets $\mathcal{CR}_{k}:\{\delta x_{k}:G^{\mathrm{cr}}_{k}\delta x_{k}\leq h^{\mathrm{cr}}_{k}\}$ , for $k=N,\ldots,0$ . It holds that:

\begin{split}&\delta x_{k}^{*}\in\mathcal{CR}_{k}\ k=N,\ldots,0\ \ \text{and}\\ &\begin{cases}\forall\delta x_{k}\in\mathcal{CR}_{k}&:\ \delta\pi^{*}_{k}(\delta x_{k})=\delta\hat{\pi}^{*}_{k,a}(\delta x_{k})\\ \forall\delta x_{k+1}\in\mathcal{CR}_{k+1}&:\ \delta V_{k+1}(\delta x_{k+1})=\delta\tilde{V}_{k+1}(\delta x_{k+1}),\end{cases}\end{split}

for $k=N-1,\ldots,0$ .

Remark 3.

A notable consequence of Theorem 1 is that the canonical cost-to-go recursion is ill-posed in the presence of constraints. One must back-propagate both the cost-to-go terms and a set of constraints (i.e., the sets $\{\mathcal{CR}_{k}\}$ ) that define the regions where the quadratic models of the cost-to-go functions are precise.

Despite the exactness of the DP recursion, there are some computational drawbacks. First, one must solve both the “full-horizon” QP defined in (4) and the one-step QPs defined in (13), serially. Second, back-propagating sets $\{\mathcal{CR}_{k}\}$ is not numerically robust, particularly if the sensitivity $K_{k}^{y}$ is ill-defined. This occurs when the LICQ condition fails and the resulting matrix solve computation for the sensitivities is singular. Thus, in the next section, we outline a parallelized and tuneable approximation to the sensitivity gains, derived from the viewpoint of interior point methods.

III-C Approximating the Sensitivity Gains

For $k=N-1,\ldots,0$ , define problem $\mathcal{P}_{k\multimap}(\delta x)$ as the tail portion of QP sub-problem (4), starting at time-step $k$ at $\delta x$ . Let $\bm{\delta u}^{*}_{k\multimap}(\delta x)$ represent the optimal solution as a function of $\delta x$ , i.e., the optimal control perturbation sequence starting at time-step $k$ . Notice that $\delta\pi_{k}^{*}(\delta x)$ , as defined in (9), corresponds to the first element of the sequence $\bm{\delta u}^{*}_{k\multimap}(\delta x)$ .

Now define the QP problem $\mathcal{P}_{k}(\delta x)$ as QP sub-problem (4), subject to an additional equality constraint: $\delta x_{k}=\delta x$ , and let $\bm{\delta u}^{*}_{k:}(\delta x)$ represent the optimal tail control perturbation sequence starting at time-step $k$ .

Notice then that for all $\delta x$ where $\mathcal{P}_{k}(\delta x)$ is feasible, we have that $\bm{\delta u}^{*}_{k\multimap}(\delta x)=\bm{\delta u}^{*}_{k:}(\delta x)$ . Thus, $\delta\pi_{k}^{*}(\delta x)$ is equal to the optimal control perturbation at time-step $k$ for problem $\mathcal{P}_{k}(\delta x)$ , hereby denoted as the function $\delta\hat{\pi}_{k}^{*}(\delta x)$ . Further, since $\mathcal{P}_{k}(\delta x_{k}^{*})$ is feasible, one may compute the desired sensitivity gains $K_{k}^{u}$ as the Jacobian $\partial\delta\hat{\pi}_{k}^{*}(\delta x_{k}^{*})/\partial\delta x$ .

Note the distinction: the sensitivity $\partial\delta\hat{\pi}_{k}^{*}/\partial\delta x$ corresponds to the Jacobian of the solution of a fixed-horizon QP (problem $\mathcal{P}_{k}(\delta x)$ ) w.r.t. a parameter ( $\delta x$ ) that defines the equality constraint $\delta x_{k}=\delta x$ . In comparison, the sensitivity $\partial\delta\pi_{k}^{*}/\partial\delta x$ corresponds to the Jacobian of the solution of a variable-horizon QP (problem $\mathcal{P}_{k\multimap}(\delta x)$ ) w.r.t. a parameter ( $\delta x$ ) that defines the “initial condition.” The former computation is easily parallelized.

Leveraging a recent result in [28], we approximate $\partial\delta\hat{\pi}_{k}^{*}(\delta x_{k}^{*})/\partial\delta x$ by the Jacobian of the solution of the following unconstrained barrier re-formulation of problem $\mathcal{P}_{k}(\delta x)$ w.r.t. $\delta x$ at $\delta x=\delta x_{k}^{*}$ :

\begin{split}\small\min_{\bm{\delta u},\bm{\delta x}}\ &\sum_{j=0}^{N-1}\bigg{[}\tilde{l}_{j}-\gamma\bm{1}^{T}\log(c_{j}^{x}+J_{j}^{x}\delta x_{j})-\gamma\bm{1}^{T}\log(c_{j}^{u}+J_{j}^{u}\delta u_{j})\bigg{]}\\ &\ +\dfrac{1}{2\gamma}\|\delta x_{k}-\delta x\|^{2}+\tilde{l}_{N}-\gamma\bm{1}^{T}\log(c_{N}^{x}+J_{N}^{x}\delta x_{N}),\end{split}

subject to the linear dynamics in (4b); where $\gamma>0$ is the barrier constant. Denote $K_{k}^{u}(\gamma)$ to be the barrier-based Jacobian with parameter $\gamma$ and let $K_{k}^{u}$ be the true Jacobian. Under appropriate conditions on the solution of QP sub-problem (4), $K_{k}^{u}(\gamma)\rightarrow K_{k}^{u}$ as $\gamma\rightarrow 0$ [28].

Practically, we compute the Jacobians $K_{k}^{u}(\gamma)$ efficiently using iLQR and a straightforward application of the Implicit Function Theorem [29]. We initialized the solver with the QP sub-problem (4) solution $\bm{\delta u}^{*}$ , and found only a handful of iterations were needed to converge, particularly since problem (III-C) is convex.

III-D Hybrid SQP Algorithm

We formally state the Hybrid SQP algorithm as a line-search modification of Shooting SQP, introduced in Section II. Thus, at the current primal-dual iterate $(\bm{u},\bm{x}[\bm{u}],\bm{y})$ :

Step 1: Solve QP-subproblem (4) to obtain the optimal perturbation sequence pair $(\bm{\delta u}^{*},\bm{\delta x}^{*})$ .

Step 2: Compute the sensitivity gains $\{K_{k}\}_{k=0}^{N-1}$ , using either the DP recursion in Section III-B (i.e., $K_{k}=K_{k}^{u}$ ), or the smoothed approximation method in Section III-C (i.e., $K_{k}=K_{k}^{u}(\gamma)$ for some $\gamma>0$ ).

Step 3: Perform line-search using (6), where $\bm{x}[\alpha]:=\bm{x}+\bm{\delta x}[\alpha]$ is now defined by the closed-loop rollout:

\begin{split}\delta u_{k}[\alpha]:=&\mathrm{clip}_{\underline{u}-u_{k}}^{\overline{u}-u_{k}}\left[\alpha\delta u_{k}^{*}+K_{k}(\delta x_{k}[\alpha]-\alpha\delta x_{k}^{*})\right]\\ \delta x_{k+1}[\alpha]&=f(x_{k}+\delta x_{k}[\alpha],u_{k}+\delta u_{k}[\alpha])-x_{k+1},\ \delta x_{0}[\alpha]=0.\end{split}

Notice that if $\alpha=1$ , the rollout corresponds with the ideal DDP rollout in (12), while for $\alpha<1$ , we end up with an approximation²²2As the sensitivity gains are only valid in a neighborhood of $\bm{\delta x}^{*}$ , it is possible (though rare in our experiments) that the computed step-length $\alpha$ falls below the user-set threshold $\underline{\alpha}$ for a specific iteration. As a backup, we compute a set of TV-LQR gains $\{K_{k}^{\mathrm{lqr}}\}$ using the linearized dynamics and the Hessian of the objective function $\mathcal{J}(\bm{u},\bm{x})$ , and perform the closed-loop rollout with these gains. This strategy is motivated by the goal of tracking the perturbation $\alpha\bm{\delta x}^{*}$ during the rollout [13]. stemming from using a fixed gain matrix.

IV Experiments

We compare the naïve, open-loop, Shooting SQP implementation introduced in Section II (referred to as OL) with the DDP-style closed-loop variation developed in Section III (referred to as CL and $\texttt{\bf{CL}}_{\gamma}$ ) on two environments. The identifiers CL and $\texttt{\bf{CL}}_{\gamma}$ distinguish between the exact DP recursion and the smoothed barrier-based approximation for computing the forward rollout gains. Please see Appendix E for details regarding problem setup, SQP hyperparameters, additional plots, and an extra worked example.

IV-A Motion Planning for a Car

The first example is taken from [17], featuring a 2D car ( $n=4$ , $m=2$ ) moving within the obstacle-ridden environment shown in Figure 2. The objective is to drive to the goal position $(3,3)$ with final velocity $0$ and orientation aligned with the horizontal axis in $N=40$ steps, while avoiding the obstacles.

Figure 2 shows the computed trajectories by the three methods for three different initial conditions, while Table I provides the solver statistics. Notice that the OL method fails to converge within 100 iterations (the limit) for two of the three cases. In contrast, both closed-loop variations are quickly able to converge to stationary solutions.

Method	Case	Converged	Iter	Obj	Viol	Time/it [s]
OL	1	✗	100	278.53	-0.0346	0.34
	2	✗	100	2.17	0.0061	0.30
	3	✓	12	21.49	3.25e-6	0.30
CL	1	✓	19	3.19	7.87e-6	13.17
	2	✓	19	6.98	1.43e-6	19.19
	3	✓	13	21.58	1.12e-5	17.90
$\texttt{\bf{CL}}_{\gamma}$	1	✓	19	3.19	1.34e-5	0.40
	2	✓	16	2.06	1.28e-5	0.41
	3	✓	11	21.58	-4.67e-6	0.38

TABLE I: Solver statistics for the car planning example. Baseline: OL, Ours:

\{\texttt{\bf{CL}},\texttt{\bf{CL}}_{\gamma}\}

. Iter, Obj, and Viol report the number of iterations, objective, and max state constraint violation (negative value indicates infeasibility), respectively, for the final solution. Time/it reports the average (over the course of the optimization) computation time per SQP iteration.

In Figure 3, we plot the re-construction error $\|\delta\hat{\pi}_{k}^{*}(\delta x_{k}^{*})-\delta u_{k}^{*}\|$ for CL and $\texttt{\bf{CL}}_{\gamma}$ over the course of the SQP iterations. Observe that for most iterations, the error is negligible for CL, with occasional spikes resulting from the numerical instability of the constrained DP recursion. In contrast, the error remains sufficiently low for all iterations of $\texttt{\bf{CL}}_{\gamma}$ , even leading to a better quality (lower objective) solution for Case #2. We hypothesize that the better numerical stability of $\texttt{\bf{CL}}_{\gamma}$ stems from a smoothing of the computed Jacobians (i.e., feedback gains), courtesy of the unconstrained barrier re-formulation. Finally, a notable advantage of $\texttt{\bf{CL}}_{\gamma}$ over CL is the computation time. While CL involves differentiating through the KKT conditions of one-step horizon QPs, this computation must happen serially in the backward pass. In contrast, $\texttt{\bf{CL}}_{\gamma}$ computes the required Jacobians across all time-steps in parallel using an efficient adjoint recursion associated with problem (III-C). Consequently, the computation times-per iteration are much closer together for OL and $\texttt{\bf{CL}}_{\gamma}$ than for OL and CL. For the remaining experiments, we only compare OL and $\texttt{\bf{CL}}_{\gamma}$ .

IV-B Quad-Pendulum

Consider a quadrotor with an attached pendulum ( $n=8$ , $m=2$ ) moving within an obstacle-cluttered 2D vertical plane, subject to viscous friction at the pendulum joint. The system is subject to operational constraints on the state and control input, as well as obstacle avoidance constraints. The task involves planning a trajectory starting at rest with the pendulum at the stable equilibrium, to a goal location, with the pendulum upright and both quadrotor and pendulum stationary. Table II provides the solver statistics for $\texttt{\bf{CL}}_{\gamma}$ and OL (up until the algorithm stalls due to infeasibility of the QP sub-problem). Figure 1 shows a timelapse of the solution for the more difficult of the two cases, highlighting the ability of $\texttt{\bf{CL}}_{\gamma}$ in solving challenging planning tasks.

Method	Case	Converged	Iter	Obj	Viol	Time/it [s]
OL	1	✗	stall (2)	2272.8	-7.33	1.49
OL	2	✗	stall (4)	488.9	-0.58	1.35
$\texttt{\bf{CL}}_{\gamma}$	1	✓	39	9.31	-2.73e-10	6.25
	2	✓	59	11.57	-4.33e-10	5.65

TABLE II: Solver statistics for the Quad-Pendulum example. Algorithm OL stalls within a handful of iterations.

V Conclusions

In this work, we re-interpret DDP rollout policies from a perspective of sensitivity-based corrections, and use this insight to develop algorithms for computing analogous policies for constrained problems. We incorporate the resulting closed-loop rollouts within a shooting-based SQP framework, and demonstrate significant improvements in convergence speed over a standard SQP implementation using open-loop rollouts.

Our work opens several avenues for future research. First, a key bottleneck of SQP involves solving the QP sub-problem at each iteration to compute a “nominal” perturbation sequence. This may be achieved with fast, but potentially, less-accurate unconstrained solvers (e.g., augmented-Lagrangian iLQR), that additionally compute the desired sensitivity gains using an efficient Riccati recursion. Second, leveraging recent results on differentiating through the solution of general convex problems, the sensitivity-based computations may be applied to the sequential-convex-programming algorithm. Finally, while the SQP algorithm was studied in the shooting context, recent work [30] has demonstrated how to incorporate nonlinear rollouts with both states and controls as optimization variables, albeit in an otherwise unconstrained setting. The sensitivity-based gain computation can be extended to this setting, potentially boosting the efficiency of “full” direct methods.

References

[1] J. Schulman, Y. Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan, S. Patil, K. Goldberg, and P. Abbeel, “Motion planning with sequential convex optimization and convex collision checking,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1251–1270, 2014.
[2] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal, “STOMP: Stochastic trajectory optimization for motion planning,” in 2011 IEEE international conference on robotics and automation. IEEE, 2011, pp. 4569–4574.
[3] V. Sindhwani, R. Roelofs, and M. Kalakrishnan, “Sequential operator splitting for constrained nonlinear optimal control,” in 2017 American Control Conference (ACC). IEEE, 2017, pp. 4864–4871.
[4] T. A. Howell, B. E. Jackson, and Z. Manchester, “ALTRO: A fast solver for constrained trajectory optimization,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 7674–7679.
[5] Y. Tassa, T. Erez, and W. D. Smart, “Receding horizon differential dynamic programming,” in nips, 2007, pp. 1465–1472.
[6] D. Bertsekas, Nonlinear Programming. Athena Scientific; 3rd edition, 2016.
[7] W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems.” in ICINCO (1), 2004, pp. 222–229.
[8] D. Q. Mayne, “Differential dynamic programming–a unified approach to the optimization of dynamic systems,” Control and Dynamic Systems, vol. 10, pp. 179–254, 1973.
[9] D. Murray and S. Yakowitz, “Differential dynamic programming and newton’s method for discrete optimal control problems,” Journal of Optimization Theory and Applications, vol. 43, no. 3, pp. 395–414, 1984.
[10] J. C. Dunn and D. Bertsekas, “Efficient dynamic programming implementations of Newton’s method for unconstrained optimal control problems,” Journal of Optimization Theory and Applications, vol. 63, no. 1, pp. 23–38, 1989.
[11] L.-z. Liao and C. A. Shoemaker, “Advantages of differential dynamic programming over Newton’s method for discrete-time optimal control problems,” Cornell University, Tech. Rep., 1992.
[12] F. Borrelli, A. Bemporad, and M. Morari, Predictive control for linear and hybrid systems. Cambridge University Press, 2017.
[13] M. J. Tenny, S. J. Wright, and J. B. Rawlings, “Nonlinear model predictive control via feasibility-perturbed sequential quadratic programming,” Computational Optimization and Applications, vol. 28, no. 1, pp. 87–121, 2004.
[14] Y. Tassa, N. Mansard, and E. Todorov, “Control-limited differential dynamic programming,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 1168–1175.
[15] J. Marti-Saumell, J. Solà, C. Mastalli, and A. Santamaria-Navarro, “Squash-box feasibility driven differential dynamic programming,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 7637–7644.
[16] M. Giftthaler and J. Buchli, “A projection approach to equality constrained iterative linear quadratic optimal control,” in 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids). IEEE, 2017, pp. 61–66.
[17] Y. Aoyama, G. Boutselis, A. Patel, and E. A. Theodorou, “Constrained differential dynamic programming revisited,” arXiv preprint arXiv:2005.00985, 2020.
[18] Z. Xie, C. K. Liu, and K. Hauser, “Differential dynamic programming with nonlinear constraints,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 695–702.
[19] S. Yakowitz, “The stagewise Kuhn-Tucker condition and differential dynamic programming,” IEEE transactions on automatic control, vol. 31, no. 1, pp. 25–30, 1986.
[20] G. Lantoine and R. P. Russell, “A hybrid differential dynamic programming algorithm for constrained optimal control problems. part 1: Theory,” Journal of Optimization Theory and Applications, vol. 154, no. 2, pp. 382–417, 2012.
[21] B. Plancher, Z. Manchester, and S. Kuindersma, “Constrained unscented dynamic programming,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5674–5680.
[22] J. Ma, Z. Cheng, X. Zhang, M. Tomizuka, and T. H. Lee, “Alternating direction method of multipliers for constrained iterative LQR in autonomous driving,” arXiv preprint arXiv:2011.00462, 2020.
[23] A. Pavlov, I. Shames, and C. Manzie, “Interior point differential dynamic programming,” IEEE Transactions on Control Systems Technology, 2021.
[24] S. Singh, J. J. Slotine, and V. Sindhwani, “Optimizing trajectories with closed-loop dynamic SQP,” arXiv preprint arXiv:2109.07081, 2021.
[25] P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright, “User’s guide for NPSOL (version 4.0): A Fortran package for nonlinear programming.” Stanford Univ CA Systems Optimization Lab, Tech. Rep., 1986.
[26] P. E. Gill, W. Murray, and M. A. Saunders, “SNOPT: An SQP algorithm for large-scale constrained optimization,” SIAM review, vol. 47, no. 1, pp. 99–131, 2005.
[27] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter, “Differentiable convex optimization layers,” in nips, 2019, pp. 9562–9574.
[28] W. Jin, S. Mou, and G. J. Pappas, “Safe pontryagin differentiable programming,” arXiv preprint arXiv:2105.14937, 2021.
[29] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable mpc for end-to-end planning and control,” in Advances in Neural Information Processing Systems, 2018, pp. 8289–8300.
[30] C. Mastalli, R. Budhiraja, W. Merkt, G. Saurel, B. Hammoud, M. Naveau, J. Carpentier, L. Righetti, S. Vijayakumar, and N. Mansard, “Crocoddyl: An efficient and versatile framework for multi-contact optimal control,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 2536–2542.
[31] W. Murray and F. J. Prieto, “A sequential quadratic programming algorithm using an incomplete solution of the subproblem,” SIAM Journal on Optimization, vol. 5, no. 3, pp. 590–640, 1995.
[32] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos, “The explicit linear quadratic regulator for constrained systems,” Automatica, vol. 38, no. 1, pp. 3–20, 2002.
[33] R. Tedrake, “Underactuated robotics: Learning, planning, and control for efficient and agile machines course notes for MIT 6.832,” MIT, Tech. Rep., 2009.

Appendix A Related Work

iLQR/DDP-like methods for constrained trajectory optimization fall into one of three main categories: control-bounds only [14],[15], modification of the backward pass via KKT analysis [16, 17, 18, 19, 20], and augmented Lagrangian methods [21, 4, 3, 22, 23]. In the first category, only control-bound constraints are considered. For instance, [14] leverage the box-constrained DDP algorithm where the projected Newton algorithm is used to compute the affine perturbation policy in the backward pass, accounting for the box constraints on the control input. On the other hand, [15] composes a “squashing” function to constrain the input, and augments the objective with a barrier penalty to encourage the iterates to stay away from the plateaued regions of squashing function.

More general constraints are considered in the second and third categories [16, 17, 18, 19, 20, 21, 4, 3, 22, 23]. The work in [16] features equality constraints. The update step leverages constraint linearization and a nullspace projection to reduce the problem to a singular optimal control problem in an unconstrained control input lying within the nullspace of the linearized constraints. The works in [17, 18, 19, 20] formulate a constrained backward pass, where the one-step optimization problems now feature general state and control inequality constraints, linearized about the current trajectory iterate, yielding constrained one-step QPs. The algorithm in [18] extracts a guess of the active inequality constraints at each time-step by looking at the current trajectory iterate, and formulates the backward pass using linearized active constraints as equalities. The resulting KKT system is solved using Schur’s complement and using the dual variable extracted from the “feedforward” perturbation (i.e., the control perturbation computed assuming zero state perturbation) to yield an affine perturbation policy. The work in [19] formulates the theoretical underpinnings of such a constrained backward pass by introducing the stage-wise KKT conditions, and implements a very similar algorithm to [18], but where the active set is guessed to be equal to the active set at the feedforward solution, computed using linearized inequality constraints. [20] implements an identical backward pass, but with the active set guessed by extracting the violating constraints from implementing the feedforward perturbation (computed using a trust-region constraint). More recently, [17] leverages a slack-form of the linearized inequality constraints in the one-step optimization problems and uses an iterative procedure to refine the dual variable assuming zero state perturbation. A final solve of the KKT system using Schur’s complement yields the affine perturbation policy. The forward pass of all these methods discard this affine policy and instead use the one-step QPs to compute the control perturbation trajectory. In similar spirit, [23] introduced an interior-point variation of the backward pass, where the one-step optimization is re-written in min-max form w.r.t. the Lagrangian, and linearization of the associated perturbed KKT conditions is used to compute the affine perturbation policy.

The algorithms in [21, 4, 3, 22] incorporate the constraints by forming the augmented Lagrangian, and alternate between unconstrained trajectory optimization (with the augmented Lagrangian as the objective), and updating the dual variables and penalty parameter – loosely mimicking the method of multipliers [6]. [21] implements this in combination with a sampling-based construction of the optimal cost-to-go approximation; [4] dovetails the multipliers algorithm with Newton-based projections to project the intermediate solution onto the current active set; [3] adopts an ADMM-based solution, by leveraging indicator function representations of the constraints and alternates between an unconstrained TV-LQR problem, linearized constraint projection, consensus update, and dual update.

In contrast to the methods above, our approach does not involve guessing active constraint sets for the one-step QPs and instead re-interprets the DDP-style control law from the lens of sensitivity analysis about a known feasible perturbation sequence. We use this interpretation to derive an augmented “backward pass” where we additionally back-propagate a “next-step” constraint set, along with the quadratic cost-to-go parameters. Further, we demonstrate how to approximate the desired feedback gains across all time-steps in parallel, dispelling the need for backward passes which are numerically less robust. Finally, we embed the closed-loop rollout within a theoretically sound SQP framework.

Appendix B SQP Preliminaries

In this appendix, we outline the main steps of the SQP algorithm. These steps represent a simplified version of the commercial code NPSOL [25], and use the termination conditions from the commercial code SNOPT [26]. Consider the following smooth optimization problem:

\begin{split}\mathcal{P}:\quad\min_{x\in\mathbb{R}^{n}}\quad&f(x)\\ \mathrm{s.t.}\quad&c(x)\in\mathbb{R}^{m}_{\geq 0},\end{split}

(17)

where the functions $f$ and $c$ are at least $\mathcal{C}^{2}$ . Define the Lagrangian as:

\mathcal{L}(x,y)=f(x)-y^{T}c(x).

A standard SQP iteration consists of the following steps:

•

Solve a quadratic problem (QP) at current primal-dual point $(x,y)$ :

\begin{split}\delta x^{*}=\operatornamewithlimits{\mathrm{argmin}}_{\delta x}\quad&g(x)^{T}\delta x+\dfrac{1}{2}\delta x^{T}H(x,y)\delta x\\ \mathrm{s.t.}\quad&c(x)+J(x)\delta x\geq 0.\end{split}

where $g(x)=\nabla_{x}f(x)$ , $H(x,y)=\nabla^{2}_{xx}\mathcal{L}(x,y)$ , and $J(x)=\partial c(x)/\partial x$ . Let $\hat{y}$ be the optimal dual variable for the inequality constraint above, and define the primal-dual search directions $(\delta x^{*},\delta y^{*})$ , where $\delta y^{*}:=\hat{y}-y$ .

•

Define the Merit function to be the augmented Lagrangian:

$\mathcal{M}(x,y,s;\rho)=f(x)-y^{T}(c(x)-s)+\dfrac{1}{2}\rho\|c(x)-s\|^{2},$

where $s\in\mathbb{R}^{m}_{\geq 0}$ is the inequality “slack,” defined only for the line-search, and $\rho$ is a non-negative penalty parameter. At the current iterate $(x,y)$ and penalty value $\rho$ , set $s$ as follows:

$s=\begin{cases}\max(0,c(x))\quad&\text{if}\ \rho=0\\ \max\left(0,c(x)-\dfrac{1}{\rho}y\right)\quad&\text{otherwise}\end{cases},$

where the $\max$ operator is component-wise, and set the slack search direction $\delta s^{*}$ as $c(x)+J(x)\delta x^{*}-s$ .

•

Define the line-search function $\phi(\alpha;\rho):=\mathcal{M}(x+\alpha\delta x^{*},y+\alpha\delta y^{*},s+\alpha\delta s^{*};\rho)$ and pick the updated penalty parameter $\rho^{+}$ s.t. $\phi^{\prime}(0;\rho^{+})\leq-(1/2)\Delta^{*}$ , where $\Delta^{*}$ is the decrement, defined as: $\delta x^{*^{T}}H(x,y)\delta x^{*}$ . A simple update rule is given below [31]:

\rho^{+}\leftarrow\begin{cases}\rho\quad&\text{if}\quad\phi^{\prime}(0;\rho)\leq-\dfrac{1}{2}\Delta^{*}\\ \max\left\{2\rho,\dfrac{g(x)^{T}\delta x+\frac{1}{2}\Delta^{*}+(2y-\hat{y})^{T}(c(x)-s)}{\|c(x)-s\|^{2}}\right\}\quad&\text{otherwise}.\end{cases}

(18)

•

Compute (e.g., using backtracking, safe-guarded polynomial interpolation, etc.) the largest step length $\alpha\in[\underline{\alpha},1]$ , where $\underline{\alpha}$ is a user-specified lower-bound, s.t. the following line-search conditions are satisfied:

\phi(\alpha;\rho^{+})-\phi(0;\rho^{+})\leq\sigma\alpha\phi^{\prime}(0;\rho^{+}),\quad\text{and}\quad|\phi^{\prime}(\alpha;\rho^{+})|\leq-\eta\phi^{\prime}(0;\rho^{+}),

(19)

where $0<\sigma\leq\eta<\frac{1}{2}$ .

•

Update $(x,y)$ using the computed step length, and set $\rho\leftarrow\rho^{+}$ .

Note that this stripped-down version of SQP does not account for infeasibility detection, which involves slackened forms of the QP sub-problem.

The termination conditions are based upon specified relative tolerances $\tau_{p},\tau_{d}\in\mathbb{R}_{>0}$ . Define $\tau_{x}:=\tau_{p}(1+\|x\|)$ , and $\tau_{y}:=\tau_{d}(1+\|y\|)$ . Then, convergence to a KKT stationary point is declared and the algorithm terminated if:

\begin{cases}\min_{i}c_{i}(x)\geq-\tau_{x}\\ \min_{j}y_{j}\geq-\tau_{y}\\ \|c(x)\circ y\|_{\infty}\leq\tau_{y}\\ \|g(x)-J(x)^{T}y\|_{\infty}\leq\tau_{y}\end{cases}

(20)

where $\circ$ denotes the Hadamard product.

Appendix C Details for Shooting SQP

We provide here the explicit expressions used within various stages of Shooting SQP.

C-A QP Sub-Problem

The linear term in (4a) and (5) is given as follows:

\begin{split}\langle\bm{\delta u},\nabla_{\bm{u}}\mathcal{J}_{R}(\bm{u})\rangle&=\sum_{k=0}^{N-1}\nabla_{x}l_{k}^{T}\delta x_{k}+\nabla_{u}l_{k}^{T}\delta u_{k}+\nabla_{x}l_{N}^{T}\delta x_{N}\\ &:=\sum_{k=0}^{N-1}q_{k}^{T}\delta x_{k}+r_{k}^{T}\delta u_{k}+q_{N}^{T}\delta x_{N}\end{split}

(21)

where $\delta x_{k}$ is defined recursively using the linearized dynamics:

\delta x_{0}=0,\ \ \delta x_{k+1}=A_{k}\delta x_{k}+B_{k}\delta u_{k},\quad k=0,\ldots,N-1.

(22)

The Hessian term in (4a) and (5) is given by:

\begin{split}\langle\bm{\delta u},\nabla_{\bm{u}}^{2}\mathcal{L}_{R}(\bm{u},\bm{y})\bm{\delta u}\rangle=&\sum_{k=0}^{N-1}\begin{bmatrix}\delta x_{k}\\ \delta u_{k}\end{bmatrix}^{T}\overbrace{\begin{bmatrix}\nabla_{xx}^{2}\hat{H}_{k}&\nabla_{xu}^{2}\hat{H}_{k}\\ *&\nabla_{uu}^{2}\hat{H}_{k}\end{bmatrix}}^{:=Z_{k}}\begin{bmatrix}\delta x_{k}\\ \delta u_{k}\end{bmatrix}+\delta x_{N}^{T}\overbrace{\nabla_{xx}^{2}\hat{l}_{N}}^{:=Z_{N}}\delta x_{N},\\ &\qquad\qquad\text{where:}\\ &\hat{l}_{k}(x_{k},u_{k},y_{k}):=l_{k}(x_{k},u_{k})-y_{k}^{T}c_{k}(x_{k},u_{k}),\quad k=0,\ldots,N-1\\ &\hat{l}_{N}(x_{N},y_{N}):=l_{N}(x_{N})-y_{N}^{T}c_{N}(x_{N})\\ &\hat{H}_{k}=\hat{l}_{k}(x_{k},u_{k},y_{k})+\hat{\nu}_{k+1}^{T}f(x_{k},u_{k}),\quad k=0,\ldots,N-1\\ &\begin{cases}\hat{\nu}_{N}=\nabla_{x}\hat{l}_{N}(x_{N},y_{N})\\ \hat{\nu}_{k}=\nabla_{x}\hat{l}_{k}(x_{k},u_{k},y_{k})+A_{k}^{T}\hat{\nu}_{k+1},\ k=N-1,\ldots,1\end{cases}\end{split}

(23)

Remark 4.

The matrices $Z_{k}$ above may not be positive semi-definite. To ensure problem (4) is convex, we project $Z_{k}$ to the positive semi-definite cone with some $\epsilon>0$ tolerance.

C-B Line-Search

Define $s_{k}=(s_{k}^{x},s_{k}^{u})$ where $s_{k}^{x}$ and $s_{k}^{u}$ have the same dimensionality as $c_{k}^{x}$ and $c_{k}^{u}$ respectively. We use the augmented Lagrangian function as the merit function for line-search:

\mathcal{M}_{I}(\bm{u},\bm{y},\bm{s};\rho)=\sum_{k=0}^{N-1}l_{k}-y_{k}^{T}(c_{k}-s_{k})+\dfrac{\rho_{k}}{2}\|c_{k}-s_{k}\|^{2}+l_{N}-y_{N}^{T}(c_{N}-s_{N})+\dfrac{\rho_{N}}{2}\|c_{N}-s_{N}\|^{2},

(24)

where the dependence on $\bm{x}$ is implicit, i.e., $\bm{x}=\bm{x}[\bm{u}]$ . The non-negative constants $\{\rho_{k}\}_{k=0}^{N}$ are a set of penalty parameters, which for brevity, we denote by $\rho$ . The vectors $s_{k}$ represent a set of slack vectors, used only for the line-search. At the current iterate $(\bm{u},\bm{x})$ and set of penalty parameters $\rho$ , set $s_{k}$ as

s_{k}=\begin{cases}\max(0,c_{k})\quad&\text{if}\ \rho_{k}=0\\ \max(0,c_{k}-y_{k}/\rho)\quad&\text{otherwise}\end{cases},

and set $\delta s_{k}^{*}=(\delta s_{k}^{x^{*}},\delta s_{k}^{u^{*}})$ , the slack-search directions, as:

\begin{split}\delta s_{k}^{x^{*}}&=c_{k}^{x}(x_{k})+J_{k}^{x}\delta x_{k}^{*}-s_{k}^{x}\\ \delta s_{k}^{u^{*}}&=c_{k}^{u}(u_{k})+J_{k}^{u}\delta u_{k}^{*}-s_{k}^{u}.\end{split}

As in Appendix B, we now must compute an updated set of penalty parameters $\rho^{+}$ so that $\phi^{\prime}(0;\rho^{+})\leq-(1/2)\langle\bm{\delta u}^{*},\nabla_{\bm{u}}^{2}\mathcal{L}_{R}(\bm{u},\bm{y})\bm{\delta u}^{*}\rangle:=-(1/2)\Delta^{*}$ . To derive this update, let us consider an abstract optimization problem of the form:

\begin{split}\min_{x\in\mathbb{R}^{n}}\quad&f(x)\\ \mathrm{s.t.}\quad&c_{k}(x)\geq 0,\quad k=0,\ldots,N,\end{split}

where $\{c_{k}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m_{k}},k=0,\ldots,N\}$ is a set of vector-valued inequality constraints, with corresponding Jacobians $J_{k}(x):=\partial c_{k}(x)/\partial x$ . Let $\bm{y}=(y_{0},\ldots,y_{N})$ be the stacked dual vector for the inequality constraints, $H(x,\bm{y})$ the Lagrangian Hessian, and $g(x)$ the objective gradient. The QP sub-problem is then defined as:

\begin{split}\delta x^{*}=\operatornamewithlimits{\mathrm{argmin}}_{\delta x\in\mathbb{R}^{n}}\quad&\psi(\delta x):=g(x)^{T}\delta x+\dfrac{1}{2}\delta x^{T}H(x,\bm{y})\delta x\\ \mathrm{s.t.}\quad&c_{k}(x)+J_{k}(x)\delta x\geq 0,\quad k=0,\ldots,N.\end{split}

Let $\hat{y}_{k}$ be the optimal dual vector for the QP-subproblem for constraint $k$ and define $\delta y_{k}^{*}:=\hat{y}_{k}-y_{k}$ . Consider the Augmented Lagrangian, $\mathcal{M}$ , defined as:

\mathcal{M}(x,\bm{y},\bm{s};\{\rho_{k}\}_{k=0}^{N})=f(x)-\sum_{k=0}^{N}y_{k}^{T}(c_{k}(x)-s_{k})+\sum_{k=0}^{N}\dfrac{1}{2}\rho_{k}\|c_{k}(x)-s_{k}\|^{2}\quad(s_{k}\geq 0),

where $s_{k}\in\mathbb{R}^{m_{k}}_{\geq 0}$ is the “slack” for the $k^{\mathrm{th}}$ constraint, and $\bm{s}$ denotes the concatenatation $(s_{0},\ldots,s_{N})$ . At the current iterate $(x,\bm{y})$ set the slack vectors $s_{k}$ as follows:

s_{k}=\begin{cases}\max(0,c_{k})\quad&\text{if}\ \rho_{k}=0\\ \max(0,c_{k}-y_{k}/\rho)\quad&\text{otherwise}\end{cases},

and set the slack search directions $\delta s_{k}^{*}=c_{k}(x)+J_{k}(x)\delta x^{*}-s_{k}$ . Define the line-search function $\phi(\alpha;\rho)$ as:

\phi(\alpha;\rho):=\mathcal{M}(x+\alpha\delta x^{*},\bm{y}+\alpha\bm{\delta y},\bm{s}+\alpha\bm{\delta s};\rho).

Now, we wish to choose the set of penalty parameters $\{\rho_{k}^{+}\}_{k=0}^{N}$ s.t. $\phi^{\prime}(0;\rho^{+})\leq-\dfrac{1}{2}\delta x^{*^{T}}H(x,\bm{y})\delta x^{*}:=-\dfrac{1}{2}\Delta^{*}$ . Consider the gradient $\phi^{\prime}(0;\rho^{+})$ :

\begin{split}\phi^{\prime}(0;\rho^{+})&=g^{T}\delta x^{*}+\sum_{k=0}^{N}\left[-\delta x^{*^{T}}J_{k}^{T}y_{k}+\rho_{k}^{+}\delta x^{*^{T}}J_{k}^{T}(c_{k}-s_{k})-(c_{k}-s_{k})^{T}\delta y_{k}^{*}+y_{k}^{T}\delta s_{k}^{*}-\rho_{k}^{+}\delta s_{k}^{*^{T}}(c_{k}-s_{k})\right]\\ &=g^{T}\delta x^{*}+\sum_{k=0}^{N}(2y_{k}-\hat{y}_{k})^{T}(c_{k}-s_{k})-\rho_{k}^{+}\|c_{k}-s_{k}\|^{2},\end{split}

where we have used the definitions of $\delta s_{k}^{*}$ and $\delta y_{k}^{*}$ . Notice now that if $c_{k}-s_{k}=0$ for all $k$ , then since $s_{k}\geq 0$ , $\delta x=0$ is a feasible solution to the QP sub-problem. Thus, $\psi(\delta x^{*})\leq\psi(0)$ , from which one obtains the desired inequality. Consider then the case that the set $\mathcal{I}:=\{k:\|c_{k}-s_{k}\|>0\}$ is non-empty. Then, we can re-write $\phi^{\prime}(0;\rho^{+})$ as:

\sum_{k\in\mathcal{I}}\dfrac{1}{|\mathcal{I}|}g^{T}\delta x^{*}+(2y_{k}-\hat{y}_{k})^{T}(c_{k}-s_{k})-\rho_{k}^{+}\|c_{k}-s_{k}\|^{2}

Now, for $\rho_{k}^{+}=\hat{\rho}_{k},k\in\mathcal{I}$ , defined as:

\hat{\rho}_{k}:=\dfrac{\frac{1}{|\mathcal{I}|}\psi(\delta x^{*})+(2y_{k}-\hat{y}_{k})^{T}(c_{k}-s_{k})}{\|c_{k}-s_{k}\|^{2}},

we obtain:

\phi^{\prime}\left(0;\{\hat{\rho}_{k}\}_{k\in\mathcal{I}}\cup\{\rho_{k}\}_{k\notin\mathcal{I}}\right)=-\dfrac{1}{2}\Delta^{*}.

Thus, the update equation for $\{\rho_{k}\}$ may be written as:

\rho_{k}^{+}=\begin{cases}\max(2\rho_{k},\hat{\rho}_{k})\quad&\text{if}\quad k\in\mathcal{I}\\ \rho_{k}\quad&\text{otherwise}.\end{cases}

In context of Section II-A, the term $\psi(\delta x^{*})$ is equivalent to the optimal objective of problem (4), while the correspondences for the dual and slack vectors follows straightforwardly. The line-search acceptance conditions are as given in (19).

C-C Termination Conditions

Let $\tau_{p},\tau_{d}\in\mathbb{R}_{>0}$ be user-specified relative tolerances, and define $\tau_{x}=\tau_{p}(1+\|\bm{u}\|)$ and $\tau_{y}=\tau_{d}(1+\|\bm{y}\|)$ . Then, the KKT stationarity termination conditions are given as:

\begin{split}&k=0,\ldots,N:\begin{cases}c_{k}\geq-\tau_{x}\\ y_{k}\geq-\tau_{y}\\ \|c_{k}\circ y_{k}\|_{\infty}\leq\tau_{y}\end{cases}\\ &k=0,\ldots,N-1:\|\nabla_{u_{k}}{\hat{H}}_{k}\|_{\infty}\leq\tau_{y}\end{split}

(25)

where $\circ$ denotes the Hadamard product, and $\hat{H}_{k}$ , the Hamiltonian, is defined in (23).

Appendix D Computing Sensitivity Gains using DP

Proof of Theorem 1.

We will prove the result via induction. Notice that for the base case $k=N-1$ , the cost-to-go function $\delta V_{N}(\delta x_{N})$ is defined only for $\delta x_{N}\in\mathcal{X}_{N}$ , which by construction, is the set $\mathcal{CR}_{N}$ . Thus $\delta x_{N}^{*}\in\mathcal{CR}_{N}$ and $\delta\tilde{V}_{N}(\delta x_{N})=\delta V_{N}(\delta x_{N})$ for $\delta x_{N}\in\mathcal{CR}_{N}$ . It follows that problem (13) for $k=N-1$ coincides exactly with the definition for $\delta\pi_{N-1}^{*}$ given in (9). Thus, $\delta\hat{\pi}^{*}_{N-1}(\delta x_{N-1})=\delta\pi^{*}_{N-1}(\delta x_{N-1})$ for all $\delta x_{N-1}\in\delta X_{N-1}$ where the problem is feasible.

Now, notice that problem (13) for $k=N-1$ is a multi-parametric QP in the “parameter” $\delta x_{N-1}$ . Thus, by Theorem 2 in [32], linear independence of the active inequality constraints at $\delta\hat{\pi}^{*}_{N-1}(\delta x_{N-1}^{*})$ implies that the solution functions $\delta\hat{\pi}^{*}_{N-1}(\delta x_{N-1})$ and $\hat{y}^{*}_{N-1}(\delta x_{N-1})$ are locally³³3One can actually show that the solution function $\delta\hat{\pi}^{*}_{N-1}$ is in fact piecewise affine over $\delta\mathcal{X}_{N-1}$ however this is not necessary for the proof. affine in a region containing $\delta x_{N-1}^{*}$ , taking the form in (14). This region containing $\delta x_{N-1}^{*}$ is precisely the set:

\{\delta x_{N-1}\in\delta\mathcal{X}_{N-1}:\delta\hat{\pi}^{*}_{N-1,a}(\delta x_{N-1})=\delta\hat{\pi}^{*}_{N-1}(\delta x_{N-1})\}.

A second consequence of the aforementioned theorem is that the above set is defined by the intersection of the following two sets:

\begin{split}&\{\delta x_{N-1}\in\delta\mathcal{X}_{N-1}:\bar{G}_{N-1}\begin{bmatrix}\delta\hat{\pi}^{*}_{N-1,a}(\delta x_{N-1})\\ A_{N-1}\delta x_{N-1}+B_{N-1}\hat{\pi}^{*}_{N-1,a}(\delta x_{N-1})\end{bmatrix}\leq\bar{h}_{N-1}\}\\ &\{\delta x_{N-1}\in\delta\mathcal{X}_{N-1}:\hat{y}^{*}_{N-1,a}(\delta x_{N-1})\geq 0\},\end{split}

that is, the subset of $\delta\mathcal{X}_{N-1}$ where the locally affine solutions satisfy the primal-dual feasibility constraints of problem (13) for $k=N-1$ . This however is precisely the definition of the set $\mathcal{CR}_{N-1}$ in (15). Thus, we have established that $\delta\hat{\pi}^{*}_{N-1,a}(\delta x_{N-1})=\delta\pi^{*}_{N-1}(\delta x_{N-1})$ for $\delta x_{N-1}\in\mathcal{CR}_{N-1}$ and by construction, $\delta x_{N-1}^{*}\in\mathcal{CR}_{N-1}$ . Substituting the affine feedback law into the objective for problem (13) for $k=N-1$ gives the recursion for $\{P_{N-1},p_{N-1},v_{N-1}\}$ , as defined in (16). Thus, $\delta\tilde{V}_{N-1}(\delta x_{N-1})=\delta V_{N-1}(\delta x_{N-1})$ for $\delta x_{N-1}\in\mathcal{CR}_{N-1}$ , completing the proof for $k=N-1$ .

Suppose then that the following statements are true for some $k+1\leq N-1$ : (i) $\delta x_{k+1}^{*}\in\mathcal{CR}_{k+1}$ , and (ii) $\delta V_{k+1}(\delta x_{k+1})=\delta\tilde{V}_{k+1}(\delta x_{k+1})$ for $\delta x_{k+1}\in\mathcal{CR}_{k+1}$ . Consider the definition of $\delta\pi^{*}_{k}$ given in (9), and define:

\mathrm{Pre}^{*}(\mathcal{CR}_{k+1}):=\{\delta x_{k}\in\delta\mathcal{X}_{k}:A_{k}\delta x_{k}+B_{k}\delta\pi^{*}_{k}(\delta x_{k})\in\mathcal{CR}_{k+1}\}

Since $\delta x_{k+1}^{*}\in\mathcal{CR}_{k+1}$ by the inductive hypothesis, it follows that $\delta x_{k}^{*}\in\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ and hence, the set is non-empty. Now, for all $\delta x_{k}\in\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ , one may add the redundant constraint: $\delta x_{k+1}\in\mathcal{CR}_{k+1}$ and equivalently re-write $\delta\pi_{k}^{*}$ as:

\begin{split}\operatornamewithlimits{\mathrm{argmin}}_{\delta u_{k}\in\delta\mathcal{U}_{k}}\quad&\tilde{l}_{k}(\delta x_{k},\delta u_{k})+\delta V_{k+1}(\delta x_{k+1})\\ \mathrm{s.t.}\quad&\delta x_{k+1}\in\mathcal{CR}_{k+1}.\end{split}

Now, leveraging the inductive hypothesis that $\delta\tilde{V}_{k+1}(\delta x_{k+1})=\delta V_{k+1}(\delta x_{k+1})$ for all $\delta x_{k+1}\in\mathcal{CR}_{k+1}$ , we can re-write $\delta\pi_{k}^{*}$ as:

\begin{split}\operatornamewithlimits{\mathrm{argmin}}_{\delta u_{k}\in\delta\mathcal{U}_{k}}\quad&\tilde{l}_{k}(\delta x_{k},\delta u_{k})+\delta\tilde{V}_{k+1}(\delta x_{k+1})\\ \mathrm{s.t.}\quad&\delta x_{k+1}\in\mathcal{CR}_{k+1}.\end{split}

Comparing with problem (13), this is precisely the definition of $\delta\hat{\pi}^{*}_{k}$ . Thus, we have that $\delta\hat{\pi}^{*}_{k}(\delta x_{k})=\delta\pi^{*}_{k}(\delta x_{k})$ for $\delta x_{k}\in\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ .

Now, consider the fixed affine feedback law $\delta\hat{\pi}^{*}_{k,a}$ in (14), that is well-defined thanks to the linear independence assumption at $\delta x_{k}^{*}$ . Once again, applying Theorem 2 in [32], we have that $\delta\hat{\pi}^{*}_{k}(\delta x_{k})=\delta\hat{\pi}^{*}_{k,a}(\delta x_{k})$ for $\delta x_{k}\in\mathcal{CR}_{k}$ , as defined in (15). Substituting the affine law into the objective of problem (13) yields the recursion in (16); thus, $\delta\tilde{V}_{k}(\delta x_{k})=\delta V_{k}(\delta x_{k})$ for all $\delta x_{k}\in\mathcal{CR}_{k}\cap\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ . Further, by construction, $\delta x_{k}^{*}\in\mathcal{CR}_{k}$ ; thus, the intersection $\mathcal{CR}_{k}\cap\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ is non-empty.

To finish the proof, we must show that $\mathcal{CR}_{k}\subseteq\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ . By doing so, we establish the equivalency between the fixed affine feedback law $\delta\hat{\pi}^{*}_{k,a}$ and $\delta\pi^{*}_{k}$ and consequently, the equivalency of $\delta V_{k}$ and $\delta\tilde{V}_{k}$ for $\delta x_{k}\in\mathcal{CR}_{k}$ .

To show this last part, notice that for any $\delta x_{k}$ that lies on the boundary $\partial\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ , it holds that $A_{k}\delta x_{k}+B_{k}\delta\pi_{k}^{*}(\delta x_{k})$ lies on the boundary $\partial\mathcal{CR}_{k+1}$ . This follows from the continuity of $\delta\pi_{k}^{*}(\delta x_{k})$ . Then, by the equivalence of $\delta\pi^{*}_{k}$ and $\delta\hat{\pi}_{k}^{*}$ for $\delta x_{k}\in\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ and the equivalence of $\delta\hat{\pi}^{*}_{k}$ and $\delta\hat{\pi}^{*}_{k,a}$ for $\delta x_{k}\in\mathcal{CR}_{k}$ , for any $\delta x_{k}\in\mathcal{CR}_{k}\cap\partial\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ , it holds that $A_{k}\delta x_{k}+B_{k}\delta\hat{\pi}_{k,a}^{*}(\delta x_{k})\in\partial\mathcal{CR}_{k+1}$ . It follows that $\delta x_{k}\in\partial\mathcal{CR}_{k}$ .

Thus, since the intersection $\mathcal{CR}_{k}\cap\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ is non-empty, then either (i) $\mathcal{CR}_{k}\subset\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ , or (ii) $\mathcal{CR}_{k}\cap\partial\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ is non-empty. In the latter case, we have shown that is the set $\partial\mathcal{CR}_{k}\cap\partial\mathrm{Pre}^{*}(\mathcal{CR}_{k+1})$ , completing the proof.

∎

Appendix E Experiment Details

The following hyperparameters were held fixed for all Shooting SQP variations over all environments:

Parameter	Value
Line-search decrease ratio: $\sigma$	0.4
Line-search gradient ratio: $\eta$	0.49
Line-search step-size lower-bound: $\underline{\alpha}$	1e-5
Termination Primal tolerance: $\tau_{p}$	1e-3
Termination Dual tolerance: $\tau_{d}$	1e-3*

TABLE III: Hyperparameters for Shooting SQP. (*) For the Quad-Pendulum example, the dual tolerance

\tau_{d}

was set as

10^{-2}

In the following, we give details specific to each environment case-study.

E-A Motion Planning for a Car

This is a system with 4 states: $x=(p_{x},p_{y},\theta,v)$ , where $(p_{x},p_{y})\in\mathbb{R}^{2}$ is the 2D position, $\theta\in\mathcal{S}^{1}$ is the orientation, and $v\in\mathbb{R}$ is the forward velocity. The continuous-time dynamics are given by:

\dot{p}_{x}=v\sin\theta,\quad\dot{p}_{y}=v\cos\theta,\quad\dot{\theta}=vu^{\theta},\quad\dot{v}=u^{v},

where $(u^{\theta},u^{v})$ are the steering and acceleration inputs. As in [18], we use Euler integration with a time-step of $0.05$ s to yield the discrete-time model. The control limits are taken from [17]: $u^{\theta}\in[-\frac{\pi}{3},\frac{\pi}{3}]$ rad/s, and $u^{v}\in[-6,6]\ \mathrm{s}^{-2}$ .

The stage cost is $l_{k}(x,u)=0.05u^{T}Ru$ with $R=\mathrm{diag}(0.2,0.1)$ , and terminal cost is $l_{N}(x)=(x-x_{g})^{T}Q_{N}(x-x_{g})$ with $Q_{N}=\mathrm{diag}(50,50,50,10)$ and $x_{g}=(3,3,\frac{\pi}{2},0)$ . The horizon $N$ is 40 steps. The initial control sequence guess was all zeros. The constant $\gamma$ for $\texttt{\bf{CL}}_{\gamma}$ was set to $10^{-4}$ .

Figure 4 shows the solver progress for all algorithms and cases.

E-B Acrobot

We study the classic acrobot “swing-up” task with $n=4$ states and $m=1$ control input. The state is defined as $x=(q_{1},q_{2},v_{1},v_{2})$ where $q_{1}$ the root joint angle, $q_{2}$ is the relative elbow joint angle, and $v_{1},v_{2}$ are their respective angular rates. The dynamics were taken from [33]. Starting from an initial state $x_{0}=0$ , we require the acrobot to swing up to the goal state $x_{g}=(\pi,0,0,0)$ , subject to the control limits $u\in[-2,2]$ . The stage cost is $l_{k}(x,u)=\frac{1}{2}\left[w_{1}(\cos q_{1}+\cos(q_{1}+q_{2})+2)+w_{2}u^{2}\right]$ , and the terminal cost is $l_{N}(x)=\frac{1}{2}w_{3}\|x-x_{g}\|^{2}$ , with weights $(w_{1},w_{2},w_{3})=(0.1,0.01,10)$ . Additionally, we enforce the limit constraints⁴⁴4Due to the fact that these two state components really live on $\mathcal{S}^{1}\times\mathcal{S}^{1}$ , we employ chart switching to smoothly handle this constraint near the boundary $-\pi/\pi$ . This is achieved by adjusting the boundary constraints for each angle to lie in $[-\pi\pm\delta_{k},\pi\pm\delta_{k}]$ for the next iterate, where the per time-step adjustment $\delta_{k}$ is chosen based on the current values for $q_{1}$ and $q_{2}$ at time-step $k$ . We also wrap both angles back to the interval $[-\pi,\pi]$ at the end of each iteration. $(q_{1},q_{2})\in[-\pi,\pi]^{2}$ and the terminal constraint: $\|x_{N}-x_{g}\|^{2}\leq 0.2^{2}$ . The discrete-time dynamics were obtained using explicit Euler integration with a time-step of 0.05s, and the horizon $N$ was $150$ steps.

To generate the initial guess, we set $\bm{\chi}_{0}$ , the initial guess for the state-trajectory to be a linear interpolation between $x_{0}$ and $x_{g}$ , and $\bm{\mu}_{0}$ , the initial guess for the controls as all zeros. Then, the following rollout was used to generate the initial control trajectory for SQP:

x_{k+1}=f(x_{k},\mathrm{clip}_{\underline{u}}^{\overline{u}}[\mu_{k}+K_{k}^{\mathrm{lqr}}(x_{k}-\chi_{k})]),\ x_{0}=0,\ k=0,\ldots,N-1,

where the gains $\{K_{k}^{\mathrm{lqr}}\}$ were computed via a TV-LQR solve, with cost defined by the Hessian of the objective and dynamics linearized about $(\bm{\chi}_{0},\bm{\mu}_{0})$ .

We found that for this problem, the Gauss-Newton Hessian approximation was more stable for both algorithms. This entails dropping the second-order gradients stemming from $\hat{\nu}_{k+1}^{T}f(x_{k},u_{k})$ in (23). The constant $\gamma$ for $\texttt{\bf{CL}}_{\gamma}$ was set to $10^{-4}$ . Figure 5 illustrates solver progress.

Despite sub-optimal step-sizes for both algorithms (due to the natural instabilities of the dynamics), $\texttt{\bf{CL}}_{\gamma}$ successfully converges within 33 iterations while OL is still struggling to resolve the Lagrangian stationarity residual at 100 iterations.

E-C Quad-Pendulum

The state is defined as $x=(p_{x},p_{z},\theta,\phi,\dot{p}_{x},\dot{p}_{z},\dot{\theta},\dot{\phi})$ , where $(p_{x},p_{z})\in\mathbb{R}^{2}$ is the CoM position of the quadrotor in the vertical plane, $\theta\in\mathcal{S}^{1}$ is its orientation, and $\phi\in\mathcal{S}^{1}$ is the orientation of the attached pendulum w.r.t. the vertical. The two control inputs $u=(u_{1},u_{2})$ corresponding to the rotor thrusts are subject to the limits $[0.1,3]m_{q}g$ , where $m_{q}$ is the mass of the quadrotor and $g$ is the gravitational acceleration.

We provide the Lagrangian characterization of the dynamics. The mass matrix $M(q)$ , as a function of the generalized coordinates $q=(p_{x},p_{z},\theta,\phi)$ is given by:

\begin{bmatrix}m_{q}+m_{p}&0&0&m_{p}L\cos\phi\\ 0&m_{q}+m_{p}&0&m_{p}L\sin\phi\\ 0&0&J&0\\ m_{p}L\cos\phi&m_{p}L\sin\phi&0&m_{p}L^{2}\end{bmatrix},

where $m_{p}$ is the mass of the pendulum (assumed concentrated at the endpoint), $L$ is the pendulum length, and $J$ is the quadrotor moment of inertia. The kinetic energy is thus: $T(q,\dot{q})=\frac{1}{2}\dot{q}^{T}M(q)\dot{q}$ ; the potential energy is $V(q)=m_{q}gp_{z}+m_{p}g(p_{z}-L\cos\phi)$ , and the mechanical Lagrangian is $\mathcal{L}(q,\dot{q})=T(q,\dot{q})-V(q)$ . The dynamics are thus given as:

\dfrac{d}{dt}\left(\dfrac{\partial\mathcal{L}(q,\dot{q})}{\partial\dot{q}}\right)-\dfrac{\partial\mathcal{L}(q,\dot{q})}{\partial q}=F(q),

where $F(q)$ , the generalized force vector is given as:

F(q)=\begin{bmatrix}-(u_{1}+u_{2})\sin\theta\\ (u_{1}+u_{2})\cos\theta\\ (u_{1}-u_{2})l-\tau_{f}\\ \tau_{f}\end{bmatrix},

where $2l$ is the quadrotor wing-span, $\tau_{f}$ is the viscous frictional torque: $-\nu(\dot{\phi}-\dot{\theta})$ , where $\nu$ is the constant friction coefficient. We used the constants: $m_{q}=0.486$ , $m_{p}=0.2m_{q}$ , $l=0.25$ , $L=2l$ , $g=9.81$ , $J=0.00383$ , $\nu=0.01$ , and employed Euler discretization with time-step 0.025 s.

The stage cost is $l_{k}(x,u)=\frac{1}{2}(w_{1}(\|(p_{x},p_{z},\theta)-(g_{x},g_{z},g_{\theta})\|^{2}+(1+\cos\phi))+w_{2}\|u-u_{h}\|^{2})$ , and the terminal cost is $l_{N}(x)=\frac{1}{2}(x-x_{g})^{T}Q_{N}(x-x_{g})$ , where $x_{g}=(g_{x},g_{z},g_{\theta},g_{\phi},0,0,0,0)$ is the goal state with: $(g_{x},g_{z},g_{\theta},g_{\phi})=(3,-1.5,0,\pi)$ . The vector $u_{h}=0.5(m_{q}+m_{p})g\bm{1}$ is the hover thrust setpoint, and $Q_{N}=\mathrm{diag}(10,10,1,1,1,1,1,1)$ . The weights are $(w_{1},w_{2},w_{3})=(0.01,0.05,5)$ . The horizon $N$ was 160 steps.

In addition to the obstacle avoidance constraints, we also set the operational constraints $(p_{x},p_{z})\in[-4,4]\times[-2,2]$ and $\theta\in[-\frac{3\pi}{4},\frac{3\pi}{4}]$ . The initial control sequence guess was set to $u_{h}$ , the hover setpoint, for all timesteps.

Due to the difficulty of this problem, the constant $\gamma$ for $\texttt{\bf{CL}}_{\gamma}$ was initialized at $10^{-3}$ and reduced by a factor of 10 every SQP iteration, until a lower-bound of $10^{-5}$ .

Figure 6 shows a timelapse for the first case solution from $\texttt{\bf{CL}}_{\gamma}$ , and in Figure 7 we plot the re-construction error $\|\delta\hat{\pi}_{k}^{*}(\delta x_{k}^{*})-\delta u_{k}^{*}\|$ for both cases for $\texttt{\bf{CL}}_{\gamma}$ .