Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

Yue Zhao [email protected] Jiequn Han [email protected]

Abstract

This work is concerned with solving neural network-based feedback controllers efficiently for optimal control problems. We first conduct a comparative study of two prevalent approaches: offline supervised learning and online direct policy optimization. Albeit the training part of the supervised learning approach is relatively easy, the success of the method heavily depends on the optimal control dataset generated by open-loop optimal control solvers. In contrast, direct policy optimization turns the optimal control problem into an optimization problem directly without any requirement of pre-computing, but the dynamics-related objective can be hard to optimize when the problem is complicated. Our results underscore the superiority of offline supervised learning in terms of both optimality and training time. To overcome the main challenges, dataset and optimization, in the two approaches respectively, we complement them and propose the Pre-train and Fine-tune strategy as a unified training paradigm for optimal feedback control, which further improves the performance and robustness significantly. Our code is accessible at https://github.com/yzhao98/DeepOptimalControl.

keywords:

Optimal Control , Deep Learning , Open-Loop Control , Closed-Loop Control

^†^†journal: Physica D

\affiliation

[label1]organization=Center for Data Science, Peking University,addressline=No. 5 Yiheyuan Road, city=Beijing, postcode=100871, country=China

\affiliation

[label2]organization=Flatiron Institute,addressline=162 5th Ave., city=New York, postcode=10010, state=NY, country=USA

1 Introduction

It is ubiquitous and paramount to design optimal feedback controllers [1] for various complex tasks in engineering and industry. Real-world applications are more than challenging due to high-dimensionality and speed requirements for real-time execution. In recent years, deep learning has been introduced to tackle these issues and shown astonishing performances [2, 3, 4, 5]. Generally speaking, given system dynamics explicitly, there are two prevalent types of approaches to training neural network-based optimal feedback controllers: offline supervised learning and online direct policy optimization.

Offline supervised learning is to train a feedback network controller by approximating the corresponding open-loop solutions at different states directly, leveraging the fact that open-loop control for a fixed initial state is much easier to solve than feedback control in optimal control problems (OCP).

The other way, online direct policy optimization, transforms the network-based feedback OCP concerning a distribution of initial states into an optimization problem and solves it directly without pre-computing open-loop optimal control. We call this method direct policy optimization because the objective to minimize comes exactly from the original OCP. With fully-known dynamics, the evolution of states (controlled by a neural network) is governed by a controlled Ordinary Differential Equation (ODE), and one can use stochastic gradient descent to optimize network parameters. This type of approach was first proposed to solve high-dimensional stochastic control problems [2] and recently applied to solve deterministic optimal control problems [4, 6].

The two methods have not, though, hitherto been comparatively studied. Thus, it is imperative to conduct a comparative study of them. Albeit supervised learning problems can be easily optimized, the quality of the learned controller heavily depends on the dataset generated by open-loop optimal control solvers. In this work, we demonstrate that the supervised learning approach holds advantages over direct policy optimization in terms of optimality and training time. The primary benefit of supervised learning stems from the network’s ability to directly learn the optimal control signal from a pre-computed dataset, making the process less challenging compared to the online training required for direct policy optimization. However, direct policy optimization can yield further performance enhancements with a near-optimally initialized network, since its objective takes the controlled dynamics into consideration and aligns more closely with the original goal of the OCP.

Our comparison of the two approaches shows that they are interrelated and complementary, enlightening us to combine them to overcome their inherent limitations. Thereby, we present a unified training paradigm for neural network-based optimal feedback controllers, dubbed as the Pre-Train and Fine-Tune strategy. Initial pre-training through offline supervised learning guides the network to a reasonable solution with a small loss, whilst fine-tuning by online direct policy optimization breaks the limitations of the precomputed dataset and further improves the controller’s performance and robustness.

We summarize our contributions as follows:

1.

We establish a benchmark and comprehensively compare offline supervised learning and online direct policy optimization for solving neural network-based optimal feedback control problems.
2.

We identify the challenges of two methods, which are dataset quality and network optimization, respectively.
3.

Our results underscore the superiority of offline supervised learning to online direct policy optimization in terms of both optimality and training time.
4.

Drawing from our comparative analysis, we propose a new paradigm to train neural network-based feedback controllers, namely Pre-Train and Fine-Tune, which significantly enhances performance and robustness.

2 Preliminaries and Related Works

2.1 Mathematical Formulation

We first consider an open-loop optimal control problem:

\left\{\begin{array}[]{cl}\underset{\bm{u}(t)}{\operatorname{minimize}}&J[\bm{u}(t)]=M\left(\bm{x}\left(T\right)\right)+\displaystyle{\int_{0}^{T}}{L}(\bm{x},\bm{u})dt,\\ \text{ s.t. }&\dot{\bm{x}}(t)=\bm{f}(\bm{x},\bm{u}),\quad\bm{x}(0)=\bm{x}_{0},\\ \end{array}\right.

(1)

where $\bm{x}_{0}\in\mathbb{R}^{n}$ denotes an initial state, $\bm{x}(t)\in\mathbb{R}^{n}$ denotes the state at time $t\in[0,T]$ , $\mathcal{U}$ denotes the admissible control set, and $\bm{u}(t)\in\mathcal{U}\subset\mathbb{R}^{m}$ denotes the open-loop control function. The dynamics is described by a smooth function $\bm{f}:\mathbb{R}^{n}\times\mathcal{U}\rightarrow\mathbb{R}^{n}$ . The total cost $J$ is the sum of the terminal cost $M:\mathbb{R}^{n}\rightarrow\mathbb{R}$ and an integral of the running cost $L:\mathbb{R}^{n}\times\mathcal{U}\rightarrow\mathbb{R}$ , with the assumption that both $M$ and $L$ are differentiable.¹¹1For ease of notation, we assume the dynamics and running cost are time-independent. Both approaches discussed in this paper can be applied straightforwardly to problems with time-dependent dynamics or running cost. We assume the solution to the open-loop OCP (1) exists and is unique, which is $\bm{u}^{*}:[0,T]\rightarrow\mathcal{U}$ that minimizes the total cost under the dynamics and the given initial state.

The open-loop control is designed for a fixed initial state and takes only time $t$ as the input. In contrast, the closed-loop control is designed for various initial states and thus takes both time and the current state as the input, i.e., $\bm{u}(t,\bm{x}):[0,T]\times\mathbb{R}^{n}\rightarrow\mathcal{U}\subset\mathbb{R}^{m}$ . In classical control theory [7], it is well-known that there exists a closed-loop optimal control $\bm{u}^{*}:[0,T]\times\mathbb{R}^{n}\rightarrow\mathcal{U}$ so that $\bm{u}^{*}\left(t,\bm{x}^{*}\left(t;\bm{x}_{0}\right)\right)$ is identical to the open-loop optimal control solution at time $t$ with the initial state $\bm{x}_{0}$ , where $\bm{x}^{*}$ follows $\dot{\bm{x}}^{*}\left(t;\bm{x}_{0}\right)=\bm{f}\left(\bm{x}^{*}\left(t;\bm{x}_{0}\right),\bm{u}^{*}\left(t,\bm{x}^{*}\left(t;\bm{x}_{0}\right)\right)\right.$ . The identity holds for any initial state $\bm{x}_{0}$ . In other words, we can induce a family of open-loop optimal control with any possible initial states if the closed-loop optimal control is given. Based on that fact, we slightly abuse the notation $\bm{u}$ to denote both the closed-loop control and the induced open-loop control. The context of closed-loop or open-loop control can be inferred from the arguments when necessary and will not be confusing. We also use the words feedback and closed-loop interchangeably. We focus on the design of closed-loop optimal controllers, considering they are more reliable and robust to dynamic disturbance and model misspecification in real-world applications compared to the open-loop counterpart. We operate under the assumption that there is a reliable numerical method for deriving the open-loop optimal solution, which implies that the problem is not overly complex, thus making the development of closed-loop optimal control feasible.

Traditional methods for solving the closed-loop optimal control rely on solving the corresponding Hamilton–Jacobi–Bellman (HJB) equation. It is notoriously difficult to solve those equations in high dimensions based on classical grid-based methods, due to the so-called curse of dimensionality [8]. To overcome this essential difficulty, there has been active research in recent years to approximate the control and value functions with other function approximators and proper objective functions to find the optimal approximators. Notable examples of such approximating functions include sparse polynomials [9, 10], sparse grids [11], kernel functions [12], and neural networks [2, 3, 4]. Particularly, neural networks have received significant interest due to their exceptional ability in high-dimensional function approximation and their flexibility in optimization with various loss functions. For instance, it is feasible to employ loss functions linked to the classical Hamilton-Jacobi-Bellman (HJB) equation [10] or the Pontryagin principle [13, 14], especially when the Hamiltonian minimization can be explicitly solved. In this work, we focus on neural network-based methods and loss functions that do not require the previously mentioned condition. The progress made in this direction can be categorized into two families of methods: offline supervised learning and online direct policy optimization. We introduce their details in the following subsections accordingly.

2.2 Offline Supervised Learning

In the supervised learning approach, neural networks are utilized to approximate solutions provided by an optimal control dataset. The learning signal de facto comes from the open-loop optimal control solution, which is much easier to solve than the closed-loop control. It is offline trained and then applied in real-time control. The regression problem is easily formulated and optimized. The key point of this training approach is how to generate high-quality data for training the neural network to achieve optimal control.

Data Generation

Many numerical methods can be applied to solve the open-loop optimal control, such as numerical solvers for the Boundary Value Problem (BVP, 15) and Differential Dynamical Programming (DDP, 16). Since the data is generated offline, the computation time of each trajectory is usually not the primary concern, as long as it is affordable to generate a suitable dataset in the form of $\mathcal{D}=\{(t^{(i)},\bm{x}^{(i)},\bm{u}^{(i)})\},$ where $\bm{u}^{(i)}$ denotes the optimal control at time $t^{(i)}$ and state $\bm{x}^{(i)}$ , with $i$ being the index of data. For instance, [3] solves the corresponding BVP as a necessary condition of the optimal control based on PMP and collects open-loop solutions starting from different initial states and evaluated at different temporal grid points to form the dataset.

Objective

The objective in offline supervised learning is to find a network policy $\bm{u}^{\text{NN}}(t,\bm{x};\bm{\theta})$ with parameter $\bm{\theta}$ that minimizes the least-square error based on the generated dataset $\mathcal{D}$ :

\underset{\bm{\theta}}{\operatorname{minimize}}\quad\frac{1}{|\mathcal{D}|}\sum_{i=1}^{|\mathcal{D}|}\left\|\bm{u}^{\mathrm{NN}}\left(t^{(i)},\bm{x}^{(i)};\bm{\theta}\right)-\bm{u}^{(i)}\right\|^{2}.

(2)

We remark that this approach of learning the feedback control is similar to imitation learning [17] or behavior cloning [18], where the objective is to learn from a dataset with labels from expert policies. Our method differs from conventional imitation learning by how to obtain expert demonstrations. Instead of human demonstrations, labels are derived from solving corresponding open-loop optimal control problems, enabling the construction of an expansive dataset with optimal quality that can be scaled up. Also, this approach is not limited to learning control policies but is also applicable to value functions [3].

2.3 Online Direct Policy Optimization

The direct method [19, 20, 21, 22] is a classical family of methods in the optimal control literature that mainly refers to methods that transform the open-loop OCP into a nonlinear optimization problem. The neural network-based direct policy optimization investigated in this paper shares a similar optimization objective with those direct methods, but distinguishes from them in terms of solving closed-loop optimal controls. That is why we highlight the term “policy optimization” in its name. The method applies to both stochastic and deterministic control problems, and in this work we focus on problems with deterministic dynamics and cost.

Objective

To formulate the direct policy optimization method, we assume $\bm{x}_{0}$ is sampled from a distribution ${\mu}$ that covers the initial state space of interest, such as a uniform distribution over a compact set. We use $J(\bm{u};\bm{x_{0}})$ to denote the cost defined in (1), which allows the control to be feedback and highlights the effect of the initial state. The objective function in the direct policy optimization is defined as an expectation of the cost functional over the distribution of initial states:

\underset{\bm{\theta}}{\operatorname{minimize}}\quad\underset{\bm{x}_{0}\sim{\mu}}{\mathbb{E}}J(\bm{u}^{\mathrm{NN}}(\cdot;\bm{\theta});\bm{x}_{0}).

(3)

We classify direct policy optimization into two settings based on whether the system is explicitly known: the fully-known dynamics setting and the unknown dynamics setting.

With fully-known dynamics, we have explicit forms of $\eqref{eq1-objective}$ (the dynamics $f$ , cost functions $M$ and $L$ ) and build a large computational graph along trajectories corresponding to the objective. Then we can optimize the objective based on randomly sampled initial states to approximate the expectation. Employing fixed initial points is also a viable approach, as in [10, 23]; however, in complex scenarios, randomly sampled initial points better cover the initial distribution, leading to improved stability and performance. The gradient with respect to the network parameters $\bm{\theta}$ can be computed using two methods: back-propagation [24] or the adjoint method [6, 25]. Details of these methods and a comparison between them are provided in A.1. We report results optimized using back-propagation in the main content, as the adjoint method is much more time-consuming with indistinguishable objective improvements.

In contrast, in the setting with unknown dynamics, we can only observe trajectories generated by the underlying dynamics and cost signals on the encountered states. This scenario is commonly referred to as a reinforcement learning (RL) problem, where the goal is to learn an optimal policy from these observations without prior knowledge of the underlying dynamics.²²2In this article, unless explicitly emphasized as offline RL, RL refers to online RL. Consequently, the agent needs to collect a large amount of samples by interacting with the environment to learn the best actions and maximize its long-term cumulative rewards. This characteristic of learning from samples is a fundamental facet of RL. In A.2, we employ proximal policy optimization (PPO, 26), a popular policy-based on-policy method, to assess the effectiveness of RL algorithms in solving optimal control problems. In line with our expectations, PPO’s performance is substantially lower than direct policy optimization with fully-known dynamics due to its sample inefficiency. Additional details regarding the training process can be found in B.1.

3 Comparisons and A Unified Framework

3.1 Comparative Analysis

We summarize the characteristics of supervised learning and direct policy optimization approaches in Table 1. In the training process, the supervised learning approach depends on an offline-generated dataset, while the batches in direct policy optimization are randomly sampled online. Two approaches are developed upon two different solution philosophies in optimal control: supervised learning is related to the open-loop optimal control, while direct policy optimization comes from the direct method for closed-loop optimal control.

Table 1: Comparison on Supervised Learning (SL) and Direct Policy Optimization (DO)

Methods	Data	Training	Methodological Relevance	Challenge
SL	✓	Offline	Available open-loop method	Dataset quality
DO	✘	Online	Closed-loop direct method	Network optimization

In supervised learning, the information or the supervised signal comes from the open-loop optimal control solution, which can be obtained by any available numerical methods. Besides the aforementioned BVP or DDP solvers, we remark that the direct method can also be applied to open-loop problems, which is different from the direct policy optimization that solves the closed-loop OCP directly. It is often observed in practice that the most challenging part of supervised learning is to build an appropriate dataset using open-loop solvers rather than regression. It is why many efforts have been made, such as time-marching [3] or space-marching [27] techniques, to improve the performance of open-loop solvers. Furthermore, when the problem gets harder, the average solution time of each path gets longer. Given the same computation budget, one wishes to obtain a dataset of higher quality through adaptive sampling. One such improvement is related to data distribution, i.e., the discrepancy between the distribution of training data and the distribution of states indeed encountered by a controller over the feedback process increases over time, which is called the distribution mismatch phenomenon [28, 29]. The IVP-enhanced sampling proposed by [29] helps alleviate distribution mismatch and improves performance significantly. However, as shown in our numerical results, even with adaptive sampling and sufficiently small validation error, there might still be a gap of realized total costs between the supervised learning outcome and optimal control in challenging problems.

Different from supervised learning, direct policy optimization needs no generated dataset yet has no prior information on optimal solutions. The whole problem is turned into a large-scale nonlinear optimization problem, which is natural to compute yet without further mathematical principles to follow. All difficulties are burdened by the optimization side, which makes it hard to train and leads to much longer training time. Consequently, direct policy optimization is much more time-consuming than the supervised learning approach. Relatedly, it is often observed that, given similar training time, supervised learning surpasses direct optimization in closed-loop simulation. What is worse, in challenging cases, the direct policy optimization may fail to get a reasonable solution under long-time training, since the randomly initiated policy network has a large deviation from the optimal solution and it is hard in such a complex optimization problem to find an appropriate direction and stepsize to converge to the optimal solution.

3.2 A Unified Training Paradigm

Summarizing the limitations discussed above, we find that supervised learning suffers from the dataset to further improve, while direct optimization needs proper initialization. Challenges in the two methods are different and somehow orthogonal, which offers us an opportunity to draw on their merits and complement each other. We combine the two methods into a unified training paradigm for neural network-based optimal control problems. The new paradigm, called Pre-train and Fine-tune, can be briefly sketched to pre-train by offline supervised learning first and then fine-tune by online direct policy optimization. This training paradigm is outlined in Figure 1, in which the two separated training approaches are combined as sequential training stages. In the first stage, we pre-train a controller via supervised learning. Various methods, such as adaptive sampling, can be included to achieve better performance. However, in some challenging cases, even with those techniques, there is still a gap between the learned controller and the optimal control. Thereby, we apply direct policy optimization based on the pertained network for fine-tuning. As discussed, a bottleneck of direct policy optimization is the extremely huge initial loss, which prevents the method to find a proper way to optimize. Fortunately, the pre-training through supervised learning can provide a reasonable solution that is close enough to the optimal solution. We can validate the optimality of different approaches in closed-loop simulation. Intuitively, the performances sorted from the worst to the best should be: Direct Policy Optimization $\leq$ Supervised Learning $\leq$ Pre-train and Fine-Tune $\leq$ Optimal Control, which will be verified by experiments in Section 4.

Refer to caption — Figure 1: A unified training paradigm for neural network-based closed-loop optimal control consists of two stages. In Stage I, we first solve corresponding open-loop OCP to generate a dataset, on which we train the controller through supervised learning. In Stage II, we fine-tune the controller pre-trained in Stage I through online direct policy optimization.

The philosophy of Pre-train and Fine-tune is relatively broad, encompassing various approaches, such as pre-training on a large dataset for a general task and fine-tuning on domain-specific data for more specialized tasks. In our context, it is important to emphasize that the objectives in the two stages are distinct yet correspond to the same optimal control problems from the same distribution of initial states. This approach bears some resemblance to certain combinations of offline and online RL algorithms [30, 31, 32]. Offline RL involves learning from a fixed dataset, whereas online RL optimizes through dynamic interactions. A notable distinction in our method is its ability to generate a significant amount of optimal data when learning from a dataset of expert demonstrations. This is in contrast to many offline RL scenarios, where acquiring expert policy labels, such as human demonstrations, poses a challenge. Consequently, these scenarios often rely on static datasets, characterized by limited data and policies that may deviate from optimality. We believe that our Pre-train and Fine-tune strategy for optimal control problems offers a more systematic, controllable, and principled testbed for studying related algorithms, providing valuable insights into RL problems.

4 Experiments

4.1 Experimental Settings

We consider two optimal control problems: the optimal attitude control of a satellite and the optimal landing of a quadrotor. Both problems aim at controlling an aero-spatial vehicle from a general starting state to a target state with minimal cost, which has important engineering applications. Both problems have served as benchmarks in previous studies [3, 11, 29, 33, 34, 35, 36]. The optimal landing problem for a quadrotor is more challenging than that of the satellite due to the higher dimension and non-linear dynamics (see the detailed formulations of the problems in C). Besides, in A.4, we evaluate the total cost of an uncontrolled system and the Linear Quadratic Regulator (LQR) controller as baselines to further underscore that these simpler controllers may have a big performance gap compared to the optimal controller.

We focus on the problems where the initial states fall in a compact set $\mathcal{S}$ (defined in each example below respectively). In direct policy optimization, the initial state distribution $\mu$ in (3) is taken as the uniform distribution on $\mathcal{S}$ . The dataset in supervised learning is also generated by uniformly sampling the initial states from $\mathcal{S}$ unless otherwise specified as the adaptive dataset in Section 4.3. For comprehensive comparisons, we conduct closed-loop simulations starting from different initial states. In robustness experiments, we corrupt the input states with uniform noises, i.e., the input for controllers becomes $\bm{x}(t)+\bm{n}(t)$ , where each dimension of $\bm{n}(t)$ is independently and uniformly sampled from $[-\sigma,\sigma]$ with $\sigma$ being the disturbance scale. As for evaluation metrics, we compute the pathwise ratio between the cost under the network controller and the optimal cost starting from the same initial state, called the cost ratio. We remark that neither the BVP solution nor direct policy optimization can ensure global optimality. We use the cost ratio as a reasonable metric to establish a standard for comparing against a potentially optimal cost obtained by the open-loop solver. We plot the cumulative distribution curve and summarize different statistics of the cost ratio (with respect to the initial state distribution $\mu$ ) to evaluate different controllers. All training details and time consumption are provided in B.1.

4.2 The Optimal Altitude Control Problem of Satellite

Settings

In this section, we conduct experiments on a satellite altitude control problem with a six-dimensional state [3, 33, 11]. The state can be formulated as $\bm{x}=(\bm{v}^{\mathrm{T}},\bm{\omega}^{\mathrm{T}})^{\mathrm{T}}=(\phi,\theta,\psi,\omega_{1},\omega_{2},\omega_{3})^{\mathrm{T}}$ , where $\bm{v}$ represents the attitude of the satellite and $\bm{\omega}$ represents the angular velocity. The full dynamics can be found in C.1. The problem is to apply a torque $\bm{u}\in\mathbb{R}^{3}$ to stabilize the satellite to a final state of $\bm{v=0}$ and $\bm{\omega=0}$ at a fixed terminal time $T=20$ . The set of interest for the initial state is $\mathcal{S}_{\text{sate}}=\{\bm{v},\bm{\omega}\in\mathbb{R}^{3}\mid-\frac{\pi}{3}\leq\phi,\theta,\psi\leq\frac{\pi}{3},-\frac{\pi}{4}\leq\omega_{1},\omega_{2},\omega_{3}\leq\frac{\pi}{4}\}.$

With the time-marching technique, it takes about 0.5 seconds on average to solve an open-loop solution for one trajectory and less than 1 minute in total (without parallel computing) to generate the whole optimal control dataset. Then we train the network under supervised learning for 100 epochs in 1 minute. In direct optimization, it costs more than 1 hour to train 2000 iterations from scratch, which is much longer than supervised learning.

Table 2: Statistics of cost ratio and total computation time (min) in satellite’s optimal attitude control problem

Method	Mean	Std	Max	Min	Median	Time
Direct Optimization	1.048	0.034	1.180	1.012	1.037	87
Supervised Learning	1.003	0.001	1.006	1.002	1.003	2

Is supervised learning superior in performance? We first evaluate the learned controllers in deterministic environments and summarize statistics of the cost ratio and total time in Table 2. As shown in Table 2, the performance of supervised learning is better than online direct policy optimization. The mean of cost ratio of supervised learning is extremely close to one, implying that the learned controller is close enough to the optimal control. Under such circumstances, there is no need for further fine-tuning. However, in the next challenging example, when purely supervised learning can not achieve competitive performance to optimal control, fine-tuning matters. We further conduct experiments under different scales of disturbances $\sigma=0.01,0.025,0.05$ and plot the cumulative distribution functions of the cost ratio in Figure 2. The supervised controller can still stabilize the system and surpass direct optimization under small disturbances, demonstrating its robustness. We remark that since supervised learning is trained to fit a deterministic dataset rather than designed for stochastic systems, there may be states far away from those in the used optimal control dataset when the noise is large. Thus, supervised learning only maintains its priority to direct optimization under moderate disturbances.

What are the challenges in direct policy optimization? Exploring the optimization landscapes

In Figure 3, we investigate the challenges associated with direct policy optimization by examining the optimization landscapes using three metrics inspired by [37]. Let $l(\theta)$ denote the loss function (employed in supervised learning or direct policy optimization) and $lr$ represent the corresponding learning rate. To understand the landscape at various optimization stages given the current parameter $\hat{\theta}$ , we assess local sensitivity by computing $l(\theta^{\prime})$ and $\nabla_{\theta}l(\theta^{\prime})$ , where multiple $\theta^{\prime}$ are given by $\theta^{\prime}=\hat{\theta}-{step\_size}\times\nabla_{\theta}l(\hat{\theta})$ with multiple ${step\_size}$ , mirroring updates in stochastic gradient descent (SGD). Note that such local evaluations do not impact the training procedure itself and it is performed independently to gain insights into the behavior and characteristics of the optimization process. During local evaluations, the actual ${step\_size}$ is determined by scaling the learning rate used in training, denoted as $step\_size=step\_ratio\times lr$ . This allows us to analyze the sensitivity to different learning rates along the gradient direction. In our implementation, we adopt multiple values of the $step\_ratio$ parameter within a predefined range and observe the corresponding changes in loss, which collectively form the loss landscape. Additionally, we compute the $l_{2}$ changes in the gradient to measure approximately the “gradient predictiveness”. We also compute the effective “ $\beta$ -smoothness”, as defined in [37], which quantifies the maximum ratio between difference (in $l_{2}$ -norm) in the gradient and the distance moved in a specific gradient direction as the $step\_ratio$ varies within the range. This measure provides valuable insights into the Lipschitz continuity of the gradient. To enable a consistent comparison between methods with different losses, we have normalized each one by scaling it according to the loss value in the final iteration, which brings metrics to a unified scale.

For both supervised learning and direct policy optimization (randomly initialized or pre-trained by supervised learning), Figure 3 displays scaled variations in losses (left), scaled variations of $l_{2}$ changes in the gradient (middle), and the scaled effective $\beta$ -smoothness, the maximum ratio between gradient difference (in $l_{2}$ -norm) and parameter difference (right), as moving in the gradient direction with different $step\_ratio$ .³³3We remark that the valid domain of the $step\_ratio$ parameter used in direct policy optimization is considerably smaller compared to supervised learning since the former has a much rougher optimization landscape, as shown in Figure 3. Further details regarding this observation can be found in B.1. Our comparisons shed light on the challenges associated with direct policy optimization, particularly its sensitivity to the actual $step\_size$ . The much higher shaded regions in Figure 3 (left and middle) for the direct policy optimization (randomly initialized) show its larger sensitivity to the $step\_size$ compared to supervised learning. Furthermore, the effective $\beta$ -smoothness, which measures the Lipschitzness of the gradient, demonstrates that the optimization landscape in supervised learning is notably smoother. Significantly, the metrics for direct policy optimization which is pre-trained by supervised learning, exhibit the smoothest outcomes in the comparison, showing the critical role of proper initialization. These findings emphasize the challenges faced in direct policy optimization and highlight the advantages of supervised learning, which benefits from a smoother and more benign optimization landscape.

4.3 The Optimal Landing Problem of Quadrotor

Settings

We consider a more complex twelve-dimensional problem aiming to control a quadrotor to land at a target position in a fixed terminal time $T$ [29]. The state can be formulated as $\bm{x}=\left(\bm{p}^{\mathrm{T}},\bm{v}_{b}^{\mathrm{T}},\bm{\eta}^{\mathrm{T}},\bm{w}_{b}^{\mathrm{T}}\right)^{\mathrm{T}}$ , where $\bm{p}=(x,y,z)^{\mathrm{T}}\in\mathbb{R}^{3}$ denotes the position in Earth-fixed coordinates, $\bm{v}_{b}=(v_{x},v_{y},v_{z})^{\mathrm{T}}\in\mathbb{R}^{3}$ denotes the velocity with respect to the body frame, $\bm{\eta}=(\phi,\theta,\psi)^{\mathrm{T}}\in\mathbb{R}^{3}$ denotes the attitude in Earth-fixed coordinates, and $\bm{w}_{b}\in\mathbb{R}^{3}$ denotes the angular velocity in the body frame. The full dynamics can be found in C.2. We train a feedback controller $\bm{u}^{\text{NN}}(t,\bm{x}):[0,T]\times\mathbb{R}^{12}\rightarrow\mathbb{R}^{4}$ to control the rotor thrusts to land the quadrotor. The detailed setting is similar to that in [29]. Specifically, the set of interest for the initial state is $\mathcal{S}_{\text{quad}}=\{x,y\in[-40,40],z\in[20,40],~{}~{}v_{x},v_{y},v_{z}\in[-1,1],\theta,\phi\in[-\pi/4,\pi/4],\psi\in[-\pi,\pi],\bm{w}=\bm{0}\}$ . We also consider problems whose initial states lie in a smaller domain $\tilde{\mathcal{S}}_{\text{quad}}\subset\mathcal{S}_{\text{quad}}$ to study the effect of varying time horizons, where $\tilde{\mathcal{S}}_{\text{quad}}=\{x,y\in[-8,8],z\in[4,8],v_{x},v_{y},v_{z}\in[-0.2,0.2],\theta,\phi\in[-\pi/20,\pi/20],\psi\in[-\pi/5,\pi/5],\bm{w}=\bm{0}\}$ .

What are the challenges in direct policy optimization? Comparisons on different horizons

As discussed in Section 3, the challenge in direct policy optimization is the optimization itself. To better understand this challenge, we first consider the initial state $\bm{x}_{0}$ uniformly sampled from $\tilde{\mathcal{S}}_{\text{quad}}$ and the time horizon $T$ equals 4, 8, or 16. The cumulative distribution functions of cost ratio are plotted in Figure 4 and the corresponding statistics are reported in Table 3. From the results of the direct optimization, we can clearly observe that, as the total time $T$ gets longer, the method deteriorates dramatically, and the gap between direct optimization and supervised learning grows significantly. The reason is when the horizon is longer, the forward trajectory controlled by a randomly initialized network will severely deviate from the optimal path more and more, resulting in a larger initial loss which grows quickly as $T$ increases. Under such circumstances, the optimization is much harder since finding a proper direction for the neural network to improve is difficult. Compared to training from scratch, the significant improvements brought by a pre-trained network emphasize the main challenge in direct optimization is the optimization itself.

Table 3: Statistics of cost ratio and total computation time (min) in quadrotor’s optimal landing problem on

\bm{x}_{0}\in\tilde{\mathcal{S}}_{\text{quad}}

with different time horizons

T=4,8,16

Horizon	Method	Mean	Std	Max	Min	Median	Time
$\bm{T}$ =4	DO	1.05	0.02	1.11	1.01	1.04	106
	SL	1.01	0.00	1.03	1.00	1.00	16
	Fine-tune	1.00	0.00	1.02	1.00	1.00	16 + 4
$\bm{T}$ =8	DO	1.63	0.43	3.70	1.17	1.50	157
	SL	1.15	0.11	1.72	1.01	1.11	16
	Fine-tune	1.03	0.03	1.23	1.00	1.02	16 + 6
$\bm{T}$ =16	DO	157.71	75.65	420.18	64.64	143.10	387
	SL	9.59	11.00	51.18	1.11	4.52	16
	Fine-tune	2.31	1.22	6.93	1.11	1.83	16 + 7

What are the challenges in supervised learning? Table 4 reports results in a more difficult setting where the initial state $\bm{x}_{0}$ is uniformly sampled from a larger set $\mathcal{S}_{\text{quad}}$ and the total time horizon $T=16$ . Due to the problem’s difficulty, we consider two types of datasets of the same size in supervised learning, one from uniform sampling and one from IVP-enhanced adaptive sampling [29], and we apply fine-tuning in both cases. Supervised learning surpasses the direct method still, yet it has poor performance in the worst cases, for example, the max cost ratio is very large. This is due to the distribution mismatch phenomenon [29]: when the controller encounters states far from those in the optimal control dataset, the control output is unreliable.⁴⁴4The exceptionally large cost ratio in the worst cases, due to the distribution mismatch phenomenon, also causes the mean of the cost ratio to fluctuate significantly across multiple runs of experiments when learning from a fixed dataset. In Table 4, we report the best mean among multiple runs. Meanwhile, the median of the cost ratios across different runs fluctuates much less and consistently reflects the superiority of supervised learning over the direct method. To alleviate this phenomenon, we employ IVP-enhanced sampling to obtain an adaptive dataset for supervised learning. We remark that the validation loss of supervised learning has similar behavior and converges to a small scale of 1e-4 on both the uniform dataset and the adaptive dataset, but the closed-loop simulation performances are much different as shown in Table 4 and Figure 5. This fact supports our analysis in Section 3.1 on supervised learning whose challenge is the dataset instead of the optimization process. Furthermore, comparisons between supervised learning and direct policy optimization in Figures 4, 5 and Tables 3, 4 provide similar evidence as in the example of the satellite: supervised learning achieves better results than direct policy optimization with less time, which also implies that the learning procedure in supervised learning is much easier than direct policy optimization.

Table 4: Statistics of cost ratio and total computation time (min) in quadrotor’s optimal landing problem on

\bm{x}_{0}\in\mathcal{S}_{\text{quad}}

and

T=16

. The pre-training is conducted with a uniformly sampled dataset and an adaptively sampled dataset, respectively.

Method	Fine-tune	Mean	Std	Max	Min	Median	Time
Direct	—–	9.72	4.31	29.67	3.96	8.69	426
Pre-train (Uniform)	0 (SL)	6.45	13.18	94.96	1.11	2.97	68
	100	3.07	3.11	23.22	1.06	2.09	68 + 9
	1000	1.55	0.69	6.89	1.05	1.34	68 + 92
Pre-train (Adaptive)	0 (SL)	2.05	1.61	11.84	1.03	1.48	225
	100	1.26	0.34	3.10	1.03	1.13	225 + 9
	1000	1.06	0.04	1.28	1.01	1.05	225 + 92

Can Pre-train and Fine-tune improve performance and robustness? In all settings reported in Table 3 and Table 4, fine-tuning improves the performance of supervised learning as desired, regardless of whether the dataset in supervised sampling is uniformly or adaptively sampled. Figure 5 plots the cumulative distribution functions of cost ratio when $\bm{x}_{0}\in\mathcal{S}_{\text{quad}}$ and the measurement noise $\sigma=0$ or $0.25$ . More experiments under different measurement noises are provided in A.3, demonstrating similar improvements brought by fine-tuning on robustness. We highlight the fine-tuning time (see Table 3, 4 and B.2), which is only about a few minutes with significant improvements in both performance and robustness. Direct policy optimization costs several hours and achieves sub-optimal solutions when training from scratch, whilst fine-tuning can improve performance significantly within a few minutes based on a proper initialization. We stop the pre-training process upon observing little progress in the validation loss, followed by initiating the fine-tuning phase. As shown in Table 4, it first takes about 1 hour to pre-train and then 10 minutes to improve the performance significantly through fine-tuning 100 iterations, while the direct optimization requires about 7 hours to train from scratch. If we further fine-tune the controller to 1000 iterations, it will cost more than 1 hour yet improve less than before. Therefore, the results of the Pre-train and Fine-tune strategy we report are fine-tuned by 100 iterations, unless otherwise stated in Table 4.

We proceed with additional experiments to highlight the distinction between further training via supervised learning and fine-tuning via direct optimization. Using the same pre-trained model from the second line Table 4, we conduct additional SL training for 1000 epochs and fine-tuning for 1000 iterations, respectively. Figure 6 visually depicts the different loss behaviors observed during further training with different loss functions. Continuing with SL yields only marginal improvements in the direct loss (average of the total costs for the samples). Conversely, fine-tuning through direct optimization results in rapid improvements in the direct loss, while even having a litter adverse impact on the SL loss. This comparison shows that there can be two models which perform similarly on the SL loss but significantly different on the optimal control objective. In other words, for models that are reasonably close to the optimal control, the SL loss may not be an effective indicator of how the model solves the optimal control problem. Therefore, we emphasize the significance of fine-tuning when the improvements achieved through SL training are marginal and fail to substantially enhance the overall performance.

5 Conclusion

In this work, we conduct a comprehensive comparative study of two approaches for training a neural network-based closed-loop optimal controller. Primarily, we establish a benchmark for comparing offline supervised learning and online direct policy optimization and analyze the merits and drawbacks of the two methods in detail. The experimental results highlight the priority of offline supervised learning on both performance and training time. We point out that the main challenges to the two methods are dataset and optimization, respectively. Based on the detailed analysis, we naturally propose the Pre-train and Fine-tune strategy as a unified training paradigm for closed-loop control, which significantly enhances performance and robustness.

Appendix A Additional Results

A.1 Direct Policy Optimization with Fully-Known Dynamics

As mentioned in Section 2.3, the objective (3) of direct policy optimization with fully-known dynamics can be optimized through stochastic gradient descent. There are two ways to compute the required (stochastic) gradient: back-propagation and the adjoint method.

In back-propagation [24], we first discretize the continuous trajectory with a given initial state $\bm{x}_{0}$ by an ODE integration scheme and get a discrete trajectory to approximate the total cost $J(\bm{u};\bm{x}_{0})$ . Based on discrete trajectories, we apply back-propagation through the operations of the trajectories to compute the gradient with respect to the parameters $\bm{\theta}$ . Adaptive ODE solvers, which automatically find suitable stepsize under certain error tolerance, can often achieve a better approximation efficiency of the continuous trajectory than a fixed-stepsize integrator. However, directly leveraging adaptive ODE solvers leads to an implicit and unknown depth of the computation graph, resulting in a deterioration of performance and computation time in back-propagation. To overcome this limitation, [38] suggest deleting the computation graph of adaptive solvers and storing the final computed stepsize only, which reduces the memory cost and benefits in the final performance.

The adjoint method [25] takes another avenue to calculate the gradient of the total loss $J$ with respect to parameters $\bm{\theta}$ in a continuous viewpoint. [6] defines the adjoint $\bm{\alpha}(t)=\partial J/\partial\bm{x}(t)$ with an augmented ODE $\frac{d\bm{\alpha}(t)}{dt}=-\bm{\alpha}(t)^{\top}\frac{\partial\bm{f}(\bm{x},\bm{u})}{\partial\bm{x}}-\frac{\partial L(\bm{x},\bm{u})}{\partial\bm{x}}$ , where the notations are the same as OCP (1). Then the gradient can be calculated by $\frac{\partial J}{\partial\bm{\theta}}=\int_{0}^{T}\bm{\alpha}(t)^{\top}\frac{\partial\bm{f}}{\partial\bm{u}}\frac{\partial\bm{u}}{\partial\bm{\theta}}+\frac{\partial L}{\partial\bm{u}}\frac{\partial\bm{u}}{\partial\bm{\theta}}dt$ , which requires a forward ODE to solve the state $\bm{x}(t),t\in[0,T]$ and a backward ODE with $\bm{\alpha}(T)=\partial M/\partial\bm{x}(T)$ to solve the adjoint $\bm{\alpha}(t)$ . The numerical error and memory usage of the adjoint method can be less than the back-propagation; however, the computing time is longer due to the additional calls on ODE solvers. The idea of the adjoint method has been followed by Neural ODE [39] to view the discrete neural network model as a continuous flow. Extending the implementation of [39] back to control problems, efforts are made in neural network-based control under deterministic dynamics [4, 6]. Unfortunately, naively leveraging the adjoint method may cause catastrophic divergence, even in the simple linear quadratic regulator (LQR) case [6], which can be alleviated by adding checkpoints to ODE trajectories [6, 38, 40]. The aforementioned idea of adaptive solvers can also be applied and the stored states can serve as the checkpoints [38].

We compare direct policy optimization with back-propagation or the adjoint method in the optimal landing problem on $\tilde{\mathcal{S}}_{\text{quad}}$ with $T=4,8,16$ in Table 5. The adjoint method costs time about twice as back-propagation. We round up to 4 decimal places to show the slight differences in $T=4,8$ . When the time horizon is $T=16$ , the adjoint method deteriorates even with checkpoints added. Since memory is not the bottleneck, we use back-propagation with competitive performances and faster training speed to train the direct optimization in the main content.

Table 5: Cost ratio and training time (min) in quadrotor’s optimal landing problem on

\bm{x}_{0}\in\tilde{\mathcal{S}}_{\text{quad}}

with varying time horizons

T=4,8,16

and comparisons on back-propagation (BP) and the adjoint method.

Problem	Method	Time	Mean	Std	Max	Min
$\bm{T}$ =4	BP	106	1.0468	0.0235	1.1120	1.0125
$\bm{T}$ =4	Adjoint	207	1.0449	0.0232	1.1089	1.0123
$\bm{T}$ =8	BP	157	1.6295	0.4349	3.6979	1.1712
$\bm{T}$ =8	Adjoint	298	1.6218	0.4416	3.6976	1.1609
$\bm{T}$ =16	BP	387	157.7106	75.6489	420.1823	64.6441
$\bm{T}$ =16	Adjoint	706	174.5295	116.2190	787.1036	61.1059

A.2 Direct Policy Optimization with Reinforcement Learning

In this subsection, we compare the performance of direct policy optimization with and without fully-known dynamics. We focus on the optimal attitude control problem of a satellite. With fully-known dynamics, we optimize by back-propagating through the computational graph of the entire trajectory. With unknown dynamics, we utilize the on-policy, model-free method, Proximal Policy Optimization (PPO) [26], as a representative method in reinforcement learning, which has been empirically demonstrated to achieve state-of-the-art results in a wide range of tasks, including robotics [41] and large language models [42].

Table 6: Statistics of cost ratio in satellite’s optimal attitude control problem

Method	Mean	Std	Max	Min	Median
BP With Fully-Known Dynamics	1.048	0.034	1.180	1.012	1.037
RL (1x Samples)	12.60	8.959	78.52	4.106	10.31
RL (5x Samples)	1.628	0.392	4.832	1.254	1.548

To ensure a fair comparison, we conduct experiments with RL using either the same number of trajectories as direct policy optimization with fully-known dynamics or five times as many, denoted as 1x and 5x samples, respectively. As presented in Table 6, direct policy optimization with fully-known dynamics has a cost ratio of $1.048\pm 0.034$ , which is comparable to optimal control. In contrast, RL (5x Samples) has a worse cost ratio of $1.423\pm 0.434$ , and RL (1x Samples) is significantly worse of $12.58\pm 8.959$ . These results illustrate that RL algorithms are significantly less sample-efficient.

We demonstrate the trajectories controlled by networks trained using the above methods in Figure 7, starting from the same initial point. While the RL solutions seem plausible, they are less optimal than the ones learned through dynamics. In this case, the optimal cost is 4.03, and cost of direct optimization with fully-known dynamics is 4.06, which are very close. However, the cost of RL (5x) is 5.75 and the cost of RL (1x) is much worse, 27.20. The advantages of using fully-known dynamics are straightforward, as we possess more information about the trajectories. Further details of our experiments are provided in B.1.

A.3 Robustness Evaluation

In this section, we present robustness experiments in quadrotor’s optimal landing problem, when there is state measurement noise in the close-loop simulation. We only fine-tune 100 iterations within 10 minutes based on the network pre-trained by supervised learning.

Figure 8 and 9 demonstrate that supervised learning outperforms direct policy optimization under moderate disturbances. Additionally, we highlight the superior robustness of the Pre-train and Fine-tune strategy compared to both direct policy optimization and supervised learning. When the noise is small, supervised learning is still able to control the quadrotor and maintain its priority over direct optimization. However, since supervised learning is not designed for stochastic optimal control, it does not suit the case when the noise is very large. Also, the long time horizon enlarges the influence brought by disturbance, accumulates the error along the trajectory, and results in the distribution mismatch phenomenon and poor performances. Fine-tuning brings robustness to the pre-trained network, which can endure much larger disturbance and performs the best even when the performance of supervised learning decreases dramatically as the bottom row in Figure 8 shows. Figure 9 also shows that fine-tuning is able to improve the robustness regardless of the dataset distribution where it is pre-trained on.

A.4 Uncontrolled System and LQR Performances

In this subsection, we aim to quantify the difficulty of different problems by evaluating the performance of various controllers. We begin by computing the total cost of uncontrolled systems on the test dataset. To provide a benchmark for comparison, we also consider the Linear Quadratic Regulator (LQR) as a baseline. The LQR controller is designed to optimize the system’s performance by linearizing the full dynamics around the terminal state [43]. In the context of satellite altitude control, the terminal point is defined as $\bm{x_{f}}=0$ and $\bm{u_{f}}=0$ , while in the quadrotor’s landing problem, it is $\bm{x_{f}}=0$ and $\bm{u_{f}}=mg$ . We apply Drake [44] to solve the related finite horizon LQR problems. Following the same problem setup as discussed in the main content, the feedback control $\bm{u}(t,\bm{x})=-K(t)\bm{x}$ is utilized, where $K(t)$ represents a time-varying control matrix, which comes from solving the related continuous-time Riccati Equation. The results of the zero controller, LQR controller, and other controllers trained by different methods (as presented in Table 7) are collectively analyzed for a comprehensive comparison. The performance of both zero control and LQR is inferior, indicating the increased difficulty of the respective problems. Notably, the satellite control problem emerges as the least challenging among the scenarios explored in this study, while the quadrotor problem, especially with a longer time horizon, presents a greater level of difficulty.

Table 7: Mean of cost ratio of zero control, finite horizon LQR, direct policy optimization, offline supervised learning and fine-tuned controllers. The last three columns share the same result as Table 2 where

\bm{x}_{0}\in{\mathcal{S}}_{\text{sate}}

and Table 3 where

\bm{x}_{0}\in\tilde{\mathcal{S}}_{\text{quad}}

System	Zero Control	LQR	DO	SL	Fine-tune
Satellite, $T=20$	118.48	1.03	1.05	1.00	1.00
Quadrotor, $T=4$	2246.29	84.53	1.05	1.01	1.00
Quadrotor, $T=8$	229094.04	1602.78	1.63	1.15	1.03
Quadrotor, $T=16$	34735478.38	74110.31	151.71	9.59	2.31

Appendix B Experimental Details

All the experiments are run on a single NVIDIA 3090 GPU.

B.1 Training Details

Hyper-parameters are summarized in Table 8, and we briefly explain some details.

In our RL experiments, we use a discretization timestep of $\Delta t=0.005$ to build the environment, resulting in 4000 steps per episode. The reward at each timestep $t$ is the running cost integrated over a small time interval, $R_{t}=-\int_{t}^{t+\Delta t}{L}(\bm{x},\bm{u})dt$ , and the terminal reward additionally subtracts the terminal cost $M(\bm{x}(T))$ . In RL, the goal is to learn the policy $\bm{u}$ that maximizes the expected discounted return $G_{t}=\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}$ , where the discount factor $\gamma=1.0$ in our finite horizon problem. Note that the expected discounted return in RL to maximize is the total cost in OCP (1) to minimize. We use the PPO algorithm implemented in stable-baselines3 [45].⁵⁵5https://github.com/DLR-RM/stable-baselines3 This algorithm employs the generalized advantage estimation (GAE, 46) technique to reduce variance in the gradient estimate. We experiment with gae_lambda from {1.0, 0.99, 0.98, 0.97, 0.96, 0.95}, ultimately settling on gae_lambda=0.98 as it yields the best results. To ensure fair comparisons, we set the number of inner loop epochs in PPO to 1, so that the collected data is only used once in RL, just as in direct optimization with dynamics. It is worth noting that the number of trajectories used in direct optimization with dynamics is the product of its batch size and number of iterations (1024 $\times$ 2000), which is the same as the number of episodes used in RL (1x) (2048000), as shown in Table 8.

We use Dopri5 (Dormand–Prince Runge–Kutta method of order (4)5) in the satellite problem and RK23 in the quadrotor problem as the ODE solvers. The implementation of direct policy optimization in this paper includes the aforementioned advanced techniques of adaptive solvers and adding checkpoints to achieve desired performance. The adaptive solvers in training are implemented by [38].⁶⁶6https://github.com/juntang-zhuang/torch-ACA Since our main results are architecture-agnostic, we use a fully connected neural network as the closed-loop controller, which takes both time and state as the input, i.e., $\bm{u}^{\mathrm{NN}}(t,\bm{x})$ . We use time-marching in satellite and space-marching in quadrotor to successfully generate the BVP solution. We randomly generate a validation dataset of 100 initial states for each problem, and adopt BVP solvers to compute their optimal costs for comparison. As for IVP-enhanced sampling, we follow the same settings as [29] does, with temporal grid points on $0<10<14<16$ . The number of initial states in uniform sampling and adaptive sampling is the same. We remark that a learning rate decay strategy is incorporated to achieve good performance in SL, following common practices in the field. However, in the case of direct policy optimization, we are unable to employ similar learning rate decay due to its inefficient optimization process. Given the limited number of iterations currently used, reducing the learning rate in direct policy optimization would significantly slow down the training process and hurt performance. On the other hand, at the fine-tuning stage, we have the flexibility to utilize much smaller learning rates, allowing us to achieve optimal performance.

We present further detailed information on the analysis of optimization landscapes in Figure 3, following the settings outlined in [37]. Our focus is primarily on local evaluations conducted during the training process. We denote the loss function as $l(\theta)$ and the learning rate (used in the supervised learning or direct policy optimization) as $lr$ . During the local evaluation, we perform updates similar to stochastic gradient descent (SGD), $\theta^{\prime}=\hat{\theta}-{step\_size}\times\nabla_{\theta}l(\hat{\theta})$ , where $step\_size=step\_ratio\times lr$ and $\hat{\theta}$ denotes the current parameter in training. Specifically, the multiple values of $step\_ratio$ selected for supervised learning (SL) are within [1/100, 100], while for direct optimization, the values of $step\_ratio$ are within [1/100, 1/10]. Notably, in direct optimization which is randomly initialized, we are limited to choosing $step\_ratio$ values much smaller than 1.0. This restriction arises due to the differing update rules between Adam (the training optimizer) and SGD (the update rule in local evaluation). The need for this careful adjustment is driven by the high sensitivity of direct policy optimization to the choice of optimizer and learning rate, particularly in the initial stages when gradients are significantly larger. The $step\_ratio$ is not strictly limited to such small values when the weights are pre-trained and have a much smoother landscape. We remark that we use the same $step\_ratio$ in direct optimization for both randomly initialized and with pre-trained weights. By selecting appropriate $step\_ratio$ values within this constrained range, we ensure a valid evaluation process.

Table 8: Experimental Settings

Hyper-parameter	Setting
Network architecture	Three hidden layers with a hidden size of 64
Activation function	Tanh
ODE solver	Dopri5 with atol=1e-5 and rtol=1e-5 (Satellite)
	RK23 with atol=1e-5 and rtol=1e-5 (Quadrotor)
BVP technique	Time-marching [3] (Satellite)
	Space-marching [27] (Quadrotor)
Trajectories for evaluation	100
Trajectories for training	100 (Satellite)
	500 (Quadrotor in $\tilde{\mathcal{S}}_{quad}$ )
	1000 (Quadrotor in $\mathcal{S}_{quad}$ )
Epochs in SL	100 (Satellite)
	1000 (Quadrotor in $\tilde{\mathcal{S}}_{quad}$ )
	2000 (Quadrotor in $\mathcal{S}_{quad}$ )
Iterations in DO	2000 (Satellite)
	3000 (Quadrotor)
Iterations in Fine-tuning	100 (Quadrotor in $\tilde{\mathcal{S}}_{quad}$ )
	100 / 1000 (Quadrotor in $\mathcal{S}_{quad}$ )
Episodes in RL	2048000 / 10240000
Optimizer	Adam [47]
Learning rate in SL	0.01 (Satellite)
	0.001 (Quadrotor in $\tilde{\mathcal{S}}_{quad}$ )
	0.01 and decays every 500 epochs by 0.5 (uniform in $\mathcal{S}_{quad}$ )
	0.005 and decays every 500 epochs by 0.5 (adaptive in $\mathcal{S}_{quad}$ )
Learning rate in DO	0.01
Learning rate in Fine-tuning	0.0001
Learning rate in RL	0.0005
Batch size in SL	1024 (Satellite)
	4096 (Quadrotor)
Batch size in DO	1024 (Satellite)
	2048 (Quadrotor)
Batch size in Fine-tuning	2048
Batch size in RL	2048

B.2 Training Time

The column Data denotes the time for generating the whole dataset by BVP solvers according to the related marching methods. We apply parallel techniques to fasten the total procedure of data generation in quadrotor with 24 processors. The addition time in Quad_Large_Ada is related to the network training in adaptive sampling. The other columns represent the training time of supervised learning, the fine-tuning time based on the pre-trained network, and the training time by direct optimization, respectively. The rows Quad_Small_T4, Quad_Small_T8 and Quad_Small_T16 denote the experiments initiated in the small domain $\tilde{\mathcal{S}}_{\text{quad}}$ with varying time spans of $T=4,8,16$ . The last two rows are experiments in $\mathcal{S}_{\text{quad}}$ with $T=16$ , where Quad_Large_Uni denotes that the dataset for supervised learning is generated based on uniform initial states, whilst Quad_Large_Ada denotes that the dataset is generated by adaptive sampling.

Table 9 demonstrates that supervised learning costs less time than direct optimization and highlights the priority of the proposed Pre-train and Fine-tune strategy in terms of time consumption. Direct optimization is harder to train and costs more time, especially when the problem is more complex and the time horizon is longer. There is no need for fine-tuning in satellite, since the pre-trained network is already very close to the optimal control. The fine-tuning time reported is only for 100 iterations, which are all within 10 minutes. As shown in Table 3 and Table 4, with only 100 iterations of fine-tuning, the performance of the controller can be improved significantly. We also record the fine-tuning time of 1000 iterations, which is about 92 minutes for both the uniform dataset and the adaptive dataset. All experiments related to fine-tuning are fine-tuned for 100 iterations, except the ones mentioned in Table 4.

We utilize 256 parallel environments to collect data for reinforcement learning. The training process takes 85 minutes in RL (1x) and 434 minutes in RL (5x). The training times are comparable between RL (1x) and BP with fully-known dynamics when using the same number of trajectories, and are approximately five times longer for RL (5x) at a larger sample scale.

Table 9: Summary on time (min) of data generation and network training

Problem	Data	Supervised	Fine-tuning	Direct
Satellite	0.8	0.7	—	87
Quad_Small_T4	2.4	13.5	4.4	106
Quad_Small_T8	2.2	13.5	6.0	157
Quad_Small_T16	2.2	13.5	7.4	387
Quad_Large_Uni	12.8	55.5	9.1	426
Quad_Large_Ada	12.8 + 162	51.5	9.4	426

Appendix C Full Dynamics of Satellite and Quadrotor

C.1 Satellite

We introduce the full dynamics of the satellite problem [3, 33, 11] that are considered in Section 4.2.

The state is $\bm{x}=\left(\bm{v}^{\mathrm{T}},\bm{\omega}^{\mathrm{T}}\right)^{\mathrm{T}}$ , where $\bm{v}=(\phi,\theta,\psi)^{\mathrm{T}}$ denotes the attitude of the satellite in Euler angles and $\bm{\omega}=\left(\omega_{1},\omega_{2},\omega_{3}\right)^{\mathrm{T}}$ represents the angular velocity in the body frame. $\phi$ (roll), $\theta$ (pitch), and $\psi$ (yaw) are the rotation angles around the body frame.

The closed-loop control is $\bm{u}(t,\bm{x}):\left[0,T\right]\times\mathbb{R}^{6}\rightarrow\mathbb{R}^{m}$ . The output dimension $m$ represents the number of momentum wheels, which is set to 3 as the fully actuated case. Then we define the constant matrix $\bm{B}\in\mathbb{R}^{3\times m}$ , $\bm{J}\in\mathbb{R}^{3\times 3}$ to denote the combination of the inertia matrices of the momentum wheels, and $\bm{h}\in\mathbb{R}^{3}$ to denote the rigid body without wheels as

\bm{B}=\left[\begin{array}[]{ccc}1&1/20&1/10\\ 1/15&1&1/10\\ 1/10&1/15&1\end{array}\right],\quad\bm{J}=\left[\begin{array}[]{lll}2&0&0\\ 0&3&0\\ 0&0&4\end{array}\right],\quad\bm{h}=\left[\begin{array}[]{l}1\\ 1\\ 1\end{array}\right].

Finally, the full dynamics are

\left[\begin{array}[]{c}\dot{\bm{v}}\\ \bm{J}\dot{\bm{\omega}}\end{array}\right]=\left[\begin{array}[]{c}\bm{E}(\bm{v})\bm{\omega}\\ \bm{S}(\bm{\omega})\bm{R}(\bm{v})\bm{h}+\bm{Bu}\end{array}\right],

where the skew-symmetric matrix and rotation matrices $\bm{S}(\bm{\omega}),\bm{E}(\bm{v}),\bm{R}(\bm{v}):\mathbb{R}^{3}\rightarrow\mathbb{R}^{3\times 3}$ are defined as

\bm{S}(\bm{\omega})=\left[\begin{array}[]{ccc}0&\omega_{3}&-\omega_{2}\\ -\omega_{3}&0&\omega_{1}\\ \omega_{2}&-\omega_{1}&0\end{array}\right],\quad\bm{E}(\bm{v})=\left[\begin{array}[]{ccc}1&\sin\phi\tan\theta&\cos\phi\tan\theta\\ 0&\cos\phi&-\sin\phi\\ 0&\sin\phi/\cos\theta&\cos\phi/\cos\theta\end{array}\right],

\bm{R}(\bm{v})=\left[\begin{array}[]{ccc}\cos\theta\cos\psi&\cos\theta\sin\psi&-\sin\theta\\ \sin\phi\sin\theta\cos\psi-\cos\phi\sin\psi&\sin\phi\sin\theta\sin\psi+\cos\phi\cos\psi&\cos\theta\sin\phi\\ \cos\phi\sin\theta\cos\psi+\sin\phi\sin\psi&\cos\phi\sin\theta\sin\psi-\sin\phi\cos\psi&\cos\theta\cos\phi\end{array}\right].

The running cost and the final cost are defined as

L(\bm{x},\bm{u})=W_{1}\|\bm{v}\|^{2}+W_{2}\|\bm{\omega}\|^{2}+W_{3}\|\bm{u}\|^{2},

M(\bm{x}(T))={W_{4}}\left\|\bm{v}(T)\right\|^{2}+{W_{5}}\left\|\bm{\omega}(T)\right\|^{2},

where the weights are set as $W_{1}=\frac{1}{2},W_{2}=5,W_{3}=\frac{1}{4},W_{4}=\frac{1}{2},W_{5}=\frac{1}{2}$ .

C.2 Quadrotor

We describe the full dynamics of the optimal landing problem of a quadrotor [27, 29, 34, 35, 36] which are considered in Section 4.3.

The state is $\bm{x}=\left(\bm{p}^{\mathrm{T}},\bm{v}_{b}^{\mathrm{T}},\bm{\eta}^{\mathrm{T}},\bm{w}_{b}^{\mathrm{T}}\right)^{\mathrm{T}}\in\mathbb{R}^{12}$ where $\bm{p}=(x,y,z)^{\mathrm{T}}\in\mathbb{R}^{3}$ denotes the position in Earth-fixed coordinates, $\bm{v}_{b}=(v_{x},v_{y},v_{z})^{\mathrm{T}}\in\mathbb{R}^{3}$ denotes the velocity with respect to the body frame, $\bm{\eta}=(\phi,\theta,\psi)^{\mathrm{T}}\in\mathbb{R}^{3}$ denotes the attitude in Earth-fixed coordinates, and $\bm{w}_{b}\in\mathbb{R}^{3}$ denotes the angular velocity in the body frame.

The closed-loop control is $\bm{u}(t,\bm{x}):\left[0,T\right]\times\mathbb{R}^{12}\rightarrow\mathbb{R}^{4}$ . In practice, the individual rotor thrusts $\bm{F}=\left(F_{1},F_{2},F_{3},F_{4}\right)^{\mathrm{T}}$ is applied to steer the quadrotor, where $\bm{u}=E\bm{F}$ . We use $l$ as the distance from the rotor to the quadrotor’s center of gravity, and $c$ as a constant that relates the rotor angular momentum to the rotor thrust (normal force), then the matrix $E$ is defined as

E=\left[\begin{array}[]{cccc}1&1&1&1\\ 0&l&0&-l\\ -l&0&l&0\\ c&-c&c&-c\end{array}\right].

With $\bm{F}^{*}=E^{-1}\bm{u}^{*}$ , the optimal $\bm{F}^{*}$ can be directly computed, once we obtain the optimal control $\bm{u}^{*}$ .

Using the same parameters as [35], the mass and the inertia matrix are defined as $m=2kg$ and $\bm{J}=\operatorname{diag}\left(J_{x},J_{y},J_{z}\right)$ , where $J_{x}=J_{y}=\frac{1}{2}J_{z}=1.2416\mathrm{~{}kg}\cdot\mathrm{m}^{2}$ . Another constant vector is $\bm{g}=(0,0,g)^{\mathrm{T}}$ , where $g=9.81\mathrm{~{}m}/\mathrm{s}^{2}$ denotes the acceleration of gravity on Earth. The rotation matrix $\bm{R}(\bm{\eta})\in SO(3)$ denotes the transformation from the Earth-fixed coordinates to the body-fixed coordinates, and the attitude kinematic matrix $\bm{K}(\bm{\eta})$ relates the time derivate of the attitude with the associated angular rate as

\bm{R}(\bm{\eta})=\left[\begin{array}[]{ccc}\cos\theta\cos\psi&\cos\theta\sin\psi&-\sin\theta\\ \sin\theta\cos\psi\sin\phi-\sin\psi\cos\phi&\sin\theta\sin\psi\sin\phi+\cos\psi\cos\phi&\cos\theta\sin\phi\\ \sin\theta\cos\psi\cos\phi+\sin\psi\sin\phi&\sin\theta\sin\psi\cos\phi-\cos\psi\sin\phi&\cos\theta\cos\phi\end{array}\right],

\bm{K}(\bm{\eta})=\left[\begin{array}[]{ccc}1&\sin\phi\tan\theta&\cos\phi\tan\theta\\ 0&\cos\phi&-\sin\phi\\ 0&\sin\phi\sec\theta&\cos\phi\sec\theta\end{array}\right].

Finally, the dynamics are

\left\{\begin{array}[]{l}\dot{\bm{p}}=\bm{R}^{\mathrm{T}}(\bm{\eta})\bm{v}_{b}\\ \dot{\bm{v}}_{b}=-\bm{w}_{b}\times\bm{v}_{b}-\bm{R}(\bm{\eta})\bm{g}+\frac{1}{m}A\bm{u}\\ \dot{\bm{\eta}}=\bm{K}(\bm{\eta})\bm{w}_{b}\\ \dot{\bm{w}}_{b}=-\bm{J}^{-1}\bm{w}_{b}\times\bm{J}\bm{w}_{b}+\bm{J}^{-1}B\bm{u}\end{array}\right.,

where the constant matrices $A$ and $B$ are defined as

A=\left[\begin{array}[]{llll}0&0&0&0\\ 0&0&0&0\\ 1&0&0&0\end{array}\right],\quad B=\left[\begin{array}[]{llll}0&1&0&0\\ 0&0&1&0\\ 0&0&0&1\end{array}\right].

The running cost and the final cost are defined as

L(\bm{x},\bm{u})=\left(\bm{u}-\bm{u}_{d}\right)^{\mathrm{T}}Q_{u}\left(\bm{u}-\bm{u}_{d}\right),

M(\bm{x})=\bm{p}^{\mathrm{T}}Q_{pf}\bm{p}+\bm{v}^{\mathrm{T}}Q_{vf}\bm{v}+\bm{\eta}^{\mathrm{T}}Q_{\eta f}\bm{\eta}+\bm{w}^{\mathrm{T}}Q_{wf}\bm{w}=\bm{x}^{\mathrm{T}}Q_{f}\bm{x},

where $Q_{u}=\operatorname{diag}(1,1,1,1)$ , $Q_{pf}=5I_{3}$ , $Q_{vf}=10I_{3}$ , $Q_{\eta f}=25I_{3}$ and $Q_{wf}=50I_{3}$ , where $I_{3}$ denotes the identity matrix in three-dimensional space.

References

Franklin et al. [2020] G. F. Franklin, J. D. Powell, A. Emami-Naeini, Feedback Control of Dynamic Systems, eighth ed., Pearson Education Limited, 2020.
Han and E [2016] J. Han, W. E, Deep learning approximation for stochastic control problems, arXiv preprint arXiv:1611.07422 (2016).
Nakamura-Zimmerer et al. [2021] T. Nakamura-Zimmerer, Q. Gong, W. Kang, Adaptive deep learning for high-dimensional Hamilton–Jacobi–Bellman equations, SIAM Journal on Scientific Computing 43 (2021) A1221–A1247.
Böttcher et al. [2022] L. Böttcher, N. Antulov-Fantulin, T. Asikis, AI Pontryagin or how artificial neural networks learn to control dynamical systems, Nature Communications 13 (2022) 333.
E et al. [2022] W. E, J. Han, J. Long, Empowering optimal control with machine learning: A perspective from model predictive control, arXiv preprint arXiv:2205.07990 (2022).
Ainsworth et al. [2021] S. Ainsworth, K. Lowrey, J. Thickstun, Z. Harchaoui, S. Srinivasa, Faster policy learning with continuous-time gradients, in: Learning for Dynamics and Control, PMLR, 2021, pp. 1054–1067.
Liberzon [2011] D. Liberzon, Calculus of variations and optimal control theory: a concise introduction, Princeton University Press, 2011.
Bellman [1957] R. Bellman, Dynamic Programming, Rand Coperation Research Study, Princeton University Press, 1957.
Azmi et al. [2021] B. Azmi, D. Kalise, K. Kunisch, Optimal feedback law recovery by gradient-augmented sparse polynomial regression, The Journal of Machine Learning Research 22 (2021) 2205–2236.
Kunisch et al. [2023] K. Kunisch, D. Vásquez-Varas, D. Walter, Learning optimal feedback operators and their sparse polynomial approximations, Journal of Machine Learning Research 24 (2023) 1–38.
Kang and Wilcox [2017] W. Kang, L. C. Wilcox, Mitigating the curse of dimensionality: sparse grid characteristics method for optimal feedback control and HJB equations, Computational Optimization and Applications 68 (2017) 289–315.
Weston et al. [2002] J. Weston, O. Chapelle, V. Vapnik, A. Elisseeff, B. Schölkopf, Kernel dependency estimation, Advances in neural information processing systems 15 (2002).
Meng et al. [2022] T. Meng, Z. Zhang, J. Darbon, G. Karniadakis, Sympocnet: Solving optimal control problems with applications to high-dimensional multiagent path planning problems, SIAM Journal on Scientific Computing 44 (2022) B1341–B1368.
Onken et al. [2022] D. Onken, L. Nurbekyan, X. Li, S. W. Fung, S. Osher, L. Ruthotto, A neural network approach for high-dimensional optimal control applied to multiagent path finding, IEEE Transactions on Control Systems Technology 31 (2022) 235–251.
Kierzenka and Shampine [2001] J. Kierzenka, L. F. Shampine, A bvp solver based on residual control and the maltab pse, ACM Transactions on Mathematical Software (TOMS) 27 (2001) 299–316.
Jacobson and Mayne [1970] D. H. Jacobson, D. Q. Mayne, Differential dynamic programming, Elsevier Publishing Company, 1970.
Osa et al. [2018] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al., An algorithmic perspective on imitation learning, Foundations and Trends® in Robotics 7 (2018) 1–179.
Bain and Sammut [1995] M. Bain, C. Sammut, A framework for behavioural cloning., in: Machine Intelligence 15, 1995, pp. 103–129.
Bock and Plitt [1984] H. Bock, K. Plitt, A multiple shooting algorithm for direct solution of optimal control problems, IFAC Proceedings Volumes 17 (1984) 1603–1608.
Betts [1998] J. T. Betts, Survey of numerical methods for trajectory optimization, Journal of Guidance, Control, and Dynamics 21 (1998) 193–207.
Ross and Fahroo [2002] I. M. Ross, F. Fahroo, A direct method for solving nonsmooth optimal control problems, IFAC Proceedings Volumes 35 (2002) 479–484.
Diehl et al. [2006] M. Diehl, H. G. Bock, H. Diedam, P.-B. Wieber, Fast direct multiple shooting algorithms for optimal robot control, in: Fast motions in biomechanics and robotics, Springer, 2006, pp. 65–93.
Kunisch and Walter [2021] K. Kunisch, D. Walter, Semiglobal optimal feedback stabilization of autonomous systems via deep neural network approximation, ESAIM: Control, Optimisation and Calculus of Variations 27 (2021) 16.
Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating errors, Nature 323 (1986) 533–536.
Pontryagin et al. [1962] L. S. Pontryagin, E. Mishchenko, G. RV, The mathematical theory of optimal processes, 1962.
Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017).
Zang et al. [2022] Y. Zang, J. Long, X. Zhang, W. Hu, W. E, J. Han, A machine learning enhanced algorithm for the optimal landing problem, in: 3rd Annual Conference on Mathematical and Scientific Machine Learning, PMLR, 2022, pp. 1–20.
Long and Han [2022] J. Long, J. Han, Perturbational complexity by distribution mismatch: A systematic analysis of reinforcement learning in reproducing kernel hilbert space, Journal of Machine Learning vol 1 (2022) 1–34.
Zhang et al. [2022] X. Zhang, J. Long, W. Hu, W. E, J. Han, Initial value problem enhanced sampling for closed-loop optimal control design with deep neural networks, arXiv preprint arXiv:2209.04078 (2022).
Nair et al. [2020] A. Nair, A. Gupta, M. Dalal, S. Levine, Awac: Accelerating online reinforcement learning with offline datasets, arXiv preprint arXiv:2006.09359 (2020).
Lee et al. [2022] S. Lee, Y. Seo, K. Lee, P. Abbeel, J. Shin, Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble, in: Conference on Robot Learning, PMLR, 2022, pp. 1702–1712.
Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv:2005.01643 (2020).
Kang and Wilcox [2015] W. Kang, L. Wilcox, A causality free computational method for HJB equations with application to rigid body satellites, in: AIAA Guidance, Navigation, and Control Conference, 2015, p. 2009.
Bouabdallah et al. [2004] S. Bouabdallah, P. Murrieri, R. Siegwart, Design and control of an indoor micro quadrotor, in: IEEE International Conference on Robotics and Automation, volume 5, IEEE, 2004, pp. 4393–4398.
Madani and Benallegue [2006] T. Madani, A. Benallegue, Control of a quadrotor mini-helicopter via full state backstepping technique, in: Proceedings of the 45th IEEE Conference on Decision and Control, IEEE, 2006, pp. 1515–1520.
Mahony et al. [2012] R. Mahony, V. Kumar, P. Corke, Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor, IEEE Robotics and Automation magazine 19 (2012) 20–32.
Santurkar et al. [2018] S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help optimization?, in: Advances in Neural Information Processing Systems, volume 32, 2018.
Zhuang et al. [2020] J. Zhuang, N. Dvornek, X. Li, S. Tatikonda, X. Papademetris, J. Duncan, Adaptive checkpoint adjoint method for gradient estimation in neural ODE, in: International Conference on Machine Learning, PMLR, 2020, pp. 11639–11649.
Chen et al. [2018] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, D. Duvenaud, Neural ordinary differential equations, in: Advances in Neural Information Processing Systems, volume 32, 2018.
Gholami et al. [2019] A. Gholami, K. Keutzer, G. Biros, Anode: Unconditionally accurate memory-efficient gradients for neural odes, arXiv preprint arXiv:1902.10298 (2019).
Miki et al. [2022] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning robust perceptive locomotion for quadrupedal robots in the wild, Science Robotics 7 (2022) eabk2822.
Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human feedback, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 27730–27744.
Mehrmann [1991] V. L. Mehrmann, The autonomous linear quadratic control problem: theory and numerical solution, Springer, 1991.
Tedrake and the Drake Development Team [2019] R. Tedrake, the Drake Development Team, Drake: Model-based design and verification for robotics, 2019. URL: https://drake.mit.edu.
Raffin et al. [2021] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22 (2021) 1–8.
Schulman et al. [2016] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, High-dimensional continuous control using generalized advantage estimation, in: Proceedings of the International Conference on Learning Representations, 2016.
Kingma and Ba [2015] D. P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: Proceedings of the International Conference on Learning Representations, 2015.