Vision-driven UAV River Following: Benchmarking with Safe Reinforcement Learning

Zihan Wang and Nina Mahmoudian School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907 USA (e-mail: wang5044,ninam@ purdue.edu).

Abstract

In this study, we conduct a comprehensive benchmark of the Safe Reinforcement Learning (Safe RL) algorithms for the task of vision-driven river following of Unmanned Aerial Vehicle (UAV) in a Unity-based photo-realistic simulation environment. We empirically validate the effectiveness of semantic-augmented image encoding method, assessing its superiority based on Relative Entropy and the quality of water pixel reconstruction. The determination of the encoding dimension, guided by reconstruction loss, contributes to a more compact state representation, facilitating the training of Safe RL policies. Across all benchmarked Safe RL algorithms, we find that First Order Constrained Optimization in Policy Space achieves the optimal balance between reward acquisition and safety compliance. Notably, our results reveal that on-policy algorithms consistently outperform both off-policy and model-based counterparts in both training and testing environments. Importantly, the benchmarking outcomes and the vision encoding methodology extend beyond UAVs, and are applicable to Autonomous Surface Vehicles (ASVs) engaged in autonomous navigation in confined waters.

keywords:

Safe Reinforcement Learning, Unmanned Aerial Vehicle, Vision-driven Control

^†^†thanks: This work was supported by ONR N00014-20-1-2085 and N00014-24-1-2019.

1 Introduction

Performing surveillance and rescue missions by autonomous vehicles in complex and unknown riverine environments requires safe control policy for unmanned navigation overcoming detrimental failures such as collision with natural or man-made obstacles. For example during self-driving navigation missions an ASV can run aground to river bank or sand island, and an UAV can bump with trees or bridges. Deep Reinforcement Learning (DRL) approaches has been gaining promising results in solving autonomous navigation tasks of mobile robots, Wang et al. (2023); Lee and Yusuf (2022); Zhao et al. (2023). Water segmentation mask have been used for vision-driven autonomous river following of an UAV in a relatively simple curved river channel, Taufik et al. (2015); Taufik (2016). However, as illustrated in Ray et al. (2019); Ji et al. (2023a), meaningful trade-offs tend to exist between task performance and safety objectives, so that the well-trained DRL model in unconstrained manner may not have safe behaviors automatically. The safe policy can be achieved by injecting safety into DRL algorithms to satisfy safety constraints throughout exploration during training phase and inference during testing phase. Safe RL bridges the RL simulation experiments to real-world applications, Gu et al. (2022).

While DRL has proven effective in numerous game-like scenarios, its application to visual navigation within realistic environments presents a formidable challenge, Kulhánek et al. (2019). Li et al. (2020) tackles the visual navigation in a low-resource setting by an unsupervised RL approach, Xiao et al. (2022) focuses on extracting collision information from visual observation to reshape reward in navigating a ground agent, Zieliński and Markowska-Kaczmar (2021) employs image processing module like object detection network to navigate an Autonomous Underwater Vehicle (AUV) to a target point, Knyaz and Kniaz (2020) addresses the navigation of an UAV by jointly reconstructing 3D voxel model and semantic scene segmentation from images. Nevertheless, the exploration of DRL for vision-driven navigation in riverine domains, considering the interplay between task objectives and safety, remains a relatively under-researched and challenging area. Thus it is critical to investigate how state-of-the-art Safe RL algorithms balance these trade-offs in the river following task, where the safe agent in this paper is focused on UAVs, but can be extended to ASVs as well.

A riverine environment within Unreal Engine is constructed by Liang (2021) and Wei et al. (2022) to train DRL algorithms for vision-driven UAV’s autonomous river following task. However, safety considerations are not integrated into either the environment or the DRL algorithms. Existing benchmarking environments that incorporate safety constraints either focus on ground vehicles (Ray et al. (2019); Ji et al. (2023a, b); Zhang et al. (2022b); Xu et al. (2023)), or non-vision observations as policy input for aerial vehicles (Yuan et al. (2022)), few Safe RL environments target vision-driven control for autonomous navigation tasks in the riverine domain. In this regard, we present the Safe Riverine Environment (SRE), an extension of our previous work, Wang et al. (2024), on photorealistic riverine simulation environment (RSE) for the river following task. Compared to RSE’s two maps, SRE has three difficulty levels ({Easy, Medium, Hard}, in accordance to environment structure design in Ji et al. (2023a)) with increasing task complexity. As environment difficulty increases, the number of bridges, river turns and sand islands, and the variations of river width and depth are increased. Safety constraints are firstly introduced in SRE, which are categorized into tight and loose constraints due to their different severity degrees. The fine-grained failure details in both constraint categories are provided in SRE for more detailed statistical failure analysis and algorithm comparison. Moreover, besides the pure RGB image input for vision encoding in RSE, SRE also accepts pure mask or RGB+Mask input for vision-driven navigation.

To facilitate the DRL training, we enhance the visual representation by reducing the dimension of Variational AutoEncoder (VAE)’s embedding, Kingma and Welling (2013), and augmenting RGB images with semantic water masks. This encoding strategy is validated to enhance robustness in river reconstruction and provide a more informative yet compact representation of the state. Based on this, we benchmark several state-of-the-art Safe RL algorithms in SRE with the enhanced visual encoding method, and make in-depth comparison and analysis about the performance-safety trade-off and failure cases among these algorithms. Specifically, the performance and safety measures are compared in training phase in the Medium environment, then in testing phase in all three environments. Orthographically viewed images of three environments in SRE are shown in Figure 1.

In summary, the main contributions of this paper are:

•

The presentation of Unity-based photo-realistic Safe Riverine Environment that features multiple difficulty levels and fine-grained safety metrics feedback, for training performant and safe autonomous agents in river following task.
•

The validation of the superiority of the added semantic water mask for image encoding, as well as the choice of encoding dimension, for more informative and compact state representation in vision-driven control.
•

The benchmark of several state-of-the-art Safe RL algorithms in SRE for the river following task of UAV in terms of performance and safety measures during both training and testing phases, as well as detailed comparative analysis of strengths and weaknesses of these algorithms.

2 Methodology

This section gives a brief introduction to the Constrained Markov Decision Process (CMDP) for Safe RL problem in Section 2.1, and the general methods to solve the constrained optimization problem of CMDP in Section 2.2.

2.1 Preliminary

Safe RL is formalized as finding the (sub-)optimal policy in terms of reward gain while satisfying safety constraints in a CMDP, Altman (2021). CMDP is a tuple $(S,A,R,C,P,\mu)$ , where $S$ is the set of states, $A$ is the set of actions, $R:S\times A\rightarrow\mathbb{R}$ is the reward function, $C:S\times A\rightarrow\mathbb{R}$ is the cost function, $P:S\times A\times S\rightarrow[0,1]$ is the state transition probability function, and $\mu:S\rightarrow[0,1]$ is the initial state distribution. Compared to Markov Decision Process (MDP) for regular RL, CMDP incorporates cost function $C$ , which serves as the penalty for undesired states and (or) actions.

Safe RL aims to find a policy $\pi:S\rightarrow A$ that maximizes a performance measure $J^{R}(\pi)\in\mathbb{R}$ , while constraining a safety measure $J^{C}(\pi)\in\mathbb{R}\leq d$ , where the policy $\pi\in\Pi$ is a mapping from states to probability distribution over actions, performance measure $J^{R}(\pi)$ is usually selected to be expected discounted accumulated reward over infinite horizon: $J^{R}(\pi)\doteq E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})]$ , cost measure $J^{C}(\pi)$ is defined similarly: $J^{C}(\pi)\doteq E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}C(s_{t},a_{t})]$ , and $d\in\mathbb{R}_{\geq 0}$ is a cost budget hyperparameter. Here $\tau:(s_{0},a_{0},s_{1},a_{1},...)$ denotes the trajectory dependent on the policy $\pi$ , and $\gamma$ is the discount factor. Thus within a feasible policy space $\Pi_{C}$ , the optimal Safe RL policy in a CMDP is

\pi^{\ast}=\arg\max_{\pi\in\Pi_{C}}J^{R}(\pi)

(1)

To estimate the performance of a policy, state value function $V^{R}_{\pi}(s)\doteq E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\mid s_{0}=s]$ and state action value function $Q^{R}_{\pi}(s,a)\doteq E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\mid s_{0}=s,a_{0}=a]$ are defined. $A^{R}_{\pi}(s,a)=Q^{R}_{\pi}(s,a)-V^{R}_{\pi}(s)$ is the reward advantage function. The value and advantage functions for $m$ costs ( $V^{C_{i}}_{\pi}(s)$ , $Q^{C_{i}}_{\pi}(s,a)$ , $A^{C_{i}}_{\pi}(s,a){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0},\forall i\in\{1,...,m\}}$ ) are defined similarly.

2.2 Benchmarked Algorithms

In this section, we briefly highlight the core contributions of several state-of-the-art on-policy and off-policy model-free, and model-based Safe RL algorithms. On-policy algorithms update policy only using experience collected by the most recent version of the policy, whereas off-policy algorithms update policy using data collected at any point during training. Eight model-free algorithms are selected out of all ten benchmarked algorithms due to the dynamics-free and vision-driven nature of the SRE. Model-free algorithms forego the potential gains of sample efficiency and online planning from using an environment model that predicts state transitions, rewards and costs. In order to test the potential performance improvement by learning vision-dynamics model, Grigorescu et al. (2021); Ginerica et al. (2021), we incorporate two model-based Safe RL algorithms in our benchmark. All benchmarked algorithms, and their domains and methods, are listed in Table 1.

2.2.1 Lagrangian Method

converts a constraint optimization problem to an unconstrained optimization problem using adaptive penalty coefficients to enforce constraints. Given the optimization objective of CMDP in Equation 1, the unconstrained problem can be formed as a Lagrangian dual problem (RCPO algorithm in Tessler et al. (2018))

\underset{\lambda\geq 0}{\min}\hskip 1.0pt\underset{\boldsymbol{\theta}}{\max}\hskip 1.0ptL(\lambda,\boldsymbol{\theta})=\underset{\lambda\geq 0}{\min}\hskip 1.0pt\underset{\boldsymbol{\theta}}{\max}\hskip 1.0pt[J^{R}(\pi)-\lambda(J^{C}(\pi)-d)]

(2)

where $L$ is the Lagrangian, $\lambda$ is the Lagrange multiplier and $\boldsymbol{\theta}$ is the parameter vector of the parameterized policy $\pi(\boldsymbol{\theta})$ . As $\lambda$ increases, the solution to the above equation converges to that of Equation 1. The effectiveness of the primal-dual methods is justified in Paternain et al. (2022), where zero duality gap is guaranteed under certain assumptions. Both on-policy (PPO, Schulman et al. (2017)) and off-policy (DDPG, Lillicrap et al. (2015); TD3, Fujimoto et al. (2018); SAC, Haarnoja et al. (2018)) RL algorithms can be integrated with the Lagrangian method to optimize the surrogate objective that considers both reward and costs. Due to the genericness of this Lagrangian primal-dual optimization method, the algorithms with suffix ”Lag” refer to the original RL algorithms combined with Lagrangian method (Table 1).

Table 1: Domains and methods of benchmarked Safe RL algorithms

Domains	Methods	Algorithms
On-Policy	Primal-Dual	PPOLag
	Convex Optimization	FOCOPS
	Penalty Function	P3O
	Primal	OnCRPO
Off-Policy	Primal-Dual	DDPGLag
		TD3Lag
		SACLag
Model-based	Online Plan	SafeLOOP
Model-based	Online Plan	CCEPETS

2.2.2 Constrained Policy Optimization Methods

provide an approach for policy search in continuous CMDP that guarantees near-constraint satisfaction at each iteration, and uses trust region, Schulman et al. (2015), to construct surrogate functions that approximate the objectives and constraints. For parameterized stationary policies, a constrained policy optimization method tries to solve the problem

\begin{split}\pi_{k+1}&=\arg\max_{\pi\in\Pi_{\boldsymbol{\theta}}}\underset{\begin{subarray}{c}s\sim\d{_}{\pi_{k}}\\ a\sim\pi\end{subarray}}{E}[A^{R}_{\pi_{k}}(s,a)]\\ s.t.\hskip 5.0ptJ^{C_{i}}_{\pi_{k}}&\leq d_{i}-\frac{1}{1-\gamma}\underset{\begin{subarray}{c}s\sim\d{_}{\pi_{k}}\\ a\sim\pi\end{subarray}}{E}[A^{C_{i}}_{\pi_{k}}(s,a)]\hskip 5.0pt\forall i\\ &\bar{D}_{KL}(\pi||\pi_{k})\leq\delta\end{split}

(3)

where $\bar{D}_{KL}(\pi||\pi_{k})=E_{s\sim\pi_{k}}[\bar{D}_{KL}(\pi||\pi_{k})[s]]$ , $D_{KL}(\pi||\pi_{k})$ is the Kullback–Leibler divergence (Equation 4) between target policy $\pi$ and current policy $\pi_{k}$ , and $\delta$ is the target KL divergence. The first constraint in Equation 3 is derived from the objective difference between two policies (Kakade and Langford (2002)): $J(\pi^{\prime})-J(\pi)=\frac{1}{1-\gamma}E_{s\sim\d{_}{\pi^{\prime}},a\sim\pi^{\prime}}[A_{\pi}(s,a)]$ , with the assumption of the second constraint holds so that policies within the trust region have nearly the same state distribution: $d_{\pi^{\prime}}\approx d_{\pi}$ . In this domain, First Order Constrained Optimization in Policy Space (FOCOPS, Zhang et al. (2020)) algorithm solves the above optimization objective using first-order approximation to find the optimal update policy in the non-parameterized policy space, which is then projected to the parameterized policy space.

2.2.3 Model-based Online Planning Methods

employ the learning of dynamics models for better sample efficiency and faster convergence rate, compared to model-free methods, Janner et al. (2019). Online planning optimizes model predictive policies via the optimization of sampled trajectories and trade-off between approximated model and value functions (SafeLOOP, Sikchi et al. (2022)), or through cross entropy method to update policy distribution by sorting the elite policies according to their performance and feasibility in a population-based gradient-free manner (CCEPETS, Wen and Topcu (2018)).

Except for the above mentioned methods, several Safe RL algorithms using penalty function or primal approach are also benchmarked. Constraint-Rectiﬁed Policy Optimization (CRPO, Xu et al. (2021)) algorithm uses primal approach to update the policy alternatingly between objective improvement and constraint satisfaction. Penalized Proximal Policy Optimization (P3O, Zhang et al. (2022a)) derives an equivalent unconstrained optimization problem by employing exact penalty functions instead of inclusion of dual variables, and optimizes a clipped surrogate objective, instead of a trust region objective. OnCRPO in Table 1 refers to the on-policy version of CRPO.

3 Experiments

The components of Safe Riverine Environment as a CMDP are discussed in Section 3.1. The experiments of state (visual encoding) dimension and image channel dimension are described in Section 3.2. Benchmark experiments of Safe RL algorithms are detailed in Section 3.3.

3.1 Safe Riverine Environment

Built on top of RSE, SRE is a Unity-based RL training environment by ml-agents toolkit, Juliani et al. (2018). The UAV agent in both environments is abstracted as a camera facing $20$ degrees down. As a CMDP that can be abstracted by a tuple $(S,A,R,C,$ $P,\mu)$ , Section 2.1, SRE is detailed with each component of this tuple. The state space $S$ is the set of image encoding vectors by VAE. We decide the length of the vector in accordance with the reconstruction loss of VAE networks with various latent space dimensions. The augmentation of image channels with an additional water mask channel is also validated in virtue of Relative Entropy (RE). The multi-discrete action $A$ with four branches (each branch consists of actions from $\{0,1,2\}$ ) is adopted to control vertical, yaw, longitudinal and latitudinal movements of UAV in discretized manner. SRE is dynamics-free to reduce the training difficulty, thus state transition model $P$ is deterministic. For initial state distribution $\mu$ , we reset the UAV agent randomly in a safe pose above the circular river on episode begin.

The safe operation space in SRE is the bounding volume lifted by the river surface. Reward in SRE is given if the UAV agent visits some unvisited river spline segments, the reward value is the ratio of the number of newly visited segments over all segments number, then multiplied by $10$ . Conversely, a positive cost will be given before environment reset, if the agent travels to any unsafe pose. A safe agent stays inside the bounding volume, has no collision with bridges, and points its view close to the tangent line of the river spline. In SRE, we categorize the safety constraints into tight and loose ones, where the former constraint violation results in $c=1$ and the latter $c=0.2$ . Tight constraints include {Collision , OutOfVolumeHorizontally, OutOfVolumeVertically}, and loose constraints contain {YawOverDeviation, Idle, MaxStepReached} . More details can be found in Wang et al. (2024).

3.2 Image Encoding

The length of the state vector is determined by comparing the reconstruction loss of VAE networks with different latent dimensions. The input image is of size $128\times 128\times 3$ , and the reconstruction loss is the Mean Squared Error between the input image and the reconstructed image, averaged over randomly collected $2000$ image samples. The overall reconstruction loss gets smaller as the latent dimension decreases, until dimension equals $8$ , which is only better than $1024$ -dimensional latent space network. Since the reconstruction losses for dimensions $\{64,32,16\}$ have no significant difference, we choose the smallest state vector size $16$ for Safe RL training.

We use 4-channel image input to the VAE encoder based on the intuition that the added fourth water mask channel may server as an inductive bias for better state representation and easier RL training. On one hand, water pixels in an image provide the most related and useful information for the river following task, compared to distracting background objects like sky and terrain. On the other hand, semantic water mask could bridge the gap between simulated and real riverine environments, Wang and Mahmoudian (2023), to facilitate easier Sim2Real policy transfer. This 4-channel encoding input is empirically validated by the Relative Entropy (also named Kullback–Leibler divergence) between two encoding datasets with different input image channel sizes to know straightforwardly which one contains more information than the other. Suppose $x$ belongs to some sample space $\chi$ , $P(x)$ and $Q(x)$ are two discrete probability distributions of $x$ , then the RE is

RE(P||Q)=\sum_{x\in\chi}P(x)\log\frac{P(x)}{Q(x)}

(4)

In our experiment, $N_{sample}=2000$ image-mask pairs are randomly collected from the medium SRE, then separately trained using VAE with channel sizes: {3: RGB, 1: Mask, 4: RGB+Mask}. Encoding vector are linearly rescaled to range (-2, 2), and histogramized to probability distribution with bin number $N_{bin}=\lceil\sqrt{N_{sample}}\rceil=45$ . RE is averaged over 16 features to form Table 2.

Table 2: Relative Entropy of VAE-encoded 16-dimensional state vector of images with different channel sizes.

	RGB	MASK	RGB+MASK
RGB	-	$\boldsymbol{0.103}$	0.037
MASK	0.075	-	0.091
RGB+MASK	$\boldsymbol{0.075}$	$\boldsymbol{0.187}$	-

Since RE is non-negative and non-symmetric, we measure the information gain by relative number in off-diagonal cells in Table 2. RGB+Mask encoding contains more information than both RGB and Mask encoding (bold number in last row), and expectedly, RGB encoding is informatively richer than mask encoding (bold number in first row). Qualitatively, we can validate the superiority of RGB+Mask encoding than RGB encoding in terms of the reconstruction quality of river pixels, as shown in Figure 2. It is obvious that some reconstructed RGB images lose the river information (row c), whereas all reconstructed RGB+Mask images (row d) retain the river channel pixels, which are an important clue for UAV to follow river based on vision input.

3.3 Benchmarking Specifications

All benchmarked Safe RL algorithms are adapted from OmniSafe repository, Ji et al. (2023b). VAE is pre-trained with $2000$ image mask pairs randomly collected in the Medium environment for $100$ epochs, and the encoder part of VAE is used in both training and testing phases for all benchmarked Safe RL algorithms. Action space shaping by discretizing diagonal Gaussian continuous action to multi-discrete action is adopted, Kanervisto et al. (2020). All algorithms are trained with three seeds for $200K$ steps in the Medium environment of SRE, and tested in all three environments with the respective seed. During training phase, episodic return (averaged over last $50$ episodes) and cost rate, Ray et al. (2019), are compared. Here cost rate is the ratio of accumulated environment reset times due to constraint violations over current training steps. In testing phase, algorithms are evaluated in all environments and compared according to episodic return and episodic cost, both averaged over $60$ episodes. PPO algorithm serves as the baseline for comparison, where only reward is used during training.

4 Results and Analysis

In this section, training results in training environment and evaluation results in all environments are presented and analysed.

From Figure 3, all off-policy algorithms (DDPGLag, TD3Lag and SACLag) and model-based algorithms ( SafeLOOP, CCEPETS) are worse than baseline (PPO) in either episodic return or cost rate. Specifically, model-based Safe RL algorithms have the lowest episodic return but the highest and non-decreasing cost rate, whereas off-policy algorithms have slightly better performance and overall decreasing cost rate. On the contrary, on-policy algorithms (PPOLag, FOCOPS, P3O) achieve the overall better metrics during training phase. Specifically, FOCOPS is the sole algorithm that exceeds the baseline in both metrics, with $1.4$ times higher episodic return, and $65\%$ cost rate of the baseline’s.

From Figure 4, tight cost rates of all algorithms have basically similar trends as the total cost rate in Figure 3. In general, model-based algorithms have the highest violation rates of tight constraints, followed by off-policy Lagrangian algorithms, then on-policy algorithms. Combined with the loose cost rate figure, OnCRPO algorithm makes more soft constraints violations in terms of soft violation percentage over all violations, meaning the agent often triggers less severe constraints by making consecutive non-progressive actions in situ. The same phenomenon can also be observed in Figure 5. This is inline with the feature of CRPO: immediate switches between optimizing the objective and reducing the constraints whenever they are violated. However, the exploration ability is limited in this primal-based algorithm, compared to the other Lagrangian-based ones.

Table 3: Evaluation results of Safe RL algorithms (trained in Medium SRE) tested in three levels of SRE maps in terms of episodic return (EpR) and episodic cost (EpC) averaged over 60 episodes (20 episodes per seed)

Env	Easy		Medium		Hard
	EpR	EpC	EpR	EpC	EpR	EpC
PPO (baseline)	$9.19\pm 2.63$	$0.09\pm 0.28$	$5.22\pm 3.89$	$0.74\pm 0.44$	$1.34\pm 1.34$	$0.84\pm 0.32$
PPOLag	$5.86\pm 3.85$	$0.32\pm 0.38$	$3.81\pm 3.36$	$0.79\pm 0.37$	$0.96\pm 0.87$	$0.65\pm 0.40$
FOCOPS	$\boldsymbol{7.83\pm 3.74}$	$\underline{\boldsymbol{0.15\pm 0.32}}$	$\boldsymbol{6.30\pm 4.41}$	$0.49\pm 0.49$	$\boldsymbol{2.05\pm 1.96}$	$0.79\pm 0.35$
P3O	$6.23\pm 3.93$	$0.37\pm 0.44$	$3.14\pm 3.43$	$0.67\pm 0.41$	$0.85\pm 0.82$	$0.65\pm 0.40$
OnCRPO	$0.27\pm 0.50$	$0.27\pm 0.22$	$0.10\pm 0.14$	$0.29\pm 0.26$	$0.15\pm 0.23$	$0.36\pm 0.32$
DDPGLag	$1.75\pm 2.55$	$0.78\pm 0.37$	$0.73\pm 0.91$	$0.83\pm 0.33$	$0.44\pm 0.64$	$0.80\pm 0.35$
TD3Lag	$0.14\pm 0.12$	$0.55\pm 0.40$	$0.13\pm 0.10$	$0.69\pm 0.39$	$0.10\pm 0.10$	$0.48\pm 0.38$
SACLag	$0.06\pm 0.18$	$0.20\pm 0.00$	$0.02\pm 0.03$	$\underline{\boldsymbol{0.27\pm 0.22}}$	$0.04\pm 0.11$	$\underline{\boldsymbol{0.27\pm 0.22}}$
SafeLOOP	$0.12\pm 0.12$	$0.87\pm 0.30$	$0.12\pm 0.11$	$0.92\pm 0.24$	$0.15\pm 0.24$	$0.84\pm 0.32$
CCEPETS	$0.08\pm 0.10$	$0.79\pm 0.35$	$0.05\pm 0.04$	$0.79\pm 0.35$	$0.08\pm 0.23$	$0.69\pm 0.39$

Table 3 lists the evaluation results of agents trained in Medium SRE and evaluated in all three environments in terms of episodic return and episodic cost. In SRE, the maximum attainable episodic return is $10$ , and the maximum episodic cost is $1$ . It is notable that FOCOPS achieves the highest episodic return (bold) in all environments, and the lowest episodic cost (bold with underline) in Easy environment. FOCOPS achieves better results than baseline in both Medium and Hard environments, only worse than baseline in Easy environment.

Detailed failure case visualization is presented in Figure 5, where the total number of failures for each failure reason give intuitive impression of how algorithms behave in terms of safety. No algorithm triggers $\textit{MaxStepReached}=500$ failure since each river needs less than $500$ perfect steps to traverse, thus Success or Idle failure will be triggered beforehand. First of all, baseline PPO has almost linearly decreasing failure numbers as the failure severity decreases (from left to right subplot in Figure 5), making it a decent baseline algorithm which focuses only on task objective without safety consideration. The only algorithm resembling this trend is FOCOPS, with overall lower failure cases. The only algorithm that reverses this trend is OnCRPO, which has lower failure times than baseline in tight constraints, but higher violations of loose constraints. Secondly, the model-based algorithms have trouble in making continuous valid safe actions so that the collision with bridges or being idle is not triggered at all. Thirdly, on-policy algorithms have overall lower number of failures than off-policy and model-based algorithms, which is also validated in Figure 3 and Table 3. In this regard, we speculate that the visual dynamics and characteristics in SRE affect the effectiveness of off-policy correction methods like Importance Sampling (IS), thus cause data distribution mismatch, breaking the assumption of off-policy algorithms.

In summary, SRE is shown to be a difficult Safe RL environment for agent to operate in nicely and safety. Several findings can be highlighted based on the benchmarking results to inspire the future work. First, the scale and value of the cost as a penalty term in Lagrangian methods need to be investigated since the unclipped increasing Lagrange multiplier may limit the exploration ability of the policy. Second, immediate response to constraints violation (CRPO) provides good safety compliance, whereas the balance of it with reward maximization needs more attention. Third, the discretization of action space may cause strongly IS-dependent off-policy algorithms to struggle, thus adaptation of SRE to continuous action space needs to be experimented with. Fourth, the effectiveness of learned visual dynamics in model-based algorithms needs to be researched to make sure the dynamics model is actually helping safe policy learning instead of harming it. Fifth, FOCOPS shows good balance between exploration and safety, proving the benefits of optimization in unparameterized policy space then projection into the parameterized one, as well as the simple first order approximation. However, based on its safety violation trend in Figure 5, how to better discriminate between tight and loose constraints to train safer policy remains a problem for this algorithm.

5 Conclusion

In conclusion, this paper makes several contributions to the field of safe vision-driven autonomous navigation in confined waters. The development of a Unity-based photo-realistic Safe Riverine Environment (SRE) introduces multiple difficulty levels and provides fine-grained safety metrics feedback, enabling the RL training of autonomous agents for both performance and safety in river following tasks. Additionally, the study validates the effectiveness of incorporating a semantic water mask for image encoding and explores the optimal encoding dimension, enhancing the informativeness and compactness of state representation in vision-driven control. Furthermore, the paper conducts a comprehensive benchmark of several state-of-the-art Safe RL algorithms within the SRE, evaluating and analysing their performance and safety measures, in terms of both tight and loose constraint violations. Notably, the proposed vision encoding method is also applicable to vision-driven navigation of ASVs in rivers, and the benchmarking results of Safe RL algorithms offer valuable insights for safe autonomy in maritime domain.

References

Altman (2021) Altman, E. (2021). Constrained Markov decision processes. Routledge.
Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587–1596. PMLR.
Ginerica et al. (2021) Ginerica, C., Isofache, V., and Grigorescu, S.M. (2021). Vision dynamics: Environment modelling, path planning and control based on semantic segmentation. In 2021 International Aegean Conference on Electrical Machines and Power Electronics (ACEMP) & 2021 International Conference on Optimization of Electrical and Electronic Equipment (OPTIM), 481–486. IEEE.
Grigorescu et al. (2021) Grigorescu, S., Ginerica, C., Zaha, M., Macesanu, G., and Trasnea, B. (2021). Lvd-nmpc: A learning-based vision dynamics approach to nonlinear model predictive control for autonomous vehicles. International Journal of Advanced Robotic Systems, 18(3), 17298814211019544.
Gu et al. (2022) Gu, S., Yang, L., Du, Y., Chen, G., Walter, F., Wang, J., Yang, Y., and Knoll, A. (2022). A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330.
Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861–1870. PMLR.
Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32.
Ji et al. (2023a) Ji, J., Zhang, B., Zhou, J., Pan, X., Huang, W., Sun, R., Geng, Y., Zhong, Y., Dai, J., and Yang, Y. (2023a). Safety gymnasium: A unified safe reinforcement learning benchmark. Advances in Neural Information Processing Systems, 36.
Ji et al. (2023b) Ji, J., Zhou, J., Zhang, B., Dai, J., Pan, X., Sun, R., Huang, W., Geng, Y., Liu, M., and Yang, Y. (2023b). Omnisafe: An infrastructure for accelerating safe reinforcement learning research. arXiv preprint arXiv:2305.09304.
Juliani et al. (2018) Juliani, A., Berges, V.P., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, C., Gao, Y., Henry, H., Mattar, M., et al. (2018). Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627.
Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, 267–274.
Kanervisto et al. (2020) Kanervisto, A., Scheller, C., and Hautamäki, V. (2020). Action space shaping in deep reinforcement learning. In 2020 IEEE conference on games (CoG), 479–486. IEEE.
Kingma and Welling (2013) Kingma, D.P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Knyaz and Kniaz (2020) Knyaz, V. and Kniaz, V. (2020). Object recognition for uav navigation in complex environment. In Image and Signal Processing for Remote Sensing XXVI, volume 11533, 154–166. SPIE.
Kulhánek et al. (2019) Kulhánek, J., Derner, E., De Bruin, T., and Babuška, R. (2019). Vision-based navigation using deep reinforcement learning. In 2019 european conference on mobile robots (ECMR), 1–8. IEEE.
Lee and Yusuf (2022) Lee, M.F.R. and Yusuf, S.H. (2022). Mobile robot navigation using deep reinforcement learning. Processes, 10(12), 2748.
Li et al. (2020) Li, J., Wang, X., Tang, S., Shi, H., Wu, F., Zhuang, Y., and Wang, W.Y. (2020). Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12123–12132.
Liang (2021) Liang, J.R. (2021). Vision-Based Unmanned Aerial Vehicle Navigation in Virtual Complex Environment using Deep Reinforcement Learning. University of California, Davis.
Lillicrap et al. (2015) Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Paternain et al. (2022) Paternain, S., Calvo-Fullana, M., Chamon, L.F., and Ribeiro, A. (2022). Safe policies for reinforcement learning via primal-dual methods. IEEE Transactions on Automatic Control, 68(3), 1321–1336.
Ray et al. (2019) Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1), 2.
Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, 1889–1897. PMLR.
Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Sikchi et al. (2022) Sikchi, H., Zhou, W., and Held, D. (2022). Learning off-policy with online planning. In Conference on Robot Learning, 1622–1633. PMLR.
Taufik (2016) Taufik, A. (2016). Multi-rotor drone to fly autonomously along a river and 3d map modeling of an environment around a river.
Taufik et al. (2015) Taufik, A., Okamoto, S., and Lee, J.H. (2015). Multi-rotor drone to fly autonomously along a river using a single-lens camera and image processing. International Journal of Mechanical Engineering, 4(6), 39–49.
Tessler et al. (2018) Tessler, C., Mankowitz, D.J., and Mannor, S. (2018). Reward constrained policy optimization. arXiv preprint arXiv:1805.11074.
Wang et al. (2023) Wang, X., Sun, Y., Xie, Y., Bin, J., and Xiao, J. (2023). Deep reinforcement learning-aided autonomous navigation with landmark generators. Frontiers in Neurorobotics, 17.
Wang et al. (2024) Wang, Z., Li, J., and Mahmoudian, N. (2024). Vision-driven autonomous flight of uav along river using deep reinforcement learning with dynamic expert guidance. arXiv preprint arXiv:2401.09332.
Wang and Mahmoudian (2023) Wang, Z. and Mahmoudian, N. (2023). Aerial fluvial image dataset for deep semantic segmentation neural networks and its benchmarks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.
Wei et al. (2022) Wei, P., Liang, R., Michelmore, A., and Kong, Z. (2022). Vision-based 2d navigation of unmanned aerial vehicles in riverine environments with imitation learning. Journal of Intelligent & Robotic Systems, 104(3), 47.
Wen and Topcu (2018) Wen, M. and Topcu, U. (2018). Constrained cross-entropy method for safe reinforcement learning. Advances in Neural Information Processing Systems, 31.
Xiao et al. (2022) Xiao, W., Yuan, L., He, L., Ran, T., Zhang, J., and Cui, J. (2022). Multigoal visual navigation with collision avoidance via deep reinforcement learning. IEEE Transactions on Instrumentation and Measurement, 71, 1–9.
Xu et al. (2021) Xu, T., Liang, Y., and Lan, G. (2021). Crpo: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, 11480–11491. PMLR.
Xu et al. (2023) Xu, Z., Liu, B., Xiao, X., Nair, A., and Stone, P. (2023). Benchmarking reinforcement learning techniques for autonomous navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 9224–9230. IEEE.
Yuan et al. (2022) Yuan, Z., Hall, A.W., Zhou, S., Brunke, L., Greeff, M., Panerati, J., and Schoellig, A.P. (2022). Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics. IEEE Robotics and Automation Letters, 7(4), 11142–11149.
Zhang et al. (2022a) Zhang, L., Shen, L., Yang, L., Chen, S., Yuan, B., Wang, X., and Tao, D. (2022a). Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814.
Zhang et al. (2022b) Zhang, L., Zhang, Q., Shen, L., Yuan, B., and Wang, X. (2022b). Saferl-kit: Evaluating efficient reinforcement learning methods for safe autonomous driving. arXiv preprint arXiv:2206.08528.
Zhang et al. (2020) Zhang, Y., Vuong, Q., and Ross, K. (2020). First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33, 15338–15349.
Zhao et al. (2023) Zhao, S., Wang, W., Li, J., Huang, S., Liu, S., et al. (2023). Autonomous navigation of the uav through deep reinforcement learning with sensor perception enhancement. Mathematical Problems in Engineering, 2023.
Zieliński and Markowska-Kaczmar (2021) Zieliński, P. and Markowska-Kaczmar, U. (2021). 3d robotic navigation using a vision-based deep reinforcement learning model. Applied Soft Computing, 110, 107602.