A Transferability Metric Using Scene Similarity and Local Map Observation for DRL Navigation

Shiwei Lian and Feitian Zhang Shiwei Lian is with the Department of Advanced Manufacturing and Robotics, College of Engineering, Peking University, Beijing, 100871, China, [email protected]Feitian Zhang is with the Department of Advanced Manufacturing and Robotics, and the State Key Laboratory of Turbulence and Complex Systems, College of Engineering, Peking University, Beijing, 100871, China, [email protected]

Abstract

While deep reinforcement learning (DRL) has attracted a rapidly growing interest in solving the problem of navigation without global maps, DRL typically leads to a mediocre navigation performance in practice due to the gap between the training scene and the actual test scene. To quantify the transferability of a DRL agent between the training and test scenes, this paper proposes a new transferability metric — the scene similarity calculated using an improved image template matching algorithm. Specifically, two transferability performance indicators are designed including the global scene similarity that evaluates the overall robustness of a DRL algorithm and the local scene similarity that serves as a safety measure when a DRL agent is deployed without a global map. In addition, this paper proposes the use of a local map that fuses 2D LiDAR data with spatial information of both the agent and the destination as the DRL observation, aiming to improve the transferability of DRL navigation algorithms. With a wheeled robot as the case study platform, both simulation and real-world experiments are conducted in a total of 26 different scenes. The experimental results affirm the robustness of the local map observation design and demonstrate the strong correlation between the scene similarity metric and the success rate of DRL navigation algorithms.

Index Terms:

autonomous navigation, scene similarity, deep reinforcement learning, local map

I Introduction

Autonomous navigation is an indispensable ability for mobile robots to find a collision-free path towards destinations. To enhance navigational performances in unknown environments, deep reinforcement learning (DRL) has emerged as a promising approach for its generalization ability [1, 2, 3, 4]. Particularly, DRL has demonstrated great potential in navigation without a global map that takes multi-modal sensory data as observations [5, 6, 7, 8]. With low data sample efficiency, most DRL navigation algorithms first train policies in simulators and then deploy these trained policies to robots in the real world. However, the gap between simulation and reality or the difference between the training scene and the test scene often leads to poor navigation performance when a well-trained DRL is transferred to or deployed in a new environment [9, 10]. Specifically, in autonomous navigation, the DRL algorithm trains a reactive strategy that takes the action with the highest cumulative reward in training. The trained strategy, statistically speaking, has a lower navigation success rate when transferred to the test environment due to the differences between the training and test scenes in, for example, the number and the size of obstacles [11].

While a number of DRL navigation methods have been investigated to increase the success rate when applied in new environments, limited research has looked into the quantification of the transferability of DRL navigation algorithms. This paper proposes a transferability metric for the DRL navigation algorithms by quantifying the similarity between the training scene and the test scene. Specifically, this paper proposes two scene similarity performance indicators using the improved image template matching algorithm to quantify the transferability of the DRL navigation algorithm. The global scene similarity, calculated from the global maps of the training and test scenes, is designed to evaluate the overall transferability or robustness of different navigation algorithms. The local scene similarity, taking the collected local obstacle maps in the test scenes, serves as a safety indicator when a trained agent is deployed in a new environment without a global map. In addition, this paper designs a robust DRL navigation algorithm using the local map as the observation that fuses 2D LiDAR data, the agent position, and the destination position together onto the local map as the input to the action network. Experimental results through diverse scenes both in simulation and the real world further show the effectiveness of the proposed scene similarity metric and local map-based DRL navigation in quantifying and enhancing algorithmic transferability, respectively.

The main contributions of the paper are threefold. First, a novel scene similarity metric involving global and local measures is proposed to quantify the transferability of DRL navigation using the improved image template matching algorithm. To the best of the authors’ knowledge, this is the first effort to design such a transferability metric using scene similarity for DRL navigation. Second, different from common designs using range measurements as observation inputs, a robust DRL navigation algorithm is proposed using the fused local map as the input, which allows interchanging the LiDAR sensor with different fields of views (FoVs) and angular resolutions, thus exhibiting higher transferability at deployment time. Third, extensive experiments with diverse scenes constructed in both simulation and the real world are conducted. The experimental results support the validation of the proposed transferability metric and show the robustness of the designed DRL algorithm using the local map as the observation.

II Related Work

II-A Robot Navigation

This paper focuses on navigation tasks in unknown environments without a global map. Given only the local perception of the robot, traditional path/motion planning algorithms such as artificial potential fields (APF)[12] and dynamic window approach[13] would likely fail due to the local optimum. In addition, their navigation performance is highly dependent on their selected parameters, while it is generally difficult to update the parameters for an unknown environment[3].

To overcome the limitations of the traditional methods, DRL algorithms have been applied to enhance navigational performances. Sensor data used in the DRL navigation of mobile robots includes but not limited to 2D LiDAR [3, 4, 5, 6, 10, 14, 15], RGB images [16], and depth images [1, 2, 8]. 2D LiDAR data is most commonly used due to the minor sim-to-real gap and extensively represented in the form of vectors [4, 5, 6, 10]. This representation results in a fixed sensory FoV and a fixed-size measurement as the observation.

To relax the constraint on 2D LiDAR observations and enhance the DRL navigation performance, researchers have investigated a number of alternative observation designs. For example, Leiva et al. [3] converted LiDAR data into 2D point clouds with variable sizes as observations and achieved improved navigational performances. Pfeifer et al. [17] and Yao et al. [18] converted 2D LiDAR data to occupancy maps to address more complex and dynamic environments showing great potential in robot navigation with sensor fusion.

This paper proposes a local map-based DRL navigation algorithm independent of a global map. The 2D LiDAR data, the agent position, and the destination position are all fused onto the local map as the observation. Convolutional neural networks (CNNs) are utilized to efficiently extract the features of the fused spatial data and learn a robust navigation policy. The local map-based navigation algorithm is independent of the use of the transferability metric.

II-B DRL Transferability

To enhance the transferability of DRL, many studies have been conducted to bridge the gap between the training and test performances, including but not limited to domain randomization[19, 20, 21, 22], learning from demonstration[23], domain adaptation[24, 25, 26], and meta-RL[27].

For instance, Chaffre et al.[22] designed a DRL policy with incremental environment complexity to reduce the need for additional training in the real world. Hieu et al.[23] developed transfer learning with demonstrations to accelerate the training process when an autonomous vehicle entered a new environment. Bharadhwaj et al.[26] performed adversarial domain adaptation to learn an encoder that generated the same distribution of latent states over real images as the simulated images and then fine-tuned the learnt policy in real environments. Luo et al.[27] proposed meta-learning on the latent space algorithm for rapid adaptation of DRL-based visual navigation to new observations. The studies above mainly focused on the design of DRL algorithms to increase the success rate when transferred from a training scene to a test scene, but did not look into the quantification or the measure of the DRL transferability.

Chebotar et al.[19], on the other hand, quantified the DRL transferability by introducing a discrepancy function evaluated by computing the difference between the simulated and real-world robotic arm trajectories using weighted $l_{1}$ and $l_{2}$ norms. This discrepancy measure calculates the difference given the motion state inputs, however, cannot be directly applied to calculate the difference of two navigation scenes.

This paper focuses on finding an appropriate metric to quantify or measure the transferability of DRL navigation algorithms but without looking into how to re-train or update the algorithms in the new environment. This paper proposes a new transferability metric for robot navigation using the improved image template matching algorithm to measure the similarity between the training scenes and the real test scenes. Two straightforward and easy-to-implement performance indicators are designed including the global and local scene similarities. The global scene similarity is used to evaluate the transferability of different DRL navigation algorithms. The local scene similarity indicates the safety of the local map-based DRL navigation algorithm when deployed in a new environment without a global map.

III Problem Description

This paper applies the DRL method to design a navigation algorithm for a mobile robot without the prior knowledge of a global map and provides a measure of the transferability of the designed algorithm when applied to new environments. We assume that at time instant $t$ , the robot agent takes the sensor measurements and its relative position to the destination as the observation $\bm{o}_{t}$ . Apply the DRL algorithm to learn a policy $\pi$ with weights $\theta$ in a training scene $S_{\rm train}$ . When applied to a test scene $S_{\rm test}$ , the policy is expected to map the observation to a suitable agent action $\bm{a}_{t}$ that steers the agent towards the destination without collision, i.e.,

\bm{a}_{t}=\pi_{\theta|S_{\rm train}}(\bm{o}_{t}\mid S_{\rm test})

(1)

The gap between the training and test scenes is typically the major cause of failure of a learnt policy when deployed in new environments. To quantify the transferability of the DRL policy, scene similarity between the test scene and the training scene is measured. Given the global map of the test scene ${M}_{{\rm test}}$ and the global map of the training scene ${M}_{{\rm train}}$ , a scoring function $f_{1}()$ calculates the global scene similarity score $SS_{\rm global}$ through

SS_{\rm global}=f_{1}({M}_{{\rm train}},{M}_{{\rm test}})

(2)

This global metric is used to evaluate the performance of different DRL navigation algorithms when deployed in different test scenes with different $SS_{\rm global}$ .

When the global map of the test scene is not available, which is the typical case in practice, the agent only obtains local information through sensor measurements. A scoring function $f_{2}()$ calculates the local scene similarity score $SS_{\rm local}$ from the collected observations $\bm{o}_{\rm test}$ in the test scene and the global map of the training scene ${M}_{{\rm train}}$ , i.e.,

SS_{\rm local}=f_{2}({M}_{{\rm train}},\bm{o}_{\rm test})

(3)

This local metric is expected to predict the success rate of the proposed local map-based DRL navigation algorithm when deployed in new environments.

To improve the transferability performance of DRL navigation, the problem to be addressed is summarized as to design the global and local scene similarities (Eqs. (2) –(3)) to quantify the navigation transferability as well as to incorporate the DRL policy with local map-based observations (Eq. (1)) for enhanced robustness with respect to different test scenes.

IV Local Map-Based DRL Navigation

IV-A Deep Reinforcement Learning

Motion planning is considered as a Markov decision process (MDP). The MDP is represented by the tuple ${(\mathcal{S,A,T,}r,\gamma)}$ , where ${\mathcal{S}=\{s\}}$ is the state space, ${\mathcal{A}=\{a\}}$ is the action space, $\mathcal{T}$ is the state transition function indicating the probability distribution of the next state when performing action $a$ in the current state $s$ , $r$ is the reward function, and ${\gamma\in[0,1]}$ is the discount factor. At each discrete time step, the agent selects and takes an action based on the current state. The agent then receives a reward and transitions to the next state. The objective of the agent is to learn an optimal policy ${\pi^{*}(a\mid s)}$ that maximizes the expected return.

In this paper, we use DQN[28] and TD3 [29] as example DRL algorithms to learn navigation with discrete and continuous actions, respectively. Both algorithms are sample efficient with the replay buffer to collect experience during training and have been proved with high navigation performance [1, 15]. The implementation of DRL navigation is specified as follows. The observation or the input of DRL is the local map $\bm{M}$ , which will be discussed in detail in Section IV-B.

Action space: A differential drive robot is selected as the testing platform in this study. The action space of the differential drive robot includes a translational velocity $v_{t}$ and a rotational velocity $\omega_{t}$ , i.e. ${\{\bm{a_{t}}:=[v_{t},\omega_{t}]\}}$ . Specifically, for DQN, three discrete actions are considered, i.e., ${v=0.15\text{ m/s}}$ , ${\omega\in\{-1.0,0.0,1.0\}\text{ rad/s}}$ . For TD3, continuous actions are considered, i.e., ${v\in[0.,0.25]\text{ m/s}}$ , ${\omega\in[-1.0,1.0]\text{ rad/s}}$ .

Reward design: The objective of the agent is to use the oncoming local map observations to reach the destination point as fast as possible without colliding with obstacles. The reward at time $t$ , denoted by $r^{t}$ , is designed as

r^{t}=r_{\rm step}^{t}+\left\{\begin{array}[]{rcl}r_{\rm succ}^{t}&&\text{if }d_{t}<0.25\rm m\\ r_{\rm nav}^{t}&&\text{if }d_{t}>d_{t-1}\\ r_{\rm coll}^{t}&&\text{if collision}\end{array}\right.

(4)

Here, $d_{t}$ is the distance from the agent to the goal at time $t$ . $r_{\rm step}^{t}$ gives a negative reward at each time step so that the agent reaches the goal destination as quickly as possible. $r_{\rm succ}^{t}$ generates a positive reward when the agent enters a neighborhood of the goal to encourage success. $r_{\rm nav}^{t}$ is used to encourage approaching the goal and the agent receives a negative reward when it moves away from the goal. $r_{\rm coll}^{t}$ is used to avoid the collision of the agent with obstacles. It generates a negative reward when a collision occurs. In this paper, ${r_{\rm step}^{t}=0}$ , ${r_{\rm succ}^{t}=1}$ , ${r_{\rm nav}^{t}=-0.001}$ and ${r_{\rm coll}^{t}=-1}$ is set for DQN. ${r_{\rm step}^{t}=-0.05}$ , ${r_{\rm succ}^{t}=5}$ , ${r_{\rm nav}^{t}=-0.1}$ and ${r_{\rm coll}^{t}=-5}$ is set for TD3. $r_{\mathrm{step}}^{t}$ for DQN is set to 0 for the reason that a negative $r_{\mathrm{step}}^{t}$ increases the training time.

Termination conditions: During training, an episode will be terminated if one of the following conditions is satisfied. 1) The agent reaches the goal. 2) The agent collides with an obstacle. 3) The number of the time steps exceeds the maximum time step $\rm T_{max}$ . For DQN, $\rm T_{max}$ is set to 10,000. For TD3, $\rm T_{max}$ is set to 2,000.

IV-B Using Local Maps as Observations

The 2D LiDAR data, the agent position and the goal position are all spatial information. They are all fused onto the local map as the observation or the input of the DRL algorithm. Using the local map as the observation allows for learning a more robust DRL navigation policy and interchanging LiDAR sensors with different angular resolutions and FoVs at the deployment time.

The local map is denoted by $\bm{M}=\left[M_{\rm o},M_{\rm p},M_{\rm g}\right]\in\mathbb{R}^{3\times H\times W}$ . Here, $H$ and $W$ represent the height and width of the map, respectively. $M_{\rm o}$ is the obstacle map converted from 2D LiDAR data. $M_{\rm p}$ is the position map, indicating the agent’s current position. $M_{\rm g}$ is the goal map, indicating the distance and the orientational angle of the destination point from the agent. $M_{\rm o}$ , $M_{\rm p}$ and $M_{\rm g}$ are merged together to fuse all the data and generate the local map.

Refer to caption — Figure 1: The schematic of the generation of the local map at time $t$ .

The process of generating the local map is illustrated in Fig. 1. The obstacle map $M_{\rm o}$ is converted from the LiDAR data. The LiDAR data $\{(\rho_{t},\theta_{t})\}$ received at time $t$ typically consists of the distance $\rho_{t}$ and the orientation angle $\theta_{t}$ of an obstacle with respect to the agent expressed in the polar coordinates. The LiDAR polar coordinates $(\rho_{\rm A},\theta_{\rm A})$ is convertible to the image coordinates $(x_{\rm I},y_{\rm I})$ by

\left[\begin{array}[]{rcl}x_{\rm I}\\ y_{\rm I}\end{array}\right]=\left[\begin{array}[]{rcl}\rho_{\rm A}\cos{(\theta_{\rm A})}/R+W/2\\ -\rho_{\rm A}\sin{(\theta_{\rm A})}/R+H/2\end{array}\right]

(5)

Here, $R$ (meter/pixel) is the resolution of the map, $H$ and $W$ represent the height and width of the map, respectively, and the subscripts I and A indicate the image coordinate and the agent coordinate, respectively. The obstacle map is generated by a dilation operation [30] ignoring the LiDAR data outside. The dilation operation increases the thickness of LiDAR data, repairs small breaks in the map, and enlarges the features of small obstacles. Compared to the classic input of LiDAR data as a fixed-size vector, the obstacle map is independent of the sensor resolution and FoV, thus allowing the LiDAR data with arbitrary measurement sizes. The local map observation is adaptable to other sensor modules for sensor fusion.

The position map $M_{\rm p}$ encodes the agent’s shape and current position. In this paper, a circle with a radius of the same pixel as the agent is used to represent the agent’s shape. The circle in $M_{\rm p}$ is set at the center of the local map at any time step helping the network recognize the danger of collisions with nearby obstacles and the navigational orientation towards the goal destination.

The goal map $M_{\rm g}$ uses the circle to represent the locations of the destination. There are cases where the target point is outside the boundary of the map $M_{\rm g}$ . To ensure $M_{\rm g}$ is non-empty, the distance of the goal with respect to the agent at time $t$ , denoted by $d_{t}$ , is clipped to $d_{t}^{\prime}$ by

d_{t}^{\prime}=\min\left(d_{t},\frac{RW}{2|\cos\alpha_{t}|},\frac{RH}{2|\sin\alpha_{t}|}\right)

(6)

Here, $R$ (meter/pixel) is the resolution of the map, $H$ and $W$ represent the height and width of the map, respectively, and $\alpha_{t}$ is the orientation angle of the goal with respect to the agent at time $t$ . The coordinates of the center of the circle $(x_{\rm Ig},y_{\rm Ig})$ in $M_{\rm g}$ are obtained by converting $d_{t}^{\prime}$ and $\alpha_{t}$ following the same coordinate transformation described in Eq. (5).

The local map fuses all information to take advantage of the powerful feature extraction capability of CNNs. The network architectures of DQN and the actor-critic network of TD3 are shown in Fig. 2. The three convolutional layers of DQN and the TD3 actor network efficiently extract the features of the input local map $\bm{M}$ . The feature map is then processed by the global max pooling layer to extract global features. The global features are flattened and fed into the fully connected layers. The output of DQN is the $Q$ value of the three discrete actions. The output of the TD3 actor network is the predicted continuous actions $[v,\omega]$ using the activation function of Tanh.

The critic network of TD3 requires additional input, i.e., the agent’s action. The input of the critic is either formulated as a numerical vector that concatenates the agent’s action and the extracted feature vector of CNNs output $[v,\omega]$ or a fused local map that combines the local map with the action map designed as ${\bm{M}_{a}=[M_{\rm v},M_{\rm\omega}]}$ by replacing non-zero values of $M_{\rm p}$ with action values. We find that fusing $\bm{M}_{a}$ with the local map $\bm{M}$ as the input of the TD3 critic network achieves better navigation performance. The output of the critic network is the state-action estimate value.

V Scene Similarity Metric Based On Improved Image Template Matching

V-A Image Template Matching Algorithm

The image template matching is an image processing method to search and find the maximum match between a template image and another larger image. Sliding the template image over the image of interest, the algorithm calculates the similarity between the sub-images and the template image and outputs the position with the maximum similarity as the matching result. The similarity is calculated as the normalized correlation coefficient $R_{I,T}(x,y)$ by

\begin{split}&R_{I,T}(x,y)=\\ &\frac{\sum\limits_{x_{T}}\sum\limits_{y_{T}}(T^{\prime}(x_{T},y_{T})\cdot I^{\prime}(x+x_{T},y+y_{T}))}{\sqrt{\sum\limits_{x_{T}}\sum\limits_{y_{T}}T^{\prime}(x_{T},y_{T})^{2}\cdot\sum\limits_{x_{T}}\sum\limits_{y_{T}}I^{\prime}(x+x_{T},y+y_{T})^{2}}}\end{split}

(7)

where

T^{\prime}(x_{T},y_{T})=T(x_{T},y_{T})-\frac{\sum\limits_{i=0}^{w-1}\sum\limits_{j=0}^{h-1}T(i,j)}{w\cdot h}

(8)

\begin{split}I^{\prime}(x&+x_{T},y+y_{T})=\\ &I(x+x_{T},y+y_{T})-\frac{\sum\limits_{i=0}^{w-1}\sum\limits_{j=0}^{h-1}I(x+i,y+j)}{w\cdot h}\end{split}

(9)

Here, $I$ and $T$ represent the image to be matched and the template image, respectively. ${R_{I,T}(x,y)}$ is the similarity of the template $T$ to the sub-image of the image $I$ at ${(x,y)}$ . ${T(x_{T},y_{T})}$ represents the pixel value at ${(x_{T},y_{T})}$ of the image $T$ . $w$ and $h$ represent the width and height of the image $T$ , respectively. The best match score denoted by ${R^{*}(I,T)}$ is defined by

R^{*}(I,T)=\max_{x,y}R_{I,T}(x,y)

(10)

The template matching above is suitable for the case when the template and the target object in the image have the same orientation. Matching often fails when there is an orientational difference between the target and the template. Therefore, the template image needs to be rotated to generate a series of templates with different rotation angles $\phi$ . The best match score is then updated by

R^{*}\left(I,T\right)=\max_{x,y,\phi}R_{I,T_{\phi}}\left(x,y\right)

(11)

V-B Global Scene Similarity

Currently, most DRL navigation algorithms achieve good navigation performance when the test scene is sufficiently similar to the training scene. However, some abnormal behaviors of DRL agents are often observed when tested in dissimilar scenes [10]. We propose a global scene similarity performance indicator based on the image template matching algorithm described in Section V-A. This indicator quantifies the similarity between the training scene and the test scene to evaluate the transferability of DRL navigation algorithms in both similar and dissimilar scenes.

The calculation of the global scene similarity metric is illustrated in Fig. 3. The image to be matched is the global map of the training scene denoted by ${M}_{{\rm train}}$ . The template is the sub-images $T_{ij}$ of the global map of the test scene ${M}_{{\rm test}}$ generated through a sliding window. The global scene similarity score $SS_{\rm global}$ between ${M}_{{\rm train}}$ and ${M}_{{\rm test}}$ is calculated as the averaged best match scores by

SS_{\rm global}=\frac{1}{N_{1}N_{2}}\sum_{i}^{N_{1}}\sum_{j}^{N_{2}}R^{*}(M_{\rm train},T_{ij})

(12)

where

T_{\rm ij}=\begin{bmatrix}M_{\rm test}(is_{x},js_{y})&\cdots&M_{\rm test}(is_{x}+w-1,\\ &&js_{y})\\ \vdots&\ddots&\vdots\\ M_{\rm test}(is_{x},&\cdots&M_{\rm test}(is_{x}+w-1,\\ js_{y}+h-1)&&js_{y}+h-1)\\ \end{bmatrix}

(13)

Here, $R^{*}(M_{\rm train},T_{ij})$ is the best match score between ${M}_{{\rm train}}$ and $T_{ij}$ using Eq. (11) with the rotation angle ${\phi\in\{10^{\circ},20^{\circ},...,360^{\circ}\}}$ . $T_{ij}$ is the sub image of $M_{\rm test}$ generated through a sliding window with strides $s_{x}$ and $s_{y}$ to reduce computation. ${M_{\rm test}(is_{x},js_{y})}$ represents the pixel value at ${(is_{x},js_{y})}$ of ${M_{\rm test}}$ . $w$ and $h$ represent the width and height of the image $T_{ij}$ , respectively.

V-C Local Scene Similarity

The local scene similarity is proposed to quantify the transferability of the local map-based navigation algorithm when deployed in a new environment without the global map. Without the global map of the test scene, only the local obstacle maps collected along the agent’s trajectory are available. The metric calculates the similarity between the actual observations in the test scene and the observations in the training scene. Since the global map of the training scene is available and easily obtained, the proposed local scene similarity utilizes the global obstacle map for the convenience of calculation instead of the local map.

The calculation of the local scene similarity metric is based on the image template matching method described in Section V-A. The image to be matched is the global obstacle map of the training scene denoted by $M_{\rm o}^{\text{global}}$ . The template is the local obstacle maps converted from the LiDAR measurements in the test scene. The image template matching method finds the averaged best match scores between the local obstacle maps collected in the test scene and the global obstacle map of the training scene.

We find that the direct use of template matching causes the following two problems. First, as shown in Fig. 4(a), the center of the best match sub-image sometimes lies within the obstacle region in the training scene, and the agent cannot reach such a location during the training process. Second, Fig. 4(b) shows all the trajectories of the agent stored in the replay buffer during training with the darkness in red indicating the total number of arrivals. We observe a significant variance among different locations regarding how many times an observation is used in the training. Therefore, it is unfair that different positions in the global obstacle map carry the same level of importance.

To solve the aforementioned problems, the template matching method is updated by introducing a weight matrix in the calculation of the match score. The weight is calculated based on how many times the agent reaches a specific position on the global map during training, which indicates the importance of the match score at the corresponding position. The weight at position $(x,y)$ in $M_{\rm o}^{\text{global}}$ is calculated by

\text{Wt}(x,y)=\text{clip}\left(\frac{\text{clip}\left(\#(\text{arrivals})(x,y),0,N_{\text{max}}\right)}{N_{\text{max}}},0.5,1\right)

(14)

where ${\#(\text{arrivals})(x,y)}$ is the total number of times the agent reaches the position $(x,y)$ in $M_{\rm o}^{\text{global}}$ during the training. To normalize the weight and prevent it from being too small, two layers of clip functions are used. In the first layer, $N_{\text{max}}$ is the total number of the agent reaching a position that we consider as the threshold to give the highest importance. In the second layer, [0.5,1] is used to define the range of the weight values. The introduction of the weight matrix helps the metric to distinguish specific training processes.

The scene similarity score $SS$ is formulated as follows. First, the agent collects $N$ local obstacle maps ${\{M_{{\rm o}i},i\in[0,N)\}}$ from the test scene. Then, the averaged best match score between ${\{M_{{\rm o}i},i\in[0,N)\}}$ and the global obstacle map ${M_{\rm o}^{\text{global}}}$ of the training scene is calculated by

SS=\frac{1}{N}\sum_{i}^{N}\max_{x,y,\phi}\text{Wt}\left(x+\frac{W}{2},y+\frac{H}{2}\right)R_{M_{\rm o}^{\text{global}},M_{{\rm o}i\phi}}\left(x,y\right)

(15)

where $SS$ is the scene similarity score, $W$ and $H$ are the width and height of the local obstacle map, respectively, and the rotation angle ${\phi\in\{1^{\circ},2^{\circ},...,360^{\circ}\}}$ .

Considering that the similarity score of the training scene itself $SS_{\rm train}$ varies, to debias the influence of the training scene, the local scene similarity $SS_{\rm local}$ is therefore defined as the relative scene similarity score or the difference between the similarity score calculated in the test scene $SS_{\rm test}$ and the similarity score calculated in the training scene $SS_{\rm train}$ , i.e.,

SS_{\rm local}=SS_{\rm test}-SS_{\rm train}

(16)

VI Case Study on Local Scene Similarity Metric

In this section, the effectiveness of the local scene similarity metric in quantifying the transferability of the local map-based DQN navigation algorithm is shown through both simulation and real-world experiments in diverse test scenes when no global map of the test scenes is available.

VI-A Case Study Implementation Details

The configuration of the local map and the hyper parameters of DQN and TD3 are listed in Table I. DQN and TD3 are implemented in PyTorch and trained using the Adam optimizer. The generation of the local maps and the calculation of the scene similarity metric use the OpenCV library.

The training and the testing in simulation are carried out in the PyBullet simulator on a computer with an i7-12700F CPU and an NVIDIA GeForce RTX 3070 GPU. The agent is a differential drive robot equipped with a laser range finder that has a max range of 3.5m, a $360^{\circ}$ FoV and 360 measurement readings. A total of 12 scenes with a 10m $\times$ 10m two-dimensional space are constructed in simulation as shown in Fig. 5. The training of DQN and TD3 takes approximately 18 hours and 3 hours for each agent in scene (a), respectively.

TABLE I: Hyper Parameters

	Parameter	Value
local map	size ( $H,W$ )	(60, 60)
	resolution $R$	0.05 m/pixel
	kernel size of dilation	5
DQN	learning rate	$10^{-4}$
	discount factor $\gamma$	0.99
	initial exploration	1.0
	final exploration	0.05
	batch size	64
	replay buffer size	$10^{6}$
	number of training episodes	4500
TD3	actor, critic learning rate	$5\times 10^{-5}$ , $10^{-3}$
	discount factor $\gamma$	0.99
	initial exploration noise	1.0
	final exploration noise	0.05
	policy exploration noise	0.2
	policy update delay	2
	soft update factor $\tau$	0.005
	batch size	64
	replay buffer size	$2\times 10^{5}$
	number of training episodes	2500

VI-B Simulation Results

The simulation environment is set up as described in Section VI-A. 17 different agents are first trained using DQN in the same training scene (Fig. 5(a)) with the same hyper parameters listed in Table I but different DQN initial weights and exploration processes. Following that, the agents are deployed in 12 test scenes that include the training scene itself as shown in Fig. 5. The comparison results between the calculated local scene similarity scores $SS_{\rm local}$ and the navigation success rates are analyzed and presented to support the validation of the effectiveness of the proposed scene similarity metric in quantifying the transferability of the navigation algorithm.

Simulation results show that the local map-based DQN navigation algorithm successfully steers the robot to the destination in the 12 test scenes with various success rates. In each scene, 100 destination points are randomly generated to calculate the success rate.

Figure 6(a) shows the scatter plot between the navigation success rate and the local scene similarity metric over all 17 agents represented by 17 different colors. We observe that with a decreased local scene similarity score $SS_{\rm local}$ , the navigation success rate in general has a wider distribution or a larger variation along with a decreased mean value. Figure 6(c) classifies all the test data in Fig. 6(a) according to $SS_{\rm local}$ into nine equal intervals ranging from 0.04 to -0.26, showing the mean and standard deviation of the success rate for each interval. We observe a similar trend that an increased uncertainty in the success rate is correlated with a decreased scene similarity.

Figure 6(b) presents the simulation results according to the scene, using 12 different colors to represent the 12 test scenes. Figure 6(d) shows the mean and the standard deviation of the success rate and $SS_{\rm local}$ for each scene. The scene alphabet labels are consistent with that in Fig. 5. Generally speaking, as the local scene similarity score decreases, the mean of the success rate shows a decreasing trend, and the variance shows an increasing trend.

It is noteworthy that scene j has a lower similarity than scene e, but the navigation success rate is approximately equal. We conjecture that the navigation success rate is not only related to the similarity between the training and test scenes but also to the complexity of the scene itself. Although scene e has a high similarity, it contains narrower passages, and thus the agent is more likely to collide during navigation. In contrast, scene j is less similar to the training scene, but with more open space, thus making the navigation task easier.

VI-C Real-World Experiment

A TurtleBot4 wheeled robot with a maximum linear velocity of 0.31 m/s and a maximum angular velocity of 1.90 rad/s is used to experimentally test the proposed navigation algorithm and the local scene similarity metric. It is equipped with a RPLIDAR A1M8 LiDAR featuring a maximum measuring range of 12 m, a FoV of $360^{\circ}$ and an angular resolution $\leq 1^{\circ}$ . The navigation algorithm runs on a laptop that communicates with the robot in real time. The laptop has an i7-12700H CPU and an NVIDIA GeForce RTX 3060 GPU.

A total of 14 test scenes are constructed in the lab. The scenes and the sampled navigation trajectories are shown in Fig. 7. The order of the scenes is sorted by the local similarity score, arranged from the highest to the lowest. Without loss of generality, the trained agent #10 from the simulation is selected and used in the experiment. In each scene, 30 goal destinations are generated to perform the proposed local map-based DQN navigation algorithm.

Figure 8 shows the scatter plot of the experimental results between the relative scene similarity score and the navigation success rate. The proposed local map-based DQN navigation algorithm successfully steers the robot toward its destinations in the real world. The success rate varies with the local similarity score following the same pattern as observed in the simulation. As the local scene similarity score decreases, the success rate has a smaller mean value and a larger standard deviation, indicating increased uncertainties in the model.

The simulation and experimental results confirm that the local scene similarity metric and the navigation success rate of the proposed local map-based navigation algorithm are strongly correlated. The proposed scene similarity metric has a great potential in providing guidance for the design of training scenes and the transfer learning algorithm so that an improved navigation performance (e.g., increased safety) can be achieved.

VII Case Study on Global Scene Similarity and Local Map Observation

In this section, the local map-based DRL navigation is tested and compared to other DRL algorithms with classic and state-of-the-art observations of vector-based LiDAR measurements in simulation using the global scene similarity $SS_{\rm global}$ . In addition, ablation experiments are conducted to show the benefit of fusing all spatial data onto the local map. The configurations of the LiDAR sensor are changed to show that the local map is independent of the FoV and the angular resolution of the sensor at the deployment time. Finally, a comparison experiment with traditional algorithms is conducted.

VII-A Comparisons With Other Observations

To showcase the transferability and robustness of using the local map as the observation, comparison studies are conducted with selected classic and state-of-the-art observations. The global scene similarity $SS_{\rm global}$ is adopted as the evaluation metric.

The selected observations and corresponding network architectures are shown in Fig. 9. The first observation shown in Fig. 9(a) is the sparse LiDAR data proposed in [5]. The network architecture similar to [5] is used to train. The spare LiDAR data is a normalized 10-dimensional vector uniformly sampled from the raw laser measurements from angle $-90^{\circ}$ to $90^{\circ}$ . The vector is concatenated with the goal information ${\bm{g}=[d,\alpha]}$ as the input, where $d$ and $\alpha$ are the distance and the orientation angle of the goal with respect to the agent, respectively. The second observation shown in Fig. 9(b) is the LiDAR data after the min pooling operation proposed in [6]. The similar architecture proposed in [6] is used to train. The operation compresses the raw 360 laser measurements into a 36-dimensional vector. The third observation is the LiDAR data after the min pooling and the IPAPRec operation proposed in [10], which is a powerful input preprocessing approach with adaptively parametric reciprocal functions. The initial value of IPAPRec trainable parameter $\beta_{i}$ is set to -1 in this paper. The last observation shown in Fig. 9(c) is the most commonly used dense LiDAR data, which uses the raw laser range measurements as a 360-dimensional vector. The adopted network architecture is similar to [3] which utilizes 1D convolutional layers to extract features of laser measurements.

For each observation, 5 independent agents are trained using DQN and TD3 with the hyper parameters listed in Table I and the reward function in Eq. (4). The trained agents are deployed in 8 test scenes in Fig. 5. Those test scenes are selected for clear presentation of the results considering that their global scene similarity scores $SS_{\rm global}$ with respect to the training scene differ considerably apart from each other. In each test scene, the same 100 destination points are generated for each agent. The average success rate and the average navigation time of each DRL observation with respect to $SS_{\rm global}$ are shown in Fig 10.

As shown in Fig. 10(a) and Fig. 10(b), the success rates of all DRL navigation algorithms using different observations show a decreasing trend as $SS_{\rm global}$ decreases. However, compared to other observations, using the local map achieves significantly better navigation performance with much higher navigation success rates in test scenes with lower $SS_{\rm global}$ . Although there is no small obstacle in the training scene (scene a), the local map-based DRL navigation still avoids colliding with small obstacles when deployed in test scenes h, k and l. The results confirm the enhanced robustness and transferability of the local map observation design for DRL navigation. The reason for the higher transferability of the local map is that it converts LiDAR data and the agent’s position to the map, which helps the agent recognize the shapes of the surrounding obstacles as well as the colliding situation by recognizing the overlap between the obstacles and current position on the map. In comparison, Minpool, for example, only takes the minimum LiDAR measurement within a certain angle interval, not able to reflect the shape of obstacles in the angle interval and resulting in collisions when the shapes of the obstacles are dissimilar or more complex than the training scene. In Fig. 10(c) and Fig. 10(d), we observe that the navigation time of the local map in scenes (h) and (l) is slightly longer than other observations. We conjecture the reason is that the local map-based navigation takes safer and more conservative actions to avoid collisions in scenes with lower similarity scores.

VII-B Ablation Study

While the goal and velocity information are often modeled as a vector input, the local map fused all information onto a map, which helps the CNNs learn all the spatial information together. The proposed local map $\bm{M},\bm{M}_{a}$ is compared to other inputs that use the goal and/or velocity as a vector and the results are shown in Fig. 11. We use $M$ to denote the map input and $V$ to denote the vector input. All the inputs are trained using the TD3 algorithm in Fig. 2 except $M_{\rm o},\bm{V}_{\rm g}$ being trained using DQN because the TD3 agent fails to learn a good policy when using this kind of input. The four selected inputs are detailed as follows.

1) $\bm{M},\bm{M}_{a}$ denotes the use of the proposed local map $\bm{M}$ . It converts the velocity information to the action map $\bm{M}_{a}$ as the input to the TD3 critic network. 2) $\bm{M},\bm{V}_{a}$ inputs the action as a vector $\bm{V}_{a}$ to the TD3 critic network and concatenates it with the extracted features of the local map $\bm{M}$ after CNNs. 3) $M_{\rm o,p},\bm{V}_{\rm g},\bm{V}_{a}$ adds the robot position onto the obstacle map [1] denoted by $M_{\rm o,p}$ , and inputs goal and velocity information as vectors $\bm{V}_{\rm g}$ and $\bm{V}_{a}$ , respectively. 4) $M_{\rm o},\bm{V}_{\rm g}$ only inputs the obstacle map $M_{\rm o}$ and the goal information as a vector $\bm{V}_{\rm g}$ .

As shown in Fig. 11, the proposed local map input $\bm{M},\bm{M}_{a}$ achieves the best overall performance with a shorter average navigation time and higher average navigation success rates in almost all test scenes. $M_{\rm o},\bm{V}_{\rm g}$ performs worst since it only inputs the obstacle map without the agent’s position, thus very challenging for the network to learn the colliding situations with obstacles.

VII-C Changing Sensor FoV and Angular Resolution

Converting the LiDAR data to the local map allows an arbitrary number of LiDAR measurements. The FoV and the angular resolution of the LiDAR sensor are changed at the deployment time to test if the local map-based DRL navigation algorithm still performs robustly well with respect to different sensor setups. The average navigation success rates and average navigation time using different sensor setups tested in scene (a) of Fig. 5 are listed in Table II. The local map-based DRL navigation algorithm maintains a high success rate in almost all presented LiDAR setups. Only when the number of measurements is significantly small (# of Meas.=30), the navigation success rate of the DQN-based algorithm decreases slightly. The results show the robustness and transferability of the local map-based navigation algorithm that allows the interchange of the sensor’s FoV and angular resolution at the deployment time.

TABLE II: The mean and standard deviation of navigation success rate and navigation time in scene (a) with respect to LiDAR sensor configurations, i.e., the FoV and the number of laser measurements (# of Meas.).

Variant		DQN		TD3
FoV	# of Meas.	SR (%)	Time (s)	SR (%)	Time (s)
360	360	99.8 $\pm$ 0.4	46.4 $\pm$ 1.8	100.0 $\pm$ 0.0	26.7 $\pm$ 0.8
360	260	100.0 $\pm$ 0.0	46.8 $\pm$ 2.3	100.0 $\pm$ 0.0	26.7 $\pm$ 0.8
360	160	99.6 $\pm$ 0.8	46.9 $\pm$ 1.8	100.0 $\pm$ 0.0	26.7 $\pm$ 0.8
360	60	97.0 $\pm$ 2.7	46.6 $\pm$ 2.6	100.0 $\pm$ 0.0	26.7 $\pm$ 0.9
360	30	95.4 $\pm$ 1.9	45.8 $\pm$ 2.1	99.4 $\pm$ 1.2	27.1 $\pm$ 1.0
270	270	99.8 $\pm$ 0.4	46.4 $\pm$ 1.8	100.0 $\pm$ 0.0	26.7 $\pm$ 0.8
180	180	99.8 $\pm$ 0.4	46.5 $\pm$ 1.3	99.2 $\pm$ 1.2	27.0 $\pm$ 0.7

VII-D Comparison With Traditional Local Planners

A new simulation test scene is constructed to compare the local map-based DRL navigation with the traditional local planners APF[12] and DWA[13]. We use the same agents trained in Section VII-A and the results are shown in Fig.12. APF and DWA fail to reach destinations due to the local optimum, while the local map reaches a success rate of over 70% in the test scene with a low $SS_{\rm global}$ (0.374). When it is trained in a scene similar to the test scene, it is able to achieve better navigation performance. The results confirm the generalization of the local map-based DRL navigation in a new environment.

VIII Discussion

VIII-A Comparision With Other Metrics

To demonstrate the advantage of $SS_{\rm global}$ and $SS_{\rm local}$ , the proposed metrics are compared to the $l_{1}$ and $l_{2}$ norms used in the discrepancy function of[19]. The $l_{1}$ and $l_{2}$ norms require the same size of the global maps of scenes to be flattened into vectors with the same dimension, while the proposed transferability metric has no such limitation. We utilized the same data in Fig. 6 and the results of the relationship between navigation success rates and different metrics are presented in Fig. 13. There exists no clear correlation between the navigation success rate and the $l_{1}$ and $l_{2}$ norms (Fig. 13(a) and (b)). In comparison, as $SS_{\rm global}$ and $SS_{\rm local}$ decrease, the success rate decreases with a larger deviation (Fig. 13(c) and Fig. 6(d)). The results demonstrate the effectiveness of the proposed transferability metric.

VIII-B Generalization of Local Scene Similarity

To test the generalization of the local scene similarity, 3 independent agents in simulation using DQN for each observation including Dense, Minpool and IPAPRec with the same setting in Section VII-A are trained, since they use the same amount of LiDAR data as the local map. The success rate with respect to $SS_{\rm local}$ is shown in Fig. 14. With decreasing $SS_{\rm local}$ , the success rates of three observations show a decreasing trend with a wider distribution. Although the local scene similarity is designed to quantify the transferability of the local map-based navigation algorithm, the results show a great potential in applying $SS_{\rm local}$ to other observations.

IX Conclusion

This paper proposed a novel scene similarity metric using the improved image template matching algorithm for quantifying the transferability of the DRL navigation algorithm exampled by the global and local performance measures. In addition, a DRL algorithm using the local map as the observation was designed for the navigation of mobile robots. A case study with a wheeled robot was designed and extensive experiments were conducted in a total of 26 simulated and real-world test scenes. The experimental results confirmed the robustness and transferability of the proposed local map-based navigation algorithm and showed the strong correlation between the designed scene similarity metric and the success rate of the DRL navigation algorithm when applied to new environments. To implement the global scene similarity in the real world, a mapping module, such as gmapping[31], is required to generate the global map of the test scene.

In future work, we plan to refine the transferability metric, following the findings in the case study to comprehensively consider the scene similarity and the scene complexity. In addition, we plan to apply the proposed metric to other DRL-based navigation algorithms.

References

[1] Y. Chen, G. Chen, L. Pan, J. Ma, Y. Zhang, Y. Zhang, and J. Ji, “DRQN-based 3D obstacle avoidance with a limited field of view,” in 2021 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 8137–8143.
[2] D. Hoeller, L. Wellhausen, F. Farshidian, and M. Hutter, “Learning a state representation and navigation in cluttered and dynamic environments,” IEEE Robot. Automat. Lett., vol. 6, no. 3, pp. 5081–5088, 2021.
[3] F. Leiva and J. Ruiz-del Solar, “Robust RL-based map-less local planning: Using 2d point clouds as observations,” IEEE Robot. Automat. Lett., vol. 5, no. 4, pp. 5787–5794, 2020.
[4] J. Choi, K. Park, M. Kim, and S. Seok, “Deep reinforcement learning of navigation in a complex and crowded environment with a limited field of view,” in 2019 Int. Conf. Robot. Automat. (ICRA), pp. 5993–6000.
[5] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation,” in 2017 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 31–36.
[6] M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart, and J. Nieto, “Reinforced imitation: Sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations,” IEEE Robot. Automat. Lett., vol. 3, no. 4, pp. 4423–4430, 2018.
[7] A. Wahid, A. Toshev, M. Fiser, and T.-W. E. Lee, “Long range neural navigation policies for the real world,” in 2019 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 82–89.
[8] Y. Jang, J. Baek, and S. Han, “Hindsight intermediate targets for mapless navigation with deep reinforcement learning,” IEEE Trans. Ind. Electron., vol. 69, no. 11, pp. 11 816–11 825, 2022.
[9] K. Lobos-Tsunekawa and T. Harada, “Point cloud based reinforcement learning for sim-to-real and partial observability in visual navigation,” in 2020 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 5871–5878.
[10] W. Zhang, Y. Zhang, N. Liu, K. Ren, and P. Wang, “IPAPRec: A promising tool for learning high-performance mapless navigation skills with deep reinforcement learning,” IEEE/ASME Trans. Mechatron., vol. 27, no. 6, pp. 5451–5461, 2022.
[11] K. Zhu and T. Zhang, “Deep reinforcement learning based mobile robot navigation: A review,” Tsinghua Sci. Technol., vol. 26, no. 5, pp. 674–691, 2021.
[12] C. Warren, “Global path planning using artificial potential fields,” in 1989 Int. Conf. Robot. and Automat., vol. 1, pp. 316–321.
[13] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,” IEEE Robot. Automat. Mag., vol. 4, no. 1, pp. 23–33, 1997.
[14] Y. Zhu, Z. Wang, C. Chen, and D. Dong, “Rule-based reinforcement learning for efficient robot navigation with space reduction,” IEEE/ASME Trans. Mechatron., vol. 27, no. 2, pp. 846–857, 2022.
[15] H. Jiang, M. A. Esfahani, K. Wu, K. wah Wan, K. kian Heng, H. Wang, and X. Jiang, “iTD3-CLN: Learn to navigate in dynamic scene through deep reinforcement learning,” Neurocomputing, vol. 503, pp. 118–128, 2022.
[16] D. Liu, Z. Lyu, Q. Zou, X. Bian, M. Cong, and Y. Du, “Robotic navigation based on experiences and predictive map inspired by spatial cognition,” IEEE/ASME Trans. Mechatron., vol. 27, no. 6, pp. 4316–4326, 2022.
[17] M. Pfeiffer, G. Paolo, H. Sommer, J. Nieto, R. Siegwart, and C. Cadena, “A data-driven model for interaction-aware pedestrian motion prediction in object cluttered environments,” in 2018 Int. Conf. Robot. Automat. (ICRA), pp. 5921–5928.
[18] S. Yao, G. Chen, Q. Qiu, J. Ma, X. Chen, and J. Ji, “Crowd-aware robot navigation for pedestrians with multiple collision avoidance strategies via map-based deep reinforcement learning,” in 2021 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 8144–8150.
[19] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simulation randomization with real world experience,” in 2019 Int. Conf. Robot. Automat. (ICRA), pp. 8973–8979.
[20] B. Balaji, S. Mallya, S. Genc, S. Gupta, L. Dirac, V. Khare, G. Roy, T. Sun, Y. Tao, B. Townsend, E. Calleja, S. Muralidhara, and D. Karuppasamy, “Deepracer: Autonomous racing platform for experimentation with sim2real reinforcement learning,” in 2020 IEEE Int. Conf. Robot. Automat. (ICRA), pp. 2746–2754.
[21] Z. Ding, N. F. Lepora, and E. Johns, “Sim-to-real transfer for optical tactile sensing,” in 2020 IEEE Int. Conf. Robot. Automat. (ICRA), pp. 1639–1645.
[22] T. Chaffre, J. Moras, A. Chan-Hon-Tong, and J. Marzat, “Sim-to-real transfer with incremental environment complexity for reinforcement learning of depth-based robot navigation,” 2020, arXiv:abs/2004.14684.
[23] N. Q. Hieu, D. T. Hoang, D. Niyato, P. Wang, D. I. Kim, and C. Yuen, “Transferable deep reinforcement learning framework for autonomous vehicles with joint radar-data communications,” IEEE Trans. Commun., vol. 70, no. 8, pp. 5164–5180, 2022.
[24] J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for visual control,” IEEE Robot. Automat. Lett., vol. 4, no. 2, pp. 1148–1155, 2019.
[25] Y. Park, S. H. Lee, and I. H. Suh, “Sim-to-real visual grasping via state representation learning based on combining pixel-level and feature-level domain adaptation,” in 2021 IEEE Int. Conf. Robot. Automat. (ICRA), pp. 6300–6307.
[26] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “A data-efficient framework for training and sim-to-real transfer of navigation policies,” in 2019 Int. Conf. Robot. Automat. (ICRA), pp. 782–788.
[27] Q. Luo, M. Sorokin, and S. Ha, “A few shot adaptation of visual navigation skills to new observations using meta-learning,” in 2021 IEEE Int. Conf. Robot. Automat. (ICRA), pp. 13 231–13 237.
[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” 2013, arXiv:abs/1312.5602.
[29] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. 35th Int. Conf. Mach. Learn., J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1587–1596.
[30] R. Szeliski, Computer Vision: Algorithms and Applications. Cham: Springer International Publishing, 2022, ch. Image Processing, pp. 85–151.
[31] G. Grisetti, C. Stachniss, and W. Burgard, “Improved techniques for grid mapping with rao-blackwellized particle filters,” IEEE Trans. Robot., vol. 23, no. 1, pp. 34–46, 2007.