This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Feedback-efficient Active Preference Learning for
Socially Aware Robot Navigation

Ruiqi Wang1, Weizheng Wang1,2, and Byung-Cheol Min1 1SMART Laboratory, Department of Computer and Information Technology, Purdue University, West Lafayette, IN, USA. [wang5357,minb]@purdue.edu.2College of Mechanical and Electrical Engineering, Beijing University of Chemical Technology, Beijing (BUCT), China. wz.w.robot@gmail.com.
Abstract

Socially aware robot navigation, where a robot is required to optimize its trajectory to maintain comfortable and compliant spatial interactions with humans in addition to reaching its goal without collisions, is a fundamental yet challenging task in the context of human-robot interaction. While existing learning-based methods have achieved better performance than the preceding model-based ones, they still have drawbacks: reinforcement learning depends on the handcrafted reward that is unlikely to effectively quantify broad social compliance, and can lead to reward exploitation problems; meanwhile, inverse reinforcement learning suffers from the need for expensive human demonstrations. In this paper, we propose a feedback-efficient active preference learning approach, FAPL, that distills human comfort and expectation into a reward model to guide the robot agent to explore latent aspects of social compliance. We further introduce hybrid experience learning to improve the efficiency of human feedback and samples, and evaluate benefits of robot behaviors learned from FAPL through extensive simulation experiments and a user study (N=10) employing a physical robot to navigate with human subjects in real-world scenarios. Source code and experiment videos for this work are available at: https://sites.google.com/view/san-fapl.

I Introduction

Advances in artificial intelligence-embedded robotics are increasingly enabling robots to work in environments that necessitate human-robot interaction (HRI). Delivery robots around university campuses, guide robots in shopping malls, elder care robots at nursing homes, and other such applications all require robots to perform socially aware navigation in human-rich environments, wherein the robots must not only consider how to complete navigation tasks successfully but also recognize and follow social etiquette to sustain comfortable spatial interaction with humans [1, 2]. For example, when navigating in a human-filled environment as depicted in Fig. 1, beyond simply reaching the final goal without collisions, the robot must maintain an acceptable distance from other pedestrians and adjust its movements to generate a comfortable interaction experience for humans.

Refer to caption
Figure 1: An example of real-world socially aware navigation with a mobile robot in a human-filled environment.

Existing research in the field of socially aware navigation can be divided into two main categories: model-based and learning-based approaches. Model-based methods [3, 4, 5] aim to model the social conventions and dynamics of crowds to serve as additional parameters for traditional multi-agent collision-free robot navigation algorithms. However, it is quite difficult to emulate with one single model the precise rules followed by all pedestrians [1], not to mention that oscillatory trajectories can be produced [6].

As an alternative, learning-based approaches have achieved better performance [1]. Within this category, there are two main types: reinforcement learning (RL) [6, 7, 8] and inverse reinforcement learning (IRL) [9, 10, 11]. A primary shortcoming of RL methods is that engineering a meticulous handcrafted reward which emulates non-quantifiable social compliance is a non-trivial endeavor. Another problem stemming form handcrafted rewards is reward exploitation, that is, robots learn to achieve high rewards via some undesired and unnatural action that impairs human comfort. On the other hand, IRL methods, where a policy or reward is learned from human demonstrations, can avoid reward engineering and exploitation and allow experts to introduce human insights and comfort into robot policy. However, obtaining sufficient and accurate demonstrations is expensive, and careful feature engineering is required to achieve sound performance [12].

Refer to caption
Figure 2: Framework of the proposed FAPL method. The FAPL is composed of two parts: 1) Hybrid Experience Learning (Section III-B) and 2) Agent Learning (Section III-C, III-D). The robot agent begins with curious exploration, where it is encouraged to take different actions and reach diverse states through a state-entropy-based primitive reward. The collected exploration experiences are stored in a replay buffer along with the expert experiences from human demonstrations. Next, human teachers express preferences for a pair of robot navigation trajectories in the buffer, based on which the reward model is learned. Then, all samples are updated with a new reward value from each new generation of the reward model. At the end, the robot agent utilizes the updated samples to optimize a policy that gains maximum return from the model distilled from human preferences through off-policy RL.

In this paper, we present a Feedback-efficient Active Preference Learning approach (FAPL) for socially aware robot navigation. The framework and procedure are illustrated in Fig. 2. Our main idea is to efficiently distill a reward model, one having human intelligence and comfort embedded, via active preference learning. Then, the learned reward model imitates a human teacher in the subsequent RL process to guide the robot agent to explore the latent space of social compliance. To reduce the heavy human workload typically required in active preference learning, we introduce hybrid experience learning, which consists of curious exploration based on [13, 14, 15] and expert demonstration. Diverse exploration experiences provide a good state coverage while expert experiences guarantee positive benchmarks, enabling human teachers to give instructional feedback efficiently and accelerating the learning process.

To evaluate the performance of FAPL, we conducted extensive experiments to compare it with other four state-of-the-art baseline methods, [3, 6, 7, 8], and with one ablation model in both simulation and real-world scenarios for quantitative measurement and qualitative analysis.

Our main contributions are: 1) To the best of our knowledge, this paper is the first to introduce active preference learning for socially aware navigation, such a purely data-driven method can tailor robot behaviors to human expectations and comfort, without the massive reward or feature engineering required in previous research; 2) The hybrid experience learning we introduce improves the efficiency of human feedback and samples; and 3) Our experiments show that FAPL can lead to more desirable and natural robot navigation behaviors.

II Background

Related Work. In optimizing robot trajectories for human comfort, existing RL-based methods [6, 7, 8] rely on a handcrafted reward function as shown in Eq. 1, which penalizes collisions and uncomfortable distances.

Rt={0.25if collision0.1+dt2else if distance from humandt<0.21else if reaching the goal0otherwise.R_{t}=\left\{\begin{array}[]{ll}-0.25&\text{if collision}\\ -0.1+\frac{d_{t}}{2}&\text{else if distance from human}\ d_{t}<0.2\\ 1&\text{else if reaching the goal}\\ 0&\text{otherwise}.\end{array}\right. (1)

However, distance is not the only consideration in robot movement that influences human comfort [1, 2]. As such, a handcrafted reward function is unlikely to lead to robot behaviors that satisfy the broad and non-quantifiable human expectations of robots in socially compliant navigation. Handcrafted reward functions are also conducive to reward exploitation, where robots learn undesired and unnatural but highly-rewarded behaviors that impair human comfort, e.g., frequently turning 180 degrees to avoid collisions or invasive proximity.

Another mainstream approach concerns IRL-based methods [9, 10, 11], which aim to learn a policy or reward function through learning from human demonstration. Such human-in-loop learning approaches introduce expert intelligence to guide the robot agent, avoiding the reward engineering and exploitation characteristic of RL, and introducing naturalness to robot trajectories. Nevertheless, as the expert demonstrations constitute the only sample resources, it is very expensive to obtain sufficient and accurate samples so as to cover all latent aspects of social compliance. For instance, learning from demonstrations that lack negative data, e.g., collisions, may result in a policy that only values comfort without safety [12]. Moreover, extensive feature engineering is required to achieve reasonable performance [9].

Active preference or feedback learning [16, 17], which relies on human feedback rather than demonstrations to provide corrective and adaptable instructions to guide the learning process, can serve as a good alternative for addressing the above-described challenges. Such interactive reinforcement learning can tailor robot behaviors to human preference naturally, without reward engineering and exploitation. In addition, it does not require and therefore is not limited by expensive human demonstrations. However, achieving good performance with active preference learning does require sufficiently frequent human feedback. Thus, it is essential to improve feedback efficiency so as to reduce the requisite human workload.

Our approach, feedback-efficient active preference learning or FAPL, can encode human comfort and expectation into a reward model and then into robot policy. To improve feedback efficiency, we introduce hybrid experience learning, which enables a human teacher to provide more critical feedback in the process of active reward learning. In contrast to existing RL methods, FAPL distills the reward model from human preferences rather than having it handcrafted, covers social compliance more intuitively and comprehensively, and avoids reward exploitation. Also, distinct from IRL methods, FAPL introduces expert intelligence without suffering from expensive, noisy, and scarce human demonstrations, nor from extensive feature engineering.

Problem Formulation. We follow the formulation in previous work [6, 7, 8]. The task of navigating with humans through a spatial HRI environment is formulated as a partially observable Markov decision process (POMDP) problem in RL. Consider the movement space of all agents as a two-dimensional Euclidean space. Let htnh^{n}_{t} present the state of nthn^{th} human observable by the robot at time tt. Each htnh^{n}_{t} consists of the position: (pxn,pyn)(p^{n}_{x},p^{n}_{y}) and angle: βh\beta_{h} of nthn^{th} human agent. Different from previous research, the human speed and radius are considered unobservable for better sim-to-real transfer, as these are expensive to be perceived precisely. Let wtw_{t} present the the robot’s state, which includes the velocity: (vx,vy)(v_{x},v_{y}), maximum and minimum speed: (vmax,vmin)(v_{max},v_{min}), position: (px,py)(p_{x},p_{y}), radius: ρ\rho, angle: βr\beta_{r} of the robot and the position of navigation goal: (gx,gy)(g_{x},g_{y}). Then the joint state of the environment can be defined as stjnts^{jnt}_{t} = [wt,ht1,ht2,,htn][w_{t},h^{1}_{t},h^{2}_{t},…,h^{n}_{t}]. In each time step, the robot agent starts at an initial joint state stjnts^{jnt}_{t}, where it takes the action ata_{t} generated by its policy π(at|st)\pi(a_{t}|s_{t}), and then reaches the next joint state st+1jnts^{jnt}_{t+1} via a random unknown transition function P(|st,at)P(\cdot|s_{t},a_{t}) from the environment and gets a corresponding reward rtr_{t}. The optimal policy: π:𝐬tjnt𝐚t\pi^{*}:\mathbf{s}_{t}^{jnt}\mapsto\mathbf{a}_{t}, is the policy which can receive maximum expected reward return 𝔼[k=0γkrt+k]\mathbb{E}[\sum_{k=0}^{\infty}\gamma^{k}r_{t+k}], where γ\gamma is a discount factor.

Soft Actor-Critic Training Framework. In this work, Soft Actor-Critic (SAC) [18] will be utilized as the training framework. There are two alternating parts for parameter iteration in SAC: soft policy evaluation and soft policy improvement. In soft policy evaluation, a Q-function with parameters θ\theta, is updated to minimize Bellman-residual based objective (see Eqs. (2), (3)), where \mathcal{B} presents the replay buffer, γ\gamma is a discount parameter, θ¯\bar{\theta} and α\alpha are the delayed and temperature parameters respectively.

Q(θ)=𝔼(𝐬t,𝐚t)[12(Qθ(𝐬t,𝐚t)(r(𝐬t,𝐚t)+γ𝔼𝐬t+1p[Vθ¯(𝐬t+1)]))2]\begin{split}\mathcal{L}_{Q}(\theta)&=\mathbb{E}_{\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)\sim\mathcal{B}}\bigg{[}\frac{1}{2}\Big{(}Q_{\theta}\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)-\\ &\left(r\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)+\gamma\mathbb{E}_{\mathbf{s}_{t+1}\sim p}\left[V_{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\right)\Big{)}^{2}\bigg{]}\end{split} (2)

where,

V¯(𝐬t)=𝔼𝐚tπϕ[Qθ¯(𝐬t,𝐚t)αlogπϕ(𝐚t𝐬t)].\bar{V}\left(\mathbf{s}_{t}\right)=\mathbb{E}_{\mathbf{a}_{t}\sim\pi_{\phi}}\left[Q_{\bar{\theta}}\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)-\alpha\log\pi_{\phi}\left(\mathbf{a}_{t}\mid\mathbf{s}_{t}\right)\right]. (3)

In soft policy improvement, the policy πϕ\pi_{\phi} is optimized to minimize the following objective:

π(ϕ)=𝔼𝐬t[𝔼𝐚tπϕ[αlog(πϕ(𝐚t𝐬t))Qθ(𝐬t,𝐚t)]].\mathcal{L}_{\pi}(\phi)=\mathbb{E}_{\mathbf{s}_{t}\sim\mathcal{B}}\left[\mathbb{E}_{\mathbf{a}_{t}\sim\pi_{\phi}}\left[\alpha\log\left(\pi_{\phi}\left(\mathbf{a}_{t}\mid\mathbf{s}_{t}\right)\right)-Q_{\theta}\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)\right]\right]. (4)

III Approach

III-A Overview

In this section we introduce FAPL: Feedback-efficient Active Preference Learning. A social-compliance-embedded reward function R^μ\hat{R}_{\mu}, a Q-function QθQ_{\theta} and a policy πϕ\pi_{\phi} will be updated by following steps:

  • Expert Demonstration: Initially, we let human experts provide demonstrations of the socially aware navigation task and store them into an initialized experience replay buffer EBEB as expert experiences, (sts_{t}, aexperta_{expert}, st+1s_{t+1}). (Section III-B)

  • Curious Exploration: We train a pre-policy πϕ\pi_{\phi} by maximizing a state-entropy-based reward to conduct curious exploration and collect diverse samples, which are stored in the buffer EBEB as exploration experiences (st,aexploration,st+1s_{t},a_{exploration},s_{t+1}). Meanwhile, a temporary buffer BB is set to store incoherent samples generated during the early training process as temporary experiences (st,atemporary,st+1s_{t},a_{temporary},s_{t+1}). (Section III-B)

  • Active Reward Learning: The reward model R^μ\hat{R}_{\mu}, encoded with human comfort and intelligence, is distilled from human feedback via active preference learning. Then all collected samples in the EBEB and BB are updated with a new reward value by R^μ\hat{R}_{\mu}. (Section III-C)

  • Off-policy Learning: The Q-function QθQ_{\theta} and policy πϕ\pi_{\phi} will be optimized using all updated samples via SAC to gain maximum return from the distilled reward model R^μ\hat{R}_{\mu}. (Section III-D)

III-B Hybrid Experience Learning

At the start of traditional active preference learning, the initialized agent follows a random policy to interact with the environment and obtain samples for a human to judge. However, such a policy covers the state space badly and leads to incoherent behaviors, making it hard for the human teacher to provide meaningful feedback [14]. Thus, a lot of samples and human efforts are required to make initial progress. To address this challenge, we introduce hybrid experience learning module, which consists of curious exploration and expert demonstration.

In curious exploration, we inspire the robot agent to explore the state space of socially aware navigation thoroughly to provide a better state coverage by using a kk-NN-based state entropy estimator, adapted from [19, 14, 15], as the incipient reward function. The state entropy estimator Hstate (s)H_{\text{state }}(s), evaluates the sparsity and randomness of the state distribution by measuring the space distance between each state and its KthK^{th} nearest neighbors as:

Hstate (s):=t=1nlog(p+1kst(k)Nk(st)stst(k)).H_{\text{state }}(s):=\sum_{t=1}^{n}\log\left(p+\frac{1}{k}\sum_{s_{t}^{(k)}\in\mathrm{N}_{k}\left(s_{t}\right)}\left\|s_{t}-s_{t}^{(k)}\right\|\right). (5)

where pp is a parameter for numerical stability (usually be fixed to 1), and st(k)s_{t}^{(k)} are the kk neighbors of sts_{t} in state space.

Correspondingly, the incipient exploration reward and objective function can be defined as:

Rexploration(𝐬t)=log(p+1kst(k)Nk(st)stst(k)),R_{\mathrm{exploration}}\left(\mathbf{s}_{t}\right)=\log\left(p+\frac{1}{k}\sum_{s_{t}^{(k)}\in\mathrm{N}_{k}\left(s_{t}\right)}\left\|s_{t}-s_{t}^{(k)}\right\|\right), (6)
ϕπϕ=argmaxϕt=1nRexploration(𝐬t).\phi_{\pi_{\phi}}^{\star}=\underset{\phi}{\operatorname{argmax}}\sum_{t=1}^{n}R_{\mathrm{exploration}}\left(\mathbf{s}_{t}\right). (7)

Maximizing the objective function in Eq. (7), we train an initial policy πϕ\pi_{\phi}, by which the robot agent can explore a broader range of states. However, the socially aware navigation task not only requires the robot to reach the final goal without collisions, but values more on human comfort, which is hidden in the state space and is, thus, quite difficult to cover by the robot itself. As a result, a lot of iteration rounds and therefore a high volume of human feedback is needed to optimize a desired policy. To accelerate learning process and further improve feedback efficiency, we add human demonstrations, e.g., controlling the robot with a keyboard to demonstrate the navigation task, to introduce expert experiences to cover the hidden aspects of social compliance and provide naturalness to robot movements. The human teachers can regard expert demonstrations as benchmarks to show preferences, offering critical feedback more easily.

Meanwhile, it must be mentioned that human demonstrations are always accompanied with noise [9]. To avoid the influence of inaccurate demonstrations, we only store a triple, (st,aexpert,st+1)(s_{t},a_{expert},s_{t+1}), without a reward value in the replay buffer EBEB, which is different from traditional IRL. Then the human teacher can express preference on good trajectories from demonstration or dislike on inaccurate ones to add reward labels on these demonstration samples. Namely, the expert experiences are not regarded as oracles, they will be judged by human teachers like the exploration experiences, under which circumstance, we can make advantages of good demonstrations without being affected by noisy ones. The full process of hybrid experience learning is concluded in Algorithm 1.

Algorithm 1 Hybrid Experience Learning.
1:Initialize a critic QθQ_{\theta} and a policy πϕ\pi_{\phi}
2:Initialize two replay buffers B{B}\leftarrow\emptyset and EB{EB}\leftarrow\emptyset
3:while Expert Demonstration do
4:     Store expert experiences
5:     EBEB{(st,aexpert,st+1)}EB\leftarrow EB\cup\{(s_{t},a_{\text{expert}},s_{t+1})\}
6:end while
7:while Curious Exploration do
8:     Given tmaxt_{\max} and tmemoryt_{memory}
9:     t1t\leftarrow 1
10:     while t<tmaxt<t_{\max} do
11:         Take atπϕ(at|st)a_{t}\sim\pi_{\phi}(a_{t}|s_{t}) in sts_{t} and reach st+1s_{t+1}
12:         Compute reward rtRexplorationstr_{t}\leftarrow R_{\text{exploration}}^{s_{t}} in (6)
13:         if t<tmemoryt<t_{memory} then
14:              Store temporary experiences
15:              BB{(st,atemporary,st+1,rt)}B\leftarrow B\cup\{(s_{t},a_{\text{temporary}},s_{t+1},r_{t})\}
16:         else
17:              Store exploration experiences
18:              EBEB{(st,aexploration,st+1)}EB\leftarrow EB\cup\{(s_{t},a_{\text{exploration}},s_{t+1})\}
19:         end if
20:         Sample minibatch transitions B\sim B
21:         Optimize θ\theta , ϕ\phi by following Q(θ)\mathcal{L}_{Q}(\theta) in (2)
22:         and π(ϕ)\mathcal{L}_{\pi}(\phi) in (4)
23:         t=t+1
24:     end while
25:end while
26:return B,EB,πϕB,EB,\pi_{\phi}

III-C Active Reward Learning

The objective of active reward learning is to learn a reward function R^μ\hat{R}_{\mu}, which is a neural network with parameters μ{\mu}, to encode human expectation and preference of how a socially compliant robot should act in spatial HRI. It has been shown that people feel much easier to make relative judgements than direct rating [20]. Therefore, instead of asking human teacher to give a rate value for a single set of robot trajectories, we provide two segments for human to express preferences at a time, e.g., which segment is better or worse. We follow the framework of [16, 14], to optimize our reward model R^μ\hat{R}_{\mu}. Robot trajectories stored in the experience replay buffer EBEB will be divided into several segments, each segment σ\sigma contains a sequence of states and actions in robot movements: σ=(st,at,,st+n,at+n)\sigma=({s_{t},a_{t},...,s_{t+n},a_{t+n}}). In each feedback step, the agent will query human preference Υ\Upsilon, which is one of (1,0), (0,1), (0.5,0.5), on two segments σ1\sigma_{1},σ2\sigma_{2}. The judgment with two segments will be stored in a database DD as (σ1,σ2,Υ)(\sigma_{1},\sigma_{2},\Upsilon).

Then, a preference predictor 𝒫μ\mathcal{P}_{\mu} is built to train the reward model R^μ\hat{R}_{\mu}, where σ1σ2\sigma_{1}\succ\sigma_{2} means that σ1\sigma_{1} will be preferred:

𝒫μ[σ1σ2]=exp(st,atσ1R^μ(st,at))exp(st,atσ1σ2R^μ(st,at)).\mathcal{P}_{\mu}\left[\sigma_{1}\succ\sigma_{2}\right]=\frac{\exp\left(\sum_{s_{t},a_{t}\in\sigma_{1}}\hat{R}_{\mu}(s_{t},a_{t})\right)}{\exp\left(\sum_{s_{t},a_{t}\in\sigma_{1}\cup\sigma_{2}}\hat{R}_{\mu}(s_{t},a_{t})\right)}. (8)

The assumption behind is that the probability of a human teacher’s preference on one segment σi\sigma_{i} in a pair of segments (σi,σj)(\sigma_{i},\sigma_{j}) depends exponentially on the accumulated reward value of σi\sigma_{i} gained from R^μ\hat{R}_{\mu} over the whole reward value of both segments.

Based on the preference predictor 𝒫μ\mathcal{P}_{\mu}, we optimize the reward model R^μ\hat{R}_{\mu} by minimizing the cross-entropy loss function (R^μ)\mathcal{L}\left(\hat{R}_{\mu}\right), which evaluates the difference between the prediction of R^μ\hat{R}_{\mu} and ground-truth human preference:

(R^μ)=(σ1,σ2,μ)DΥ(1)log𝒫μ[σ1σ2]+Υ(2)log𝒫μ[σ2σ1].\begin{split}\mathcal{L}\left(\hat{R}_{\mu}\right)=-\sum_{\left(\sigma_{1},\sigma_{2},\mu\right)\in D}&\Upsilon(1)\log\mathcal{P}_{\mu}\left[\sigma_{1}\succ\sigma_{2}\right]+\\ &\Upsilon(2)\log\mathcal{P}_{\mu}\left[\sigma_{2}\succ\sigma_{1}\right].\end{split} (9)

In addition, we adapt the uncertainty-based query selection method proposed by [21] to determine which pair of segments (σi,σj)(\sigma_{i},\sigma_{j}) of robot behaviors stored in EBEB are selected to query the human teacher for preference Υ\Upsilon each time. Such an informative query selection can pick up behaviors with maximum entropy, which are typically "uncertain samples on the decision boundary", leading to a significant decrease of uncertainty in unlabelled behaviors and informative feedback from humans.

III-D Off-policy learning

To further improve sample efficiency, we will adapt an off-policy RL framework, SAC, for following training. However, compared with on-policy RL frameworks, it is less stable when utilizing the reward model R^μ\hat{R}_{\mu} gained from active learning, for such a reward function is always updated and thus non-static during training process, under which circumstance off-policy RL will reuse all samples in the replay buffer that contain inaccurate reward values provided by previous in-process and non-optimized reward models. To address this issue, all previous samples stored in both EBEB and BB will be updated with a new reward label every time when a new reward model is distilled from human preference, and temporary experiences in BB will be transferred to EBEB. Then the Q-function QθQ_{\theta} and pre-trained policy πϕ\pi_{\phi} gained from curious exploration will rely on the reward model R^μ\hat{R}_{\mu} combined with updated samples in EBEB for optimization. The process of active reward learning and FAPL is presented in Algorithm 2 and 3 respectively.

Algorithm 2 Active Reward Learning
1:Initialize R^μ\hat{R}_{\mu} and a database DϕD\leftarrow\phi
2:Given number of feedback MM
3:m1m\leftarrow 1
4:B,EBB,EB\leftarrow Algorithm 1
5:while m<Mm<M do
6:     Select segments (σ0,σ1)EB(\sigma_{0},\sigma_{1})\sim EB
7:     Query human feedback Υ\Upsilon
8:     Store preference DD{(σ0,σ1,Υ)}D\leftarrow D\cup\{(\sigma_{0},\sigma_{1},\Upsilon)\}
9:     Sample minibatch preference D\sim D
10:     Optimize μ\mu by (R^μ)\mathcal{L}(\hat{R}_{\mu}) in (9)
11:     Update entire EBEB and BB using R^μ\hat{R}_{\mu}
12:     mm+1m\leftarrow m+1
13:end while
14:Store EBEBBEB\leftarrow EB\cup B
15:return R^μ\hat{R}_{\mu} and EBEB
Algorithm 3 FAPL
1:Initialize a critic QθQ_{\theta}
2:// Hybrid Experience Learning
3:πϕ\pi_{\phi}\leftarrow Algorithm 1
4:// Agent Learning
5:for  each time step  do
6:     // Active Reward Learning
7:     EB,R^μEB,\hat{R}_{\mu}\leftarrow Algorithm 2
8:     Take atπϕ(at|st)a_{t}\sim\pi_{\phi}(a_{t}|s_{t}) and reach st+1s_{t+1}
9:     Compute reward R^μ(st,at)R^μ\hat{R}_{\mu}(s_{t},a_{t})\leftarrow\hat{R}_{\mu}
10:     Store transitions
11:     EBEB{(st,at,st+1,R^μ(st,at))}{EB}\leftarrow EB\cup\{(s_{t},a_{t},s_{t+1},\hat{R}_{\mu}(s_{t},a_{t}))\}
12:     // Policy Optimization
13:     for each gradient step do
14:         Sample minibatch transitions EB\sim EB
15:         Optimize θ\theta, ϕ\phi by Q(θ)\mathcal{L}_{Q}(\theta) in (2) and π(ϕ)\mathcal{L}_{\pi}(\phi) in (4)
16:     end for
17:end for
18:return πϕ\pi_{\phi}, QθQ_{\theta}

III-E Implementation Details

We pretrain the policy πϕ\pi_{\phi} in curious exploration for 3,0003,000 episodes, the samples gained in first 1,0001,000 episodes are stored in BB and the rest is stored in EBEB. In expert demonstration part, human trainers (authors) use the keyboard to control the robot velocity along the direction of x and y axis: (vx,vy)(v_{x},v_{y}) to provide 500500 rounds of demonstrations in simulation. Once the robot is controlled to reach the goal, we say one round of demonstration is done. The reward model in active reward learning is a 33-layer neural network with 265265 hidden units in each layer and the activation function of Leaky ReLUs. We utilize Adam to train the reward model with an original learning rate of 4×1044\times 10^{-4}. The segment length is set to 3535 (about 15s15~{}s), which means each segment contains 3535 tuples of (st,at)(s_{t},a_{t}). And we recruited 1010 human raters, who are students or professionals in robotics, to provide 1,5001,500 rounds of preferences on robot behaviors for our model and ablation model respectively (IV-A3). Once the human rater gives a judgement Υ\Upsilon on one pair of robot trajectory segments σ1\sigma_{1},σ2\sigma_{2}, one round of preference is done.

Refer to caption
Figure 3: Simulation environment illustration. In a 22m×20m22~{}m\times 20~{}m two-dimensional space, the yellow circle indicates the robot. The blue dotted line illustrates the robot FoV, humans that can be detected by the robot are green circles while those out of robot view are red circles. The red star is the robot goal, and the orientation and number of each agent are presented as a red arrow and a black number respectively.

IV Experiments And Results

IV-A Simulation Experiment

IV-A1 Simulation Environment

We adapted a simulation environment from [6, 7] for training and experiments as is shown in Fig. 3. Each agent in the environment follows holonomic kinematics, where agents can move to any direction at any time. The agent’s action at time tt is the preferred velocity in x-axis and y-axis direction: at=(vtx,vty)a_{t}=(v^{x}_{t},v^{y}_{t}). Such a velocity is assumed to be immediately achievable. All human agents are controlled by ORCA [3], and the parameters of their policy are generated by a Gaussian distribution to provide random pedestrian behaviors, e.g., different velocity and goals. To avoid learning the exceedingly aggressive behavior that the robot compels all humans to yield, we follow previous research [7] to set an invisible robot setting, where the human agent is required only to do reaction, e.g., yield, to other human agents. Meanwhile, different from previous setting, we set the degree of robot field-of-view (FoV) as 9090^{\circ} rather than 360360^{\circ} to better mimic real-world scenarios and reduce sim-to-real domain-mismatch issues, since sorting multiple sensors on a physical robot to obtain a global FoV is quite impractical and expensive.

IV-A2 Baselines

We compare the proposed FAPL with other four state-of-the-art methods mentioned before: ORCA [3] is regarded as the baseline of model-based approaches; CADRL [6], SARL [7] and RGL [8] are regarded as the baselines of learning-based approaches.

IV-A3 Ablation Study

To evaluate the feedback efficiency of FAPL, we also implement another ablation model APL, which removes the hybrid experience learning module in FAPL, as the baseline of active preference learning.

IV-A4 Training Details

We utilize the same reward function in 1 for CADRL, SARL and RGL. The architectures of all networks stay the same in each experiment. All baseline networks are trained by following the implementing details in original papers for 1×1041\times 10^{4} episodes. We train APL and FAPL with 1,5001,500 pieces of human feedback respectively (III-E) for 1×1041\times 10^{4} episodes using a learning rate of 2×1042\times 10^{-4} with same parameters.

IV-A5 Evaluation

We compare the learning curves of all learning-based methods in terms of success rate and discomfort frequency, which refers to the percentage of duration where the robot is too close with a human pedestrian. Three experiment setting: 55, 1010, 1515 motional human agents are involved respectively, are set to evaluate the performance of all methods. In each setting, 500500 unseen testing scenarios are tested with five indicators recorded: success rate, time-out rate, collision rate, discomfort frequency and time to reach the goal. The time-out threshold is set as 3030, 3535 and 4040 seconds in each setting separately, and the margin of discomfort frequency is set as 0.3m0.3~{}m based upon [1].

IV-A6 Results

Refer to caption
(a) Success Rate
Refer to caption
(b) Discomfort Frequency
Figure 4: Learning curves of CADRL, RGL, SARL, APL and FAPL as measured, on (a) success rate (each episode) and (b) discomfort frequency (the mean across each five hundred episodes) under the setting of 10 humans.

Fig. 4 shows the learning curves of all learning-based methods in terms of success rate and discomfort frequency. FAPL (purple) can improve the success rate and reduce discomfort frequency continuously, and achieve the best performance in the end of training, outperforming all baselines significantly. Much as RGL (blue) achieves a higher success rate than ours in first 3,0003,000 episodes, it is very unstable and finally only reaches a success rate of about 0.60.6 in the end. Compared with APL (red), FAPL has a lower success rate and higher discomfort frequency in the beginning, for its initial policy comes from curious exploration where the robot is only encouraged to visit different states. However, FAPL surpasses APL after around 3,0003,000 episodes and achieves the best performance much more quickly, which shows that FAPL can converge significantly faster with the same number of human feedback, quantitatively proving the benefits of the hybrid experience learning module.

Fig. 5 presents the percentage of success, time-out, and collision rates of each model. ORCA performs quite badly as expected, for its model is based on the assumption that all agents are always reciprocal, violating the invisible robot setting. RGL also performs badly since it is a model-based RL built based upon global robot FoV and precise perception of human velocity which are hard to obtain in our setting. Meanwhile, the learning based-baselines are competitive to our FAPL and APL in the setting of 55 humans. However, with the increase of the complexity of environment, i.e., the number of humans, our methods perform better and better than the baselines. This shows that APL and FAPL are more efficient and robust in complex task scenarios. And FAPL outperforms APL apparently in the most difficult scenario involving 1515 humans.

TABLE I: Outcome: time (second) and discomfort frequency (DisFq.).
Time DisFq.
Methods Human Number Human Number
5 10 15 5 10 15
CADRL 29.429.4 34.734.7 41.341.3 0.2030.203 0.2580.258 0.4350.435
SARL 27.327.3 32.532.5 38.738.7 0.1370.137 0.1720.172 0.2510.251
RGL 22.9\mathbf{22.9} 27.4\mathbf{27.4} 32.7{32.7} 0.2880.288 0.3540.354 0.4740.474
APL(Ours) 24.2{24.2} 28.728.7 31.9\mathbf{31.9} 0.0290.029 0.0430.043 0.0610.061
FAPL(Ours) 25.325.3 30.130.1 34.134.1 0.018\mathbf{0.018} 0.025\mathbf{0.025} 0.048\mathbf{0.048}

Table I demonstrates the discomfort frequency and navigation time of each model except ORCA due to its extremely high collision rate. RGL enjoys the best overall performance in navigation time, however, it has the highest discomfort frequency, which means, to achieve a shorter time, the robot takes a lot of aggressive action, hurting human comfort badly. FAPL obtains the lowest discomfort frequency with shorter time than CADRL and SARL, and is competitive to RGL and APL in terms of navigation time. Combining the indicators of navigation time and discomfort frequency shows that our methods lead to robot behaviors that respect human comfort distance more than all baselines and ablation model, without sacrificing too much time due to lazy behaviors, e.g., waiting still for all pedestrians to pass or excessive detouring.

Refer to caption
Figure 5: Outcome: rates of success, collision and time-out under three experimental settings with five, ten and fifteen human agents involved respectively.

IV-B Real-world Experiment

Since the social compliance is non-quantifiable and broader than the comfortable distance, it is unsatisfied and insufficient to evaluate it only via the indicator of discomfort frequency in simulation experiments. To intuitively and further evaluate the social compliance of learned robot trajectories from our methods, we recruited human participants to conduct real-world experiments, collecting their feedback from experiences of walking with a robot controlled by different models as another indicator. This real-world experiment has been reviewed by Institutional Review Board (IRB) of BUCT.

IV-B1 Robot and Environment Setup

We utilized an Enlighten mobile robot [22] (Fig. 6, left) as the platform for real-world experiments. A Kinect v2 camera with an approximately field-of-view of 84.184.1^{\circ} was set on the top of the robot to capture human positions. YOLOv5 [23] combined with DeepSORT [24], and Monoloco Library [25, 26] were used for robot-centered human tracking and localization (Fig. 7). To provide sufficient real-time computing power for the perception algorithms and RL controllers, we connected the robot to a host with a RTX 3090 GPU via ROS. To avoid potential damage to the robot hardware, we set up an action restriction: the robot is enforced to stop when its action contains a turning degree greater than 9090^{\circ}. The experimental environment was an approximate 23m×16m23~{}m\times 16~{}m controlled open space as shown in Fig. 6, right.

Refer to caption
Figure 6: Illustration of the robot platform (left) and environment (right) employed for the real-world experiment.

IV-B2 Baselines

The ablation model APL and SARL, the best performing baseline in terms of discomfort frequency in simulation, were selected as the baselines.

IV-B3 Experimental Design

We recruited 1010 participants (22 females and 88 males), all aged over 1818 years old (μ\mu = 21.921.9; σ\sigma = 1.921.92), and divided them into 22 groups. During each experiment, the robot and participants as pedestrians were required to perform point-to-point navigation tasks in the same environment. The human pedestrians were encouraged to behave naturally without any action restrictions during tasks. We set up three scenarios, where the robot and human pedestrians were assigned with different starting positions and goals. We asked each group to spatially interact with the robot controlled by three models: SARL, APL, and FAPL respectively in the three different scenarios, resulting total 1818 tests. The order of three models was randomly set in each scenario for each group, and all participants were unaware of which model was active during each test.

Refer to caption
Figure 7: Illustration of perception algorithms for human tracking (left) and localization (right), where the blue circle and arrow present the position and orientation of a human respectively.

IV-B4 Evaluation

Once each test was done, the participants were asked to fill a questionnaire to rate robot behaviors of three models in terms of comfort and naturalness using Likert Scale [27] with 1 being strongly disagree and 5 being strongly agree, based on their experiences of spatial interaction with the robot. The participant’s responses to the questionnaire were used for quantitative measurement. We also conducted a semi-structured interview to collect oral responses of participant’s experience in experiments for qualitative analysis.

IV-B5 Quantitative Measurement

During each test, participants were divided into two categories: close interaction participant (CI) and non-close interaction participant (NCI), to collect fair questionnaire responses, where CI refers to pedestrians who approach the robot within 0.6m0.6~{}m for more than 22 seconds during the test while NCI refers to those who do not. The results are summarized in Fig. 8. The benefits of robot behaviors of FAPL in terms of comfort and naturalness can be obviously seen from the responses of CI. The reason why FAPL does not outperform the other two much from the responses of NCI for comfort is that NCI did not have enough direct experiences to judge, and therefore, they tended to give all three models a middle score. And the reason why responses for naturalness of CI and NCI are similar is that the robot unnatural behaviors, e.g., suddenly stopping, were accompanied with noises that attracted NCI’s attention.

IV-B6 Qualitative Analysis

After each test, all participants were asked an open-ended interview question that: "Could you share your experiences and explain reasons behind?". The insights thematically coded from participants’ responses are presented along with supporting quotes as follows.

Refer to caption
Figure 8: Summary of participant responses to evaluate robot navigation behaviors in terms of comfort and naturalness.

Long-sighted Decision. One reason why the robot behavior in FAPL is preferred by pedestrians is that it takes more long-sighted decisions, avoiding potential risky situations where the robot is hard to maintain socially compliant interaction with pedestrians. We believe this is because we introduce human intelligence and insights via expert demonstration and feedback in the learning process, enabling the robot to handle complex situations by valuing more on long-term benefits than short-term gains. Some notable quotes are as follows:

(P55) "The reason why I like that one (FAPL) is it turned right to give me enough space before I got close to it."

(P99) "Because the third one (FAPL) went left to detour at the start instead of getting involved in the center of the crowd directly, like the first one (SARL)."

Behavior Naturalness. Another reason why our FAPL is preferred is that it leads to more natural behaviors, e.g., does not stop suddenly or turn left and right frequently. We owe this to the human-imitative trajectory samples introduced by expert demonstration in hybrid experience learning, and the avoidance of reward exploitation problems by replacing handcrafted reward functions with reward models distilled from human preference. Some notable quotes are as follows:

(P11) "The second one (SARL) always stopped immediately, and the first (APL) also did that sometime, these made me a little nervous and annoyed."

(P99) "For me, that one (FAPL) moved more like a human, I mean it hardly stopped suddenly and turned frequently, which is preferable to me."

V Conclusion

In this work, we present FAPL, a feedback-efficient active preference learning approach for socially aware robot navigation that distills human comfort and expectation into a reward model to guide the robot to explore the latent space of social compliance. We demonstrate via both simulation and real-world experiments that our method outperforms existing state-of-the-art approaches, leading to more desirable and natural robot behaviors. We also show that by introducing hybrid experience learning, the efficiency of human feedback can be improved.

ACKNOWLEDGMENT

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1846221.

References

  • [1] J. Rios-Martinez, A. Spalanzani, and C. Laugier, “From proxemics theory to socially-aware navigation: A survey,” International Journal of Social Robotics, vol. 7, no. 2, pp. 137–153, 2015.
  • [2] T. Kruse, A. K. Pandey, R. Alami, and A. Kirsch, “Human-aware robot navigation: A survey,” Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1726–1743, 2013.
  • [3] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics research.   Springer, 2011.
  • [4] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attention in human crowds,” in 2018 IEEE international Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 4601–4607.
  • [5] G. Ferrer and A. Sanfeliu, “Behavior estimation for a complete framework for human motion prediction in crowded environments,” in 2014 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2014, pp. 5940–5945.
  • [6] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 1343–1350.
  • [7] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 6015–6022.
  • [8] C. Chen, S. Hu, P. Nikdel, G. Mori, and M. Savva, “Relational graph learning for crowd navigation,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 10 007–10 013.
  • [9] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially compliant mobile robot navigation via inverse reinforcement learning,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1289–1307, 2016.
  • [10] B. Okal and K. O. Arras, “Learning socially normative robot navigation behaviors with bayesian inverse reinforcement learning,” in 2016 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2016, pp. 2889–2895.
  • [11] B. Kim and J. Pineau, “Socially adaptive path planning in human environments using inverse reinforcement learning,” International Journal of Social Robotics, vol. 8, no. 1, pp. 51–66, 2016.
  • [12] C.-E. Tsai and J. Oh, “A generative approach for socially compliant navigation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 2160–2166.
  • [13] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” in International Conference on Learning Representations, 2018.
  • [14] H. Liu and P. Abbeel, “Behavior from the void: Unsupervised active pre-training,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [15] K. Lee, L. M. Smith, and P. Abbeel, “Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training,” in International Conference on Machine Learning.   PMLR, 2021, pp. 6152–6163.
  • [16] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in NIPS, 2017.
  • [17] S. Jamieson, J. P. How, and Y. Girdhar, “Active reward learning for co-robotic vision based exploration in bandwidth limited environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 1806–1812.
  • [18] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  • [19] H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk, “Nearest neighbor estimates of entropy,” American journal of mathematical and management sciences, vol. 23, no. 3-4, pp. 301–321, 2003.
  • [20] N. Wilde, A. Blidaru, S. L. Smith, and D. Kulić, “Improving user specifications for robot behavior through active preference learning: Framework and evaluation,” The International Journal of Robotics Research, vol. 39, no. 6, pp. 651–667, 2020.
  • [21] Y. Yang and M. Loog, “A benchmark and comparison of active learning for logistic regression,” Pattern Recognition, vol. 83, pp. 401–415, 2018.
  • [22] “Enlighten mobile robot.” [Online]. Available: http://www.6-robot.com
  • [23] G. Jocher, “Yolov5.” [Online]. Available: https://github.com/ultralytics/yolov5
  • [24] Z. Pei, “Deepsort.” [Online]. Available: https://github.com/ZQPei/deep_sort_pytorch
  • [25] L. Bertoni, “Monoloco library.” [Online]. Available: https://github.com/vita-epfl/monoloco
  • [26] L. Bertoni, S. Kreiss, and A. Alahi, “Perceiving humans: from monocular 3d localization to social distancing,” IEEE Transactions on Intelligent Transportation Systems, 2021.
  • [27] A. Joshi, S. Kale, S. Chandel, and D. K. Pal, “Likert scale: Explored and explained,” British journal of applied science and technology, vol. 7, no. 4, p. 396, 2015.