\jyear

2022

[2]\pfxProf. Dr. \fnmDonglin \surWang

1]\orgdivCollege of Computer Science and Technology, \orgnameZhejiang University, \orgaddress\cityHangzhou, \postcode310024, \stateZhejiang, \countryChina

[2]\orgdivMachine Intelligence Lab (MiLAB) of the School of Engineering, \orgnameWestlake University, \orgaddress\cityHangzhou, \postcode310024, \stateZhejiang, \countryChina

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

\fnmYachen \surKang [email protected] \fnmLi \surHe [email protected] \fnmJinxin \surLiu [email protected] \fnmZifeng \surZhuang [email protected] [email protected] [ *

Abstract

Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. However, such human-in-the-loop formulation requires considerable human effort to assign preference labels to segment pairs, hindering its large-scale applications. Recent approache has tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. And consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exits a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possiblity of the model’s predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.

keywords:

reinforcement learning, preference-based reinforcement learning, semi-supervised learning, consistency regularization

1 Introduction

Deep reinforcement learning (RL) utilizes a flexible framework for learning task-oriented behaviors NateKohl2004PolicyGR ; JensKober2008PolicySF ; JensKober2013ReinforcementLI ; DavidSilver2017MasteringTG ; DmitryKalashnikov2018QTOptSD ; OriolVinyals2019GrandmasterLI , where the design of the reward function is commonly necessary and crucial. However, many tasks are often complex or hard to be specified so that the reward design is very difficult. Agents often learn to exploit some loopholes of a misspecified reward function, resulting in unwanted behaviors (misaligned with the original intention). In addition, requirements such as operational safety and compliance with social norms are difficult to effectively state and meet through reward engineering DarioAmodei2016ConcretePI ; RohinShah2019PreferencesII ; AlexanderMattTurner2020AvoidingSE .

Preference-based RL (PbRL) PaulFChristiano2017DeepRL ; ErdemByk2018BatchAP ; DorsaSadigh2017ActivePL ; ErdemByk2020ActivePG ; KiminLee2021BPrefBP , as an alternative, provides a paradigm to elicit a reward function from human preference between two trajectory segments. PbRL permits learning a flexible reward function with only binary preferences, which requires less non-trivial human effort. However, in standard RL, learning from such sparse human prior is a notorious challenge. Compared with imitation learning, PbRL does not require demonstrators to be expert. Unfortunately, the main benefit of PbRL, relying on neither hand-crafted rewards nor expert demonstrators, is also what makes it costly or impractical. Due to the ingredient of human-in-the-loop, the straightforward PbRL inevitably requires a large amount of human feedbacks, which is costly and hinders its large-scale applications. To enable more scalable and practical human-in-the-loop learning, prior works tried to make this process more efficient in terms of human feedback, by designing various strategies to select informative queries PaulFChristiano2017DeepRL ; shin2021offline ; KiminLee2021PEBBLEFI ; liang2021reward ; BorjaIbarz2018RewardLF .

Compared to human preferences, segments can be accessed more easily, and a large amount of unlabeled segments can help implicitly elucidate the distribution of segments chapelle2009semi . So in this paper, we focus on how to reuse the unlabeled segments to help learning the reward function. It is common to apply semi-supervised learning (SSL) on the dataset that can be augmented by adding unlabeled segments. Some attempts have been made in SURF park2022surf to apply simple semi-supervised learning method (pseudo-labeling) to PbRL, and have been proven to be effective for improving the efficiency of preference samples.

Refer to caption — Figure 1: Illustration of STRAPPER. (1) Generate new transtions. (2) Sample segment pairs to query preferences from human. (3) Train teacher model using labeled data. (4) Pseudo-label unlabeled data using teacher model. (5) Train student model with consistency and peer regularization using mixed data. (6) Use the student model as the new teacher model. (7) Use the reward function trained in the student model to reward the transitions. (8) Update the policy.

A widely-used SSL method is self-training with consistency regularization (CR) laine2016temporal ; sajjadi2016regularization . Specifically, self-training has three main steps in PbRL: 1) train a teacher model on labeled segments (labeled dataset), 2) use the teacher to generate pseudo-labels on unlabeled segments (unlabeled dataset), and 3) train a student model on the combination of labeled and pseudo-labeled segments (mixed dataset), and then self-training treats the student as a teacher to relabel the unlabeled data and iterates the three steps above. Consistency regularization encourages the model to have identical output for different augmented data, by constraining model prediction invariant to input noise xie2019unsupervised . Most SOTA SSL methods adopt consistency regularization as an additional loss component; however, so far it has not been fully discussed in PbRL.

We observe that in PbRL there is a unique phenomenon we named as similarity trap, and such phenomenon hinders the direct use of consistency regularization. Specifically, in PBRL, humans can derive completely opposite preference labels from only a few, but fatal, differences. When there exits such data in unlabeled dataset, the conventional SSL method would cause model to improperly improve the consistency of predictions between disjoint data and thus reduce the confidence in reward learning. To counteract such a negative effect, we present a new PbRL framework: Self-TRaining Augmented Preference-based learning via PEer Regularization, abbreviated as STRAPPER (See Figure 1). We use an iterative self-training procedure to exploit the trajectory data without preference labels, resulting in a reward function. And we add peer regularization in self-training process, discouraging the student model from outputting the same label for peer samples (two samples that independently sampled in mixed dataset) inspired by YangLiu2020PeerLF . Intuitively, peer samples would not provide any informative signals to each other. Thus, we hope to penalize the student model to memorize uninformative labels, encouraging more confident prediction.

Here, we summarize the main contribution of STRAPPER as:

•

We propose a framework to reuse the unlabeled segments in PbRL and show the potential to reduce the human effort in PbRL using semi-supervised alternatives while keeping competitive performance.
•

We observe a crucial phenomenon in PbRL called similarity trap that hinders direct applying consistency regularization to PbRL. To address this, we propose a novel peer regularization to fix such issue and empirically verify it.
•

STRAPPER consistently outperforms prior PbRL baselines. In addition, we show that our method can generally eliminate the potential noise in preferences.

2 Related Work

2.1 Preference-based RL.

PbRL is one of RL paradigms to learn from human feedback BorjaIbarz2018RewardLF . Several works have successfully utilized feedback from real human to train RL agents DilipArumugam2019DeepRL ; PaulFChristiano2017DeepRL ; BorjaIbarz2018RewardLF ; WBradleyKnox2009InteractivelySA ; KiminLee2021PEBBLEFI ; GarrettWarnell2017DeepTI . PaulFChristiano2017DeepRL scales PbRL to utilize modern deep learning techniques, and BorjaIbarz2018RewardLF improves the efficiency of this method by introducing additional forms of feedback such as demonstrations. Recently, KiminLee2021PEBBLEFI proposes a feedback-efficient RL algorithm by utilizing off-policy learning and pre-training. park2022surf uses pseudo-labeling to utilize unlabeled segments and proposes a novel augmentation called temporal cropping to augment labeled data.

2.2 Semi-supervised learning.

Self-training denotes a learning setting where the supervision on unlabeled data is given by the own prediction of the model trained with the labeled data DavidYarowsky1995UNSUPERVISEDWS ; KamalNigam2000AnalyzingTE ; zhu2005semi ; CharlesJRosenberg2005SemiSupervisedSO . Pseudo-labeling refers to a specific variant, where model predictions are converted to hard labels lee2013pseudo . This is often used along with a confidence-based threshold that retains unlabeled examples only when the classifier is sufficiently confident. Consistency regularization was first proposed by bachman2014learning and later referred to as the “ $\Pi$ -Model” rasmus2015semi , where the main idea is to force the output of the model remain constant with randomly augmented inputs. After that, FixMatch sohn2020fixmatch first combines consistency regularization and pseudo-labeling in one simple method. On the other hand, NoisyStudent QizheXie2020SelfTrainingWN employs a self-training scheme, a teacher model generates pseudo-labels on unlabeled data, and a larger noisy student model is then trained on both labeled and unlabeled data with consistency regularization. Recently, zhu2021rich proposes an analytical framework to unify consistency regularization with explicit and implicit pseudo-labels.

3 Preliminaries

3.1 Reinforcement learning.

We consider a standard RL framework where an agent interacts with an environment in discrete time RichardSutton1988ReinforcementLA . Formally, at each timestep $t$ , the agent receives a state $\mathbf{s}_{t}$ from the environment and chooses an action $\mathbf{a}_{t}$ based on its policy $\pi_{\phi}$ . In traditional RL, the environment also returns a reward $r(\mathbf{s}_{t},\mathbf{a}_{t})$ . The return $\mathcal{R}_{t}=\sum^{\infty}_{k=0}\gamma^{k}r_{t+k}$ is the discounted sum of rewards from timestep $t$ with discount factor $\gamma\in(0,1]$ . RL then maximizes the expected return $\mathcal{R}_{0}$ with respect to $\pi_{\phi}$ .

3.2 Preference-based RL (PbRL).

However, for many complex domains and tasks, it is difficult to construct a suitable reward function. We consider the PbRL framework, where human provide preferences between two behavior segments and then the agent trains under this supervision PaulFChristiano2017DeepRL ; BorjaIbarz2018RewardLF ; KiminLee2021PEBBLEFI ; JanLeike2018ScalableAA ; KiminLee2021BPrefBP . In this work, we follow the classic framework to learn a reward function $\widehat{r}_{\phi}$ from preferences, where the function is trained to be consistent with human feedback AaronWilson2012ABA ; PaulFChristiano2017DeepRL . In this framework, a segment $\sigma$ is a sequence of states and actions $\{\mathbf{s}_{h},\mathbf{a}_{h},\dots,\mathbf{s}_{h+H},\mathbf{a}_{h+H}\}$ . Transitions $(\mathbf{s}_{t},\mathbf{a}_{t},\mathbf{s}_{t+1})$ are stored in a replay buffer $\mathcal{B}$ , and we sample segments $\sigma$ from $\mathcal{B}$ . Then, we elicit preferences $y$ for segments $\sigma^{i}$ and $\sigma^{j}$ . More specifically, $y$ indicates which segment is preferred, i.e., $y\in\{0,1,0.5\}$ , where $0$ indicates $\sigma^{i}\succ\sigma^{j}$ (the event that segment $\sigma^{i}$ is preferable to $\sigma^{j}$ ), $1$ indicates $\sigma^{j}\succ\sigma^{i}$ ( $\sigma^{j}$ is preferable to $\sigma^{i}$ ), and $0.5$ implies an equally preferable case. The judgment is recorded in a dataset $D$ as a triple $(\sigma^{i},\sigma^{j},y)$ .

By following the Bradley-Terry model Bradley1952RankAO , we have a preference predictor as follows:

P_{\psi}[\sigma^{i}\succ\sigma^{j}]=\frac{\exp\sum_{t}\widehat{r}_{\psi}(\mathbf{s}_{t}^{i},\mathbf{a}_{t}^{i})}{\sum_{m\in\{i,j\}}\exp\sum_{t}\widehat{r}_{\psi}(\mathbf{s}_{t}^{m},\mathbf{a}_{t}^{m})},

(1)

where $\widehat{r}_{\psi}$ is the reward function. Intuitively, we assume that the probability of preferring one segment is exponentially proportional to the sum of an underlying reward function over the segment. While $\widehat{r}_{\psi}$ is not a binary classifier, learning $\widehat{r}_{\psi}$ amounts to binary classification with labels $y$ provided by an annotator. Concretely, the reward function, modeled as a neural network with parameters $\psi$ , is updated by minimizing the following loss:

\begin{split}\mathcal{L}^{\text{Reward}}=-\underset{(\sigma^{i},\sigma^{j},y)\sim D}{\mathbb{E}}\Big{[}y(0)\log P_{\psi}[\sigma^{i}\succ\sigma^{j}]+y(1)\log P_{\psi}[\sigma^{j}\succ\sigma^{i}]\Big{]},\end{split}

(2)

where $y(0)$ and $y(1)$ represent whether the preference label is 0 or 1, respectively. Finally, we use the learned reward function $\widehat{r}_{\psi}$ to train the policy.

3.3 Semi-supervised Learning in PbRL.

Here, we model PbRL into the framework of the semi-supervised learning. The labeled dataset $D_{L}$ is first formed with segment pairs $(\sigma^{i},\sigma^{j})$ sampled from trajectory buffer $\mathcal{B}$ and the preferences $y$ queried from human, e.g., $D_{L}:=\{(\sigma^{i},\sigma^{j},y)\}$ . We use $D_{U}$ to denote the dataset formed with segment pairs $(\sigma^{i},\sigma^{j},\cdot)$ without human labels.

4 Similarity Trap in PbRL

Similarity Trap. We argue that the direct utilization of consistency regularization in PbRL is inappropriate, although it is already one of default components in most SSL methods. In consistency regularization, the performance is substantially effected by the quality of noise addition (i.e., generating neighboring pairs), but it is not always clear whether those data-augmentation methods (e.g., random crop) in computer vision are suitable for segments, especially in “feature-based” RL¹¹1Here we slightly abuse the notation of “feature-based” RL to emphasize the difference from visual-based RL (states are images).. Consistency regularization generally assumes that the noise-added samples have the same labels as the previous ones. But as we will discuss below, labels in PbRL are highly susceptible to small perturbations in samples. Since the label in PbRL comes from human’s preferences between two segments, it mainly conveys the information of relationship, while the label in computer vision is only related to the feature of a single input.

Based on the fact that human can be highly “picky” on small perturbations when providing preferences for segment pairs. That is, the occurrence of particular action will result in one-vote veto by human for the entire segment. As an extreme example, in the case of autonomous driving, driving straightly should be preferred to driving crookedly in most cases. However, if the former doesn’t avoid pedestrians, then it is deadly and not supposed to be preferred by humans ever, regardless of how good the previous driving behavior was (see Figure 2). It is easy for humans to give the judgment that $\sigma^{1}\succ\sigma^{2}$ and $\widetilde{\sigma}^{1}\prec\sigma^{2}$ . In other words, similar segments can have a large gap to each other when evaluated based on different task metrics (people want comfortable driving behavior but safety is more important). When we prepare data for PbRL, such two similar segment pairs, $x_{1}=(\sigma^{1},\sigma^{2})$ and $\widetilde{x}_{1}=(\widetilde{\sigma}^{1},\sigma^{2})$ , have diametrically opposite labels $(1,0)$ and $(0,1)$ , despite their similarity. Such data with similar samples but different labels can significantly exacerbate the difficulty of self-training with consistency regularization. We name such phenomenon with PbRL-specific challenging data as the similarity trap. Next, we elaborate the reasons for such difficulty.

Explanation. For self-training in computer vision, ColinWei2021TheoreticalAO has proved that by assuming expansion and separation, the fitted model will denoise the pseudolabels and achieve high accuracy on the true labels. Expansion assumption intuitively states that the data distribution has good continuity within each class. On the other hand, ColinWei2021TheoreticalAO states separation assumption as follows (please refer to original paper for more details):

Assumption 4.1.

(Separation). ColinWei2021TheoreticalAO assumes that $P$ is $\mathcal{B}$ -separated with probability $1-\mu$ by ground-truth classifier $G^{*}$ : $R_{\mathcal{B}}(G^{*})\leq\mu$ . $P$ denotes a distribution of unlabeled examples over input space $\mathcal{X}$ , $\mathcal{T}$ denotes a set of transformations obtained via data augmentation, and $\mathcal{B}(x)\triangleq\left\{x^{\prime}:\exists T\in\mathcal{T}\text{such that}\left\|x^{\prime}-T(x)\right\|\leq r\right\}$ is defined as a set of points with distance $r$ from some data augmentation of $x$ .

As above, Separation assumption states that there exist a few neighboring pairs from different classes, with a small or negligible probability $\mu$ (e.g., inverse polynomial in dimension). However, in PbRL, the presence of similarity trap precludes the separation assumption. As the segment pair and its augmentation ( $x_{1}$ and $\widetilde{x}_{1}$ as in the above example) are neighboring to each other but most likely from different classes. Since the dimension of segment is extremely small compared to the image, any small perturbations have significant probability change its tendency to be preferred. This means separation assumption does not holds anymore in PbRL and this will weaken the effectiveness of using consistency regularization.

Discussion. At the same time, we emphasize that the similarity between $\sigma^{1}$ and $\widetilde{\sigma}^{1}$ is also highly confusing, which makes the learning of $(\sigma^{1},\widetilde{\sigma}^{1})$ difficult. Compared to the significantly different segment pairs, it is difficult to solicit information from such similar segment pairs and learn a reward function that can accurately locate step with fatal error. Semi-supervised framework makes such difference more significant. When the teacher model is not converged in training, it would tend to be more conservative and label $(\sigma^{1},\widetilde{\sigma}^{1})$ as $y^{k}=(0.5,0.5)$ , while neglecting the fatal error. When using such pseudo-label to train the student model iteratively, this can lead to further difficulties for the student to learn the desired discrimination. As concluded in zhu2021rich , semi-supervised learning follows the Matthew effect: “the rich get richer”. Consequently, similarity trap in PbRL would cause similar segment pairs be easy to become the poor, and “get poorer” when conducting semi-supervised learning.

To summarize, we claim that similarity trap in PbRL leads to two issues. First, data augmentation on segment pairs can produce disjoint data, which makes the consistency regularization less effective. Second, the learning of similar segment pairs is difficult although such pairs may not be proportionally significant, where this difficulty is exacerbated in semi-supervised scenarios.

5 STRAPPER

In this section, we present a new PbRL framework: Self-TRaining Augmented Preference-based learning via PEer Regularization (STRAPPER), which makes effective use of unlabeled segments by using a self-training approach and proposes peer regularization to deal with the issues induced by the similarity trap.

5.1 Self-training Augmented PbRL

In STRAPPER, we first train the parameters of the teacher model $\psi^{k}$ on the labeled dataset $D_{L}$ to minimize the Eq.2, which we abbreviate as

\begin{split}\mathcal{L}^{\text{Reward}}(\psi^{k})=\underset{(\sigma^{i},\sigma^{j},y)\sim D_{L}}{\mathbb{E}}\ell(P_{\psi^{k}}(\sigma^{i},\sigma^{j}),y).\end{split}

(3)

Then, we use the teacher model to generate pseudo preferences $y^{k}$ for unlabeled segments pairs $(\sigma^{i},\sigma^{j})$ which are sampled independently from the buffer $\mathcal{B}$ , $y^{k}=\left(P_{\psi^{k}}[\sigma^{i}\succ\sigma^{j}],P_{\psi^{k}}[\sigma^{j}\succ\sigma^{i}]\right),(\sigma^{i},\sigma^{j})\sim D_{U}$ . We augment the segment $\sigma$ to $\tilde{\sigma}$ for consistency regularization, so $y^{k}=\left(P_{\psi^{k}}[\tilde{\sigma}^{i}\succ\tilde{\sigma}^{j}],P_{\psi^{k}}[\tilde{\sigma}^{j}\succ\tilde{\sigma}^{i}]\right),(\sigma^{i},\sigma^{j})\sim D_{U}$ . Then, the triple $(\sigma^{i},\sigma^{j},y^{k})$ is recorded in the pseudo-labeled dataset $D_{U}^{k}$ . After that, the student model is trained with $D_{L}$ and $D_{U}^{k}$ :

\begin{split}\mathcal{L}^{\text{Reward}}_{\mathbf{CR}}(\psi^{k+1})=&\underset{(\sigma^{i},\sigma^{j},y)\sim D_{L}}{\mathbb{E}}\ell(P_{\psi^{k+1}}(\sigma^{i},\sigma^{j}),y)\\ +&\underset{(\sigma^{i},\sigma^{j},y^{k})\sim D_{U}^{k}}{\mathbb{E}}\ell(P_{\psi^{k+1}}(\sigma^{i},\sigma^{j}),y^{k}).\end{split}

(4)

For simplicity, we define the mixed dataset $D_{L}\cup D_{U}^{k}$ as $\tilde{D}={(\sigma^{i},\sigma^{j},\tilde{y})}$ , where $\tilde{y}=y,\forall{(\sigma^{i},\sigma^{j})\sim D_{L}}$ , and $\tilde{y}=y^{k},\forall{(\sigma^{i},\sigma^{j})\sim D_{U}}$ , to simplify the above expression as:

\begin{split}\mathcal{L}^{\text{Reward}}_{\mathbf{CR}}(\psi^{k+1})=\underset{(\sigma^{i},\sigma^{j},\tilde{y})\sim\tilde{D}}{\mathbb{E}}\ell(P_{\psi^{k+1}}(\sigma^{i},\sigma^{j}),\tilde{y}).\end{split}

(5)

5.2 Peer Regularization

Overall Loss Function. For semi-supervised problems in computer vision, it has been both empirically and theoretically proved that consistency regularization performs well. But as discussed in Section 4, the inherent phenomenon of similarity trap in PbRL poses difficulties for the direct use of consistency regularization. Due to the presence of the similarity trap, consistency regularization improperly increases the consistency of model predictions between disjoint segment pairs and reduces the confidence in reward learning, since the augmented distribution does not match with the original one. Inspired by YangLiu2020PeerLF , we propose peer regularization to deal with such issue. By adding peer regularization to Eq.5, we obtain the final loss function for training the student model as:

\begin{split}\mathcal{L}^{\text{Reward}}_{\text{peer}}(\psi^{k+1})=&\ \mathcal{L}^{\text{Reward}}_{\mathbf{CR}}(\psi^{k+1})\\ -&\underset{\begin{subarray}{c}{\color[rgb]{1,0,0}(\sigma^{i}_{n_{1}},\sigma^{j}_{n_{1}},\tilde{y}_{n_{1}})}\sim\tilde{D}\\ {\color[rgb]{0,0,1}(\sigma^{i}_{n_{2}},\sigma^{j}_{n_{2}},\tilde{y}_{n_{2}})}\sim\tilde{D}\end{subarray}}{\mathbb{E}}\ell\left(P_{\psi^{k+1}}{\color[rgb]{1,0,0}(\sigma^{i}_{n_{1}},\sigma^{j}_{n_{1}})},{\color[rgb]{0,0,1}\tilde{y}_{n_{2}}}\right),\end{split}

(6)

where $(\sigma^{i}_{n_{1}},\sigma^{j}_{n_{1}},\tilde{y}_{n_{1}})$ and $(\sigma^{i}_{n_{2}},\sigma^{j}_{n_{2}},\tilde{y}_{n_{2}})$ are two independently sampled segment pairs and corresponding labels. We call these two samples as peer samples. The second term in Eq.6, called peer regularization, encourages the “inconsistency” of model predictions between random sample paires. Intuitively, since $\tilde{y}_{n_{2}}$ would not provide any valuable information to label $(\sigma^{i}_{n_{1}},\sigma^{j}_{n_{1}})$ , we penalize the reward model memorizing uninformative labels. Formally, the peer regularization leads the training to generate more confident predictions:

Theorem 5.1.

HaoCheng2021LearningWI When minimizing $\mathcal{L}^{\text{Reward}}_{\text{peer}}$ , the optimal solution cannot simultaneously satisfy $P_{\psi^{k+1}}[\sigma^{i}\succ\sigma^{j}]>0$ and $P_{\psi^{k+1}}[\sigma^{j}\succ\sigma^{i}]>0$ .

The above theorem implies that peer regularization leads to either $P_{\psi^{k+1}}[\sigma^{i}\succ\sigma^{j}]\rightarrow 1$ or $P_{\psi^{k+1}}[\sigma^{j}\succ\sigma^{i}]\rightarrow 1$ , which indicates a confident prediction. This counteracts the negative impact of consistency regularization. At the same time, confident prediction can help distinguish confusing pairs of similar segments $(\sigma^{1},\widetilde{\sigma}^{1})$ , tending to pseudo-label $(0,1)$ or $(1,0)$ rather than $(0.5,0.5)$ . This would lead the pseudo-label to be closer to the real label.

Algorithm 1 STRAPPER

1:number of queries

M

and pseudo-label

N

per feedback session

2:Initialize parameters of

\widehat{r}_{\psi},\pi_{\phi}

, a dataset for query

D_{L}\leftarrow\emptyset

, a pseudo-labeled dataset

D_{U}^{k}\leftarrow\emptyset

and a replay buffer

\mathcal{B}\leftarrow\emptyset

3:// Reward learning

4:for

m

1\dots M

5: Sample

\sigma^{i},\sigma^{j}

from

\mathcal{B}

and query annotator for

y

D_{L}\leftarrow D_{L}\cup\{(\sigma^{i},\sigma^{j},y)\}

6: Optimize

\mathcal{L}^{Reward}_{\text{peer}}

in Eq.(6) w.r.t

\psi

using

\{(\sigma^{i},\sigma^{j},y)\}\sim D_{L}\cup D_{U}^{k}

7:end for

8:for

n

1\dots N

9: Sample

\sigma^{i},\sigma^{j}

from

\mathcal{B}

and infer the pseudo-label

y^{k}

using

P_{\psi^{k}}

D_{U}^{k}\leftarrow D_{U}^{k}\cup\{(\sigma^{i},\sigma^{j},y^{k})\}

10:end for

11:// Policy learning

12:for each timestep

t

13: Collect

\mathbf{s}_{t+1}

by taking

\mathbf{a}_{t}\sim\pi_{\phi}(\mathbf{a}_{t}\mid\mathbf{s}_{t})

\mathcal{B}\leftarrow\mathcal{B}\cup\{(\mathbf{s}_{t},\mathbf{a}_{t},\mathbf{s}_{t+1},\widehat{r}_{\psi}(\mathbf{s}_{t},\mathbf{a}))\}

14: Update

\pi_{\phi}

15:end for

Learning with Noisy Label. In addition, we further describe our proposed peer regularization in terms of learning with noisy labels. First, the similarity trap results in a noisy pseudo-labeling process. Because consistency regularization improperly increases the consistency of model predictions between disjoint segment pairs. Here, the noise is introduced by the training process and therefore independent of the data distribution. Second, PbRL is a binary classification problem. Such two problem settings are just consistent with that in YangLiu2020PeerLF . Therefore, we can also straightforward use the method proposed in YangLiu2020PeerLF for solving the problem of training with noisy labels to derive Eq.6 built on peer loss function.

5.3 Algorithm

To sum up, we provide the full procedure of STRAPPER in Algorithm 1, which is given along with PEBBLE KiminLee2021PEBBLEFI , an off-policy PbRL algorithm.

6 Experiments

In this section, we conduct our experiments to demonstrate the effectiveness of our proposed STRAPPER. To verify this, we start from investigating the first question: 1) How do various SSL alternatives combined with SOTA PbRL algorithm perform? Can consistency regularization improve them? Experiments show that current methods with consistency regularization can perform well but not on evey task due to the existence of similarity trap. So we secondly answer next question: 2) How does our proposed STRAPPER perform when peer regularization is further introduced to solve the problem brought by similarity trap? In addition, we further study the last question: 3) How sensitive is our proposed STRAPPER to the (intrinsic) noise of human preferences from non-experts?

6.1 Setups and Details

Simulate human annotators. Similar to prior work PaulFChristiano2017DeepRL ; KiminLee2021PEBBLEFI , we obtain feedback from simulated human instead of real humans. Following KiminLee2021BPrefBP , we first build simulated annotator to provide (rational and deterministic) preferences $y$ for queries $(\sigma^{i},\sigma^{j})$ , oracle behavior for short:

y=\left\{\begin{aligned} (1,0)&\quad\text{If \ \ }\textstyle\sum_{t=1}^{H}r(\mathbf{s}_{t}^{i},\mathbf{a}_{t}^{i})>\textstyle\sum_{t=1}^{H}r(\mathbf{s}_{t}^{j},\mathbf{a}_{t}^{j})\\ (0,1)&\quad\text{otherwise},\end{aligned}\right.

where $r(\cdot,\cdot)$ denotes the oracle reward function.

Table 1: Hyperparameters

Hyperparameter	Value
Initial temperature	0.1
Length of segment	50
Learning rate	0.0003 (Meta-world)
	0.0005 (Walker)
	0.0001 (Quadruped)
Critic target update freq	2
( $\beta_{1}$ , $\beta_{2}$ )	(0.9,0.999)
Frequency of feedback	5000 (Meta-world)
	20000 (Walker)
	30000 (Quadruped)
# of ensemble models $N_{en}$	3
Hidden units per each layer	1024 (DMControl), 256 (Meta-world)
# of layers	2 (DMControl), 3 (Meta-world)
Batch Size	1024 (DMControl), 512 (Meta-world)
Optimizer	Adam
Critic EMA $\tau$	0.005
Discount $\bar{\gamma}$	0.99
Maximum budget/	1000/100,100/10 (DMControl)
# of queries per session	10000/50,4000/20 (Meta-world)
	2000/25,400/10 (Meta-world)
# of pre-training steps	10000

Hyper-parameters. The implementation of STRAPPER is based on SURF and PEBBLE, and thus inherents the hyperparameter setting from PEBBLE KiminLee2021PEBBLEFI , which is shown in Table 1. An ensemble of three reward models are initialised, while the network structures differ among environments. The reward model for DMC consists of 2 fully-connected linear layers with 1024 neurons in each layer, whereas the one for Meta-World uses three hidden layers of 256 neurons instead. After 10,000 pre-training steps, the reward model are trained via ADAM optimiser. Afterwards, the model gets updated at a fixed frequency as long as the feedback budget is available. We set the update frequency to be 5000 steps for tasks in Meta-world, while the frequencies are 20,000 steps and 30,000 steps for Walker Walk and Quadruped Walk, respectively.

For the first-time learning immeadiately after the pre-training phase, we use uniform sampling to acquire segment pairs with a segment length of 50. To improve the feedback-efficiency for the training procedure, we switch to the disagreement-based sampling scheme for all subsequent feedback sessions. The aim of the disagreement-based method is to find queries with high uncertainty based on the outputs among ensemble of reward models. In each session, the number of labeled queries provided to the reward model is dependent on the training task together with the total feedback budget, and the details are specified in Table 1. On the other hand, as a large portion of unlablled queries are dropped due to the confidence threshold, we sample unlabeled queries as 10 times of labeled queries. We further increase this figure to 100 if the maximum feedback budget is equal or greater than 1000.

Computational resources. Experiments are conducted by using a computational cluster with 22x GeForce RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for 7 days.

Benchmark tasks. For simulated human annotators, we take two locomotion tasks from DeepMind Control Suit tassa2018deepmind , Walker-walk and Quadruped-walk, and four robotic manipulation tasks from Meta-world yu2021metaworld , Window Open, Button Press, Drawer Open and Door Open.

6.2 Different Semi-Supervised Learning Alternatives

We first combine the SOTA PbRL algorithm PEBBLE KiminLee2021PEBBLEFI with the following four alternative SSL methods: 1) Pseudo-labeling (PL) lee2013pseudo : the teacher model generates confident pseudo-labels using a fixed threshold and then updates student model using cross-entropy loss. This is the SSL method employed in SURF park2022surf . 2) vanilla Consistency Regularization (CR) xie2019unsupervised : constrains the output of student model between two augmented inputs using MSEloss without using pseudo-labels. 3) FixMatch(FM) sohn2020fixmatch : generates pseudo-labels with weak augmented inputs and then updates the student model with strong augmented input. 4) self-training with Noisy Student (NS) QizheXie2020SelfTrainingWN : only adds noise to the learning process of the student model. The latter three methods all employ consistency regularization as a component and use random amplitude scaling to augment the input. Random amplitude scaling multiplies a uniform random variable $z$ to the state, i.e., $\hat{s}=s\cdot z$ , where $z\sim\text{Unif}[0.995,1.005]$ . PEBBLE and SAC using the ground truth reward are also added for benchmark comparison.

Figure 3 shows SAC, PEBBLE and PEBBLE with four alternative SSL methods on locomotion tasks and robotic manipulation tasks. As shown in Fig.3a and Fig.3b, all four SSL methods improve the performance of PEBBLE (orange). And the baseline methods that utilize consistency regularization, vanilla Consistency Regularization (red), self-training with Noisy Student (purple) and FixMatch (brown), outperform or are competitive with Pseudo Labeling (green). But when facing with robotic manipulation tasks as shown in Fig.3c and Fig.3d, these three methods perform poorly. We will show in next subsection that STRAPPER we proposed can alleviate such problem on this type of tasks.

6.3 STRAPPER Performs on Benchmark Experiments

Benchmark Experiments on Robotic Manipulation Tasks. As in Section 6.2, SSL methods using consistency regularization perform well on locomotion tasks (like Walker and Quadruped) but fail on robotic manipulation tasks (like Window Open and Button Press). We attribute this to the fact that similarity trap occurs more frequently in robotic manipulation tasks, affecting the direct use of consistency regularization. Specifically, compared to locomotion tasks, most of robotic manipulation tasks require an agent to interact with objects in the environment. In such scenarios, certain states require precise operation, and subtle differences can bring about serious mistakes. Therefore we mainly focus on comparation between STRAPPER and SOTA baselines on robotic manipulation tasks. PEBBLE+SURF stands for the original implementation of SURF park2022surf , which uses PL, as mentioned in Section 6.2. The temporal data augmentation methods described in park2022surf are employed in both PEBBLE+SURF and STRAPPER. As shown in Fig.4, our method (red) outperforms or is competitive with baselines in robotic manipulation tasks. This empirically demonstrates that the peer regularization we proposed in STRAPPER can counteract the side effect of similarity trap when consistency regularization is used.

More Benchmark Experiments on Locomotion Tasks. As we mention above, the baseline methods that utilizing consistency regularization without the Peer Regularization can improve the performance upon the pseudo labeling. However, when we use Peer Regularization on such locomotion tasks, it made the performance drop, shown as STRAPPER in Fig.5. We believe this is due to the fact that, for locomotion tasks, small differences are not enough to cause fatal errors, so the impact of similarity trap is small. On the contrary, the addition of Peer Regularization causes a certain degree of exploratory deficiency, which in turn leads to performance degradation.

6.4 Experiments on Non-expert Annotators

We have shown in the experiments above that the introduction of Peer Regularization in some types of environments can improve the effectiveness of consistency regularization. We will further show that STRAPPER has additional benefits when a more realistic experimental setting is considered. In reality, human annotators are not always perfectly rational, which may provide some stochastic preferences for query. To address this, we generate noisy preferences using a stochastic model as:

P[\sigma^{i}\succ\sigma^{j}]=\frac{\exp(\beta\textstyle\sum_{t=1}^{H}\gamma^{H-t}r(\mathbf{s}_{t}^{i},\mathbf{a}_{t}^{i}))}{\sum_{m\in\{i,j\}}\exp(\beta\textstyle\sum_{t=1}^{H}\gamma^{H-t}r(\mathbf{s}_{t}^{m},\mathbf{a}_{t}^{m}))},

where $\gamma\in(0,1]$ is a discount factor, modeling myopic behavior, and $\beta\in(0,\infty]$ , modeling stochastic behavior (expert queries become perfectly rational and deterministic as $\beta\to\infty$ ). To imitate the accidental error of human expert, we flip the preference with probability of $\epsilon$ , denoted as mistake behavior. If both segments (for query) do not contain a desired behavior, human would like to discard the query; we model this noisy behavior as skipping behavior. Further, if both segments have similar returns, human would like to provide a preference with $(0.5,0.5)$ ; we model this behavior as equally behavior²²2For more behavioral details, we refer the readers to B-Pref benchmark KiminLee2021BPrefBP .. To investigate the influence of noisy preference on PbRL algorithms and study whether STRAPPER is robust to the noisy-labels, we conduct experiments with progressively enhanced noisy labels.

In Figure 6a, we plot the normalized return versus the strength of noise. We observe that the straightforward PEBBLE tends to be brittle and sensitive to the noisy preferences, especially under high-level noise. Similarly, we find that the training across three semi-supervised baselines (CR, PL, NS) can still be unstable despite of incorporating the unlabeled segments. In contrast, STRAPPER yields more robust and stable performance, which consistently outperforms the straightforward PbRL as well as semi-supervised baselines. This result serves as an evidence that unlabeled segments can be useful to eliminate the intrinsic noise in preferences.

Further, we consider both quantity-reduced and noisy preferences in Walker-walk task. In Figure 6b, we illustrate the performance of PEBBLE (top row) and PEBBLE+STRAPPER (bottom row). For PEBBLE, in this specific task, the quantity of preference seems to matter more than the quality. By reducing the amount of preference, the performance drops significantly. We observe that, by incorporating the unlabeled segments, STRAPPER eliminates the effect of the lack of labeled segments and thus leads to better performance, even under a high noise ratio.

7 Discussion and Future Work

In this work, we show how to improve performance in PbRL using semi-supervised alternatives. We observe a crucial phenomenon defined as similarity trap that hinders direct applying consistency regularization to PbRL, and then propose a novel peer regularization when training the student model to fix the issue. Empirically, we demonstrate that our proposed STRAPPER consistently outperforms prior PbRL baselines on complex locomotion and robotic manipulation tasks from DeepMind Control Suite and Meta-world. Further, we verify that our STRAPPER is useful to eliminate the intrinsic noise in preferences.

STRAPPER has a number of limitations. First, the current feedback acquisition still uses scripted teachers, i.e., the preference is obtained from the comparison of returns. However, whether this is consistent with the way humans give, the preference requires further study. Second, although semi-supervised methods improve the effectiveness of human feedback, feedback data during training is still indispensable. This limits the large-scale application of PbRL to a certain extent. It would be an interesting future work if the idea of Offline RL could be borrowed to train PbRL from data without feedbacks.

\bmhead

Supplementary information Our code is based on repository of the PEBBLE algorithm (https://github.com/rll-research/BPref). We provide our source code in the supplementary material.

References

\bibcommenthead
(1) Kohl, N., Stone, P.: Policy gradient reinforcement learning for fast quadrupedal locomotion. In: IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004 (2004). https://doi.org/10.1109/robot.2004.1307456
(2) Kober, J., Peters, J.: Policy search for motor primitives in robotics. In: Neural Information Processing Systems (2008)
(3) Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: A survey. The International Journal of Robotics Research 32(11), 1238–1274 (2013). https://doi.org/10.1177/0278364913495721
(4) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017). https://doi.org/10.1038/nature24270
(5) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., Levine, S.: QT-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Computer Vision and Pattern Recognition (2018)
(6) Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J.P., Jaderberg, M., Vezhnevets, A.S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T.L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., Silver, D.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019). https://doi.org/10.1038/s41586-019-1724-z
(7) Amodei, D., Olah, C., Steinhardt, J., Christiano, P.F., Schulman, J., Mané, D.: Concrete problems in AI safety. arXiv: Artificial Intelligence (2016)
(8) Shah, R., Krasheninnikov, D., Alexander, J., Abbeel, P., Dragan, A.D.: Preferences implicit in the state of the world. arXiv: Learning (2019)
(9) Turner, A.M., Ratzlaff, N., Tadepalli, P.: Avoiding side effects in complex environments. arXiv: Artificial Intelligence (2020)
(10) Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. arXiv: Machine Learning (2017)
(11) Bıyık, E., Sadigh, D.: Batch active preference-based learning of reward functions. arXiv: Learning (2018)
(12) Sadigh, D., Dragan, A., Sastry, S., Seshia, S.: Active preference-based learning of reward functions. In: Robotics: Science and Systems XIII (2017). https://doi.org/10.15607/rss.2017.xiii.053
(13) Biyik, E., Huynh, N., Kochenderfer, M., Sadigh, D.: Active preference-based Gaussian process regression for reward learning. In: Robotics: Science and Systems XVI (2020). https://doi.org/10.15607/rss.2020.xvi.041
(14) Lee, K., Smith, L., Dragan, A., Abbeel, P.: B-pref: Benchmarking preference-based reinforcement learning. (2021)
(15) Shin, D., Brown, D.S.: Offline preference-based apprenticeship learning. arXiv preprint arXiv:2107.09251 (2021)
(16) Lee, K., Smith, L.M., Abbeel, P.: PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In: International Conference on Machine Learning (2021)
(17) Liang, X., Shu, K., Lee, K., Abbeel, P.: Reward uncertainty for exploration in preference-based reinforcement learning. In: Deep RL Workshop NeurIPS 2021 (2021)
(18) Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., Amodei, D.: Reward learning from human preferences and demonstrations in Atari. In: Neural Information Processing Systems (2018)
(19) Chapelle, O., Scholkopf, B., Zien, A. Eds.: Semi-supervised learning (Chapelle, o. et al., eds.; 2006) [Book reviews]. IEEE Trans. Neural Netw. 20(3), 542–542 (2009). https://doi.org/10.1109/tnn.2009.2015974
(20) Park, J., Seo, Y., Shin, J., Lee, H., Abbeel, P., Lee, K.: Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050 (2022)
(21) Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
(22) Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems 29 (2016)
(23) Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., Le, Q.V.: Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 (2019)
(24) Liu, Y., Guo, H.: Peer loss functions: Learning from noisy labels without knowing noise rates. In: International Conference on Machine Learning (2020)
(25) Arumugam, D., Lee, J.K., Saskin, S., Littman, M.L.: Deep reinforcement learning from policy-dependent human feedback. arXiv: Learning (2019)
(26) Knox, W.B., Stone, P.: Interactively shaping agents via human reinforcement. In: Proceedings of the Fifth International Conference on Knowledge Capture - K-CAP ’09 (2009). https://doi.org/10.1145/1597735.1597738
(27) Warnell, G., Waytowich, N.R., Lawhern, V.J., Stone, P.: Deep TAMER: Interactive agent shaping in high-dimensional state spaces. arXiv: Artificial Intelligence (2017)
(28) Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics - (1995). https://doi.org/10.3115/981658.981684
(29) Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the Ninth International Conference on Information and Knowledge Management - CIKM ’00 (2000). https://doi.org/10.1145/354756.354805
(30) Zhu, X.J.: Semi-supervised learning literature survey (2005)
(31) Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05) - Volume 1 (2005). https://doi.org/10.1109/acvmot.2005.107
(32) Lee, D.-H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)
(33) Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. Adv Neural Inf Process Syst 27, 3365–3373 (2014)
(34) Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. Advances in neural information processing systems 28 (2015)
(35) Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E.D., Kurakin, A., Zhang, H., Raffel, C.: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence (2020)
(36) Xie, Q., Luong, M.-T., Hovy, E., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020). https://doi.org/10.1109/cvpr42600.2020.01070
(37) Zhu, Z., Luo, T., Liu, Y.: The Rich Get Richer: Disparate Impact of Semi-Supervised Learning (2021)
(38) Sutton, R., Barto, A.G.: Reinforcement learning: An introduction. (1988)
(39) Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., Legg, S.: Scalable agent alignment via reward modeling: A research direction. arXiv: Learning (2018)
(40) Wilson, A., Fern, A., Tadepalli, P.: A Bayesian approach for policy learning from trajectory preference queries. In: Neural Information Processing Systems (2012)
(41) Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4), 324 (1952). https://doi.org/10.2307/2334029
(42) Wei, C., Shen, K., Chen, Y., Ma, T.: Theoretical analysis of self-training with deep networks on unlabeled data. In: International Conference on Learning Representations (2021)
(43) Zhu, Z., Liu, T., Liu, Y.: A second-order approach to learning with instance-dependent label noise. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.00998
(44) Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., Riedmiller, M.: DeepMind Control Suite (2018)
(45) Yu, T., Quillen, D., He, Z., Julian, R., Narayan, A., Shively, H., Bellathur, A., Hausman, K., Finn, C., Levine, S.: Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning (2021)