Using Machine Teaching to Investigate Human Assumptions when Teaching Reinforcement Learners

Yun-Shiuan Chuang ¹¹1Equal Contribution
Department of Computer Science and Psychology
University of Wisconsin - Madison
Madison, WI 53706
[email protected]
&Xuezhou Zhang ¹¹1Equal Contribution
Department of Computer Science
University of Wisconsin - Madison
Madison, WI 53706
[email protected]
&Yuzhe Ma
Department of Computer Science
University of Wisconsin - Madison
Madison, WI 53706
[email protected]
Mark K. Ho
Department of Computer Science and Psychology
Princeton University
Princeton, NJ 08540
[email protected]
&Joseph L. Austerweil ²²2Co-Senior Author
Department of Psychology and Computer Science
University of Wisconsin - Madison
Madison, WI 53706
[email protected]
&Xiaojin Zhu ²²2Co-Senior Author
Department of Computer Science
University of Wisconsin - Madison
Madison, WI 53706
[email protected]

Abstract

Successful teaching requires an assumption of how the learner learns - how the learner uses experiences from the world to update their internal states. We investigate what expectations people have about a learner when they teach them in an online manner using rewards and punishment. We focus on a common reinforcement learning method, Q-learning, and examine what assumptions people have using a behavioral experiment. To do so, we first establish a normative standard, by formulating the problem as a machine teaching optimization problem. To solve the machine teaching optimization problem, we use a deep learning approximation method which simulates learners in the environment and learns to predict how feedback affects the learner’s internal states. What do people assume about a learner’s learning and discount rates when they teach them an idealized exploration-exploitation task? In a behavioral experiment, we find that people can teach the task to Q-learners in a relatively efficient and effective manner when the learner uses a small value for its discounting rate and a large value for its learning rate. However, they still are suboptimal. We also find that providing people with real-time updates of how possible feedback would affect the Q-learner’s internal states weakly helps them teach. Our results reveal how people teach using evaluative feedback and provide guidance for how engineers should design machine agents in a manner that is intuitive for people.

1 Introduction

People regularly teach other agents (e.g., children, pets, machines) in their environment using evaluative feedback (rewards and punishment). For example, Andrew is teaching his six-year-old daughter Jane to forage for wild berries. To do so, he first brings her to a bush with edible berries. What does he do next? If his goal is purely for her to eat some wild berries, then he has achieved his goal. However, to teach her to forage in a robust manner, he must provide her rewards to incentivize her to leave that bush and seek new ones. How do people teach agents using rewards and punishments, and how does their teaching depend on their knowledge of the internal dynamics of how the learner updates their beliefs?

Although not always presented from this perspective, foraging is an example of the exploration-exploitation problem within reinforcement learning: How should an agent balance exploiting rewards based on their current knowledge while still exploring for berries is an example of an exploration-exploitation problem, which has been studied extensively for humans, animals, and idealized agents. Many natural agents (adults, children, animals, and other living creatures; [3, 6, 26, 29]) stop exploiting rewards at a state to explore, providing an implicit or explicit punishment to the agent, in a manner that is optimal within their ecological niche. However, often the mechanisms used to learn and change their behavior are tuned to assumptions of their environment, and do not adjust well to those in other environments. Although foraging itself (and learning while foraging) has been extensively studied within reinforcement learning and other mathematical frameworks, to the best of our knowledge, teaching others to forage is an open question. In this paper, we explore this question for teaching Q-learning agents a full policy in a task that requires teaching sub-optimal actions so that the learner learns what they should do in states they otherwise would not encounter (because they start close to their ultimate goal).

Teaching the correct actions to take in a domain while exploring the domain is a social task involving the interaction of the teacher, a learner, and the environment. Researchers have examined how people teach others, formalized this process, and created automated methods to teach. One unifying computational framework across these areas is the Bayesian pedagogy framework, where the learner and teacher are Bayesian agents that assume both know the teacher is providing information to help the learner [27]. However, this framework is mechanism-agnostic and assumes humans are doing ideal Bayesian updates, which can be a problematic assumption. Instead, we take the perspective of machine teaching, where we examine human teaching from the perspective of optimized teaching for a particular learning mechanism – Q-learning algorithms.

Q-learning is a family of model-free reinforcement learning algorithms that is known to learn the optimal policy as the agent interacts with the MDP [30] – undisturbed by a teacher. Instead, we allow a human teacher to intervene in the agent-MDP loop, specifically by changing the reward signal, with the hope to teach the optimal policy faster. Concretely, we design multiple Q-learning agents (students) that differ in a number of critical parameters. If human teachers can teach some agents better than others, it means that those agents’ parameters are closer to human teachers’ assumption about how students learn.

2 Prior Work

From children to adults, when asked to teach, people provide different information than if they are simply asked to convey some information to another learner. For social situations that involve goals and rewards, assuming people are optimal agents within a Markov Decision Process captures human behavior well. For example, when asked to show how to do a task, people will take actions that are strictly unnecessary for completing the task, but convey information to a learner. However, if they are asked to do a task, they only do the necessary actions for completing the task [11].

Recent work in cognitive science and human-machine interaction has explored human teaching strategies and to what extent they are optimal. For example, work in Bayesian pedagogy [27] has shown that when teaching a range of simple concepts by example, people are capable of teaching other people near optimally. Similar research on linguistic pragmatics [7, 5] demonstrates that people’s use of language reflects an intention to be optimally informative. Moreover, these findings on human teaching have been shown to generalize to more complex settings, such as sequential decision making [11]. However, more complex settings also make it more likely that human teaching and learning processes are misaligned [12, 10], which motivates research into learning algorithms tailored to human teaching strategies [16, 21].

Although the Bayesian pedagogy and social cognition literature provide useful perspectives on how people ought to teach others who know they are being taught, it does so for an idealized Bayesian agent assumed to be maximizing environmentally provided rewards. Albeit a useful assumption for examining human teaching, people do not teach or learn as an idealized Bayesian agent. In the 1990s, researchers learned how difficult it can be to train a reinforcement learner to complete simple tasks that contained necessary conditions which need to be completed by the learner completes the last steps of their task which are closer in state space to their current location [e.g., 23]. Based on this intuition, recent work demonstrated that people fail to teach simple model-free and model-based reinforcement learners how to get from a start state to an end state while staying on a trail in a $3\times 3$ Grid World [10]. The learners often pick up on “positive net reward cycles” which enable them to get arbitrarily large reward while not completing the task.

Computational Teaching: Since computational teaching was first proposed in [28], the teaching has been studied in various learning settings, see [35] for a recent survey. Of particular interest to us are works on teaching online learners such as Online Gradient Descent (OGD) [18, 17], active learners [8, 24], and other sequential learners [13, 22, 34, 14, 19, 32]. The optimal control formulation is required when the learner becomes sequential. Several recent works studied teaching on Inverse Reinforcement Learning (IRL) [31, 15, 1, 9, 2], where learner learns from teacher demonstration. Finally, computational teaching in reinforcement learning have been studied recently [33, 25, 20], where teacher teach via rewards and/or state transitions. Our work instead focuses on using computational teaching theory to understand how humans teach.

3 Preparing a Family of Reinforcement Learners for Teaching

In the teaching problem of reinforcement learning, we study the interaction between three entities: the RL agent (student), the teacher, and the underlying environment. In this work, we assume that the environment is a rewardless Markov Decision Process (MDP) parametrized by $M=(\mathcal{S},\mathcal{A},P,\mu_{0})$ where $S$ is the state space, $\mathcal{A}$ is the action space, $P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$ is the transition probability, and $\mu_{0}:\mathcal{S}\rightarrow\mathbb{R}$ is the initial state distribution.

3.1 A Family of Q-Learners

In our human experiments, we focus on a family of Q learners, denoted by $\mathcal{L}$ , that is widely studied as potential theory of mind for humans, among which we describe two variants: (1) standard Q-learning and (2) Action Signaling (AS) [10]. The two learners mainly differ in their internal knowledge representation and how they perform learning updates.

A Q-learning agent stores an estimate of the Q table, $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ , which approximates the future cumulative rewards that the agent can receive after perform an action $a\in\mathcal{A}$ in a state $s\in\mathcal{S}$ . The learning update rule of Q-learning is defined by 2 parameters, the learning rate $\alpha$ , and the discounting factor $\gamma$ . $\alpha$ determines how aggressive the learner updates the current belief given the new experience, and $\gamma$ indicates how much the learner value future rewards compared to immediate rewards. Specifically, given a new piece of experience $e_{t}=(s_{t},a_{t},s_{t+1},r_{t})$ , Q-learning only updates the $(s_{t},a_{t})$ entry of its Q table as

Q_{t+1}(s_{t},a_{t})=(1-\alpha)Q_{t}(s_{t},a_{t})+\alpha(r_{t}+\gamma\max_{a^{\prime}}Q_{t}(s_{t+1},a^{\prime}))

(1)

An Action Signaling (AS1) agent stores a multinomial distribution over actions for each state, representing its current belief distribution of the optimal action [10]. In this paper, we represent the multinomial distribution also with a table, $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ , where $\sum_{a\in\mathcal{A}}Q(s,a)=1$ . The learning update of an AS agent is defined by a single parameter, the learning rate $\kappa$ , which serves the same role as $\alpha$ in Q-learning. Given a new piece of experience $e_{t}=(s_{t},a_{t},s_{t+1},r_{t})$ , an AS agent will update the belief distribution over the current state $s_{t}$ . If $r_{t}>0$ , the probability w.r.t. $a_{t}$ will increase, whereas if $r_{t}<0$ , the probability w.r.t. $a_{t}$ will decrease. In contrast to Q-learning, an AS agent does not make use of the next state $s_{t+1}$ , and does not aim at optimizing long term reward. Specifically,

Q_{t+1}(s_{t},a)={Q_{t}(s_{t},a)e^{1_{[a=a_{t}]}\kappa r_{t}}\over\sum_{b\in A}Q_{t}(s_{t},b)e^{1_{[b=a_{t}]}\kappa r_{t}}},\;\forall a\in A.

(2)

We further observe that even without the normalization step, $\operatorname*{\arg\max}_{a}Q_{t}(s,a)$ will not change. Therefore, the learning update rule 2 can be equivalently changed to $Q_{t+1}(s_{t},a)=Q_{t}(s_{t},a)e^{1_{[a=a_{t}]}}\kappa r_{t}$ . We again observe that if instead of $Q_{t}(s,a)$ , we store $Q^{\prime}_{t}(s,a)=\log Q_{t}(s,a)$ , then the update rule becomes $Q^{\prime}_{t+1}(s,a)=Q^{\prime}_{t}(s,a)+\kappa r_{t}$ . Finally, we can drop the $\kappa$ factor, since it’s a constant scaling on all $(s,a)$ -pairs, and we get $Q^{\prime}_{t+1}(s,a)=Q^{\prime}_{t}(s,a)+r_{t}$ . Therefore, we have established that AS is equivalent to a variant of Q-learning, which simply stores the sum of rewards received on each $(s,a)$ -pair. This motivates us to define another variant of the AS agent that instead of taking the sum of rewards, takes the average of rewards on each $(s,a)$ -pair.

An Action Signaling (AS2) agent stores the average of rewards on each $(s,a)$ -pair, whose update rule can be equivalently written as

Q_{t+1}(s,a)=(1-\frac{1}{n_{t}(s,a)})Q_{t}(s,a)+\frac{1}{n_{t}(s,a)}r_{t}

(3)

where $n_{t}(s,a)$ denotes the number of times the current $(s,a)$ -pair has been visited. Similar to AS1, AS2 do not take into account of future rewards, and is in fact equivalent to a Q-learning agent with $\gamma=0$ , and time-varying learning rate $\alpha_{t}=1/n_{t}(s,a)$ . For those who are familiar with the classic RL/multi-armed bandit literature, the AS2 agent is equivalent to a multi-armed bandit algorithm that treats the MDP as $S$ parallel $A$ -arm bandits.

We further assume that all learners will behave according to the $\mathbf{\varepsilon}$ -greedy policy w.r.t. the current Q table, i.e.

a_{t}=\pi_{t}^{\varepsilon}(s_{t})=\left\{\begin{array}[]{ll}\operatorname*{\arg\max}_{a}Q_{t}(s_{t},a),&\mbox{ w.p. }1-\varepsilon\mbox{, break ties uniformly}\\ \mbox{uniform from }A,&\mbox{ w.p. }\varepsilon.\end{array}\right.

(4)

We distinguish two teaching settings: In a white-box teaching setting, we assume an omniscient teacher, who has knowledge of the underlying MDP $M$ and the learning agent’s parameters. It also observes the current interaction $s_{t},a_{t},s_{t+1}$ as well as the internal state of the agent $Q_{t}$ , before providing the reward signal $r_{t}$ ; In a black-box teaching setting, we assume that the teacher can still observe $(s_{t},a_{t},s_{t+1},Q_{t})$ but does not know the precise learner update rule, and therefore cannot accurately predict the consequence of the current reward signal.

The teacher’s goal is to drive the learner to learn a target policy $\pi^{\dagger}$ . For example, the teacher may want the learner to always perform a specific action $a^{\dagger}\in A$ at state $s^{\dagger}\in S$ . This example goal can be expressed as a target set of Q tables that satisfy $\mathcal{Q}^{\dagger}:=\{Q:Q(s^{\dagger},a^{\dagger})>Q(s^{\dagger},a),\forall a\neq a^{\dagger}\}$ . The teaching succeeds if the learner’s $Q_{t}$ falls into $\mathcal{Q}$ at some time step $t$ , in which case the teaching process terminates. We are interested in calculating a teaching strategy that achieves the teaching goal with the fewest time steps.

3.2 An Optimal Teacher can Teach the Family of Q-Learners Equally Fast

One of our main observations is that the optimal teaching problem in the white-box teaching setting forms a higher level teaching MDP $\mathcal{N}=(\Xi,\Delta,\rho,\tau)$ :

•

The teacher observes the teacher state $\xi_{t}\in\Xi$ , which jointly characterizes the environment and the learner at time $t$ : $\xi_{t}:=(s_{t},a_{t},s_{t}^{\prime},Q_{t}).$
•

The teacher’s action space consists of all possible rewards $r_{t}\in\mathbb{R}=\Delta$ .
•

The teacher receives a constant cost of $\rho_{t}=1$ for every time step before the teaching goal is accomplished.
•

The teaching state transition probability is specified by $\tau(\xi_{t+1}\mid\xi_{t},r_{t})$ . The new attack state $\xi_{t+1}=(s_{t+1},a_{t+1},s^{\prime}_{t+1},Q_{t+1})$ is generated as follows: $s_{t+1}$ is copied from $s_{t}^{\prime}$ in $\xi_{t}$ ; $a_{t+1}\sim\pi^{\varepsilon}_{t+1}(s_{t+1})$ ; $s^{\prime}_{t+1}\sim P(\cdot|s_{t+1},a_{t+1})$ ; $Q_{t+1}$ is the learner’s updated Q table.

The optimal teaching strategy is one where the teacher minimize its cumulative attack cost, $\sum_{t=0}^{T}\rho_{t}$ , s.t. $Q_{T}\in\mathcal{Q}$ . Due to the randomness of the MDP as well as the learner’s behavior policy, this quantity is a random variable, and thus we instead minimize its expected value. Formally, the teacher seeks a time-invariant teaching policy $\phi^{*}$ : $\phi^{*}=\operatorname*{\arg\min}_{\phi:\Xi\mapsto\Delta}\mathbb{E}_{\mathcal{N}}\left[\sum_{t=0}^{T}\rho_{t}\mbox{ , .s.t }Q_{T}\in\mathcal{Q}^{\dagger}\right].$ We will also denote the shortest expected time step to achieve the teaching goal as the teaching dimension for the corresponding teaching problem instance, i.e.

TD(M,L,Q_{0},\pi^{\dagger})=\min_{\phi:\Xi\mapsto\Delta}\mathbb{E}_{\mathcal{N}}\left[\sum_{t=0}^{T}\rho_{t}\mbox{ , s.t. }Q_{T}\in\mathcal{Q}^{\dagger}\right]

(5)

The optimal control formulation allows us to computationally solve for the optimal teaching strategy on any specific teaching instance, defined by the MDP, learner type and target policy to be taught, using any optimal control solver. In this paper, we use Twin Delayed DDPG (TD3) [4], a state-of-the-art Deep Reinforcement Learning (DRL) algorithm that solves continuous control problems.

Our second theoretical insight is that the any learner in the Q-learning family $\mathcal{L}$ has the same teaching dimension. Specifically,

Theorem 1.

For any two teaching instances defined by $(M,L,Q_{0},\pi^{\dagger})$ and $(M,L^{\prime},Q^{\prime}_{0},\pi^{\dagger})$ , where $L$ and $L^{\prime}$ are two Q-learning agents, if $Q_{0}$ and $Q^{\prime}_{0}$ satisfies that $Q_{0}(s,a)\geq Q_{0}(s,a^{\prime})$ if and only if $Q^{\prime}_{0}(s,a)\geq Q^{\prime}_{0}(s,a^{\prime})$ for all $s\in\mathcal{S}$ , $a,a^{\prime}\in\mathcal{A}$ , then $TD(M,L,Q_{0},\pi^{\dagger})=TD(M,L,Q_{0},\pi^{\dagger})$ .

In other words, under the white-box setting, an optimal teacher can teach a learner in the family $\mathcal{L}$ equally fast.

4 Human Teaching on the Family of Q-Learners

But our human teachers do not really know what algorithm or parameters the students use. This allows us to probe the human assumption on the student: we expect that when an student algorithm is closer to human preconception of a student, the human teacher will be better at teaching that student.

How does human teaching compare to machine teaching in a task where a person needs to teach a reinforcement learner a full policy when the learner starts one action away from its environmental goal? Figure 1 presents an idealized exploration-exploitation scenario. In the scenario, the dog’s initial state is one tile left of its goal (the door which symbolizes its home). The teaching goal is to teach the dog to go home from every tile. Whenever the dog reaches its home, the next step it starts on the initial state. To teach the dog to get home from every square, the teacher must incentivize the dog to explore, which results in the dog moving away from its ultimate goal: It must punish moving to the goal and/or reward going away from the goal. Once the dog has gotten all the way to the leftmost tile, the teacher can begin to “undo” their prior teaching and teach the dog to go right. Depending on how much reward/punishment the dog is given to explore, it may take many steps (or even be impossible given feedback with finite limits) to teach the dog to move right at every tile.

To analyze the difficulty of this task, we used the machine teaching method described in Section 3 to provide the optimal teaching policy. The learner is a Q-learner that uses an $\varepsilon$ -greedy policy to select its action and it starts indifferent $Q_{0}(s,a)=0$ for every feasible state-action pair $(s,a)$ . For concreteness, we assume the learner is parametrized as follows: $\alpha=0.1$ , $\gamma=0.9$ , and $\varepsilon=0.1$ . Given this configuration, the optimal teaching policy is precisely the one discussed above (get the dog to move all the way left with minimal reward/punishment and then teach it to move right at every state). The number of steps to teach is a random variable whose outcomes depend on what actions are selected when the dog is indifferent and when suboptimal actions are taken. Thus, we approximated the expected number of steps via Monte Carlo (simulating teaching dogs using the optimal teaching policy and recording the number of steps it took for the full target policy to be taught). On average, the optimal teaching policy takes 11 steps to teach the dog the full target policy. It is 11 regardless of the value of $\alpha$ and $\gamma$ (assuming they are within the values that allow $Q$ -learning to converge). Changing $\varepsilon$ affects the optimal teaching policy and expected number of steps quantitatively, but the optimal teaching policy follows the same qualitative procedure as before. So, we set $\varepsilon=0.1$ for the remainder of the paper.

Although our machine teaching results suggest all Q-learners should be trainable with the same expected number of steps, can people train all learners? Are there some learners that are easier for people to train? As discussed earlier, recent work has found that traditional parametrization for Q-learners are extremely challenging for people to train on even simple tasks and the action-signaling model (which we proved is a special-case of Q-learning) is much easier for people to train [10]. However, they did not test a large set of parametrization for the discount rate ( $\gamma$ ) and some of their analyses suggest that a Q-learner might be trainable when the discount rate is small. This is consistent with other work on hyperdiscounting by Taber and colleagues [16] that reinforcement learners with small discount rates are easier for people to teach. In this study, we investigate the role of the discounting parameter for a simple teaching task and examine how giving people the internal dynamics of the learning update affects their ability to teach efficiently.

4.1 Experimental Design

Participants. We recruited 791 participants through Amazon MTurk. The number of participants was chosen a priori based on the authors’ intuition from similar previously conducted studies. We excluded 42 participants (11 for failing to complete the task, 19 due to experiment error, and 12 who selected “do nothing” for at least 36 steps). In addition, the dogs that were trained successfully with steps less than the optimal length (the number of steps the optimal teaching policy needs) were excluded from the analysis. For example, if the subject gave punishment on every step and the dog luckily went to the left at every tile all the way from the rightmost tile to the leftmost tile, the dog could be trained in just 4 steps. These data were excluded because they did not really reflect a success in training, but more likely a misunderstanding of the goal and luck. This composes 9.03 % of the total dogs (203 out of 2247 dogs).

Interface/Stimuli. The dog training task took place in a 4 $\times$ 1 MDP with an absorbing state on the right. Figure 1 shows the visual interface that the participants interacted with. Four states were represented by four tiles that the dog could walk on, and the absorbing state was represented by a door. At each step, the dog could only go right or left from a tile to the nearby tile or the door. If the dog went left at the leftmost tile, the dog would stay at the same tile. If the dog went right at the rightmost tile, the dog would go to the door and then placed back to the rightmost tile. The training for a dog ended if the dog had successfully learned the target policy, or the dog had already taken 40 steps but still had not learned the target policy. Once the training for the dog ended, a new dog with a different color would be shown and the participant was asked to train the new dog. The internal states were displayed as two rows provided by a “brain scanner”, which corresponded to the Q table and current optimal policy. At the beginning of each dog, the dog was placed at the rightmost tile with an initial Q table where the Q value of each state-action pair is zero. Participants responded via a continuous slider (feedback of -1 to 1), or could select a button to “do nothing” (feedback of zero).

Refer to caption — Figure 1: The dog training task took place in a small garden that was composed of four tiles and a door on the right. The dog’s current Q table (labeled “Preference”) and policy (labeled “Action Plan”) were shown above the garden. In the Q table, there were four cells corresponding to the Q values of each of the four tiles. In each cell, there were arrows pointing to the left/right, which encoded the Q value of moving to the left/right at that tile. A blue solid arrow encoded positive Q value, whereas a red dotted arrows encoded negative Q value. The policy was based on the Q table, where the arrow pointed to the direction that the dog would go in an exploitation step. When the dog’s policy at a tile dictated going towards the door, the background of the cell turned green, indicating the policy at that state matched the target policy. After the dog took its action, the participant was asked to provide feedback to the dog. The participant could used the slider to select the feedback value or clicked the “do nothing” button to give a zero feedback. After the participant chose the feedback, the dog’s Q table and target would get updated accordingly, and the dog would perform its next action. In the bottom-right corner, there as a step counter that recorded the steps that the dog has taken so far.

Procedure. Participants were asked to teach three dogs to go home from any tile. They were given an extensive quiz about the instructions and could not continue until every question was answered correctly (see Supplementary Materials). There were two between-subjects conditions: learner type $\times$ learning dynamics. Each participant was assigned to either one of the four traditional Q-learners ( $\gamma\in\{0.0,0.1,0.45,0.9\},\alpha=0.9$ ) and two action-signaling models. Learning dynamics were either shown (the whitebox setting) or not shown (the blackbox setting) to the participant. if they were shown, then the internal states displayed to the participant were synchronized to the current position of the slider. Otherwise the internal states only updated after the participant provided feedback. On each trial, the dog would move from one state to another. It followed an $\varepsilon$ -greedy policy ( $\varepsilon=0.1$ ), where the policy was the optimal one for the current Q-matrix. When a random action was selected for the dog, a squirrel would appear in the direction of that action (participants were told that when a squirrel appeared, the dog would move towards it no matter what its internal states were). After the dog moved, the participant would respond by dragging the feedback slider or hitting the “do nothing” button. Training concluded when the participant successfully taught the target policy or 40 steps had completed. Participants trained three dogs and then was given a short survey to ensure that they treated the slider symmetrically and gather standard demographic data.

4.2 Results and Discussion

Table 1: The Descriptive Statistics of the Human Experiment

Learner Type	Slider Sync	# Subjects	Success Rate (95% CI)	Avg Steps when Successful (95% CI)
AS1	on	66	30.1 ([23.4, 36.7])	20.6 ([18.8, 22.2])
AS1	off	71	33.3 ([26.7, 40.0])	24.3 ([22.6, 25.9])
AS2	on	68	26.7 ([20.2, 33.1])	21.8 ([19.6, 23.9])
AS2	off	67	26.3 ([20.1, 32.6])	21.2 ([19.2, 23.2])
Q ( $\gamma=0.00$ )	on	57	51.3 ([43.4, 59.2])	20.9 ([19.5, 22.3])
Q ( $\gamma=0.00$ )	off	57	47.4 ([39.4, 55.3])	22.3 ([20.8, 23.8])
Q ( $\gamma=0.10$ )	on	57	47.1 ([39.3, 54.9])	21.3 ([19.7, 22.9])
Q ( $\gamma=0.10$ )	off	55	46.5 ([38.4, 54.7])	22.5 ([20.6, 24.4])
Q ( $\gamma=0.45$ )	on	57	36.9 ([29.4, 44.5])	21.5 ([19.5, 23.5])
Q ( $\gamma=0.45$ )	off	61	35.2 ([27.8, 42.5])	23.8 ([21.8, 25.8])
Q ( $\gamma=0.90$ )	on	68	21.9 ([16.0, 27.9])	22.0 ([19.9, 24.2])
Q ( $\gamma=0.90$ )	off	65	24.6 ([18.4, 30.8])	25.2 ([22.9, 27.4])

Learner Type: the learner type of the dogs being trained. Slider Sync: whether the subjects can see the effect of the feedback on the Q table when they are sliding the slider to decide the feedback value. # Subjects: The number of subjects that were included in the analysis. Success Rate (%): The proportion of dogs that were successfully trained. Average Steps when Successful: The average steps the subjects used to train the dogs when the training is successful. AS1: The AS1 agent. AS2: The AS2 agent. Q: The Q-learning agent.

Figure 2 shows the success rate of participants at training the full policy and the average number of steps taken when a dog was successfully trained. To analyze the success rate, we fit a mixed-effects logistic regression model where learner type, internal dynamics, and their interaction were fixed effects, and participant was a random effect. Learner type significantly influenced the success rate for a participant ( $\chi^{2}(5)=20.7,p<0.001$ ), but the internal dynamics did not ( $\chi^{2}(1)=0.66,p=0.42$ ). To analyze the average number of steps for successfully trained dogs, we fit a mixed-effects linear regression where learner type, internal dynamics, and their interaction were fixed effects, and participant was a random effect. There also was no interaction ( $\chi^{2}(2)=1.64,p=0.90$ ). Both learner type and internal dynamics significantly affected the average number of steps a person took to train a dog successfully ( $F(5,341.76)=2.21,p=0.05$ and $F(1,248.82)=4.19,p<0.05$ , respectively).

How well did people teach the dogs? Clearly they were suboptimal – at best, the average success rate was 51% and when successful took about 21 steps (optimal has a success rate of 100% and takes an average of 11 steps). To analyze whether people adapted their teaching over time, we conducted the following permutation test. For each participant, we created 1000 simulated Q-learners. For each simulated Q-learner and state-action pair, we sampled a random permutation of the feedback provided by the participant for that state-action pair. This provides a Monte Carlo test of whether human teaching adapted with learning. Participants clearly adapted their behavior: none of the simulated Q-learners for any participant ever learned the target policy when the feedback for each action-state pair randomly permuted. Thus, human teaching is quite successful in this task (albeit still suboptimal).

In sum, we observed that human teachers are suboptimal: they did not always succeed in teaching the policy to an agent; and when they did, they required more steps compared to the 11 steps of the optimal teaching algorithm in Section 3. This is not surprising: we never expected humans to be optimal teachers. Our important discovery is that human teachers’ “favorite student” is Q0, which differs from other Q-learning agents in two key parameters: the smallest possible discounting factor $\gamma=0$ , and a large Q-update step size $\alpha=0.9$ . What does this mean? The favorite student is one which is predictable by the teacher (no complications from discounted future rewards because $\gamma=0$ ) and amenable to myopic control from the teacher (large $\alpha$ emphasizes immediate teacher reward). We thus suggest that human teachers assume such predictable, myopically controllable students when they teach. On the other hand, whether the teacher can see the “guts” of the agent (whitebox / blackbox setting with effect of reward slider on / off, respectively) turns out to not be very important for the teacher.

5 Conclusions

In this paper, we conducted the first investigation of how people teach learners to solve the exploration-exploitation tradeoff, which is a common problem in everyday life. We did so by first formulating it as a machine teaching problem where the learner is assumed to be a Q-learner. We then ran a behavioral experiment of an idealized scenario to see how well people train Q-learners to solve the exploration-exploitation tradeoff when the teaching goal is an entire policy. We found that people are suboptimal, but quite successful, and that teaching was the easiest when the Q-learner had a small discount rate $\gamma$ and large update step size $\alpha$ . We also showed that revealing how possible feedback would change the learner’s beliefs weakly helped people teach in a more efficient manner.

Given that our study is the first to explore how people teach learners to solve an exploration-exploitation tradeoff, there are potential concerns with the generalizability of the study. For example, it is unclear whether our results would generalize to other environments. It is also unclear how robust the results are to other types of reinforcement learners. For example, it seems plausible that a model-based reinforcement learning method would be easier to train on our idealized task. However, prior work suggests that model-based learners would struggle in more complex environments where positive net cycles are easily taught.

6 Broader Impacts

Over the last few decades, human-machine interactions have become a regular part of everyday life. Although these interactions are often successful, they also often fail, usually due to the machine agent not properly understanding the person’s feedback. When they do succeed, it is often because people have learned to interact with the machines, rather than machines learning or being created to interact with people in a manner that is natural to people. Not everyone can learn to interact with machines, especially those who have not kept up with current technology. This provides a large gap in the tools available to different people, with those who are disadvantaged often struggling to learn how to interact properly with the machines due to lack of experience.

One main method people interact with one another and machines is to teach by providing evaluative feedback. Despite much progress, scientists neither fully understand the mechanisms underlying this behavior nor have a formal description that can guide or be “plugged into” machine agents. This paper provides the first mechanism-level analysis of how a learner should be taught, as well as an analysis of how people teach via evaluative feedback in an idealized exploration-exploitation task. It provides guidance for how engineers should parametrize machine agents who use Q-learning while interacting with people. This is a step towards the larger goal of natural human-machine interaction that is equally accessible to all.

References

[1] Daniel S Brown and Scott Niekum. Machine teaching for inverse reinforcement learning: Algorithms and applications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7749–7758, 2019.
[2] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision tasks. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
[3] Jonathan D Cohen, Samuel M McClure, and Angela J Yu. Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481):933–942, 2007.
[4] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
[5] N. D. Goodman and M. C. Frank. Pragmatic interpretation as probabilistic inference. Trends in Cognitive Sciences, 20(11):818–829, 2016.
[6] Alison Gopnik. Childhood as a solution to explore–exploit tensions. Philosophical Transactions of the Royal Society B, 375(1803):20190502, 2020.
[7] H. P. Grie. Logic and conversation. In P. Cole and J. Morgan, editors, Syntax and Semantics (Vol. 3, pages 41–58. Academic Press, 1975.
[8] Steve Hanneke. Teaching dimension and the complexity of active learning. In International Conference on Computational Learning Theory, pages 66–81. Springer, 2007.
[9] Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching inverse reinforcement learners via features and demonstrations. In Advances in Neural Information Processing Systems, pages 8464–8473, 2018.
[10] M. K. Ho, F. Cushman, M. L. Littman, and J. L. Austerweil. People teaching with rewards and punishments as communication, not reinforcements. Journal of Experimental Psychology: General, 148(3):520–549, 2019.
[11] M. K. Ho, M. Littman, J. MacGlashan, F. Cushman, and J. L. Austerweil. Showing versus doing: Teaching by demonstration. In Advances in Neural Information Processing Systems, pages 3027–3035, 2016.
[12] M. K. Ho, M. L. Littman, F. Cushman, and J. L. Austerweil. Teaching with rewards and punishments: Reinforcement or communication? In Annual Proceedings of the 37th Cognitive Science Society Meeting, 2015.
[13] Anette Hunziker, Yuxin Chen, Oisin Mac Aodha, Manuel Gomez Rodriguez, Andreas Krause, Pietro Perona, Yisong Yue, and Adish Singla. Teaching multiple concepts to a forgetful learner. In Advances in Neural Information Processing Systems, 2019.
[14] Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640–3649, 2018.
[15] Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, and Adish Singla. Interactive teaching algorithms for inverse reinforcement learning. In IJCAI, pages 2692–2700, 2019.
[16] W. B. Knox and P. Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the Fifth International Conference on Knowledge Capture, pages 9–16, 2009.
[17] Laurent Lessard, Xuezhou Zhang, and Xiaojin Zhu. An optimal control approach to sequential machine teaching. arXiv preprint arXiv:1810.06175, 2018.
[18] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M Rehg, and Le Song. Iterative machine teaching. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2149–2158. JMLR. org, 2017.
[19] Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186–204. Springer, 2018.
[20] Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. Policy poisoning in batch reinforcement learning and control. In Advances in Neural Information Processing Systems, pages 14543–14553, 2019.
[21] J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, G. Wuang, D. L. Roberts, M. E. Taylor, and M. L. Littman. Interactive learning from policy-dependent human feedback. In Proceedings of the 34th International Conference on Machine Learning (Vol. 70), pages 2285–2294, 2017.
[22] Farnam Mansouri, Yuxin Chen, Ara Vartanian, Jerry Zhu, and Adish Singla. Preference-based batch and sequential teaching: Towards a unified view of models. In Advances in Neural Information Processing Systems, pages 9195–9205, 2019.
[23] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
[24] Tomi Peltola, Mustafa Mert Çelikok, Pedram Daee, and Samuel Kaski. Machine teaching of active sequential learners. In Advances in Neural Information Processing Systems, pages 11202–11213, 2019.
[25] Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. arXiv preprint arXiv:2003.12909, 2020.
[26] C. R. Reid, H. MacDonald, R. P. Mann, J. A. R. Marshall, T. Latty, and S. Gamier. Decision-making without a brain: how an amoeboid organism solves the two-armed bandit. Journal of the Royal Society: Interface, 13:20160030, 2016.
[27] P. Shafto, N. D. Goodman, and T. L. Griffiths. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71:55–89, 2014.
[28] Ayumi Shinohara and Satoru Miyano. Teachability in computational learning. New Generation Computing, 8(4):337–347, 1991.
[29] D. W. Stephens, J. S. Brown, and R. C. Ydenberg. Foraging: Behavior and Ecology. University of Chicago Press, Chicago, USA, 2007.
[30] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[31] Sebastian Tschiatschek, Ahana Ghosh, Luis Haug, Rati Devidze, and Adish Singla. Learner-aware teaching: Inverse reinforcement learning with preferences and constraints. In Advances in Neural Information Processing Systems, 2019.
[32] Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
[33] Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward-poisoning attacks against reinforcement learning. arXiv preprint arXiv:2003.12613, 2020.
[34] Xuezhou Zhang, Xiaojin Zhu, and Laurent Lessard. Online data poisoning attack. arXiv preprint arXiv:1903.01666, 2019.
[35] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine teaching. CoRR, abs/1801.05927, 2018.

\appendixpage

Appendix A Proof of Theorem 1

Theorem 2.

For any two teaching instance defined by $(M,L,Q_{0},\pi^{\dagger})$ and $(M,L^{\prime},Q^{\prime}_{0},\pi^{\dagger})$ , where $L$ and $L^{\prime}$ are two Q-learning agent, if $Q_{0}$ and $Q^{\prime}_{0}$ satisfies that $Q_{0}(s,a)\geq Q_{0}(s,a^{\prime})$ if and only if $Q^{\prime}_{0}(s,a)\geq Q^{\prime}_{0}(s,a^{\prime})$ for all $s\in\mathcal{S}$ , $a,a^{\prime}\in\mathcal{A}$ , then $TD(M,L,Q_{0},\pi^{\dagger})=TD(M,L^{\prime},Q_{0}^{\prime},\pi^{\dagger})$ .

Proof.

We denote $Q_{t}\equiv Q^{\prime}_{t}$ if $Q_{0}t(s,a)\geq Q_{t}(s,a^{\prime})$ if and only if $Q^{\prime}_{t}(s,a)\geq Q^{\prime}_{t}(s,a^{\prime})$ for all $s\in\mathcal{S}$ , $a,a^{\prime}\in\mathcal{A}$ . The proof is based on a key technical lemma:

Lemma 3.

For any teaching policy $\phi$ for teaching instance $(M,L,Q_{0},\pi^{\dagger})$ , there exists a matching teaching policy $\phi^{\prime}$ for teaching instance $(M,L^{\prime},Q^{\prime}_{0},\pi^{\dagger})$ , such that for any time $t$ , if $Q_{t}\equiv Q^{\prime}_{t}$ and $\xi_{t}=\xi^{\prime}_{t}$ , then $Q_{t+1}\equiv Q^{\prime}_{t+1}$ .

Lemma 3 implies that with the same random seed, the relationship $Q_{t+1}\equiv Q^{\prime}_{t+1}$ will remain invariant through teaching when $\phi$ is used on $(M,L,Q_{0},\pi^{\dagger})$ and $\phi^{\prime}$ used on $(M,L^{\prime},Q^{\prime}_{0},\pi^{\dagger})$ . Thus, $Q_{t}\in\mathcal{Q}^{\dagger}$ if and only if $Q^{\prime}_{t}\in\mathcal{Q}^{\dagger}$ . Therefore, if there exists an optimal $\phi$ that achieves the teaching dimension for $(M,L,Q_{0},\pi^{\dagger})$ , then the matching $\phi^{\prime}$ also achieves the same teaching dimension for $(M,L,Q_{0},\pi^{\dagger})$ . Thus, the $TD$ for both teaching instances much match. What remains is to prove Lemma 3.

Proof of Lemma 3.

Given a particular $Q_{t}$ and an experience tuple $\xi_{t}=(s_{t},a_{t},s_{t}^{\prime})$ , the Q-learning update rule will only modify the value of $Q_{t}(s_{t},a_{t})$ based on the teacher provided reward $r_{t}=\phi(\xi_{t})$ . Define the rank of $(s_{t},a_{t})$ in $Q_{t}$ as $rank_{Q_{t}}(s_{t},a_{t})=|\{a|Q_{t}(s_{t},a)>Q_{t}(s_{t},a_{t})\}|$ . Under the Q-learning update rule

Q_{t+1}(s_{t},a_{t})=(1-\alpha)Q_{t}(s_{t},a_{t})+\alpha(r_{t}+\gamma\max_{a^{\prime}}Q_{t}(s_{t+1},a^{\prime})),

(6)

it is obvious that there exist $r_{t}$ such that the rank of $(s_{t},a_{t})$ in $Q_{t+1}$ can be any of $0,1,...,A-1$ . Assume now that the rank of $(s_{t},a_{t})$ in $Q_{t+1}$ becomes $k$ after updating with $r_{t}=\phi(\xi_{t})$ . Define $\phi^{\prime}$ such that $r^{\prime}_{t}=\phi^{\prime}(\xi^{\prime}_{t})$ will also update the rank of $(s_{t},a_{t})$ in $Q^{\prime}_{t+1}$ to be $k$ . Then, since $Q_{t}\equiv Q^{\prime}_{t}$ and for both $Q_{t+1}$ and $Q^{\prime}_{t+1}$ the rank of $(s_{t},a_{t})$ becomes $k$ , we have $Q_{t+1}\equiv Q^{\prime}_{t+1}$ . This concludes the proof.

Appendix B Experiment Setting and Hyperparameters for TD3

Throughout the experiments, we use the following set of hyperparameters for TD3, described in Table 2. The hyperparameters are selected via grid search on the Dog MDP. Each experiment is run for 5000 episodes, where each episode is of 200 iteration long. The learned policy is evaluated for every $50$ episodes, and the policy with the best evaluation performance is used for the computation of the teaching dimension. In the computation of the teaching dimension, we run the best-found TD3 policy for 1000 episodes, and take the average number of steps to teach the target policy. This gives $11.0$ .

Parameters	Values	Description
exploration noise	$0.5$	Std of Gaussian exploration noise.
batch size	100	Batch size for both actor and critic
discount factor	0.99	Discounting factor for the attacker problem.
policy noise	0.2	Noise added to target policy during critic update.
noise clip	$[-0.5,0.5]$	Range to clip target policy noise.
action L2 weight	50	Weight for L2 regularization added to the actor network loss function.
buffer size	$10^{7}$	Replay buffer size, larger than total number of iterations.
optimizer	Adam	Use the Adam optimizer.
learning rate critic	$10^{-3}$	Learning rate for the critic network.
learning rate actor	$5^{-4}$	Learning rate for the actor network.
$\tau$	$0.002$	Target network update rate.
policy frequency	2	Frequency of delayed policy update.

Table 2: Hyperparameters for TD3.

Appendix C Datasets

C.1 Data Collection Process

Participants accessed the study on Amazon MTurk. Each participant was first given an instruction about the task (Fig. 5) and was required to complete an extensive quiz about the instructions (Fig. 9) and could not continue until every question was answered correctly. Participants trained three dogs and then was given a short survey to ensure that they treated the slider symmetrically and gather standard demographic data. Each participant was reimbursed for $2.00 for taking the task.

C.2 Participants and Data Exclusion

We recruited 791 participants through Amazon MTurk. The number of participants was chosen a priori based on the authors’ intuition from similar previously conducted studies. We excluded 42 participants (11 for failing to complete the task, 19 due to experiment error, and 12 who selected “do nothing” for at least 37 steps out of 40 steps for training a single dog). In addition, the dogs that were trained successfully with steps less than the optimal length (the number of steps the optimal teaching policy needs) were excluded from the analysis. For example, if the subject gave punishment on every step and the dog luckily went to the left at every tile all the way from the rightmost tile to the leftmost tile, the dog could be trained in just 4 steps. These data were excluded because they did not really reflect a success in training, but more likely a misunderstanding of the goal and luck. This composes 9.03 % of the total dogs (203 out of 2247 dogs). In sum, there are 749 subjects (2044 dogs) in the analysis.