This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\substitutefont

OT2cmrwncyr

Modeling Human Behavior - Part I: Learning and Belief Approaches

Andrew Fuchs Universitá di Pisa, Department of Computer SciencePisaItaly [email protected] Andrea Passarella  and  Marco Conti Institute for Informatics and Telematics (IIT), National Research Council (CNR)PisaItaly
(2018)
Abstract.

There is a clear desire to model and comprehend human behavior. Trends in research covering this topic show a clear assumption that many view human reasoning as the presupposed standard in artificial reasoning. As such, topics such as game theory, theory of mind, machine learning, etc. all integrate concepts which are assumed components of human reasoning. These serve as techniques to attempt to both replicate and understand the behaviors of humans. In addition, next generation autonomous and adaptive systems will largely include AI agents and humans working together as teams. To make this possible, autonomous agents will require the ability to embed practical models of human behavior, which allow them not only to replicate human models as a technique to “learn", but to to understand the actions of users and anticipate their behavior, so as to truly operate in symbiosis with them. The main objective of this paper it to provide a succinct yet systematic review of the most important approaches in two areas dealing with quantitative models of human behaviors. Specifically, we focus on (i) techniques which learn a model or policy of behavior through exploration and feedback, such as Reinforcement Learning, and (ii) directly model mechanisms of human reasoning, such as beliefs and bias, without going necessarily learning via trial-and-error.

Artificial Intelligence, Machine Learning, Human Behavior, Cognition, Bias, Human-AI Interaction, Human-Centric AI
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXccs: Human-centered computingccs: General and reference Surveys and overviews

1. Introduction

Utilizing models of human behavior and decision-making has spanned decades and covered numerous approaches and applications. Human behavior and reasoning enables complex behaviors and social structures. Consequently, these structures become multifaceted and grow significantly in complexity. Still, humans are generally quite successful at navigating this complex social structure. There are a multitude of attempts at explaining aspects of these capabilities and this article discusses some of the more popular or persistent methods. The motivations behind research in the area of modeling human behavior and decisions are varied, so we limit the analysis to (a representative subset of) works providing quantitative models (e.g. math models or algorithms), as these are the approaches that allow to “code" human behavior in autonomous systems. Some examples include better autonomous driving (Wu et al., 2020; Fernando et al., 2020), comprehension of mental states (Bianco and Ognibene, 2019), and population-level modeling (Jackson et al., 2017). More generally, with the pervasive diffusion of AI systems and the recent focus on Human-Centric AI (HCAI, i.e., forms of AI where humans and AI agents work “as a team"), embedding practical models of the human behavior that can guide the autonomous operations of AI systems will become a key required component of next-generation autonomous and adaptive systems.

A key distinction between the goals and approaches is often the fidelity of the replication and the expected deployment case. For instance, researchers may try to replicate the neurological pathways in an attempt to replicate the neuro-physical process underpinning reasoning (Asgari and Beauregard, 2021), or they may instead attempt to generate a computational model which is meant to mimic heuristically biased behavior (Lieder et al., 2017b). In any case, a common aspect is the desire to use humans as the template for desirable patterns of reasoning. Using human reasoning and abilities has motivated numerous research topics enabling autonomous and adaptive systems. These systems can be trained to work independently, in a multi-agent system, and human-AI hybrid domains. In all of these cases, the resulting systems require the ability to both learn from the environment as well as be adaptive to changes observed. This enables the ability to act autonomously or gain and integrate new knowledge.

Regarding the need or desire to model human behavior, there are several topic areas which involve the interaction or dependence between humans and Artificial Intelligence (AI) systems. Humans are more frequently encountering intelligent systems and so it is vital that these systems be created with the use case and user in mind. In order to consider the user, it is important to understand their behavior and capabilities. As such, we will discuss methods for modeling and replicating human behavior. In this context, we will focus primarily on topics involving relevant concepts relating to Human-Centric AI such as AI-assisted decisions or Hybrid Intelligence. These topics demonstrate cases in which humans use or interact with AI systems. These interactions require differing assumptions regarding the dynamics between the human and the system and how those dynamics impact the capabilities and characteristics of the systems. In a more direct case of Human-AI Interaction (HAI) humans utilize the output of a system in their decision-making process. In this case, it is important the user clearly understands the output of an algorithm so they can effectively utilize the provided information. As an example, consider the use of AI or Machine Learning (ML) in medical diagnostics (i.e. Human-AI Interaction and AI-assisted Decision-Making). It is not only important that the system be effective and demonstrate high accuracy, but it must also prove useful for the human user. If not, the utility of the system may go unnoticed and unused.

In the most relevant case of HCAI for autonomous and adaptive systems, i.e., the case of hybrid intelligence, humans and AI systems form an interdependency. This can take multiple forms (Gurcan et al., 2021; Kaluarachchi et al., 2021), but leads to cases in which the human and the AI are expected to operate in a synergistic manner. Some examples include (Dellermann et al., 2021): Co-evolution over time, Human-in-the-loop learning, Interactive or Active Learning, Socio-technological ensemble. Given these paradigms, the human and AI can be paired in multiple configurations depending on technical, social, and other considerations. According to (Wilkens et al., 2021), there are five types of understanding associated with human-centered AI:

  • Deficit-oriented: AI serves to augment the human and compensate for deficiencies in attention, concentration, or physical and mental stamina.

  • Data reliability-oriented: Provides AI as a tool for improving the use and understanding of data. (e.g. medical diagnosis)

  • Protection-oriented: Use of AI to perform tasks too dangerous for the human or serves as an assistant to boost safety.

  • Potential-oriented: Relates to combined intelligence to boost the performance and reasoning of the two sides as a team.

  • Political-oriented: Relates to the concepts regarding how control, labor, etc. will be distributed between humans and technology, especially as it relates to protections for the employees.

The literature on modeling HCAI systems for achieving higher levels of autonomy and adaptation can be classified as show in Figure 1. Due to the vastness of contributions in the literature, we group work in two, consistent parts, where this paper (part I) covers the first two areas, i.e., approaches for automatically learning a behavior from examples, and approaches devoted to model belief and reasoning aspects. On the other hand, the remainder of the topics illustrated in Figure 1 are covered in (Fuchs et al., 2022). Specifically, in this paper we will discuss some of the popular topics and applications relating to Human-Centric AI, Human-AI Interaction, and Hybrid Intelligence. These topics include: Reinforcement Learning, Meta-Learning, Theory of Mind, and more. These topics represent methods which attempt a model mimicking or inspired by differing biological/neurological, cognitive, and social levels of reasoning. Additionally, we will discuss how these topics align with application areas of interest. These application areas cover a wide assortment of both scenario as well as level of autonomy expected. For instance, this can include topics such as demographic preferences (Jackson et al., 2017) to something as safety-critical as fully autonomous driving (Wu et al., 2020; Fernando et al., 2020). The specific scenario can rely significantly on the level of autonomy expected and the level of risk or control humans are willing to allow. For these topics, we will provide underlying principles and definitions, relevant examples of use cases and their approaches, and further examples of relevant survey/review papers and related resources. Finally, we will provide additional details regarding some common application areas for these topics.

The rest of the paper will be structured as depicted in Figure 1. In Section 3, we will discuss learning methods which generate a model of behavior either by trial and error, or by learning from observations of others. These techniques learn from feedback denoting desirable behavior and adapt their policy according to this feedback. Next, in Section 4, we discuss methods which attempt to model the mental states of others, utilize world models to simulate human knowledge, or discuss bias and fairness in representations and reasoning. Similar to the previous section, these topics are focused on the bounded resources for reasoning and decision-making of humans, but in this case focus less on a direct replication of cognitive functions. Further, these techniques focus more on models of the world and those in them. Finally, Section 5 provides a brief overall comparison of the considered approaches.

We will not provide a comprehensive coverage of all aspects relating to the listed topics, but instead attempt to provide a useful assortment as a demonstration of topics of interest offering potential areas of further exploration. These topics and demonstrated applications are intended to illustrate situations in which humans interact with AI systems, how those interactions rely on an understanding or model of the human’s behavior, and how systems can learn from humans or models of their abilities. In each of Sections 3-4, we discuss specific topic areas of interest (e.g. RL in Section 3), to provide examples of approaches and techniques illustrating aspects of these topics. Each section describing a topic is organized according to a common structure. First, we point to specific surveys and related dealing in greater detail with that topic. Then, we discuss the general principles. Next, we discuss one concrete example where those principles are made practical. Finally, we briefly mention additional examples where the same principles have been applied.

Human Behavior Representation Part-I, Section-3 Learning Human Behaviors by Experience and Feedback Section-3.1 Reinforcement Learning Section-3.2 Inverse Reinforcement Learning Section-3.3 Imitation Learning Section-4.3 Active Learning Part-I, Section-4 Belief and Reasoning Approaches Section-4.1 Meta-Reasoning and Meta-Learning Section-4.2 Theory of Mind Section-4.3 Simulated World Knowledge Part-II, Section-3 Bounded Rationality and Cognitive Limitations Section-3.1 Cognitive Heuristics Section-3.2 Cognitively and Biologically Plausible Section-3.3 Attention Section-3.4 Game Theory Part-II, Section-4 Uncertainty and Irrationality Section-4.1 Quantum Models of Decisions Section-4.2 Bias and Fairness
Figure 1. Taxonomy of concepts

2. HCAI orthogonal concepts and samples of Application Areas

In this section, we discuss some broad concepts related to HCAI, which cut across the various modeling methods presented in detail in the remainder of the paper. Moreover, we will briefly discuss popular application areas demonstrating uses of the techniques discussed in this paper. This list is not comprehensive, but serves to demonstrate topics which are likely more familiar and of immediate interest. The approaches used demonstrate methods which serve to replicate, model, or learn from human behavior and capabilities.

2.1. Orthogonal HCAI concepts

It is important to note that different issues and considerations arise from varying concerns. For instance, some examples include how it is important to ensure the AI system is not difficult to use or explain to users, difficult to manage or maintain, or perceived as creepy by the potential users (as noted above) (Yang et al., 2020b; Eiband et al., 2021). Additionally, the systems will need awareness and capabilities supporting numerous types of intelligence such as social, emotional, physical, etc. in order to best understand and interact with humans (Cichocki and Kuleshov, 2021) while operating autonomously. Further, the reliability, correctness, and resulting impact of the system can be viewed as an important factor (Perrotta and Selwyn, 2020).

A key factor in human-centric paradigms such as Human-AI Interaction or Hybrid Intelligence is the fact that the behavior of the human and the AI are assumed to impact each other (Rahwan et al., 2019). In the case of hybrid intelligence, each side is providing a deeper aspect to the relationship. In general, we would require AI systems which can observe and understand humans in order to improve their behavior in this hybrid domain (Kambhampati, 2019) via adaptability. As an example, in some cases of Active Learning (AL) or Reinforcement Learning (RL), a model is being learned with the help of, or in observation of, a human teacher (Puig et al., 2020; Liu et al., 2019; Ramaraj et al., 2021; Navidi and Landry, 2021; Najar and Chetouani, 2021; Holzinger et al., 2019). In such a case, we are expecting the AI system to continually refine both its model relative to the data samples while also improving its ability to know when to ask for help. Further, one could expect the human to use their understanding of the system implicitly or explicitly to improve the utility or informativeness of the samples as they observe the progression of the system (Schneider, 2020). In some cases, the samples or information can come as human retellings of past experience (Kreminski et al., 2019). This can also be reversed in the sense that the system can be designed to improve the types of responses in order to guide the human and improve the information received or queries of the user (Villareale and Zhu, 2021).

Systems can also be designed in order to learn to work with the human as a team working in tandem. In such cases, it is often desired to have the system augment the abilities of the human or maintain autonomous control over an aspect of the task which would be more challenging for the human (and is less critical with respect to the larger goal). For instance, AI systems can be trained to assist the human in a navigation task (Reddy et al., 2018). In such a case, it is observed that there are aspects of the problem which are much easier for either the human or the AI, so sharing the responsibilities allows for improved performance over either working independently. For more direct assistance to the human, in (Morrison et al., 2021) we see an AI system being used to augment the senses of the user in order to assist visually-impaired users in a social context. This demonstrates a case in which the AI is much more closely integrated into the sensory systems of the human and is intended as an unobtrusive augmentation of their sensory capabilities. From another perspective, the goal could also be a system which can operate independently of the human as an additional agent in the environment (Wang et al., 2020d). In these cases, the agent is expected to respond to the observed behavior of the human in order to assist or avoid interfering as both work to achieve their tasks. In either case, it is important to consider how the two (or more) are expected to interact and respond to observed behavior (Zahedi and Kambhampati, 2021; Gao et al., 2021).

Another important aspect of hybrid intelligence would be the consideration needed for when and how to delegate control between the human or artificial/autonomous system, or determining which tasks can be performed without human intervention (Ning et al., 2021; Raghu et al., 2019). There needs to be graceful handling in the event the human and AI system are both attempting to control the system. Additionally, there needs to be clear guidance regarding when either should be in control. For instance, it is important to understand the reliability of the system and where it is likely to encounter errors. Further, it is important to understand how the human perceives the error behavior of the system in order to anticipate its impact on the interaction (Bansal et al., 2019).

To support understanding and learning for the AI systems, it is also important to support methods for representing the observed behavior from the human in a manner which can be tractable and potentially simplify learning. As an example, (Xie et al., 2020) encodes the observations in a latent space and then learns a model of behavior corresponding to the latent representation. Such an encoding can allow the agent to abstract the observations to support connections between similar observations as well as other benefits. Additionally, systems require the means to observe and adapt to humans (Puig et al., 2020; Schatzmann et al., 2006). An important example is human-aware robot navigation (Möller et al., 2021; Mavrogiannis et al., 2021) and further human-robot interaction scenarios (Semeraro et al., 2021). In the navigation context, the motion, goals, and general behavior are crucial for the robot to successfully navigate the environment. Similarly, models can be generated to estimate or predict the feedback expected from a human collaborator or teacher in order to boost training of the AI system (Navidi and Landry, 2021; Navidi, 2020).

2.2. Application areas

2.2.1. Robotics

Robotics can utilize demonstrated human capabilities as well as behaviors to learn skills or behavioral policies. For instance, humans can provide demonstrations of behavior for a robot learner using a Policy-gradient RL policy, which can then be improved through practice and exploration by the learner (Akbulut et al., 2021). Similarly, humans demonstrate the ability to adapt skills to variations and new scenarios. This adaptability has been explored in RL agents in robotics to enable generalization from initial demonstrations or exploration (Julian et al., 2020; Ramaraj et al., 2021), which enables aspects of learning from demonstrations and an interactive learning process as seen in Imitation Learning, Inverse Reinforcement Learning, and Active Learning. Robot agents can also learn to anticipate the movement of humans or other artificial agents in order to compensate for their movements or rendezvous at a later position (Wang et al., 2020c; Mavrogiannis et al., 2021; Möller et al., 2021). This concept can also be extended to further topics in human-robot collaboration (Semeraro et al., 2021). Additionally, robot vision can be designed to replicate models of human vision mimicking foveation (Baron et al., 1994), which allows agents to observe visual stimuli with similar attention and focus to stimuli. These topics demonstrate how robots can be designed to learn from humans and also learn to operate in an environment alongside humans. The approaches for these solutions span multiple disciplines, including ones described in this paper (e.g. RL, Inverse Reinforcement Learning (IRL), etc.).

2.2.2. Driver Prediction and Autonomous Driving

There have been numerous examples of research performed to model and predict behavior in a driving scenario (Kolekar et al., 2021). These topics include methods which model pedestrian behavior (Choi et al., 2019) or the behavior of other drivers (Fernando et al., 2020; Bhattacharyya et al., 2020; Chandra et al., 2020). This type of modeling serves to train autonomous vehicles how to successfully drive while predicting or compensating for the behaviors of others through the use of models based on Inverse Reinforcement Learning, Imitation Learning, and related. This predictive power is essential so systems can anticipate and react to the non-uniform nature of human behavior. In addition to modeling the behavior of other drivers, systems have been investigated which can model the vehicle control behavior of drivers (Pentland and Liu, 1999). This allows for comprehension and modeling of driver control movements when operating a vehicle.

2.2.3. AI in Games and Teaching

In the area of video and serious games, AI is being considered with respect to multiple aspects. In one aspect, researchers and developers are investigating techniques to integrate AI into game development and game character behavior (Young et al., 2004; Xia et al., 2020; Yannakakis and Togelius, 2018; Bontchev et al., 2021; Zhao et al., 2020). This allows for levels, players, etc. to be generated or controlled by adaptive behavior models to create a broader scope of experiences. These examples demonstrate uses of approaches such as planners (see Section 4), Reinforcement Learning, and related to improve the performance, adaptability, or creation of games. Non-Player Character (NPC) behavior can be supported by these techniques to generate policies based on historical gameplay data or learned models of play. This can be from learned models of behavior or based on past human player choices (de Almeida Rocha and Duarte, 2019). NPCs with these characteristics could prove better suited to respond to player choices and allow for more options. Similarly, the generation of levels and scenarios can also be expanded by allowing for more dynamic combinations of resources by the system.

3. Learning Human Behaviors by Experience and Feedback

In the following sections, we will discuss methods which learn patterns of behavior by exploration and observation of feedback. In Section 3.1, we discuss learning agents which perform RL by observing rewards which denote the desirability of actions in a particular state or context. Extending this concept of feedback-based learning, we discuss IRL in Section 3.2. These learning agents develop a model of behavior in an attempt to replicate the observed behavior of others. The agent attempts to estimate the feedback which generated the behavior observed and then learn a corresponding model of behavior. Similarly, in Imitation Learning (IL) (see Section 3.3), agents also attempt to replicate observed behavior. In this case, the learner does not attempt to replicate the feedback, but instead attempts to directly learn a model of behavior matching the observations. The final section, Section 3.4, discusses Active Learning, i.e., a learning paradigm in which the agent is able to query a teacher for feedback regarding a subset of input values. This allows the learner to learn based on an estimated confidence regarding different inputs and to utilize the expertise of the teacher to update their knowledge.

3.1. Reinforcement Learning

3.1.1. Relevant survey(s):

For relevant survey papers and related, please refer to (Schatzmann et al., 2006).

3.1.2. Principles and Definitions

RL is a method by which situations are mapped to actions so as to maximize a reward signal (Sutton and Barto, 2018). The maximization is performed by the learning agent through exploration of an environment and the possible actions. This exploration generates a feedback signal via rewards which the agent uses to learn the behaviors resulting in the most desirable feedback. The feedback received can be provided immediately or can be a result of a sequence of actions. For example, an agent could receive a reward for each step they make in an environment or simply a single reward at the end of a training session which corresponds to the outcome. The different parameters of the problem lead to numerous techniques for learning optimal behavior policies.

Markov Decision Process

In RL, agents are attempting to solve a sequential decision process, which is represented by a Markov Decision Process (MDP). This representation allows for the components of a scenario to be formally modeled with underlying assumptions, such as dependence on past states. An MDP can be represented by the tuple {S,A,T,R,p(s0),γ}\{S,A,T,R,p(s_{0}),\gamma\} (Puterman, 2014). SS refers to the states of the environment which can be traversed by executing actions from AA. The execution of actions causes a transition between states, which follows the transition probabilities T:S×Ap(S)T:S\times A\rightarrow p(S). As a means for feedback, the reward function R:S×AR:S\times A\rightarrow\mathcal{R} provides reward signals based on the selected action. Note, RR can also be defined with the inclusion of the resulting state R:S×A×SR:S\times A\times S\rightarrow\mathbb{R}. Additionally, p(s0)p(s_{0}) defines the probabilities over initial states and γ\gamma defines the discount parameter (defined in Section 3.1.3).

3.1.3. Policies and Learning

Given a scenario or MDP, an agent can be trained to find a policy π:Sp(A)\pi:S\rightarrow p(A) which defines a likelihood of actions given current state. An agent’s policy is learned through trial and error by exploring the given MDP and observing the rewards rRr\in R provided as output. A policy is based on an estimate of action utility based on past observations and estimates of trajectories. The agent can generate an estimate of discounted return based on a discounted sum of future rewards:

(1) Gt=k=0γkRt+k+1G_{t}=\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}

where Rt+kR_{t+k} denotes the reward observed time steps after time tt and γ\gamma is the discount parameter. This describes a measure of estimated return based on observations of trajectories. The value of γ\gamma determines how quickly the scale of future rewards decays, which impacts how strongly those observations impact the estimates. With this method of estimating returns, agents can generate a model of likely utility of actions in states. This concept is used to define a value function ν(s)\nu(s) given state ss where the assumption is that the agent starts in state ss and executes actions according to their policy π\pi in following states. The distinction in this case is the fact that the estimate is based on estimated behavior determined by a policy π\pi. This means that the estimated value will consider the estimates of future values given the current state and expected trajectory of future states. In (Sutton and Barto, 2018), ν(s)\nu(s) is defined as:

(2) νπ(s)=𝔼π[k=0γkRt+k+1|St=s],sS\nu_{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}|S_{t}=s\right],\forall s\in S

Similarly, the action value function is used to estimate the value of executing an action aa in a given state ss:

(3) qπ(s,a)=𝔼π[k=0γkRt+k+1|st=s,at=a],sSq_{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}|s_{t}=s,a_{t}=a\right],\forall s\in S

These are fundamental equations with respect to behavior policy learning via RL. Learning methods can be generated providing a means to learn a representation of value following Equation 3. This is done by utilizing the trial and error process of RL. Agents explore the environment and select actions (following a learning scheme - e.g. Epsilon Greedy) in order to observe the effect of actions. The effect is reflected in the rewards observed following an action or sequence of actions. These observed rewards are used in the estimated state-action value function in order to incrementally improve the estimated utility of actions. This is done by refining the estimate through a modeling process which uses immediate rewards and the current estimate of value to refine the estimate. This cycle is used to learn a policy through the feedback loop created by taking an action, observing an outcome, and updating the policy of actions accordingly. As an example, a temporal difference method can be used in Q-Learning to learn a state-action value function:

(4) Qt+1(s,a)=(1α)Qt(s,a)+α[r+γmaxaQt(s,a)]Q_{t+1}(s,a)=(1-\alpha)Q_{t}(s,a)+\alpha[r+\gamma\max_{a^{\prime}}Q_{t}(s^{\prime},a^{\prime})]

where α\alpha is the learning rate used to discount the scale of the current estimate in the update to the estimated value. As can be seen, there is a recursive relationship between the current state’s utility and the value of future states in the discounted γmaxaQt(s,a)\gamma\max_{a^{\prime}}Q_{t}(s^{\prime},a^{\prime}) term. This enables the relationship between the value of the current state and the value of future states under the assumption of following the current policy.

In RL, the method for defining and finding optimal behavior depends on the nature of the elements of the observation available to the agent. In the most general case, such as Q-Learning demonstrated above, the agent observes the current state and selects an action according to π\pi. Typically, this is done by identifying argmaxaqπ(s,a)\operatorname*{argmax}_{a}q_{\pi}(s,a). An important aspect of basing decisions on discounted future utilities is the possibility of MDPs which could result in sub-optimal behavior given a myopic view. For instance, Figure 2 provides an example of how too greedy a nearsighted view of state values would lead an agent to bias its behavior to lead from state 44 to state 11 with a sum of rewards equal to 1212 rather than the optimal choice, which is 44 to 77 with a sum of rewards equal to 2121. Such a scenario can be quite common and is one of the many challenges for successful policy learning in RL.

Refer to caption
Figure 2. Sample Markov Decision Process showing issue with myopic view. State values are in parentheses below state labels.

RL contains numerous examples of constraints which differentiate families of problems. For instance, it is possible to have an environment which is only partially observable (Shani et al., 2013). In this case, not all aspects of a state are observable to an agent and so there is uncertainty regarding the exact state the agent is occupying. This results in a need for the agent to estimate the current state and use that estimation in the updates to the value function. In another case, the actions may not lead to an outcome with 100%100\% certainty. An example is a stochastic gridworld problem (e.g. OpanAI Gym FrozenLake (Brockman et al., 2016)). In this scenario, agents try to navigate a 2D world by moving up, down, left, or right. When an agent selects an action, there is a non-zero probability that the agent moves to an unintended state. For example, an “up" move could in fact move the agent to the left. In this case, the learning method must support this uncertainty. It is worth noting that the previous examples represent scenarios supported by tabular methods. The use of RL has also been extended to continuous cases and there are methods utilizing Deep Learning. This is referred to as Deep RL and is commonly used in the case of continuous and/or large domains such as robotic control or video games (Arulkumaran et al., 2017).

3.1.4. Additional Relevant Results

RL is an extensively studied and applied research topic. Agents have been trained to learn how to utilize attention or contextual information to improve policy learning (Salter et al., 2019; Oroojlooy et al., 2020). Agents can also be trained to utilize memories and past experiences in a more human-like manner in the hopes of solving credit assignment, handling sparse rewards, handling non-stationarity, comprehending relational knowledge, etc. (McCallum, 1996; Hung et al., 2019, 2019; Lillicrap and Santoro, 2019; Lansdell et al., 2019; Padakandla et al., 2020; Wang et al., 2020a; Džeroski et al., 2001; Forbes and Andre, 2002). Similarly, RL agents can learn techniques to associate aspects of past experiences to contextual information for decision-making (Fortunato et al., 2019; Li et al., 2020).

The use of past experience can also be extended to other agents or humans in the environment. An agent can learn to perform simultaneous control (i.e. shared autonomy) using RL methods (Reddy et al., 2018) or the observed behavior of others can be included in the observation space to allow for action selection conditioned on likely behavior of others (Xie et al., 2020). An additional aspect of comprehending others is the prediction of behavior or goals, as is demonstrated in (Nguyen and Gonzalez, 2020a, b). Further, past experiences or behavior can be utilized as a means to predict likely future outcomes in order to guide agent behavior (Sun et al., 2019; Zhang et al., 2020).

3.2. Inverse Reinforcement Learning

3.2.1. Relevant survey(s):

For relevant survey papers and related, please refer to (Arora and Doshi, 2021; Fernando et al., 2020).

3.2.2. Principles and Definitions

IRL is a method by which an agent learns from examples of behavior without access to the underlying reward function motivating the behavior. The key distinction in this case is that the agent is trying to replicate or approximate the reward function RER_{E} or policy πE\pi_{E} that caused the exemplar behavior. This results in effectively needing to learn a reward while simultaneously attempting to learn optimal behavior policy under the current estimated reward function. As such, the agent is performing two interdependent tasks. Given a policy πE\pi_{E} or a set of NN demonstrated trajectories

𝒟={(s0,a0),(s1,a1),,(sj,aj)i=1N:sjS;ajA;i,j,N}\mathcal{D}=\{\langle(s_{0},a_{0}),(s_{1},a_{1}),\dots,(s_{j},a_{j})\rangle_{i=1}^{N}:s_{j}\in S;a_{j}\in A;i,j,N\in\mathbb{N}\}

the agent is tasked with learning a representation which could explain the observed behavior (Arora and Doshi, 2021).

Generally speaking, there are numerous methods or approaches with respect to IRL. Therefore, we will be unable to address all the techniques in this section. We will instead provide some preliminary examples to provide some intuition regarding the common techniques and underlying principles. One method for the IRL task is that of apprenticeship learning (Abbeel and Ng, 2004). In this case, there is an assumed vector of state-related features ϕ:S[0,1]k\phi:S\rightarrow[0,1]^{k} which support the reward R(s)=wϕ(s)R^{*}(s)=w^{*}\cdot\phi(s) with weight vector ww^{*}. The feature vectors ϕ\phi refer to observational data corresponding to the states (e.g. a collision detected flag). Given the definition of reward, the value of a policy π\pi can be measured by:

(5) 𝔼s0D[Vπ(s0)]=𝔼[t=0γtR(st)|π]=w𝔼[t=0γtϕ(st)|π]\mathbb{E}_{s_{0}\sim D}[V^{\pi}(s_{0})]=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t})|\pi]=w\cdot\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t})|\pi]

with the initial states being drawn s0Ds_{0}\sim D and with behavior following from the policy π\pi. Then, the feature expectation can be defined as:

(6) μ(π)=𝔼[t=0γtϕ(st)|π]\mu(\pi)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t})|\pi]

which is used to define a policy’s value 𝔼s0D[Vπ(s0)]=wμ(π)\mathbb{E}_{s_{0}\sim D}[V^{\pi}(s_{0})]=w\cdot\mu(\pi). Given the estimation of feature expectation μ(π)\mu(\pi), the goal is to find a policy π~\tilde{\pi} which can best match the observed demonstrations. In order to do so, this requires a comparison between π~\tilde{\pi} and πE\pi_{E}. Since the policy πE\pi_{E} is typically not provided, an estimate μ^E\hat{\mu}_{E} based on demonstrations is needed. This can be accomplished by an empirical estimate:

(7) μ^E=1mi=1mt=0γtϕ(st(i))\hat{\mu}_{E}=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{\infty}\gamma^{t}\phi(s^{(i)}_{t})

for a given set of trajectories {s0(i),s1(i),}i=1m\{s^{(i)}_{0},s^{(i)}_{1},\dots\}^{m}_{i=1}.

Given this structure, algorithms can be defined for iteratively refining the weight vectors and resulting policies. This enables the two components of the IRL paradigm. First, the estimated weight vectors w(i)w^{(i)} define an estimated reward function which can be refined by updating the weight vector so as to reduce the discrepancy between the two estimated feature expectations μ^E\hat{\mu}_{E} and μ(π)\mu(\pi) which measure agent performance. Second, the estimated reward with the current weight w(i)w^{(i)} enables learning a policy which is optimal for the current estimate of policy value (Equation 5) using RL, which is used in the first step to measure value discrepancy. This provides a cyclical process of reward refinement and policy learning.

It is important to note that the extra step of finding a suitable reward function and policy increases the complexity of the problem. The cycle of improving the reward requires the agent to retrain a behavior policy which reflects the new reward. On the other hand, finding an accurate representation of the reward would afford the agent a more general understanding of the behavior. Having access to a reward function allows an agent to understand desirable behavior at an abstract enough level to potentially transfer its understanding to a new environment. Of course, if the approximated reward isn’t accurate enough, one would expect this to result in potential issues. In any case, there is the potential for increased generality of the resulting agent. It’s worth mentioning that an exact replication of the underlying reward isn’t entirely necessary. In fact, an affine transformation of the true reward function would in fact result in an equivalent policy (Arora and Doshi, 2021).

3.2.3. Applications and Recent Results

As an example of IRL, we refer to (Yang et al., 2020a) in which they try to learn a model to replicate human gaze patterns when searching for a visual target. The authors propose the use of imagery data with the human gaze fixations annotated. They used simulated fovea to learn a plausible model of desirable objects for attention. This is used to model how a human’s gaze shifts around an image while attempting to locate an object within an image. The approach showed an improved ability to identify particular objects of interest rather than generating a saliency map to demonstrate attention.

In their method, they utilize an approach they refer to as Dynamic-Contextual-Belief (DCB), which is composed of three components: fovea, contextual beliefs, and dynamics. The fovea serves to mimic human vision and to provide only a sub-region in high detail while blurring or masking the remaining region. The masking results in a reduction in the observation space, limiting the input to a portion of the input space. In their approach, the masking is used to select a sub-region of the image to represent in high resolution while the remaining regions are represented using a blurred version of the image. This approximates the effect of the fovea on human vision and is used to represent the fixation of the observer. The contextual beliefs are used to represent a person’s understanding of the contents of the scene such as objects and background items of an image. Lastly, the dynamics collects information regarding the focal fixations during search. The dynamics are represented as a transition between versions of the image which are updated based on iterations of fixation. Each region which receives fixation is replaced by its high-resolution representation resulting in a transition represented by:

(8) B0=L;Bt+1=MtH+(1Mt)BtB_{0}=L;B_{t+1}=M_{t}\odot H+(1-M_{t})\odot B_{t}

where BtB_{t} represents the belief state after tt fixations and MtM_{t} is the mask generated by the ttht^{th} fixation. LL and HH are belief maps which represent object and background locations for low-resolution and high-resolution images, respectively. Based on this definition, we can see that the representation of the image and the beliefs BtB_{t} regarding contextual information and item locations are updated based on the iterative search conducted by moving the fixation around the image.

To represent the reward and learn a policy representing visual search behavior, the authors utilize Generative Adversarial Imitation Learning (GAIL). The GAIL framework utilizes the adversarial paradigm with networks representing the discriminator D:S×A(0,1)D:S\times A\rightarrow(0,1) and generator GG. These are used to train the system to generate data which matches the patterns of the original sampled data. The discriminator is tasked with learning to distinguish between real human data representing beliefs and fixation transition actions versus data which is generated artificially by GG. The generator GG is tasked with learning to generate beliefs and actions which are convincing enough to the discriminator DD so as to be labeled as real rather than artificially generated. This is accomplished by maximizing an objective function:

(9) D=𝔼r[log(D(S,a))]+𝔼f[log(1D(S,a))]γ𝔼r[D(S,a)2]\mathcal{L}_{D}=\mathbb{E}_{r}[\log(D(S,a))]+\mathbb{E}_{f}[\log(1-D(S,a))]-\gamma\mathbb{E}_{r}[||\nabla D(S,a)||^{2}]

where D(S,a)D(S,a) is the output of DD given the state-action pair (S,a)(S,a), 𝔼r\mathbb{E}_{r} refers to the expectation over real state-action pairs, 𝔼f\mathbb{E}_{f} refers to the expectation over fake search transition samples from GG. The gradient term at the end of Equation 9 serves to improve the convergence rate. The definition of the reward is based on the output of the discriminator:

(10) r(S,a)=log(D(S,a))r(S,a)=\log(D(S,a))

For the generator, the performance uses Equation 10 and is tasked with maximizing:

(11) G=𝔼f[log(D(S,a))]=𝔼f[r(S,a)]\mathcal{L}_{G}=\mathbb{E}_{f}[\log(D(S,a))]=\mathbb{E}_{f}[r(S,a)]

which shows that the higher likelihood of being real the discriminator places on generated data, the higher the resulting reward will be. To find an RL policy for the generator, the authors utilize Proximal Policy Optimization with the following representation:

(12) π=𝔼π[log(π(a|S))A(S,a)]+H(π)\mathcal{L}_{\pi}=\mathbb{E}_{\pi}[\log(\pi(a|S))A(S,a)]+H(\pi)

where the advantage function AA is estimated using generalized advantage estimation (GAE). The advantage represents the gain observed by taking action aa versus the policies default behavior. This definition of loss for the policy learning helps guide the learner toward actions which will result in higher advantage over other actions. The term HH is the max-entropy IRL term H(π)=𝔼π[log(π(a|S))]H(\pi)=-\mathbb{E}_{\pi}[\log(\pi(a|S))] which helps improve convergence.

To test the approach, the authors utilized the COCO-Search18 dataset. The images were selected based on five criteria: Five criteria were imposed when selecting the TP images. First, they used no images portraying an animal or person so as to avoid the known strong biases to these categories. Second, they enforced each image having only a single instance of the target item. Third, the size of the target item must be within [0.01,0.1][0.01,0.1] the area of the image (measured by bounding box). Fourth, the target should lie outside the center of the image, which was determined by excluding items whose bounding box overlapped with a region in the center. Lastly, the image dimensions must have a ratio in [1.2,2.0][1.2,2.0] to allow for their display screen. Further filtering was performed and the reader can refer to the original text for more details.

The algorithm was tested against relevant baselines as well as against human performance. The authors utilized results from 1010 participants who viewed the 6,2026,202 images in the dataset. During their participation, eye tracking was performed during their search task. The algorithms compared against were: a random scanpath, a ConvNet detector, Fixation heuristics, a Behavior Cloning CNN, and a Behavior Cloning LSTM. The results demonstrated show their method meeting or exceeding the performance of the compared algorithmic methods. The results demonstrate the relationship between the number of fixation transitions made before finding the target and the success rate of the searches.

Additional tests were performed to test additional aspects, which can be found in the original text. As was demonstrated, this approach was able to successfully approximate the behavior demonstrated by the humans and also succeed at the identification tasks.

Additional Relevant Results

As demonstrated, IRL provides an interesting approach to behavior modeling and replication. Agents can utilize observations of exemplar behavior in order to learn aspects of the desired model. For instance, agents can learn actions or skills from observed imagery via the inclusion of a learned cost function (Das et al., 2020; Wang et al., 2020b) or by clustering observations into skills (Cockcroft et al., 2020). Other recent results have focused on analytical/theoretical aspects of the problem relating to reward function search or aspects of the behavior policy providing the demonstrations (Balakrishnan et al., 2020; Kalweit et al., 2020; Ni et al., 2020; Ramponi et al., 2020). IRL can also be extended to methods which learn in a multi-task setting (Eysenbach et al., 2020) or observe policies of multiple other agents to operate and learn in a Multi-agent Reinforcement Learning (MARL) context (Reddy et al., 2012; Gruver et al., 2020). Similar to the example in the previous section, IRL also proves useful in learning components of behavior leading to outcomes observed. For instance, learners can track or identify patterns which allow them to predict driver behavior (Wu et al., 2020), simulate actions backward in time to predict likely trajectories leading to the current state (Lindner et al., 2021), or predict gaze patterns for wheelchair drivers (Maekawa et al., 2020).

3.3. Imitation Learning

3.3.1. Relevant survey(s)

For relevant survey papers and related, please refer to (Hussein et al., 2017).

3.3.2. Principles and Definitions

IL is a process which attempts to reproduce behavior given by experts in order to learn a pattern of behavior under a given task (Osa et al., 2018). At an abstract level, IL is closely related to IRL. In both cases, the goal is to utilize observations of behavior to train a behavioral model or policy. The key distinction comes in the structure of the learning process. In IRL, the learning process attempts to learn a suitable reward function which aligns with the demonstrated behavior. Additionally, the learning process uses the learned reward to generate a policy of behavior. The learned policy of behavior should closely model the demonstrated behavior. For IL, the learning process is not designed to generate a reward function which fits the demonstrations; instead, the process attempts to directly learn a model of behavior which best fits the demonstrated behavior.

As a demonstration of the principles, IL can be formulated as follows. Given a trajectory τ=[ϕ0,,ϕT]\tau=[\phi_{0},\dots,\phi_{T}] of features ϕ\phi, the learner is tasked with learning a policy reproducing the behavior. The vectors ϕ\phi represent the features of the environment or system at each stage of the trajectory. The context, or state, ss representing the system state can be used in conjunction with an optional reward parameter rr as components of the underlying optimization problem.

With a set of demonstrations 𝒟={(τi,si,ri)}i=1N\mathcal{D}=\{(\tau_{i},s_{i},r_{i})\}_{i=1}^{N} of trajectories τ\tau, contexts ss, and rewards rr, (NOTE: the reward values rir_{i} might not be provided), the goal is to find a policy π\pi^{*}. The policy should minimize the discrepancy between the distribution of features from the expert q(ϕ)q(\phi) and the distribution of features from the learner p(ϕ)p(\phi):

(13) π=argminD(q(ϕ),p(ϕ))\pi^{*}=\operatorname*{argmin}D(q(\phi),p(\phi))

where DD is a discrepancy measure such as Kullback-Leibler. A key feature of Equation 13 is the fact that it promotes the alignment of behavior through direct observation by penalizing distributions with large discrepancies from the demonstration. This goal of low discrepancy guides the policy search toward those policies which best fit the observed distribution over features. Overall, the method for building will vary with each approach, but the underlying principle of direct replication is still a key component.

3.3.3. Applications and Recent Results

In (Liu et al., 2019), the authors demonstrate an IL method which is capable of performing policy improvement through the creation of examples without domain knowledge from a human. This allows the learner to perform policy improvement which may avoid the biases of the human domain knowledge commonly observed in IL tasks. To accomplish this, they define a policy improvement operator and methods for generating behavior exemplars. Note, this method differs from IRL as it is attempting to directly learn and refine a policy to fit the observed behavior without learning from an estimated reward function.

Preliminary Definition and Theory

The following defines the general structure of the policy improvement operator II

(14) π=I(π,Vπ)\pi^{\prime}=I(\pi,V_{\pi})

II denotes an operator producing an improved policy π\pi^{\prime} given a current policy π\pi and value function VπV_{\pi}. The improvement operator is described as a general black box operation, which means it will support multiple approaches. The key aspect from a theoretical standpoint is the constraint of policy improvement being satisfied, which means a newly generated policy must provide an improved estimated value. To denote this concept, they define the policy order as ππ\pi\succ\pi^{\prime} for policies π,π\pi,\pi^{\prime} when Vπ(S)>VπV_{\pi}(S)>V_{\pi^{\prime}}. This definition is used to prove the proposed approach will converge to an optimal policy given the operator by definition must output a policy with higher value (i.e. constraint of policy improvement). In other words, this is shown to be true in the event that the policy improvement operator satisfies I(π,Vπ)πI(\pi,V_{\pi})\succ\pi.

With a policy defined using the policy improvement technique, the method can then provide improved demonstrations τπ\tau_{\pi^{\prime}} for the IL process to create an updated policy π\pi^{\prime}. The learning method utilizes a loss function which measures the divergence between the current policy and the improved policy LI=DKL[I(π,Vπ)||π]L_{I}=D_{KL}[I(\pi,V_{\pi})||\pi]. In the case where the divergence is high, then the loss will be high and there is still room for an improved policy. Therefore, the utility is based on this measure of divergence to promote policy improvement. To estimate value, they implement a Deep Neural Network (DNN) for function approximation to provide an estimated value function given a state ss: V^(θ,s)=DNNθ(s)\hat{V}(\theta,s)=DNN_{\theta}(s) (for an arbitrary DNNθDNN_{\theta} - e.g. CNN, ResNet, etc.). They also define a policy estimator π^(a|s,w)=DNNw(s)\hat{\pi}(a|s,w)=DNN_{w}(s). With these components, they define their approach, self-improving Reinforcement Learning (SI-RL). Based on the above definitions and functions, define the loss function of the policy improvement operator as:

(15) LI(w,θ,s)=DKL[I(π^(|s,w),V^(θ,s))||π^(|s,w)]L_{I}(w,\theta,s)=D_{KL}[I(\hat{\pi}(\cdot|s,w),\hat{V}(\theta,s))||\hat{\pi}(\cdot|s,w)]

with updates:

(16) wk+1=wkI(π^(|s,w),V^(θ,s))wlog(π^(|s,w))w_{k+1}=w_{k}-I(\hat{\pi}(\cdot|s,w),\hat{V}(\theta,s))\nabla_{w}\log(\hat{\pi}(\cdot|s,w))
(17) θk+1=θkI(π^(|s,w),V^(θ,s))θlog(π^(|s,w))+θ||V^(θ,s)R||2\theta_{k+1}=\theta_{k}I(\hat{\pi}(\cdot|s,w),\hat{V}(\theta,s))\nabla_{\theta}\log(\hat{\pi}(\cdot|s,w))+\nabla_{\theta}||\hat{V}(\theta,s)-R||^{2}

for reward RR.

Training via GAN

As an alternative to the use of the KL-divergence, the authors propose integrating the use of a Generative Adversarial Network (GAN) into the training process for the imitation module. This integration is performed by defining a discriminator network DD to assign low loss to improved policies and high loss to the initial policies. The generator GG is then tasked with the generation of policies which will incur a low loss. This defines a method which allows the system to construct policies which improve performance while also defining a tool to guide the construction of policies by discouraging policies which do not create an improvement over past performance. This is formalized by:

(18) minwmaxD𝔼πw[logDψ(s,a)]+𝔼π[log(1Dψ(s,a)]\min_{w}\max_{D}\mathbb{E}_{\pi_{w}}[\log D_{\psi}(s,a)]+\mathbb{E}_{\pi^{\prime}}[\log(1-D_{\psi}(s,a)]

where πw\pi_{w} is the trained policy, π\pi^{\prime} is the improved policy, and Dψ(s,a)D_{\psi}(s,a) is the output of the discriminator network. The use of a GAN is also seen in the Section 3.2 example application, but the two differ in which components of the approach is modeled. In this case, the GAN is used specifically to learn how to generate better policies to meet the requirements of the improvement operator.

Results

As noted in their proposed approach, the method defined supports a broad class of DNN-based systems for policy improvement. In their experiments, the authors test the use of trust region policy optimization (TRPO), Monte Carlo tree search (MCTS), and a cross entropy method (CEM). For test scenarios, the authors performed tests using the game Gomoku, miniRTS, and Atari games.

The game Gomoku is a zero-sum game and the authors test their MCTS-based implementation in 30503050 games. The agent is trained via self-play and tested it intervals against different levels of MCTS opponents. As is demonstrated, their agent is able to compete successfully against varying levels of opponents at the different training stages.

The next scenario tested, miniRTS, provides rule-based opponent for testing. The authors tested their SI-GARL and SI-RL approaches against the miniRTS opponent and compared performance to DQN, REINFORCE, and A3C implementations. The results demonstrate a strong performance against the miniRTS opponent. To further test their approach, the authors played their trained models against the traned models from the DQN, REINFORCE, and A3C implementations. These results again show strong performance.

Additional Relevant Results

Similar to IRL, IL provides a useful means by which exemplar behavior can be modeled or replicated. Beyond the above example, additional scenarios include uses of IL for training of non-player characters in video games (Borovikov et al., 2019) or modeling driver behavior (Bhattacharyya et al., 2020). Similarly, this can be applied to observed behavior in video games (Ross et al., 2011). Additionally, as an example, multiple techniques have been investigated regarding the improvement or efficiency of agent training (Zolna et al., 2019; Chen et al., 2020; Niu and Gu, 2020; Holzinger et al., 2019). This can include more efficient data sampling via AL-based or human-in-the-loop improvements (Niu and Gu, 2020; Holzinger et al., 2019) or improved selection of exemplars via the estimate of utility (Chen et al., 2020).

3.4. Active Learning

3.4.1. Relevant survey(s)

For relevant survey papers and related, please refer to (Settles, 2009; Ren et al., 2020; Najar and Chetouani, 2021; Ramaraj et al., 2021).

3.4.2. Principles and Definitions

AL is intended as an AI paradigm by which the learner is able to query an oracle regarding unlabeled data. The goal is to enable a more efficient usage of potentially costly data by allowing the learner to identify a representation of confidence or understanding in order to determine which items should be prioritized or require input from an oracle (Settles, 2009). This is described as a characteristic similar to the concept curiosity in humans, since AL demonstrates a motivation to inspect items with less experience or certainty. This is related to how the system is able to decide what it wants to learn more about, not a true reproduction of human cognition. However, it has been argued that humans must regulate their priorities regarding curiosity in order to when and what to learn (Ten et al., 2021). When prompted, the oracle can then provide labels for the queried samples, reducing the need to label larger sized datasets prior to the start of training.

There are numerous ways in which the learner can represent its understanding of the data to determine its confidence level. The representation is used to measure which would be the most desirable query. A possible definition could be the use of entropy to measure the confidence:

(19) xENT=argmaxxiP(yi|x;θ)logP(yi|x;θ)x^{*}_{ENT}=\operatorname*{argmax}_{x}-\sum_{i}P(y_{i}|x;\theta)\log P(y_{i}|x;\theta)

for labels yiy_{i}, instance xx, and model θ\theta. This provides an information-theoretic representation of the uncertainty. Specifically, the instance selected for the oracle feedback is the one corresponding to the largest entropy in the probability of assignment to the various labels (as per Equation 19), i.e., the one for which the distribution of probabilities is closer to a uniform distribution, thus corresponding to the most uncertain assignment. Another approach is the instance with the least confident labeling:

(20) xLC=argminxP(y|x;θ)x^{*}_{LC}=\operatorname*{argmin}_{x}P(y^{*}|x;\theta)

where y=argmaxyP(y|x;θ)y^{*}=\operatorname*{argmax}_{y}P(y|x;\theta) is the most likely class labeling. This promotes querying instances which have a best labeling with the least confidence.

The underlying principle in Equation 19 and Equation 20 remains the representation of how well the model can relate instances to the labels based on a self-aware measure of certainty. The utilization of this notion of confidence is what allows for systems to exhibit behaviors akin to curiosity. This allows them to guide the learning process and reduce the reliance on large datasets.

3.4.3. Applications and Recent Results

As a demonstration of AL, we will discuss the approach outlined in (Navidi, 2020; Navidi and Landry, 2021) in which the authors create a method of active imitation learning in a RL context. The proposed approach is a divergence from the above examples of approaches, but still utilizes the general principles of AL. In this case, the authors attempt to learn a model which can predict the oracle’s responses for improved performance and to compensate for potential delays between agent action and oracle response. The prediction model enables the agent in its learning process and guides the policy learning. Unlike the form of AL described above, the oracle feedback instead provides labels of good or bad behavior, so the agent’s policy learning phase will learn to avoid poorly performing actions rather than querying low-confidence items. This creates a method which combines concepts from IL, AL, and RL. This way, the behavior of the oracle encodes a model of the human’s preferences with respect to agent behavior. Regarding the underlying algorithm, the authors propose a combination of SARSA and A3C with human-in-the-loop training.

As a means to provide feedback from the teacher, agents observe a binary signal indicating positive or negative view from the human teacher. This is in contrast to other methods which might provide more complicated or graded feedback by utilizing varying levels of good or bad (e.g. [100,50,50,100][-100,-50,50,100]). In order to compensate for variance in both reaction time and frequency of responses from the teacher (e.g. some might tire of giving feedback), the authors propose a feedback prediction system.

The feedback predictor or manager FBFB is designed to learn and predict the feedback behavior of the human teacher. FBFB is provided responses from {1,1}\{-1,1\} to signify satisfaction: 11 or dissatisfaction: 1-1. The feedback prediction policy is denoted as:

(21) FB(O,A)=ψTθ(O,A)FB(O,A)=\psi^{T}\theta(O,A)

where FB(O,A)FB(O,A) denotes the teacher feedback policy, ψ\psi are the policy parameters, and θ(O,A)\theta(O,A) represents a density function modeling the human response delay.

The feedback is then used to guide the training of the RL policies for the agent. The authors test their approach utilizing the SARSA and A3C algorithms. The feedback weights past observations to give a teacher-based representation of the outcomes. These methods are tested against in the Open-AI Gym environment and the authors illustrate results for the Cart-Pole and Mountain-Car environments. In both cases, we can see the proposed methods performing strongly in the environments and against the baseline methods.

Additional Relevant Results

The concept of AL can also be extended to compensate for the use of multiple teachers with potentially heterogeneous behavior policies (Nguyen and Daumé III, 2020). In such a case, agents can encounter feedback from differing teachers. Consequently, the authors aim to learn models representing the policies of the different teachers to compensate for each pattern of behavior. Active learning can also be extended to concepts relating to heuristics when tasked to choose between options. As is demonstrated in (Parpart et al., 2017), agents can be trained via active learning to replicate human performance and replicate their tendencies regarding the decision model likely utilized by the humans.

4. Belief and Reasoning Approaches

In the following sections, we will discuss methods which use representations of belief or world knowledge to guide or assist in agent learning. Relating to the concept of cognitive frugality, we will discuss Meta-Reasoning and Meta-Learning in Section 4.1. These concepts relate to how skills or knowledge are utilized based on the current context. Further, these concepts also relate to how and when cognitive resources should be applied in order to determine which system will select the course of action. Compensating for the behavior or preferences of others is another important aspect of modeling and replicating human behavior. The topic Theory of Mind (ToM), which we discuss in Section 4.2, attempts to generate a model of the mental states of others in order to anticipate and collaborate with others. Lastly, we discuss methods which integrate a model of knowledge or world dynamics in Section 4.3. These methods attempt to replicate human understanding of the world in order to perform reasoning at a level which utilizes these models rather than attempting to learn them indirectly through interactions with the environment.

4.1. Meta-Reasoning and Meta-Learning

4.1.1. Relevant survey(s)

For relevant survey papers and related, please refer to (Costantini, 2002; Gershman et al., 2015; Ackerman and Thompson, 2017; Griffiths et al., 2019; Peterson and Beach, 1967; Khan et al., 2020; Hospedales et al., 2020; Wang, 2021).

4.1.2. Principles and Definitions

Meta-reasoning and Meta-learning consider topics related to a level of abstraction which allows the algorithm to make determinations at two levels. In meta-reasoning, this is often referred to ‘reasoning about reasoning’ (Costantini, 2002). Similarly, meta-learning is often referred to as ‘learning to learn’ (Khan et al., 2020). These indicate the core aspect, which is the ability to perform introspection in order to dictate behavior. More concretely, this often relates to an ability to determine how resources or behavior policies will be allocated to perform a reasoning or learning task. This implies a notion of cost or effort regarding the task as well as an ability to identify the most suitable solution/behavior based on the context of the problem. The notion of allocation of resources for reasoning systems can also be related to examples in human cognition and the notion of bounded rationality (Zilberstein, 2011), which we discuss in Part II.

Meta-reasoning

As noted in (Russell and Wefald, 1991), real agents are limited in capacity with respect to reasoning. Such a limitation manifests in both computational power as well as the time to decide and act. Further, the benefit or utility of an action can deteriorate over the time it takes to deliberate or to execute the action. Consequently, there is an implicit trade-off between the cost of an action (deliberation and execution) and its intrinsic utility. The ability to make judgments, consciously or otherwise, regarding the appropriate balance of these factors is a key aspect of meta-reasoning. In the context of computer-generated solutions, the common approach is to consider minimal time for an acceptable solution or maximizing outcome at the expense of time to completion. The ability to find an appropriate balance is one of the key motivations for investigating meta-reasoning topics.

From an algorithmic perspective, the meta-reasoning problem can be viewed as a method of optimizing the expected utility of performing an action versus the cost to do so. Following (Griffiths et al., 2019), this can be expressed as

(22) VOC(c,b)=𝔼p(b|b,c)[maxa𝔼[U(a)|b]maxa𝔼[U(a)|b]]cost(c),\textrm{VOC}(c,b)=\mathbb{E}_{p(b^{\prime}|b,c)}\left[\max_{a^{\prime}}\mathbb{E}\left[U(a^{\prime})|b^{\prime}\right]-\max_{a}\mathbb{E}\left[U(a)|b\right]\right]-cost(c),

where bb is the agent’s current belief, bb^{\prime} is the refined belief resulting from executing computation cc, and 𝔼[U(a)|b]\mathbb{E}\left[U(a)|b\right] is the expected utility of taking action aa under utility UU over the distribution of outcomes corresponding to belief bb. Given the VOC, a rational agent should select the action which maximizes its value (or perform no action if VOC<0\textrm{VOC}<0). Unfortunately, calculating the VOC is costly, and so methods need to be developed to approximate it or define a similar measure of cost versus utility.

Meta-learning

Similar to meta-reasoning, the case of meta-learning relies on an ability to perform introspection in order to understand how to best utilize the resources at hand. In this context, the resources are being applied to behavior learning and utilization. As noted in (Hospedales et al., 2020), meta-learning involves improving a learning algorithm over multiple learning episodes at two levels. First, in base learning, a learning algorithm learns to solve a task based on the scenario parameters (e.g. image classification). For the second phase, meta-learning, an outer level algorithm improves the inner learning algorithm. The improvement is made so the resulting learned model improves one of the outer objectives. There can be multiple outer objectives resulting in the need for the algorithm to consider which model/algorithm to apply in a given context. Similar to meta-reasoning, the accrual of information or the time for deliberation can be costly. Consequently, the algorithm requires a method by which it can make a determination regarding the cost/benefit trade-off between the two (Gershman et al., 2015; Griffiths et al., 2019).

For a more formal characterization, we will refer to the conventions provided by (Vanschoren, 2019). Consider the accrual of behaviors for given tasks tjTt_{j}\in T, where TT is the set of all known tasks. Given a set of learning algorithms parameterized/configured by θiΘ\theta_{i}\in\Theta and evaluation measures Pi,j=P(θi,tj)𝐏P_{i,j}=P(\theta_{i},t_{j})\in\mathbf{P}, a meta-learner LL can be trained to predict recommended configurations Θnew\Theta^{*}_{new} for a new task tnewt_{new}. In this paradigm, it becomes a matter of learning a function f:Θ×T{θk}f:\Theta\times T\rightarrow\{\theta^{*}_{k}\}, k=1Kk=1\dots K which can generate configurations θk\theta^{*}_{k}. This then allows for the creation of a portfolio of configurations, which can be associated to tasks which they are best-suited. This paradigm allows for the generation and selection of configurations based on the experiences of the algorithm. As a result, the learner develops an understanding of what to learn and how to learn it.

4.1.3. Applications and Recent Results

In (Lange and Sprekeler, 2020), the authors investigate memory-based meta-learning (related to the concept of ‘learning to learn’). For their scenario, the key concept is an algorithm which can determine when to rely on a learning algorithm for generating behavior versus utilizing a heuristic. This enables agents develop policies when it is feasible given the time/computational constraints and rely on heuristics otherwise. Relating to human cognition and behavior, the authors note a similarity between the hidden activations of LSTM-based meta-learners and the recurrent activity of neurons in the prefrontal cortex. Generally speaking, allowing a split between fast/frugal mechanisms and slower/costly reasoning systems fits examples in human cognition and the notion of bounded rationality, and we discuss these related concepts in Part II.

On the topic of learning behavior policies, the authors note three features of particular interest regarding the dependence of the meta-learning algorithm on the meta-reinforcement learning problem:

  • Ecological uncertainty: How diverse is the range of tasks the agent could encounter?

  • Task complexity: How long does it take to learn the optimal strategy for the task at hand?

  • Expected lifetime: How much time can the agent spend on exploration and exploitation?

Based on their analysis, they showed non-adaptive behaviors are optimal in two cases: low variance across tasks in the ensemble, and when time constraints prevent sufficient time for exploration.

For the first scenario investigated, they test their approach on a two-arm Gaussian bandit task. It is noted that this allows for an analytical solution of optimality. The agent interacts with the environment by performing a series of TT arm pulls. The rewards for the two arms are represented by a deterministic reward of 0 for the first arm and a Gaussian distribution with variable mean μ\mu for the second. The mean is sampled μ𝒩(1,σp2)\mu\sim\mathcal{N}(-1,\sigma^{2}_{p}) (where σp\sigma_{p} specifies the scale of ecological uncertainty) at the beginning of each episode, and then is used to define the reward function r𝒩(μ,σl)r\sim\mathcal{N}(\mu,\sigma_{l}). It is noted that this variability controls how many arm pulls are needed to estimate the mean reward and σl\sigma_{l} controls how quickly the agent can learn the policy since it controls the consistency of the observed rewards. The scale of σp\sigma_{p} clearly determines how much uncertainty should be expected for the agent regarding the mean reward observed from the stochastic arm, which consequently scales the difficulty of estimating the expected utility of this arm. For σl\sigma_{l}, this determines how difficult it will be to estimate the mean based on observed rewards.

Given this definition, the optimal solution is determined analytically. The solution is based on the problem characteristics. Since the agent should perform nn pulls for exploration before exploiting its knowledge, the goal is to identify the optimal number of trials nn^{*} before concluding the exploration. The solution is as follows:

(23) n\displaystyle n^{*} =argmaxn𝔼[t=1Trt|n,T,σl,σp]\displaystyle=\operatorname*{argmax}_{n}\mathbb{E}\left[\sum_{t=1}^{T}r_{t}|n,T,\sigma_{l},\sigma_{p}\right]
=argmaxn[n+𝔼μ,r[(Tn)×μ×p(μ^>0)]],\displaystyle=\operatorname*{argmax}_{n}\left[-n+\mathbb{E}_{\mu,r}\left[(T-n)\times\mu\times p(\hat{\mu}>0)\right]\right],

where μ^\hat{\mu} is the estimated mean reward of the second arm after nn exploration trials.

Based on their experiments, the authors noted two distinct types of behavior. In the first, learning via exploration is the optimal behavior. In the second, the agent should instead not learn and exploit its knowledge. The value in stopping learning is attributed to two aspects: 1) small ecological uncertainty can make it very unlikely the first arm is better; 2) too high of variance causes the lifetime to be too short for valid learning. As a result, the authors note this leads to two consequences. First, the values of σl\sigma_{l} and σp\sigma_{p} create a threshold between learning and non learning of behaviors. This determines whether learning is suitable or if the agent should instead rely on exploitation of the existing knowledge. Second, the optimal strategy is dependent on the time allotted for learning. The amount of knowledge an agent can attain is strictly dependent on the variance parameters and the time they have to learn the dynamics of the arms. The results indicate the relationship between complexity and uncertainty with respect to the expected number of trials. There is a clear delineation between the two behaviors indicated in the results.

Given the above specification, the authors generated agents trained on the bandit scenario, which they subsequently compared to the theoretically optimal solution. The outcome indicates the meta-learning paradigm creates agents matching the analytical solution. Given this structure, they note that such an agent demonstrates the dual components which handle learning a behavior policy and utilizing a hard-coded choice which selects the deterministic arm (which is the optimal choice in expectation).

To test the proposed approach in a more general case, the authors then train and test an LSTM-based actor-critic agent in an ensemble of gridworld tasks. This scenario allowed them to investigate the impact of lifetime on the exploration strategies of the agents generated. Intuitively, in the case of a long lifetime, the agent has more opportunity to search for higher-valued goals where shorter lifetimes would necessitate greedier identification of goals. This was tested by placing goals of increasing value at farther distances from the start state.

As indicated by the results, the trained agent demonstrates a learned preference for goals based on the proximity and lifetime TT provided. The results indicate the agent is able to learn a priority for farther goals when there is more time to discover and return to the higher-valued goal. Similarly, the agent learns to focus on goals closer to the start state when there is a higher restriction on TT or when the agent should prioritize the goal states based on time remaining. The agent exploits its policy to find the highest value goal while there is time and then switches to the lower value state when there is only time for these path lengths. This indicates the agent is able to exploit different experiential knowledge while also identifying when to switch between exploration or learned behaviors. Based on the two scenarios, the authors demonstrate their method having the means to learn how to switch between learning and exploiting learned models. This demonstrates ‘learning to learn’ and an ability to minimize the cost of behavior.

Additional Relevant Results

As described above, a key aspect to meta-reasoning and meta-learning is the cost of actions and deliberation. In (Lieder et al., 2017a; Lieder and Griffiths, 2017), we can see this concept investigated in the learning process. Additional aspects such as the use of reinforcement learning, memory, and additional theoretical analysis can be seen in (Mikulik et al., 2020; Xu et al., 2020; Oh et al., 2020; Zhen et al., 2020).

4.2. Theory of Mind

4.2.1. Relevant survey(s)

For relevant survey papers and related, please refer to (Bianco and Ognibene, 2019; Graziano, 2019)

4.2.2. Principles and Definitions

ToM is what gives humans the capacity for reasoning about the mental states such as beliefs and desires for other agents in their environment (Baker et al., 2011; Wang et al., 2021; Rabinowitz et al., 2018; Freire et al., 2019). In most, if not all, aspects of daily life, humans rely on and utilize their ability to estimate the mental state of others. One can imagine numerous scenarios they might encounter daily where they are able to interact with someone while relying on this type of reasoning. Something as simple as trying to anticipate on which side to pass another pedestrian requires you to utilize this technique. It is also easy to imagine a more complex scenario. For instance, poker players must reason about both the hand likelihoods as well as the mental state of their opponents to ensure a higher chance of success. In the case of poker, players can utilize methods of deception in an attempt to disguise their true mental state and gain an advantage. As a result, players would need to reason about the mental states of their opponents to guide their playing strategy.

ToM relies on multiple aspects of perception. A person will utilize the non-verbal cues, contextual clues, and other stimuli to form a picture of the world from the perspective of another person. ToM is what allows you to imagine yourself in someone else’s place to model the likely next steps. Such a skill allows humans to perform better both as individuals and as part of a team. As with the walking and poker examples, one could also imagine numerous scenarios where people work together to accomplish a goal. People working together will of course utilize verbal and other forms of communication, but there is also a significant reliance on each member’s ability to use a deeper understanding of the observed behaviors of their collaborators. Without this ability, we could expect that every aspect of these interactions would require explicit and comprehensive communication regarding all aspects of the interaction.

The representation of others is not necessarily exact, but can instead utilize an approximate and higher-level model (Rabinowitz et al., 2018). Further, it can be argued that humans bias their models of others based on their own perspective. It is natural to expect a person’s past experiences to impact their model of another person’s perspective. Given a system to model the perspective of others, the natural extension is using this ability to generate recursive relationships between them (Wang et al., 2021). For instance, the ability to model the mental state of others can be extended to model the other person’s model of yourself. This means a person can generate a model of how they are perceived from another person’s perspective. As is noted by (Freire et al., 2019), there are several notable approaches to address the problem of ToM. These include methods which rely on RL, Neural Networks, policy reconstruction, etc.

4.2.3. Applications and Recent Results

An aspect of ToM is the ability to predict the behavior of others in order to act accordingly. The authors of (Wang et al., 2020c) demonstrate a use case in which an RL-based learner predicts the movements of another agent to support rendezvous in a multi-agent environment. This is accomplished by generating a model for motion prediction, which supports a Hierarchical Predictive Planning (HPP) module.

Formally, the authors define the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) =n,𝒮,𝒪,𝒜1,,n,T,,γ\mathcal{M}=\langle n,\mathcal{S},\mathcal{O},\mathcal{A}_{1,\dots,n},T,\mathcal{R},\gamma\rangle where nn is the number of agents. 𝒮\mathcal{S} denotes the set of states and 𝒪=[𝒪1,,𝒪n]\mathcal{O}=[\mathcal{O}_{1},\dots,\mathcal{O}_{n}] is the joint observation space, where 𝒪i\mathcal{O}_{i} is agents ii’s observations. The joint action space 𝒜1,,n\mathcal{A}_{1,\dots,n} is the combination of the agent action spaces 𝒜i\mathcal{A}_{i} for agents i{1,,n}i\in\{1,\dots,n\}, which define the actions available to the agents. The transitions are defined by the function T:𝒮×𝒜1,,n×𝒮[0,1]T:\mathcal{S}\times\mathcal{A}_{1,\dots,n}\times\mathcal{S}\rightarrow[0,1], which models the probability of transitioning between states given a joint action a1,,na_{1,\dots,n}. \mathcal{R} is the reward function and maps states and actions to a reward rr\in\mathbb{R}. As defined in RL, γ\gamma is the discount factor for the learning process. The assumption in DecPOMDPs is that agents receive noisy observations of the environment and so the true state sis_{i} is unknown and is instead represented by the observation OiO_{i}. Consequently, the behavior relies on the estimated current state to identify a desirable action.

Agents are provided observations 𝒪i=[𝐩i,𝐩i,𝐨,𝐠]\mathcal{O}_{i}=[\mathbf{p}_{i},\mathbf{p}_{-i},\mathbf{o},\mathbf{g}] where 𝐩i\mathbf{p}_{i} and 𝐩i\mathbf{p}_{-i} refer to the agent positions, 𝐨\mathbf{o} are the sensor observations, and 𝐠\mathbf{g} is the agent’s goal. The agents learn a predictive model of motion via a self-supervision algorithm. The systems are tasked with learning two models: self-prediction(𝐟i)(\mathbf{f}_{i}) and other-prediction(𝐟i)(\mathbf{f}_{-i}). In this context, 𝐟i\mathbf{f}_{i} and 𝐟i\mathbf{f}_{-i} refer to self and other dynamics models respectively. Note that these models are the key point where ToM is used in this example as they enable modeling and prediction of others. These models are learned to generate a self-prediction of position Δ𝐩it+1\Delta\mathbf{p}^{t+1}_{i} and observation Δ𝐨it+1\Delta\mathbf{o}^{t+1}_{i}, using a time window of past position which covers the previous hh steps:

(24) (Δ𝐩it+1,Δ𝐨it+1)=𝐟i(𝐩ith:t,𝐨ith:t,𝐠)(\Delta\mathbf{p}^{t+1}_{i},\Delta\mathbf{o}^{t+1}_{i})=\mathbf{f}_{i}(\mathbf{p}^{t-h:t}_{i},\mathbf{o}^{t-h:t}_{i},\mathbf{g})

Similarly, a model for other-prediction (i.e. other agent prediction) is defined as

(25) (Δ𝐩it+1,Δ𝐨it+1)=𝐟i(𝐩ith:t,𝐨ith:t,𝐠)(\Delta\mathbf{p}^{t+1}_{-i},\Delta\mathbf{o}^{t+1}_{i})=\mathbf{f}_{-i}(\mathbf{p}^{t-h:t}_{-i},\mathbf{o}^{t-h:t}_{i},\mathbf{g})

The authors note that the models do not depend on actions, but instead utilize observations of positions and posed conditioned on goals, which prevents needing to know the action space of others.

Given the predictive models 𝐟i\mathbf{f}_{i} and 𝐟i\mathbf{f}_{-i}, a decentralized policy Πi\Pi_{i} is generated. This is done via the cross-entropy method (CEM) to convert goal evaluations into belief updates over potential rendezvous points. They note the intuition behind this approach is that each agent is simulating a centralized agent that fixes the goal of all agents, which are used to pre-condition the the motion predicted by 𝐟i\mathbf{f}_{i} and 𝐟i\mathbf{f}_{-i}. These predictions are performed in a rollout of TT time steps into the future to predict the pose of agents. Based on the rollouts, the predicted goals are scored according to

(26) (𝐩1,,n)={0,|𝐩j𝐩μ|<d,j1,,nj,kj|𝐩k𝐩j|,otherwise\mathcal{R}(\mathbf{p}_{1,\dots,n})=\begin{cases}0,\,&\,|\mathbf{p}_{j}-\mathbf{p_{\mu}}|<d,\,\forall j\in 1,\dots,n\\ \sum_{j,k\neq j}-|\mathbf{p}_{k}-\mathbf{p}_{j}|,\,&otherwise\\ \end{cases}

where 𝐩μ=1nk1,,n𝐩k\mathbf{p_{\mu}}=\frac{1}{n}\sum_{k\in 1,\dots,n}\mathbf{p}_{k} and dd is a precision parameter. This emphasizes accuracy of predicted future states and smaller distances between agents at rendezvous points. Based on the predicted goal values, goal states are sampled via a normal distribution and the estimation process continues in order to favor goals which bring the agents closer together. This process generates the policy Πi\Pi_{i} which is used to complete the rendozvous without centralized control. With the above approach, the authors demonstrate a ToM method which utilizes the first level of ToM reasoning to predict likely agent behavior based on past observations and the current circumstances.

Results

The authors demonstrate their algorithm’s capabilities in simulated and real world environments. The simulated environments range in complexity starting with no obstacles and transition to environments with multiple obstacles. Similarly, the physical environments vary in obstacle complexity, type, and layout. The approach was tested against learned, decentralized, planning-based, and centralized baselines. For the learned baseline, the authors used the popular MADDPG algorithm. Planning was performed using the RRT planning system. For the centralized system, the midpoint, other agent’s position, and random point are used. As is clear from the results, the proposed approach performs strongly compared against the baselines. This demonstrates the agents are able to perform the rendezvous without the need for centralized control. In the case of the real world environments, we can see a similar effectiveness of the proposed approach. The results indicate the approach is able to translate from simulated environments into the real world.

Additional Relevant Results

Similar to the provided sample above, there are further examples in which it is desired to understand and predict behavior of other agents in a multi-agent setting. In (Shum et al., 2019) authors utilize composeable team hierarchies to generate policies in multi-agent settings based on the behavior patterns of others/groups. Similarly, (Köpf et al., 2020) demonstrates simulating the policies of others based on past experiences. Additionally, (Dissing and Bolander, 2020) demonstrates another use of ToM in the context of robotics. Further examples of behavior conditioned on the behavior or goals of others can be seen in (Morveli-Espinoza et al., 2021; Freire et al., 2019; Chandra et al., 2020; Oguntola et al., 2021). For example, (Chandra et al., 2020) demonstrates the use of ToM to predict driver behavior patterns. Additionally, (Lee et al., 2018) demonstrates a question-answer system which tries to model the internal models to provide behavior which improves information gain based on estimated models.

4.3. Simulating Human Knowledge of World for Learners

4.3.1. Relevant survey(s)

For relevant survey papers and related, please refer to (Ullman and Tenenbaum, 2020)

4.3.2. Principles and Definitions

The topics in this section do not represent a comprehensive list, but instead serve to illustrate a type of learning which relies on a level of world comprehension to accomplish the learning task. In this case, learning agents are provided models which allow them to understand a feature of the environment instead of requiring the agent to learn these features as well. For example, humans demonstrate a capacity for modeling the fundamental characteristics of an environment such as physics, compositional structure, etc. (e.g. Solidity (Ullman and Tenenbaum, 2020)). With such an understanding, the components of a scenario can be considered when learning or utilizing a skill. Such a behavior is demonstrated in multiple aspects of daily life. Something as simple as understanding that gravity allows a person to pour water into a glass is something we take for granted. Comprehension of these aspects of the world enables a rich set of behavior. For artificial systems, such a comprehension is often not provided or demonstrated. To overcome this, researchers have investigated methods in which features such as physical, compositional, hierarchical, etc. are provided as a model for simulation or planning.

Physics Models

Following the assumption that humans possess an internal model or understanding of aspects of physics (e.g. pushing an object over the edge of a table will cause it to fall), digital systems can be generated to replicate these behaviors for modeling and simulation. Engines such as Unity or MuJoCo (Todorov et al., 2012) demonstrate motion of objects in physical environments and can provide realism to object motion for observations made by artificial agents. These models of system behavior can provide an agent with an internal estimator for world dynamics to support prediction and planning (Allen et al., 2020; Ota et al., 2020, 2021). This lets a learner generate predicted future states for reasoning and planning.

Planners

Similar to a physics engine, the components and dynamics of a system can be modeled by describing the objects in the environment and how they can interact, be utilized, or modified. This allows for constraints to be specified for states which must be met in order for an action to be available for execution. The combination of the environment composition as well as the available actions allows for planning. Planning is utilized in order to convert the system state from the initial configuration to a goal state through a sequence of actions. A commonly used method for representation is via Planning Domain Definition Language (PDDL) (Aeronautiques et al., 1998). An environment specified in PDDL can then be analyzed by a solution system to identify a desirable sequence of actions for the specified goal. These solutions are generated to utilize the world model and the defined constraints. This enables identification of viable plans without the need for a learning process such as RL. On the other hand, policy generation via RL with planning-based trajectories can be performed (Zhi-Xuan et al., 2020).

4.3.3. Applications and Recent Results

To demonstrate the use of a modeling system to enable deeper understanding, we will discuss (Zhi-Xuan et al., 2020), which integrates PDDL into the model for planning based on different likely goals or trajectories. The inclusion of a planning system was intended as a method by which the system could account for sub-optimal or failed plans and incorporate them into estimates of future outcomes. Further, this approach was desired as a method which could account for the difficulty of the planning phase itself. The authors noted how many methods attempting to estimate goals fail to consider the difficulty of the planning portion of the process. Additionally, they note how most methods require the assumption of optimal behavior or Boltzmann-rational action noise. The assumption of optimality for all goals would therefore require computation of all goals in advance, which is intractable. In their approach, the authors use the integration of a planning system to represent a boundedly rational agent, where the bounds provide a resource limitation on planning and plan execution. This model provides a mechanism for Bayesian inference of plans/goals, even those with sub-optimal solutions, requiring backtracking, or irreversible failure. The limitation forces a constraint on the time or resources available to make a decision, which forces the agent at times to generate only a partial plan up to the level afforded by the constraints.

As noted above, the proposed approach represents the states, observations, and goals using PDDL and a variant supporting stochastic transitions named Probabilistic PDDL. This is accomplished by representing states and goals via predicate-based facts, relations, and numeric expressions in PDDL format. This allows for modeling of world state and actions, which are available when the provided preconditions are satisfied (e.g. current state, tool availability, etc.). The combination of predicates (e.g. relations such as ‘on’ or ‘at’) and fluents (e.g. fuel-level) allows the planning system to identify system state and available resources in order to generate a chain of actions and outcomes leading to a goal state. The actions result in a change in predicates and fluents, which signify the transition between world and system states. For a simple example, stacking blocks can be planned and considers the stacking agent’s state as well as the current position of the blocks. A block which is covered by another would not be available for stacking and so the predicate’s constraint would not be met for this block. On the other hand, uncovered or top-most blocks would be available, so one from this set would be available for selection for movement. With this representation, the observer has a prior over goals P(g)P(g) specified via a probabilistic program over PDDL goal specifications. It is also noted that observation noise can be modeled for both the Boolean predicates and numeric fluents. This is accomplished by flipping predicate values or adding continuous noise with some probability (e.g. flipping a block’s covered flag when covered by another block).

To represent bounded rationality, the authors define a budget

(27) ηNEGATIVE-BINOMIAL(r,q)\eta\sim\textrm{NEGATIVE-BINOMIAL}(r,q)

where rr denotes the maximum failure count and qq denotes the continuation probability. Therefore, η\eta sets an upper bound on the solution search. If the bound is reached, the agent executes a partial plan leading to the most suitable state reachable from the found plans. This allows the agent to find the best plan they can given the limited resources available. The authors note that this model supports any planner capable of producing partial plans. For their scenario, the authors are operating in a gridworld environment and utilize a variant of the AA^{*} algorithm (Hart et al., 1968). that makes search stochastic. They note the modification to the AA^{*} algorithm relates to the state successor component and modifies how best next states are weighted. If an agent reaches an unexpected state or the end of its plan, then a new plan is generated with a newly sampled budget η\eta.

To test the proposed Sequential Inverse Plan Search (SIPS) approach, the authors utilize several environments with goal set GG and state space SS:

  • Taxi (|G|=3|G|=3, |S|=125|S|=125) A taxi has to transport a passenger from one location to another in a gridworld

  • Doors, Keys, & Gems (|G|=3|G|=3, |S|105|S|\sim 105) An agent must navigate a maze with doors, keys, and gems

  • Block Words (|G|=5,|S|105|G|=5,|S|\sim 105) Goals correspond to block towers that spell one of a set of five English words

  • Intrusion Detection (|G|=20,|S|1030|G|=20,|S|\sim 1030) An agent might perform a variety of attacks on a set of servers (20 possible goals corresponding to a set of attacks on up to 10 servers)

They test their approach against a Bayesian IRL (BIRL) model generated by value iteration. This allows them to compare against an approach based on RL solutions.

As a demonstration of effectiveness, the performance of SIPS and BIRL were compared to human performance in goal prediction. The results indicate a strong match between the SIPS model and the demonstrated human performance. These results were generated by collecting human goal inferences on ten trajectories with six sub-optimal or failed with N=8N=8 subjects. Human inferences were collected every six timesteps.

For evaluation of accuracy and speed, the authors tested with a dataset of optimal and non-optimal trajectories. The optimal trajectories were obtained via the AA^{*} algorithm and the non-optimal trajectories were generated using a replanning agent model with r=2,q=0.95,γ=0.1r=2,q=0.95,\gamma=0.1. They performed inference on these datasets with a uniform prior over goals. Based on their tests, they found good performance with 1010 particles per goal without the use of rejuvenation moves.

In both test cases, the proposed method demonstrates strong performance. Additionally, the performance versus computational cost is quite strong in the majority of cases in comparison with the baseline BIRL.

Additional Relevant Results

Similar to the above example, the provision of a model or planner has been investigated in further settings. For example, (Pentecost et al., 2016) demonstrates the combination of ACT-R and a physics engine to enhance the predictive power of the cognitive model to better replicate human behavior. Additionally, (Allen et al., 2020; Ota et al., 2020, 2021) demonstrate integrating a physics model into training of a learned behavior model. In (Ota et al., 2020, 2021), this allows training of the behavior in a simulated environment followed by a transition to the real world. The agent can learn a approximate solution based on the simulator and then learn a translation which maps the simulation to the real world. In (Allen et al., 2020), the authors demonstrate a learner using a physics engine to learn how to manipulate tools and objects to achieve a sequence of actions to accomplish a task. This enables motion prediction in order to determine the best sequence based on likely trajectories.

5. Conclusion

In this paper, we have demonstrated methods focusing on multiple aspects of human behavior and cognition in addition to how humans interact with artificial systems. The types and end uses for these interactions vary, but a key concept is the ability to learn from or adapt to humans. The pervasive diffusion of AI systems provides exceptional opportunities to build next-generation autonomous and adaptive systems in several application areas (which we have briefly described in Section 2). To gain full advantage of this opportunity, it is of paramount importance that AI systems “learn to know and anticipate" the behavior of the involved users. This is a cornerstone of Human-Centric AI systems for autonomous and adaptive behaviors, which require to embed practical models of the human behavior that can be used either to interpret users’ actions, and to anticipate their reactions with respect to (certain or likely) stimuli. Humans use multiple advanced skills daily to navigate the world and their interactions with others. Researchers have investigated several topics such as Theory of Mind, Inverse Reinforcement Learning, Active Learning, and more in an attempt to capture some of these capabilities and imbue systems with more advanced capabilities. This enables systems to better understand and operate in the world, including cases in which there are humans to account for and support, or when systems are expected to operate autonomously. Further, it enables systems to perform tasks more skillfully by replicating the level of skill demonstrated by humans. In addition to learning from humans, these techniques often serve to enable richer and more intuitive interactions between humans and artificial systems. This enables both the interactions as well as the system’s ability to account for the human and potentially gain further knowledge as a result of the interaction. This approach gives systems a way in which to account for the cyclical nature of these relationships in order to improve performance and learn.

In the case of Section 3 (Learning Human Behaviors by Experience and Feedback), the presented techniques serve as useful approaches for learning models of behavior when a learner has access to an environment and a method for receiving feedback. The feedback enables the learner to explore the environment and learn which behaviors prove most useful or desirable. These methods extend from discrete to continuous cases and varying levels of human demonstration or input. A possible downside to these approaches is that they often require large amounts of training samples as they tend to be sample inefficient. On the other hand, they are often demonstrated as viable techniques for training autonomous systems to operate and adapt to given environments. In the case of these learning methods (e.g. Section 3.1-Section 3.4), agents demonstrate the ability to translate experiences or demonstrated behavior into a performant policy of behavior. Often, these learned models are capable of meeting or exceeding the performance of the human counterpart. On the other hand, many of these techniques rely on exploration of the state-action space, which can be costly. Methods such as those discussed in Section 3.4 attempt to improve the efficiency of this exploration and the utility of the feedback or demonstrations utilized. In the case of Section 3.2 and Section 3.3, the learner attempts to convert observations of demonstrated behavior into a policy to enable learning from examples. In general, the topics demonstrated illustrate a powerful technique for learning behaviors, but also demonstrate there are considerations necessary to ensure accurate outcomes and sufficient resources for training or learning.

Moreover, in Section 4 (Belief and Reasoning Approaches), we discuss approaches which represent methods reasoning or considering biases or beliefs of humans. In this context, the presented approaches learn or use models of belief. When considering the limitations on resources available for reasoning (e.g. Section 4.1), agents can learn a belief regarding the current task and what behaviors are most suitable. This belief can guide the agent with respect to how and when to dedicate effort. For instance, this can be belief with respect to the mental state of others. This enables a model which can attempt to account for the unspoken aspects of human interaction, which can prove essential for autonomous systems operating in the same environment as human users or collaborators. These models, as seen in Section 4.2, demonstrate impressive reasoning levels. However, similar to what has been discussed above, these methods can require extensive training, which is costly. Further, these methods can often rely on the inclusion of domain knowledge or noisy observations, which can impact the level of performance or the ability to generalize to more realistic settings. We also discuss scenarios which allow learners to utilize world models to reason at a higher level, preventing the need to learn models of the world while simultaneously trying to learn to use the model. Similar to the previous topic, this also requires development of models, which can be costly and challenging. Also, the use of these models (as expected) requires the ability to generate such a model. The generation of such a model or modeling system can be highly effective, but again requires inclusion of domain knowledge or costly model generation. For instance, the use of a planning system/language as seen with PDDL (see Section 4.3) typically relies on explicit definitions of resources, actions, goals, and constraints in order to allow the solution generation. This is not necessarily unreasonable, but it is a worthwhile consideration when determining the level of prior knowledge desired for your approach. As can be seen in many examples in literature, this inclusion of domain knowledge affords an effective solution for generating plans of action for a system to achieve a desired outcome.

Overall, we think that the approaches and goals presented motivate the topics discussed in this paper and serve to promote further investigation into how humans and artificial systems interact and learn from each other.

6. Acknowledgments

This work was supported by the H2020 Humane-AI-Net project (grant #952026) and by the CHIST-ERA grant CHIST-ERA-19-XAI-010, by MUR (grant No. not yet available), FWF (grant No. I 5205), EPSRC (grant No. EP/V055712/1), NCN (grant No. 2020/02/Y/ST6/00064), ETAg (grant No. SLTAT21096), BNSF (grant No. KP-06-DOO2/5).

References

  • (1)
  • Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning. 1.
  • Ackerman and Thompson (2017) Rakefet Ackerman and Valerie A Thompson. 2017. Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences 21, 8 (2017), 607–617.
  • Aeronautiques et al. (1998) Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDermott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins SRI, Anthony Barrett, Dave Christianson, et al. 1998. PDDL| The Planning Domain Definition Language. Technical Report, Tech. Rep. (1998).
  • Akbulut et al. (2021) Mete Akbulut, Erhan Oztop, Muhammet Yunus Seker, X Hh, Ahmet Tekden, and Emre Ugur. 2021. Acnmp: Skill transfer and task extrapolation through learning from demonstration and reinforcement learning via representation sharing. In Conference on Robot Learning. PMLR, 1896–1907.
  • Allen et al. (2020) Kelsey R Allen, Kevin A Smith, and Joshua B Tenenbaum. 2020. Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning. Proceedings of the National Academy of Sciences 117, 47 (2020), 29302–29310.
  • Arora and Doshi (2021) Saurabh Arora and Prashant Doshi. 2021. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence (2021), 103500.
  • Arulkumaran et al. (2017) Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 6 (2017), 26–38.
  • Asgari and Beauregard (2021) Alireza Asgari and Yvan Beauregard. 2021. Brain-Inspired Model for Decision-Making in the Selection of Beneficial Information Among Signals Received by an Unpredictable Information-Development Environment. angrxiv (2021). https://engrxiv.org/preprint/view/1525
  • Baker et al. (2011) Chris Baker, Rebecca Saxe, and Joshua Tenenbaum. 2011. Bayesian theory of mind: Modeling joint belief-desire attribution. In Proceedings of the annual meeting of the cognitive science society, Vol. 33.
  • Balakrishnan et al. (2020) Sreejith Balakrishnan, Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Harold Soh. 2020. Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization. Advances in Neural Information Processing Systems 33 (2020).
  • Bansal et al. (2019) Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in human-AI team performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 2–11.
  • Baron et al. (1994) Thierry Baron, Martin D Levine, and Yehezkel Yeshurun. 1994. Exploring with a foveated robot eye system. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 2-Conference B: Computer Vision & Image Processing.(Cat. No. 94CH3440-5). IEEE, 377–380.
  • Bhattacharyya et al. (2020) Raunak Bhattacharyya, Blake Wulfe, Derek Phillips, Alex Kuefler, Jeremy Morton, Ransalu Senanayake, and Mykel Kochenderfer. 2020. Modeling human driving behavior through generative adversarial imitation learning. arXiv preprint arXiv:2006.06412 (2020).
  • Bianco and Ognibene (2019) Francesca Bianco and Dimitri Ognibene. 2019. Functional advantages of an adaptive Theory of Mind for robotics: a review of current architectures. In 2019 11th Computer Science and Electronic Engineering (CEEC). IEEE, 139–143.
  • Bontchev et al. (2021) Boyan Paskalev Bontchev, Valentina Terzieva, and Elena Paunova-Hubenova. 2021. Personalization of serious games for learning. Interactive Technology and Smart Education (2021).
  • Borovikov et al. (2019) Igor Borovikov, Jesse Harder, Michael Sadovsky, and Ahmad Beirami. 2019. Towards interactive training of non-player characters in video games. arXiv preprint arXiv:1906.00535 (2019).
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:arXiv:1606.01540
  • Chandra et al. (2020) Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Stylepredict: Machine theory of mind for human driver behavior from trajectories. arXiv preprint arXiv:2011.04816 (2020).
  • Chen et al. (2020) Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. 2020. BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning. Advances in Neural Information Processing Systems 33 (2020).
  • Choi et al. (2019) Chiho Choi, Srikanth Malla, Abhishek Patil, and Joon Hee Choi. 2019. DROGON: A trajectory prediction model based on intention-conditioned behavior reasoning. arXiv preprint arXiv:1908.00024 (2019).
  • Cichocki and Kuleshov (2021) Andrzej Cichocki and Alexander P Kuleshov. 2021. Future Trends for Human-AI Collaboration: A Comprehensive Taxonomy of AI/AGI Using Multiple Intelligences and Learning Styles. Computational Intelligence and Neuroscience 2021 (2021).
  • Cockcroft et al. (2020) Matthew Cockcroft, Shahil Mawjee, Steven James, and Pravesh Ranchod. 2020. Learning options from demonstration using skill segmentation. In 2020 International SAUPEC/RobMech/PRASA Conference. IEEE, 1–6.
  • Costantini (2002) Stefania Costantini. 2002. Meta-reasoning: a survey. In Computational Logic: Logic Programming and Beyond. Springer, 253–288.
  • Das et al. (2020) Neha Das, Sarah Bechtle, Todor Davchev, Dinesh Jayaraman, Akshara Rai, and Franziska Meier. 2020. Model-Based Inverse Reinforcement Learning from Visual Demonstrations. arXiv preprint arXiv:2010.09034 (2020).
  • de Almeida Rocha and Duarte (2019) Daniel de Almeida Rocha and Julio Cesar Duarte. 2019. Simulating human behaviour in games using machine learning. In 2019 18th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames). IEEE, 163–172.
  • Dellermann et al. (2021) Dominik Dellermann, Adrian Calma, Nikolaus Lipusch, Thorsten Weber, Sascha Weigel, and Philipp Ebel. 2021. The future of human-AI collaboration: a taxonomy of design knowledge for hybrid intelligence systems. arXiv preprint arXiv:2105.03354 (2021).
  • Dissing and Bolander (2020) Lasse Dissing and Thomas Bolander. 2020. Implementing Theory of Mind on a Robot Using Dynamic Epistemic Logic.. In IJCAI. 1615–1621.
  • Džeroski et al. (2001) Sašo Džeroski, Luc De Raedt, and Kurt Driessens. 2001. Relational reinforcement learning. Machine learning 43, 1 (2001), 7–52.
  • Eiband et al. (2021) Malin Eiband, Daniel Buschek, and Heinrich Hussmann. 2021. How to support users in understanding intelligent systems? Structuring the discussion. In 26th International Conference on Intelligent User Interfaces. 120–132.
  • Eysenbach et al. (2020) Benjamin Eysenbach, Xinyang Geng, Sergey Levine, and Ruslan Salakhutdinov. 2020. Rewriting history with inverse rl: Hindsight inference for policy improvement. arXiv preprint arXiv:2002.11089 (2020).
  • Fernando et al. (2020) Tharindu Fernando, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2020. Deep inverse reinforcement learning for behavior prediction in autonomous driving: Accurate forecasts of vehicle motion. IEEE Signal Processing Magazine 38, 1 (2020), 87–96.
  • Forbes and Andre (2002) Jeffrey Forbes and David Andre. 2002. Representations for learning control policies. In The University of New South. Citeseer.
  • Fortunato et al. (2019) Meire Fortunato, Melissa Tan, Ryan Faulkner, Steven Hansen, Adrià Puigdomènech Badia, Gavin Buttimore, Charlie Deck, Joel Z Leibo, and Charles Blundell. 2019. Generalization of reinforcement learners with working and episodic memory. arXiv preprint arXiv:1910.13406 (2019).
  • Freire et al. (2019) Ismael T Freire, Xerxes D Arsiwalla, Jordi-Ysard Puigbò, and Paul Verschure. 2019. Modeling theory of mind in multi-agent games using adaptive feedback control. arXiv preprint arXiv:1905.13225 (2019).
  • Fuchs et al. (2022) A. Fuchs, A. Passarella, and M. Conti. 2022. Modeling Human Behavior Part II - Cognitive approaches and Uncertainty. arXiv preprint arXiv:XXX (2022).
  • Gao et al. (2021) Ruijiang Gao, Maytal Saar-Tsechansky, Maria De-Arteaga, Ligong Han, Min Kyung Lee, and Matthew Lease. 2021. Human-AI Collaboration with Bandit Feedback. arXiv preprint arXiv:2105.10614 (2021).
  • Gershman et al. (2015) Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. 2015. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science 349, 6245 (2015), 273–278.
  • Graziano (2019) Michael SA Graziano. 2019. Attributing awareness to others: the attention schema theory and its relationship to behavioural prediction. Journal of Consciousness Studies 26, 3-4 (2019), 17–37.
  • Griffiths et al. (2019) Thomas L Griffiths, Frederick Callaway, Michael B Chang, Erin Grant, Paul M Krueger, and Falk Lieder. 2019. Doing more with less: meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences 29 (2019), 24–30.
  • Gruver et al. (2020) Nate Gruver, Jiaming Song, Mykel J Kochenderfer, and Stefano Ermon. 2020. Multi-agent adversarial inverse reinforcement learning with latent variables. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems. 1855–1857.
  • Gurcan et al. (2021) Fatih Gurcan, Nergiz Ercil Cagiltay, and Kursat Cagiltay. 2021. Mapping human–computer interaction research themes and trends from its existence to today: A topic modeling-based review of past 60 years. International Journal of Human–Computer Interaction 37, 3 (2021), 267–280.
  • Hart et al. (1968) Peter E Hart, Nils J Nilsson, and Bertram Raphael. 1968. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4, 2 (1968), 100–107.
  • Holzinger et al. (2019) Andreas Holzinger, Markus Plass, Michael Kickmeier-Rust, Katharina Holzinger, Gloria Cerasela Crişan, Camelia-M Pintea, and Vasile Palade. 2019. Interactive machine learning: experimental evidence for the human in the algorithmic loop. Applied Intelligence 49, 7 (2019), 2401–2414.
  • Hospedales et al. (2020) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. 2020. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439 (2020).
  • Hung et al. (2019) Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. 2019. Optimizing agent behavior over long time scales by transporting value. Nature communications 10, 1 (2019), 1–12.
  • Hussein et al. (2017) Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR) 50, 2 (2017), 1–35.
  • Jackson et al. (2017) Joshua Conrad Jackson, David Rand, Kevin Lewis, Michael I Norton, and Kurt Gray. 2017. Agent-based modeling: A guide for social psychologists. Social Psychological and Personality Science 8, 4 (2017), 387–395.
  • Julian et al. (2020) Ryan Julian, Benjamin Swanson, Gaurav S Sukhatme, Sergey Levine, Chelsea Finn, and Karol Hausman. 2020. Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. arXiv preprint arXiv:2004.10190 (2020).
  • Kaluarachchi et al. (2021) Tharindu Kaluarachchi, Andrew Reis, and Suranga Nanayakkara. 2021. A Review of Recent Deep Learning Approaches in Human-Centered Machine Learning. Sensors 21, 7 (2021), 2514.
  • Kalweit et al. (2020) Gabriel Kalweit, Maria Huegle, Moritz Werling, and Joschka Boedecker. 2020. Deep Inverse Q-learning with Constraints. Advances in Neural Information Processing Systems 33 (2020).
  • Kambhampati (2019) Subbarao Kambhampati. 2019. Challenges of human-aware AI systems. arXiv preprint arXiv:1910.07089 (2019).
  • Khan et al. (2020) Irfan Khan, Xianchao Zhang, Mobashar Rehman, and Rahman Ali. 2020. A literature survey and empirical study of meta-learning for classifier selection. IEEE Access 8 (2020), 10262–10281.
  • Kolekar et al. (2021) Suresh Kolekar, Shilpa Gite, Biswajeet Pradhan, and Ketan Kotecha. 2021. Behavior Prediction of Traffic Actors for Intelligent Vehicle using Artificial Intelligence Techniques: A Review. IEEE Access (2021).
  • Köpf et al. (2020) Florian Köpf, Alexander Nitsch, Michael Flad, and Sören Hohmann. 2020. Partner Approximating Learners (PAL): Simulation-Accelerated Learning with Explicit Partner Modeling in Multi-Agent Domains. In 2020 6th International Conference on Control, Automation and Robotics (ICCAR). IEEE, 746–752.
  • Kreminski et al. (2019) Max Kreminski, Ben Samuel, Edward Melcer, and Noah Wardrip-Fruin. 2019. Evaluating AI-based games through retellings. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 15. 45–51.
  • Lange and Sprekeler (2020) Robert Tjarko Lange and Henning Sprekeler. 2020. Learning not to learn: Nature versus nurture in silico. arXiv preprint arXiv:2010.04466 (2020).
  • Lansdell et al. (2019) Benjamin James Lansdell, Prashanth Ravi Prakash, and Konrad Paul Kording. 2019. Learning to solve the credit assignment problem. arXiv preprint arXiv:1906.00889 (2019).
  • Lee et al. (2018) Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak Zhang. 2018. Answerer in questioner’s mind for goal-oriented visual dialogue. In Visually-Grounded Interaction and Language Workshop (NIPS).
  • Li et al. (2020) Hongming Li, Ying Ma, and Jose Principe. 2020. Cognitive architecture for video games. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–9.
  • Lieder et al. (2017a) Falk Lieder, Frederick Callaway, Sayan Gul, Paul M Krueger, and Thomas L Griffiths. 2017a. Learning to select computations. In NIPS workshop on Cognitively Informed AI.
  • Lieder and Griffiths (2017) Falk Lieder and Thomas L Griffiths. 2017. Strategy selection as rational metareasoning. Psychological review 124, 6 (2017), 762.
  • Lieder et al. (2017b) Falk Lieder, Paul M Krueger, and Tom Griffiths. 2017b. An automatic method for discovering rational heuristics for risky choice.. In CogSci.
  • Lillicrap and Santoro (2019) Timothy P Lillicrap and Adam Santoro. 2019. Backpropagation through time and the brain. Current opinion in neurobiology 55 (2019), 82–89.
  • Lindner et al. (2021) David Lindner, Rohin Shah, Pieter Abbeel, and Anca Dragan. 2021. Learning What To Do by Simulating the Past. arXiv preprint arXiv:2104.03946 (2021).
  • Liu et al. (2019) Yang Liu, Yifeng Zeng, Yingke Chen, Jing Tang, and Yinghui Pan. 2019. Self-improving generative adversarial reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 52–60.
  • Maekawa et al. (2020) Yamato Maekawa, Naoki Akai, Takatsugu Hirayama, Luis Yoichi Morales, Daisuke Deguchi, Yasutomo Kawanishi, Ichiro Ide, and Hiroshi Murase. 2020. Modeling Eye-Gaze Behavior of Electric Wheelchair Drivers via Inverse Reinforcement Learning. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC). IEEE, 1–7.
  • Mavrogiannis et al. (2021) Christoforos Mavrogiannis, Francesca Baldini, Allan Wang, Dapeng Zhao, Pete Trautman, Aaron Steinfeld, and Jean Oh. 2021. Core Challenges of Social Robot Navigation: A Survey. arXiv preprint arXiv:2103.05668 (2021).
  • McCallum (1996) R Andrew McCallum. 1996. Hidden state and reinforcement learning with instance-based state identification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 26, 3 (1996), 464–473.
  • Mikulik et al. (2020) Vladimir Mikulik, Grégoire Delétang, Tom McGrath, Tim Genewein, Miljan Martic, Shane Legg, and Pedro A Ortega. 2020. Meta-trained agents implement Bayes-optimal agents. arXiv preprint arXiv:2010.11223 (2020).
  • Möller et al. (2021) Ronja Möller, Antonino Furnari, Sebastiano Battiato, Aki Härmä, and Giovanni Maria Farinella. 2021. A Survey on Human-aware Robot Navigation. arXiv preprint arXiv:2106.11650 (2021).
  • Morrison et al. (2021) Cecily Morrison, Edward Cutrell, Martin Grayson, Anja Thieme, Alex Taylor, Geert Roumen, Camilla Longden, Sebastian Tschiatschek, Rita Faia Marques, and Abigail Sellen. 2021. Social Sensemaking with AI: Designing an Open-ended AI experience with a Blind Child. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
  • Morveli-Espinoza et al. (2021) Mariela Morveli-Espinoza, Juan Carlos Nieves, and Tacla Cesar Augusto. 2021. Dealing with Conflicts between Human Activities: An Argumentation-based Approach. In AAAI-21 Workshop on Plan Activity and Intent Recognition (PAIR 2021) held at the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Virtual, February 9, 2021.
  • Najar and Chetouani (2021) Anis Najar and Mohamed Chetouani. 2021. Reinforcement learning with human advice: a survey. Frontiers in Robotics and AI 8 (2021).
  • Navidi (2020) Neda Navidi. 2020. Human AI interaction loop training: New approach for interactive reinforcement learning. arXiv preprint arXiv:2003.04203 (2020).
  • Navidi and Landry (2021) Neda Navidi and Rene Landry. 2021. New approach in human-AI interaction by reinforcement-imitation learning. Applied Sciences 11, 7 (2021), 3068.
  • Nguyen and Daumé III (2020) Khanh Nguyen and Hal Daumé III. 2020. Active Imitation Learning from Multiple Non-Deterministic Teachers: Formulation, Challenges, and Algorithms. arXiv preprint arXiv:2006.07777 (2020).
  • Nguyen and Gonzalez (2020a) Thuy N Nguyen and Cleotilde Gonzalez. 2020a. Cognitive machine theory of mind. Technical Report. Carnegie Mellon University.
  • Nguyen and Gonzalez (2020b) Thuy N Nguyen and Cleotilde Gonzalez. 2020b. Effects of Decision Complexity in Goal seeking Gridworlds: A Comparison of Instance Based Learning and Reinforcement Learning Agents. Technical Report. Carnegie Mellon University.
  • Ni et al. (2020) Tianwei Ni, Harshit Sikchi, Yufei Wang, Tejus Gupta, Lisa Lee, and Benjamin Eysenbach. 2020. f-irl: Inverse reinforcement learning via state marginal matching. arXiv preprint arXiv:2011.04709 (2020).
  • Ning et al. (2021) Huansheng Ning, Rui Yin, Ata Ullah, and Feifei Shi. 2021. A Survey on Hybrid Human-Artificial Intelligence for Autonomous Driving. IEEE Transactions on Intelligent Transportation Systems (2021).
  • Niu and Gu (2020) Yaru Niu and Yijun Gu. 2020. Active Hierarchical Imitation and Reinforcement Learning. arXiv preprint arXiv:2012.07330 (2020).
  • Oguntola et al. (2021) Ini Oguntola, Dana Hughes, and Katia Sycara. 2021. Deep Interpretable Models of Theory of Mind For Human-Agent Teaming. arXiv preprint arXiv:2104.02938 (2021).
  • Oh et al. (2020) Junhyuk Oh, Matteo Hessel, Wojciech M Czarnecki, Zhongwen Xu, Hado P van Hasselt, Satinder Singh, and David Silver. 2020. Discovering Reinforcement Learning Algorithms. Advances in Neural Information Processing Systems 33 (2020).
  • Oroojlooy et al. (2020) Afshin Oroojlooy, Mohammadreza Nazari, Davood Hajinezhad, and Jorge Silva. 2020. AttendLight: Universal Attention-Based Reinforcement Learning Model for Traffic Signal Control. arXiv preprint arXiv:2010.05772 (2020).
  • Osa et al. (2018) T Osa, J Pajarinen, G Neumann, JA Bagnell, P Abbeel, and J Peters. 2018. An Algorithmic Perspective on Imitation Learning. Foundations and Trends in Robotics 7, 1-2 (2018), 1–179.
  • Ota et al. (2020) Kei Ota, Devesh K Jha, Diego Romeres, Jeroen van Baar, Kevin A Smith, Takayuki Semitsu, Tomoaki Oiki, Alan Sullivan, Daniel Nikovski, and Joshua B Tenenbaum. 2020. Towards Human-Level Learning of Complex Physical Puzzles. arXiv e-prints (2020), arXiv–2011.
  • Ota et al. (2021) Kei Ota, Devesh K Jha, Diego Romeres, Jeroen van Baar, Kevin A Smith, Takayuki Semitsu, Tomoaki Oiki, Alan Sullivan, Daniel Nikovski, and Joshua B Tenenbaum. 2021. Data-Efficient Learning for Complex and Real-Time Physical Problem Solving Using Augmented Simulation. IEEE Robotics and Automation Letters 6, 2 (2021), 4241–4248.
  • Padakandla et al. (2020) Sindhu Padakandla, KJ Prabuchandran, and Shalabh Bhatnagar. 2020. Reinforcement learning algorithm for non-stationary environments. Applied Intelligence 50, 11 (2020), 3590–3606.
  • Parpart et al. (2017) Paula Parpart, Eric Schulz, Maarten Speekenbrink, and B Love. 2017. Active learning reveals underlying decision strategies. (2017).
  • Pentecost et al. (2016) D Pentecost, Charlotte Sennersten, R Ollington, C Lindley, and B Kang. 2016. Using a physics engine in ACT-R to aid decision making. International Journal on Advances in Intelligent Systems 9, 3-4 (2016), 298–309.
  • Pentland and Liu (1999) Alex Pentland and Andrew Liu. 1999. Modeling and prediction of human behavior. Neural computation 11, 1 (1999), 229–242.
  • Perrotta and Selwyn (2020) Carlo Perrotta and Neil Selwyn. 2020. Deep learning goes to school: Toward a relational understanding of AI in education. Learning, Media and Technology 45, 3 (2020), 251–269.
  • Peterson and Beach (1967) Cameron R Peterson and Lee Roy Beach. 1967. Man as an intuitive statistician. Psychological bulletin 68, 1 (1967), 29.
  • Puig et al. (2020) Xavier Puig, Tianmin Shu, Shuang Li, Zilin Wang, Yuan-Hong Liao, Joshua B Tenenbaum, Sanja Fidler, and Antonio Torralba. 2020. Watch-and-help: A challenge for social perception and human-AI collaboration. arXiv preprint arXiv:2010.09890 (2020).
  • Puterman (2014) Martin L Puterman. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Rabinowitz et al. (2018) Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew Botvinick. 2018. Machine theory of mind. In International conference on machine learning. PMLR, 4218–4227.
  • Raghu et al. (2019) Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. 2019. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220 (2019).
  • Rahwan et al. (2019) Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W Crandall, Nicholas A Christakis, Iain D Couzin, Matthew O Jackson, et al. 2019. Machine behaviour. Nature 568, 7753 (2019), 477–486.
  • Ramaraj et al. (2021) Preeti Ramaraj, Charles L Ortiz Jr, Matthew Klenk, and Shiwali Mohan. 2021. Unpacking Human Teachers’ Intentions For Natural Interactive Task Learning. arXiv preprint arXiv:2102.06755 (2021).
  • Ramponi et al. (2020) Giorgia Ramponi, Gianluca Drappo, and Marcello Restelli. 2020. Inverse Reinforcement Learning from a Gradient-based Learner. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 2458–2468. https://proceedings.neurips.cc/paper/2020/file/19aa6c6fb4ba9fcf39e893ff1fd5b5bd-Paper.pdf
  • Reddy et al. (2018) Siddharth Reddy, Anca D Dragan, and Sergey Levine. 2018. Shared autonomy via deep reinforcement learning. arXiv preprint arXiv:1802.01744 (2018).
  • Reddy et al. (2012) Tummalapalli Sudhamsh Reddy, Vamsikrishna Gopikrishna, Gergely Zaruba, and Manfred Huber. 2012. Inverse reinforcement learning for decentralized non-cooperative multiagent systems. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 1930–1935.
  • Ren et al. (2020) Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. 2020. A survey of deep active learning. arXiv preprint arXiv:2009.00236 (2020).
  • Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 627–635.
  • Russell and Wefald (1991) Stuart Russell and Eric Wefald. 1991. Principles of metareasoning. Artificial intelligence 49, 1-3 (1991), 361–395.
  • Salter et al. (2019) Sasha Salter, Dushyant Rao, Markus Wulfmeier, Raia Hadsell, and Ingmar Posner. 2019. Attention-privileged reinforcement learning. arXiv preprint arXiv:1911.08363 (2019).
  • Schatzmann et al. (2006) Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review 21, 2 (2006), 97–126.
  • Schneider (2020) Johannes Schneider. 2020. Humans learn too: Better Human-AI Interaction using Optimized Human Inputs. arXiv preprint arXiv:2009.09266 (2020).
  • Semeraro et al. (2021) Francesco Semeraro, Alexander Griffiths, and Angelo Cangelosi. 2021. Human-Robot Collaboration and Machine Learning: A Systematic Review of Recent Research. arXiv preprint arXiv:2110.07448 (2021).
  • Settles (2009) Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin–Madison.
  • Shani et al. (2013) Guy Shani, Joelle Pineau, and Robert Kaplow. 2013. A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems 27, 1 (2013), 1–51.
  • Shum et al. (2019) Michael Shum, Max Kleiman-Weiner, Michael L Littman, and Joshua B Tenenbaum. 2019. Theory of minds: Understanding behavior in groups through inverse planning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6163–6170.
  • Sun et al. (2019) Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. 2019. Stochastic prediction of multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641 (2019).
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction.
  • Ten et al. (2021) Alexandr Ten, Pramod Kaushik, Pierre-Yves Oudeyer, and Jacqueline Gottlieb. 2021. Humans monitor learning progress in curiosity-driven exploration. Nature communications 12, 1 (2021), 1–10.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 5026–5033.
  • Ullman and Tenenbaum (2020) Tomer D Ullman and Joshua B Tenenbaum. 2020. Bayesian models of conceptual development: Learning as building models of the world. Annual Review of Developmental Psychology 2 (2020), 533–558.
  • Vanschoren (2019) Joaquin Vanschoren. 2019. Meta-learning. In Automated Machine Learning. Springer, Cham, 35–61.
  • Villareale and Zhu (2021) Jennifer Villareale and Jichen Zhu. 2021. Understanding Mental Models of AI through Player-AI Interaction. arXiv preprint arXiv:2103.16168 (2021).
  • Wang (2021) Jane X Wang. 2021. Meta-learning in natural and artificial intelligence. Current Opinion in Behavioral Sciences 38 (2021), 90–95.
  • Wang et al. (2021) Qiaosi Wang, Koustuv Saha, Eric Gregori, David Joyner, and Ashok Goel. 2021. Towards Mutual Theory of Mind in Human-AI Interaction: How Language Reflects What Students Perceive About a Virtual Teaching Assistant. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
  • Wang et al. (2020c) Rose E Wang, J Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, and Aleksandra Faust. 2020c. Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous. arXiv preprint arXiv:2003.06906 (2020).
  • Wang et al. (2020d) Rose E Wang, Sarah A Wu, James A Evans, Joshua B Tenenbaum, David C Parkes, and Max Kleiman-Weiner. 2020d. Too many cooks: Coordinating multi-agent collaboration through inverse planning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems. 2032–2034.
  • Wang et al. (2020b) Tianyu Wang, Vikas Dhiman, and Nikolay Atanasov. 2020b. Learning navigation costs from demonstration in partially observable environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4434–4440.
  • Wang et al. (2020a) Zhi Wang, Chunlin Chen, and Daoyi Dong. 2020a. Instance weighted incremental evolution strategies for reinforcement learning in dynamic environments. arXiv preprint arXiv:2010.04605 (2020).
  • Wilkens et al. (2021) Uta Wilkens, Christian Reyes, Tim Treude, and Annette Kluge. 2021. Understandings and perspectives of human-centered AI - a transdisciplinary literature review.
  • Wu et al. (2020) Guojun Wu, Yanhua Li, Shikai Luo, Ge Song, Qichao Wang, Jing He, Jieping Ye, Xiaohu Qie, and Hongtu Zhu. 2020. A Joint Inverse Reinforcement Learning and Deep Learning Model for Drivers’ Behavioral Prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2805–2812.
  • Xia et al. (2020) Boming Xia, Xiaozhen Ye, and Adnan OM Abuassba. 2020. Recent Research on AI in Games. In 2020 International Wireless Communications and Mobile Computing (IWCMC). IEEE, 505–510.
  • Xie et al. (2020) Annie Xie, Dylan P Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. 2020. Learning latent representations to influence multi-agent interaction. arXiv preprint arXiv:2011.06619 (2020).
  • Xu et al. (2020) Mengdi Xu, Wenhao Ding, Jiacheng Zhu, ZUXIN LIU, Baiming Chen, and Ding Zhao. 2020. Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes. Advances in Neural Information Processing Systems 33 (2020).
  • Yang et al. (2020b) Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020b. Re-examining whether, why, and how human-AI interaction is uniquely difficult to design. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–13.
  • Yang et al. (2020a) Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh Hoai. 2020a. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 193–202.
  • Yannakakis and Togelius (2018) Georgios N Yannakakis and Julian Togelius. 2018. Artificial intelligence and games. Springer.
  • Young et al. (2004) R Michael Young, Mark O Riedl, Mark Branly, Arnav Jhala, RJ Martin, and CJ Saretto. 2004. An architecture for integrating plan-based behavior generation with interactive game environments. J. Game Dev. 1, 1 (2004), 1–29.
  • Zahedi and Kambhampati (2021) Zahra Zahedi and Subbarao Kambhampati. 2021. Human-AI Symbiosis: A Survey of Current Approaches. arXiv preprint arXiv:2103.09990 (2021).
  • Zhang et al. (2020) Shangtong Zhang, Vivek Veeriah, and Shimon Whiteson. 2020. Learning Retrospective Knowledge with Reverse Reinforcement Learning. Advances in Neural Information Processing Systems 33 (2020).
  • Zhao et al. (2020) Yunqi Zhao, Igor Borovikov, Fernando de Mesentier Silva, Ahmad Beirami, Jason Rupert, Caedmon Somers, Jesse Harder, John Kolen, Jervis Pinto, Reza Pourabolghasem, et al. 2020. Winning is not everything: Enhancing game development with intelligent agents. IEEE Transactions on Games 12, 2 (2020), 199–212.
  • Zhen et al. (2020) Xiantong Zhen, Yingjun Du, Huan Xiong, Qiang Qiu, Cees GM Snoek, and Ling Shao. 2020. Learning to learn variational semantic memory. arXiv preprint arXiv:2010.10341 (2020).
  • Zhi-Xuan et al. (2020) Tan Zhi-Xuan, Jordyn Mann, Tom Silver, Josh Tenenbaum, and Vikash Mansinghka. 2020. Online bayesian goal inference for boundedly rational planning agents. Advances in Neural Information Processing Systems 33 (2020), 19238–19250.
  • Zilberstein (2011) Shlomo Zilberstein. 2011. Metareasoning and Bounded Rationality.
  • Zolna et al. (2019) Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarejo, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. 2019. Task-relevant adversarial imitation learning. arXiv preprint arXiv:1910.01077 (2019).