¹¹institutetext: Applied Artificial Intelligence Institute, Deakin University ¹¹email: [email protected]

EMOTE: An Explainable architecture for Modelling the Other Through Empathy

Manisha Senadeera Thommen Karimpanal George Sunil Gupta Stephan Jacobs Santu Rana

Abstract

We can usually assume others have goals analogous to our own. This assumption can also, at times, be applied to multi-agent games - e.g. Agent 1’s attraction to green pellets is analogous to Agent 2’s attraction to red pellets. This “analogy” assumption is tied closely to the cognitive process known as empathy. Inspired by empathy, we design a simple and explainable architecture to model another agent’s action-value function. This involves learning an Imagination Network to transform the other agent’s observed state in order to produce a human-interpretable empathetic state which, when presented to the learning agent, produces behaviours that mimic the other agent. Our approach is applicable to multi-agent scenarios consisting of a single learning agent and other (independent) agents acting according to fixed policies. This architecture is particularly beneficial for (but not limited to) algorithms using a composite value or reward function. We show our method produces better performance in multi-agent games, where it robustly estimates the other’s model in different environment configurations. Additionally, we show that the empathetic states are human interpretable, and thus verifiable.

Keywords:

Multi-agent Reinforcement Learning Interpretability Explainability Opponent Modeling.

1 Introduction

Getting along with everyone isn’t easy, but trying to understand them is an essential first step to a smooth and pleasant encounter. To support this, most of us possess the cognitive ability known as empathy. Hoffman [7] defines empathy as “any process where the attended perception of the object’s state generates a state in the subject that is more applicable to the object’s state or situation than to the subject’s own prior state or situation." From this definition, empathy can be interpreted as a process which helps us understand the feelings and goals of another, by using ourselves as a point of reference. How would we feel if we were in a similar situation? What objects or goals, though not exactly the same, do we feel similarly towards? An implicit assumption is the belief that who we are observing is analogous or similar to us, even if our goals or preferences are different. As a simple example: I love chocolate but dislike liquorice. If I were to see someone eating and enjoying liquorice, I would not immediately relate to how they feel. However, if I imagined it was instead chocolate they are consuming, I could understand their joy and subsequently infer the levels of enjoyment as being similar to mine when eating chocolate. In this situation, the chocolate and liquorice are analogous features.

With the increasing prevalence of robots, virtual assistants, etc. in regular life, artificially intelligent systems in shared environments will benefit from modelling and understanding other agents or humans around them. A subset of this space involves a single learning agent coexisting with, and learning to model, one or more other agents who behave under fixed policies (e.g. a human with set preferences or a pre-trained robot) [2].

To incorporate empathetic modelling behaviours in agents being trained in such scenarios, we present EMOTE - an Explainable architecture for Modelling the Other Through Empathy. Drawing inspiration from empathy, this architecture is designed to allow a learning agent to reference its own action-value function to model another agent’s action-value function, leading to a more stable and robust representation of the other. Crucially, leveraging the learning agent’s own functions enables the other agent’s model to be human-interpretable, allowing interrogation of the inferences made by the learning agent about the other agent’s goals (in the form of a reward function). We consider settings where analogous agents share a common environment - specifically, a single learning agent aims to model the other ‘independent’ agents who are pre-trained and act according to fixed policies, the reward functions of which the learning agent is not privy to.

EMOTE consists of a two stage neural network architecture. The first, called the Imagination Network, imagines an empathetic representation of the independent agent’s state. This empathetic state can be understood to be the perception of the independent agent’s state, from the perspective of the learning agent (e.g. liquorice reimagined as chocolate). This empathetic state is fed into a second network, a copy of the learning agent’s own action-value function, to observe what values the learning agent would have associated with this empathetic state.

The benefits of EMOTE are three fold. Firstly, by referencing its own action-value function, the reward and action-value estimates made by the learning agent about the independent agent are more robust which we demonstrate in different environment settings (adversarial/assistive) and configurations (layouts). Secondly, explaining the inferences made by the learning agent about the independent agent’s behaviour is possible through the generated empathetic state which is human-interpretable. A user can tap into this empathetic state and compare it against the original state of the independent agent, observing which features remain the same (both agents view and react similarly to), and which have changed (e.g. the liquorice being reimagined as chocolate), offering a useful tool for interpretable verification of the independent agent’s action values. Lastly, funnelling the empathetic state through the learning agent’s own action-value function has a constraining effect such that the action-values and corresponding inferred rewards, lie in a similar scale to that of the learning agent. Often in multi-agent games, knowledge of the independent agent is used by the learning agent to guide its own behaviour. A common approach to achieve this is by constructing a composite reward function, value function or policy [15, 1]. In such approaches, ensuring similarity in value scales of the functions being combined is important for stable performance. Fortunately our architecture naturally ensures that the range of the inferred action-value and reward functions will be comparable to those of the learning agent, obviating the need for scaling to construct the composite function. One may also view this characteristic through the lens of inverse reinforcement learning (IRL), where the goal is to map observed behaviours to a reward function. A key challenge here is that there exists a large space of reward functions which could possibly correspond to a given arbitrary behaviour. By inferring the reward function of another agent through the use of the learning agent’s own action-value function, our architecture offers an elegant solution to naturally narrow down the space of candidate reward functions, while also enabling analogous features between the agents to be mapped to similar reward ranges.

We demonstrate our proposal by integrating our EMOTE architecture in the work by Senadeera et.al [21]. Their framework focused on building considerate learning agents by flexibly combining the learning and independent agents’ reward and action-value functions. Applied to assistive and adversarial multiagent scenarios, we show that EMOTE produces more stable models of the independent agent that are robust to various settings and layouts. Further, we observe the Imagination Network is capable of recovering interpretable empathetic states (i.e., finding analogous features between the agents). The primary contribution of this work is an architecture that produces a stable model of the independent agent’s action-value function. As a result of our design, the model permits human-interpretability of the inferences made by the learning agent about the independent agent’s value function whilst ensuring these inference values are in a similar scale to that of the learning agent. Our method does not aim to outperform current state-of-the-art, rather to produce robust and stable inferences about the independent agent’s action-value and reward function. We assume that if inferred correctly, performance result will at least matches the state-of-the-art.

2 Related Literature

Modelling the other:

Modelling other agents is a key component of multi-agent reinforcement learning (MARL). There exists a vast body of work ranging from inferring the other’s policy [5, 24, 8, 22], goals and beliefs [19, 11] and value functions [25, 6]. These works predominantly assume that all agents are being trained concurrently which differs from our intended setting where only a single agent is trained. Modelling agents who behave according to a fixed pre-trained policy can be tackled with works in Theory of Mind (ToM) [18] and Inverse Reinforcement Learning (IRL) [14] literature. A small subset of MARL combines these two problems to create environments in which a learning agent (to be trained) coexists with and tries to model a pre-trained (independent) agent [16, 21], in line with the problem setting of our work. However, such works generally do not produce interpretable models, and produce arbitrarily scaled action-value estimates, which impedes the accurate inference of agent behaviours.

Modelling based on oneself:

A selection of works model the other agent based on their own model. Using ToM, [19] trains the learning agent on all possible goals during training and uses this information to infer the hidden goal of the other agent based on its behaviour. This method differs from our setting as it is constrained to games that have set goals which can be experienced by the learner. Inspired by empathy as well, [3] proposes imposing the learning agent’s own value function directly on the independent agent, using this to infer the other’s intent. A limitation is that imposing the same value function on the independent agent assumes this agent has the same values as the learner. Our work eschews this assumption, allowing for different and even opposing intentions.

Composite value and reward functions:

In multi-agent scenarios with composite reward or value functions (e.g. summation of two or more reward or value estimates), it is important they are scaled appropriately to ensure stable behaviours. Approaches such as VDN [23] and QMIX [20] combine separate agent value functions to conduct centralised training, thus obviating the need for such scaling. However, this does not allow for independent agents with pre-trained policies. More closely related, [1] builds a joint function of learning agent and independent agent rewards (whose rewards are already known). A similar joint reward function is built by [21] however they use IRL to to infer the rewards of the independent agents. As a result of the space of potential rewards that can emerge through IRL [13], the paper mitigates the issues of a misalignment by scaling the independent agent’s functions by a constant (the ratio of the $l1$ norms of the learning agent’s reward vector and the IRL inferred rewards of the independent agent). This method was also applied by [15]. This simple $l1$ norm based normalisation may fail in many complex scenarios, and additionally, is constrained to problems that only sum two reward or action-value functions, motivating need for alternatives.

3 Methodology

3.1 Problem Setting

We formulate our problem within the context of a Markov Decision Process (MDP) framework [17] $\mathcal{<S,A,T,R,\gamma>}$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the space of actions, $\mathcal{T}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}$ is the transition function governing the probability of moving to the next state $s^{\prime}$ , having taken an action $a$ in the current state $s$ . $\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ defines the reward an agent receives for taking an action in the current state, and $\gamma\in(0,1]$ is the discount factor.

To model the behaviour of the independent agent using EMOTE, we consider environment settings consisting of a learning agent, which we train, and one (or more) independent agent(s) which shares the same action and state spaces as the learning agent, and behaves as per a fixed policy. We assume that the underlying reward function of the independent agent is unknown to the learning agent. Our EMOTE architecture consists of training the action-value function for the learning agent $Q_{learn}$ using rewards $R$ returned from the environment, and simultaneously estimating the independent agent’s action-value function as $\hat{Q}_{indep}$ , using the latter’s $<s_{i},a_{i},s_{i}^{{}^{\prime}}>$ trajectories which we assume to be accessible (similar to a real world setting involving robots who share sensory input information). Each agent is assumed to also have their own reward function. EMOTE uses relevant IRL methods to estimate the reward function $\hat{R}_{indep}$ of the independent agent from the estimated action-value function $\hat{Q}_{indep}$ . Further details of the EMOTE architecture are described in the subsequent sections.

Refer to caption — Figure 1: EMOTE architecture of the independent agent’s value function $\hat{Q}_{indep}$ comprising a two stage neural network. The first, the Imagination Network $M_{imagine}$ , takes in the state $s_{i}$ perceived by the independent agent and outputs an empathetic state $s_{e}$ . The second is a copy of the learning agent’s value function $Q_{learn}$ . $s_{e}$ is fed into $Q_{learn}$ and associated Q-values are output. Only $M_{imagine}$ is trained via a loss function comprising the difference between $s_{i}$ and $s_{e}$ and the categorical cross entropy between the predicted softmax actions $\pi_{learn}(a|s)$ and the observed one hot encoded action of the independent agent $\mathbbm{1}\cdot a_{i}$ .

3.2 The EMOTE Architecture

The components of the EMOTE architecture used to estimate the independent agent’s value function $\hat{Q}_{indep}$ are shown in Figure 1. It consists of a two stage neural network architecture. The first of these, the Imagination Network ( $M_{imagine}$ ) parameterised by $\theta_{imagine}$ , takes state $s_{i}$ , as observed by the independent agent, as input and outputs an empathetic state $s_{e}$ representing the independent agent’s state as perceived empathetically by the learning agent. That is:

\displaystyle s_{e}=M_{imagine}(s_{i};\theta_{imagine})

(1)

Such that the learning agent’s greedy action in $s_{e}$ matches the independent agent’s action $a_{i}$ . Formally, we define the empathetic state in Definition 1:

Definition 1 (Empathetic State)

In a multiagent learning scenario involving a learning agent with action-value function $Q_{learn}$ and an independent agent (sharing same state space $\mathcal{S}$ and action space $\mathcal{A}$ ) who behaves per an arbitrary unknown policy, an empathetic state $s_{e}$ is a state where the learning agent’s greedy action matches the independent agent’s observed action $a_{i}$ in state $s_{i}$ .

\displaystyle\underset{a^{\prime}}{argmax}{\;Q_{learn}(s_{e},a^{\prime})}=a_{i}

Our work is applicable when learning and independent agents share analogous features, and a sufficient degree of analogy exists. These are defined:

Definition 2 (Analogous features)

A subset of features $f\subseteq\mathcal{F}$ contained in state $s\in\mathcal{S}$ is said to be analogous if there exists another subset of features $f_{e}\subseteq\mathcal{F}$ , which would produce an empathetic state $s_{e}$ as defined in Definition 1, when $f$ is swapped with $f_{e}$ , where $\mathcal{F}$ is the feature space.

Definition 3 (Degree of Analogy)

The degree of analogy between two agents is the fraction of analogous features in the feature space.

An Illustrative Example:

Figure 2 shows an example where Agent 1 desires green pellets while Agent 2 desires red pellets. If Agent 2 observes state $s_{i}$ , it would move left in order to obtain the red pellet. If Agent 1 were to observe state $s_{i}$ from Agent 2’s position, it would not move left, but rather down, towards the green pellet. If however state $s_{i}$ , is fed into a trained Imagination Network, the resulting empathetic state $s_{e}$ would see the position of the pellets (analogous features) be swapped, as depicted on the right side of Figure 2. As a result, when Agent 1 is presented the transformed state $s_{e}$ (instead of $s_{i}$ ), it would select the same action (moving left) as Agent 2 in state $s_{i}$ . Hence, we can interpret that through $s_{e}$ , Agent 1 understands that how it behaves towards and values green pellets is analogous to how Agent 2 behaves towards and values red pellets.

In order to estimate the independent agent’s action values, we pass the obtained empathetic state $s_{e}$ through a copy of the learning agent’s own action-value function $Q_{learn}$ . The intuition is that since $s_{e}$ for the learning agent produces the same actions (behaviours) as the independent agent in state $s_{i}$ , it is reasonable to assume that for a given action, how the independent agent values $s_{i}$ is analogous to how the learning agent values $s_{e}$ . Hence, passing $s_{e}$ through the learning agent’s action-value function $Q_{learn}$ produces action-values analogous to the independent agent’s. The resulting action-values then correspond to the learning agent’s interpretation of the independent agent’s action-values, based on its own action-values (and hence, its own experiences and environment interactions). In our previous example (Figure 2), via the empathetic state which swapped the colours of the pellets, the resulting Q-value for the action taken by Agent 2 (going left towards red) is similar to that of Agent 1 taking the same action in the empathetic state (going left towards the imagined green pellet).

Having now obtained $\hat{Q}_{indep}$ , we can infer the rewards of the independent agent $\hat{R}_{indep}$ through an existing IRL method, e.g. Cascaded Supervised Learning [9]. In this way $\hat{R}_{indep}$ will have a similar scale or magnitude as that of the learning agent’s rewards $R$ . This is particularly useful for MARL algorithms which make use of a composite value or reward function in their design.

Loss Terms

While the weights of $Q_{learn}$ are trained using DQN, we train $M_{imagine}$ using a loss term (Equation 2) comprising two parts. The first ( $Loss_{1}$ ) is a categorical cross entropy (CE) loss which minimises the difference between the softmax predicted action from the $Q_{learn}$ copy and the action ( $a_{i}$ ) actually taken by the independent agent. This loss is designed to ensure the greedy action of $Q_{learn}$ , for $s_{e}$ , matches the independent agent’s action (per Equation 1).

\displaystyle\mathcal{L}(\theta_{imagine})=(1-\delta)\underbrace{CE(\mathbbm{1}.a_{i},\pi_{learn}(a|s_{e}))}_{Loss_{1}}+\delta\underbrace{\|{s_{i}-s_{e}}\|_{1}}_{Loss_{2}}

(2)

where

\pi_{learn}(a|s_{e})=\frac{e^{Q_{learn}(s_{e},a)}}{\sum_{a}e^{Q_{learn}(s_{e},a)}}

(3)

and $s_{e}$ is obtained as per Equation 1.

The second loss term ( $Loss_{2}$ ) focuses on state reconstruction, aiming to produce an empathetic state $s_{e}$ matching the original state $s_{i}$ . Together, the goal is to produce an empathetic state $s_{e}$ through minimal changes to $s_{i}$ , so that common features (e.g. walls or floors) are reconstructed, and differences between $s_{e}$ and $s_{i}$ reflect the analogous features needed to evoke empathy (driven by $Loss_{1}$ ). Components used to construct the two loss terms are shown in Figure 1.

The user specified hyperparameter $\delta\in[0,1]$ balances the importance of reconstructing the empathetic state $s_{e}$ to be as similar as possible to the original state $s_{i}$ (interpretability) and the accuracy of the learning agent’s predicted actions in $s_{e}$ . We note that in practice, a copy of $Q_{learn}$ for EMOTE is updated every few episodes (to maintain stability).

4 Experiments

Our work is constrained to settings that meet the criteria outlined in Section 3.1. To the best of our knowledge, works that fit within these settings were those by Senadeera et al. [21], Bussmann et al. [3], Noothigattu et al. [15] and Papoudakis et al. [16]. We demonstrate integration of our proposed framework on [21] only, as other works assumed the independent agent’s behaviour was (a) random [3], (b) involved a complex switching policy to train the learning agent involving imitation learning via human example trajectories [15] or (c) did not have access to the independent agent’s trajectories during testing [16]. The Sympathy Framework of [21] consists of both a joint action-value function and reward function in its design, providing a useful illustration of the benefits proposed by the EMOTE architecture. To demonstrate the versatility of our proposed approach, we designed experiments that (1) illustrate various environment settings (two assistive and two adversarial), (2) demonstrate the potential of EMOTE to perform well even under differing degrees of the analogy assumption, (3) show the ability for the empathetic state to be constructed both as a feature-by-feature transformation, or as a whole state (image) transformation, and (4) demonstrate how in addition to visual features, our approach can handle non-visual features that influence the dynamics of the environment in a complex and non-linear fashion. Environments were designed using MarlGrid [12] based on MiniGrid [4] (code in Supplementary).

Played in finite, episodic environments where actions are taken sequentially by each agent, goal of learning agent (red arrow) is to complete its assigned task of collecting all red pellets. At times, the agent may be faced with situations where it has the ability to behave selfishly or even harm the independent agent (yellow arrow). The Sympathy Framework trains the learning agent to still complete its task whilst being considerate of the independent agent, even though there is no reward-driven incentive for doing so. This is done by training the learning agent on a sympathetic reward [21], a convex weighted sum of the learning agent’s reward, and the independent agent’s inferred reward. The weighting is determined by a selfishness term, which is a function of both agents’ action-value functions.

EMOTE replaces the network used to model the independent agent in [21], but we use the same IRL method (Cascaded Supervised Learning [9]) to infer rewards. This IRL method applies supervised learning to train the model using state-action pairs, following which an inversion of the Bellman Equation is applied to extract the reward function. EMOTE is assessed by three criteria:

1.

Performance: Whether with EMOTE, the learning agent behaves considerately towards the independent agent while completing its own task.
2.

Inferred Independent Agent Rewards: Are the rewards inferred via IRL from the EMOTE architecture comparable to the learning agent’s rewards?
3.

Explainability: Are the analogous features and associated empathetic states informative and reflective of the independent agent’s behaviour?

4.1 The games

Assistive 1 and 2

Both agents try to collect their corresponding coloured pellets. When the learning agent consumes one of its pellets, it receives +10 points. In Assistive 1 the independent agent is locked behind a door and in Assistive 2 one of the independent agent’s pellets is locked behind the door. The door can only be opened by the learning agent stepping on the green button, which inflicts a small negative reward (-1) on the learning agent. Once the button is pressed, the door remains open for the remainder of the episode. The learning agent also receives an additional bonus reward (+5) when it ‘wins’ the game by consuming all of its pellets. In the absence of other rewards a step penalty (-1) applies for each step taken. Pellets are placed randomly in each episode and the episode ends if all learning-agent pellets are consumed or when the game timer runs out.

Ideally, a considerate learning agent would open the door and assist the independent agent, despite not being necessary for winning. These games are considered to have a moderate degree of analogy, as both agents react similarly to pellets, but differ as only the learning agent can open the door.

Adversarial 1 and 2

In Adversarial 1 both agents again try to collect their respective pellets, and the learning agent earns +20 points per pellet. In Adversarial 2 only the learning agent is concerned with pellet collection. When the game starts the independent agent can harm the learning agent (resulting in -50 points for the latter and the game ending). If the learning agent steps on the green button however (switching the button status from 0 to 1), for a finite period of time, it can harm the independent agent, resulting in a positive reward (+10) for the learning agent. Harming occurs when the harming agent is within 1 square from the other agent. We investigate whether the learning agent, by virtue of the empathetic architecture, avoids harming the independent agent despite receiving positive environment rewards for doing so. Each episode terminates when the learning agent collects all pellets (receiving +30 reward), if the game timer runs out, or the learning agent is harmed. Adversarial 1 has a high degree of analogy as both agents have comparable reactions to elements (pellets and harming), whilst Adversarial 2 has a lower degree, as the only analogous feature is the ability to harm each other.

4.2 Baselines

In each experiment, policies are learned via DQN [10]. A 5x5 field of vision is imposed around each agent, representing its visual state information. We compare EMOTE with the following baselines:

Selfish: A selfish version of the learning agent which only has access to its own rewards from the environment (no modelling of the independent agent).

Sympathy: Agents trained as per [21].

E-Feature: A feature-based Imagination Model where each state cell is represented as a feature. $s_{e}$ is constructed with a feature-by-feature transformation.

E-Image: An Image-based Imagination Model, where the entire state is transformed to create $s_{e}$ .

Benchmark 1 - Swap (B-Vis): The Imagination Model is replaced with a rule-based state transformation, such that the colours of the two agents’ pellets are swapped. This mimics an oracle baseline that presumes the hypothesis - how the learning agent feels about its pellets is how the independent agent feels towards its own pellets.

Benchmark 2 - Swap (B-Invis): The Imagination Model is replaced with a rule-based state transformation which swaps the colours of the agent’s pellets (like Benchmark 1) and additionally, makes the button invisible. The invisibility applies only to the observed state, and the button status remains. This mimics an oracle corresponding to the hypothesis that in addition to Benchmark 1’s belief, the independent agent does not consider the button to be important (treats it the same as the floor), as it cannot press it.

4.3 Performance

Table 1: Performance results. Assistive games show win rate (higher is better) and door-opening rate (higher is better). Adversarial games show win rate (higher is better) and harm rate (lower is better) from the learning agent toward the independent agent. Cells are shaded green where performance exceeded Sympathy. Cells have bold text where intended considerate behaviour or comparable win rates were achieved, compared to Selfish run.

		Selfish	Sympathy	E-Feature	E-Image	B-Vis	B-Invis
	Door	0.44 $\pm$ 0.15	0.99 $\pm$ 0.01	0.99 $\pm$ 0.01	0.83 $\pm$ 0.11	1 $\pm$ 0	1 $\pm$ 0
Ass. 1	Win	0.8 $\pm$ 0.11	0.93 $\pm$ 0.05	0.93 $\pm$ 0.05	0.92 $\pm$ 0.06	0.92 $\pm$ 0.06	0.93 $\pm$ 0.05
	Door	0.09 $\pm$ 0.07	0.17 $\pm$ 0.1	0.4 $\pm$ 0.15	0.38 $\pm$ 0.15	0.21 $\pm$ 0.11	0.19 $\pm$ 0.11
Ass. 2	Win	0.89 $\pm$ 0.08	0.92 $\pm$ 0.06	0.94 $\pm$ 0.05	0.92 $\pm$ 0.06	0.89 $\pm$ 0.08	0.92 $\pm$ 0.06
	Harm	0.5 $\pm$ 0.15	0.45 $\pm$ 0.15	0.24 $\pm$ 0.12	0.17 $\pm$ 0.1	0.1 $\pm$ 0.07	0.1 $\pm$ 0.07
Adv. 1	Win	0.46 $\pm$ 0.15	0.26 $\pm$ 0.13	0.4 $\pm$ 0.15	0.41 $\pm$ 0.15	0.37 $\pm$ 0.15	0.34 $\pm$ 0.14
	Harm	0.29 $\pm$ 0.13	0.12 $\pm$ 0.09	0.1 $\pm$ 0.08	0.12 $\pm$ 0.09	0.05 $\pm$ 0.04	0.03 $\pm$ 0.02
Adv. 2	Win	0.73 $\pm$ 0.13	0.31 $\pm$ 0.14	0.44 $\pm$ 0.15	0.47 $\pm$ 0.15	0.34 $\pm$ 0.14	0.33 $\pm$ 0.14

Table 1 shows the Win Rate and Door Open rate for the two Assistive environments, as well as the Win Rate and Harm Rate (rate at which the learning agent harms the independent agent) for each of the adversarial experiments. Cells highlighted in green show where the EMOTE architecture or Benchmarks had similar or better performance than the Sympathy baseline. Cells with bold text indicate better or similar performance to the Selfish baseline.

In the Assistive environments, the E-Feature and E-Image EMOTE baselines result in high win and door open rates, producing similar or better results to the Sympathy baseline, with it at times even outperforming the Benchmarks. In adversarial environments, all harm rates were lower than the Selfish baseline. E-Feature, E-Image and the Benchmarks outperformed Sympathy (state-of-the-art baseline) with a lower harm rate, and produced higher win rates. The Benchmarks produced lower harm rates than EMOTE however. The win rates of other baselines compared to Selfish were either similar (Adversarial 1) or lower (Adversarial 2). Overall, EMOTE induced more considerate behaviours than Sympathy.

4.4 Inferred Reward Values

Figure 4 shows the independent agent’s inferred rewards ( $\hat{R}_{indep}$ ) from the EMOTE and Benchmark action-value functions ( $\hat{Q}_{indep}$ ), and contrasts them against those from Sympathy. For reference the learning agent’s environmental rewards for the same features are shown alongside the independent agent’s $\hat{R}_{indep}$ .

4.4.1 Assistive

To judge whether the inferred independent-agent’s rewards are comparable to the learning agent’s environmental rewards we can examine similarities in reward values across various features. For instance one would expect the baselines’ inference for the independent agent consuming its pellet (IA pellet) to look similar to the learning agent’s environmental reward for consuming its own pellet (LA pellet). In Figure 4 (a) and (b) (Assistive 1 and 2) E-Feature, E-Image, and the Benchmark baselines all infer positive rewards for IA pellet, though magnitudes vary. The rewards inferred by the Benchmarks are closest to the learning agent’s reward for LA pellet (+10). Opening the door elicited a slight positive value in the EMOTE runs, and the value for taking a step was close to that of the learning agent’s step value (-1). Under the EMOTE and Benchmark baselines, the rewards inferred for Assistive 1 and 2 are similar. This demonstrates the potential of the architecture to learn a robust model which infers consistent reward values even when the layout of the environment changes. In contrast, Sympathy inferred different rewards for the two environments. It inferred a strong positive value for the button being pressed in Assistive 1, but a slight negative value in Assistive 2 (Figure 4 a) and b)). Additionally despite the high win rate in Assistive 2, the Sympathy baseline did not result in as high of a door opening rate (Table 1) as it wrongly inferred a negative reward for door opening. All baselines inferred the independent agent associates a value close to 0 for the learning agent consuming a pellet.

4.4.2 Adversarial

For the adversarial games (Figure 4 (c) and (d)) E-Feature, E-Image and both Benchmarks were able to capture the strong negative reward that the independent agent associates with being harmed. This is similar to the environmental reward the learning agent receives when it is killed. In Adversarial 1, a positive value was also associated with the independent agent consuming its own pellet. In Adversarial 1, the Sympathy baseline was not able to infer a negative reward for the independent agent being harmed. It also failed to capture a strong enough positive reward for the independent agent consuming a pellet and inferred a strong negative reward for taking a step and pressing the button, leading to a low win rate and high harm rate, similar to that of Selfish (Table 1). We expect the poor performance is due to the order of magnitude difference, resulting from the $l1$ norm scaling, between rewards inferred for taking a step and button press relative to the other reward features. This inappropriate scaling of the reward components in turn results in poor estimates of the sympathetic reward. In Adversarial 2, the Sympathy baseline inferred a strong negative for independent agent being harmed, leading to a reduced harm rate. Despite this Sympathy had lower win rates compared to EMOTE or Benchmarks (Table 1). EMOTE produced consistent reward inferences, almost matching Benchmark, unlike Sympathy where rewards inferred varied by the environment configuration.

4.5 Empathetic State

EMOTE’s key benefit is its ability to produce a human-interpretable empathetic state $s_{e}$ which can be used to explain some of the performance results from Section 4.3. Figure 5 shows an original state $s_{i}$ for each of the four environments examined alongside three examples of the final empathetic state $s_{e}$ (at the end of training) generated from the E-Feature and E-Image Imagination Networks. More final empathetic states, as well as examples of the change in the empathetic state for those shown in Figure 5 during training can be found in the Supplementary.

The learning agent’s pellet (LP) is fairly consistently transformed to the floor by $s_{e}$ , indicating the lack of importance the independent agent places on it. We observed EMOTE outperforming Benchmarks (Section 4.3). We infer that because the learning agent will have experienced more instances of the floor (relative to IP, which also disappear when consumed), this feature is better estimated in its value function, leading to better performance by reimagining the LP as floor. The colour of the independent agent’s pellet (IP) becomes that of the learning agent’s pellets (blue) in the transformed state $s_{e}$ thus explaining the inferred rewards for IA Pellet in Figure 4. This suggests that the empathy architecture allows the learning agent to interpret the relationship between independent agent and its pellets as analogous to that between the learning agent and its own pellets.

In the empathetic states for Assistive 1 in Figure 5, the button either remains unchanged, changes to a door or an independent agent’s pellet. The independent agent does not value the button as it cannot interact with it thus it is mapped to features that are similarly irrelevant in the learning agent’s own reward function. In contrast, for the Assistive 2 environment, Figure 5 shows the button transforming at times to the learning agent’s pellet. This is expected as how the independent agent moves towards the door (in front of which the button is placed) is similar to how the learning agent moves towards its own pellet.

In Adversarial environments, the button usually disappears in the empathetic states. This is expected, as the independent agent has no influence over the button, and is thus not a feature of importance. However, it is important to observe the value of the predicted button status in the empathetic states, the results of which are shown in Table 2. For a button status value of 0 in $s_{i}$ (button is inactive and independent agent can harm the learning agent) the resulting status prediction in $s_{e}$ is closer to 1, while a button status value of 1 in $s_{i}$ (button is active and learning agent can harm the independent agent) produces a prediction in $s_{e}$ closer to 0. This indicates that the EMOTE architecture is able to associate the button status (a non-visual feature) with the power dynamics between agents, namely how the independent agent behaves when the button status is 1 (i.e. trying to avoid being harmed) is similar to how the learning agent behaves when the button status is 0 (i.e. also trying to avoid being harmed). In the adversarial environments, the Benchmarks outperformed EMOTE for the harm metric as shown in Table 2. This is because the Benchmarks are oracle baselines, where the button status was swapped using manually encoded rules. However, for EMOTE this was a difficult transformation (with a non-linear influence on the environment dynamics), that had to be learned (Table 2).

Table 2: Adversarial: Predicted

s_{e}

button status for

s_{i}

button status values

	$s_{i}$ E-Feature		$s_{i}$ E-Image
Game	0	1	0	1
Adv 1	0.76 $\pm$ 0.11	0.24 $\pm$ 0.11	0.785 $\pm$ 0.09	0.215 $\pm$ 0.10
Adv 2	0.85 $\pm$ 0.32	0.15 $\pm$ 0.32	0.88 $\pm$ 0.16	0.12 $\pm$ 0.16

5 Discussion and Future Work

The hyperparameter $\delta\in[0,1]$ (Equation 2) balances the trade-off between the two loss terms. As $\delta$ approaches 1, empathetic state $s_{e}$ becomes similar to original state $s_{i}$ , resulting in the inferred rewards of the independent agent being the same as the learning agent’s rewards. In practice the bottleneck imposed by the second model ( $Q_{learn}$ ) led to the finding that high $\delta$ values were beneficial. In particular, it allowed $s_{e}$ to reproduce common features such as walls and floors, contributing to better performances. Experimental settings for $\delta$ are in the Supplementary.

The multi-objective nature of the loss term in Equation 2, can make learning stability hard to achieve (as it is not possible for both loss terms to reach 0 concurrently), and thus could be further improved. Additionally, although there are no limits on the number independent agents modelled, a new Imagination Network would be required for each, which could hinder scalability. Future work can look to address this limitation, perhaps by exploring transfer learning based solutions where multiple imagination transformations are learned with the same network. Future work could also look to extend EMOTE to situations where all agents are being trained, and where states are only partially observable.

6 Conclusion

In this work, we presented the EMOTE architecture, which enables a learning agent to model the action-value function of an independent agent, based on its own value function and rewards, under the assumption that agents have “analagous” features. A key benefit is the ability to generate an interpretable empathetic states, allowing identification of analogous features between agents. The architecture is well suited to multiagent learning algorithms, particularly those utilising composite action-value or reward functions in their design. EMOTE was demonstrated on a previously proposed Sympathy Framework, where it produced more considerate behaviours, more consistent rewards (despite re-configurations of the environment), and also produced insightful empathetic states. We discussed the design of our loss function, and provided insights into future research directions to improve our proposed approach.

References

[1] Alamdari, P.A., Klassen, T.Q., Icarte, R.T., McIlraith, S.A.: Be considerate: Objectives, side effects, and deciding how to act. arXiv preprint arXiv:2106.02617 (2021)
[2] Albrecht, S., Stone, P.: Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence 258, 66–95 (2018)
[3] Bussmann, B., Heinerman, J., Lehman, J.: Towards empathic deep q-learning. In: Espinoza, H., Yu, H., Huang, X., Lecue, F., Chen, C., Hernández-Orallo, J., Ó hÉigeartaigh, S., Mallah, R. (eds.) Artificial Intelligence Safety 2019. pp. 1–7. CEUR Workshop Proceedings, CEUR-WS.org (8 2019)
[4] Chevalier-Boisvert, M., Willems, L., Pal, S.: Minimalistic gridworld environment for gymnasium (2018), https://github.com/Farama-Foundation/Minigrid
[5] Foerster, J., Chen, R., Al-Shedivat, M., Whiteson, S., Abbeel, P., Mordatch, I.: Learning with opponent-learning awareness. In: AAMAS 2018: Proceedings of the Seventeenth International Joint Conference on Autonomous Agents and Multi-Agent Systems (July 2018), http://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/foersteraamas18.pdf
[6] He, H., Boyd-Graber, J., Kwok, K., Daumé, III, H.: Opponent modeling in deep reinforcement learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1804–1813. PMLR, New York, New York, USA (20–22 Jun 2016), https://proceedings.mlr.press/v48/he16.html
[7] Hoffman, M.L.: Empathy and moral development. The annual report of educational psychology in Japan 35, 157–162 (1996)
[8] Hu, H., Lerer, A., Peysakhovich, A., Foerster, J.: “other-play” for zero-shot coordination. In: International Conference on Machine Learning. pp. 4399–4410. PMLR (2020)
[9] Klein, E., Piot, B., Geist, M., Pietquin, O.: A cascaded supervised learning approach to inverse reinforcement learning. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) Machine Learning and Knowledge Discovery in Databases. pp. 1–16. Springer Berlin Heidelberg, Berlin, Heidelberg (2013)
[10] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. nature 518(7540), 529–533 (2015)
[11] Moreno, P., Hughes, E., McKee, K.R., Pires, B.A., Weber, T.: Neural recursive belief states in multi-agent reinforcement learning. arXiv preprint arXiv:2102.02274 (2021)
[12] Ndousse, K.: Marlgrid. https://github.com/kandouss/marlgrid (2020)
[13] Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning. p. 278–287. ICML ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999)
[14] Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml. vol. 1, p. 2 (2000)
[15] Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K.R., Campbell, M., Singh, M., Rossi, F.: Teaching ai agents ethical values using reinforcement learning and policy orchestration. IBM Journal of Research and Development 63(4/5), 2–1 (2019)
[16] Papoudakis, G., Christianos, F., Albrecht, S.: Agent modelling under partial observability for deep reinforcement learning. Advances in Neural Information Processing Systems 34, 19210–19222 (2021)
[17] Puterman, M.L.: Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons (2014)
[18] Rabinowitz, N., Perbet, F., Song, F., Zhang, C., Eslami, S.M.A., Botvinick, M.: Machine theory of mind. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4218–4227. PMLR (10–15 Jul 2018), https://proceedings.mlr.press/v80/rabinowitz18a.html
[19] Raileanu, R., Denton, E.L., Szlam, A.D., Fergus, R.: Modeling others using oneself in multi-agent reinforcement learning. In: ICML (2018)
[20] Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., Whiteson, S.: Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning. pp. 4295–4304. PMLR (2018)
[21] Senadeera, M., Karimpanal, T.G., Gupta, S., Rana, S.: Sympathy-based reinforcement learning agents. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. p. 1164–1172. AAMAS ’22, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC (2022)
[22] Shu, T., Tian, Y.: M³ rl: Mind-aware multi-agent management reinforcement learning. arXiv preprint arXiv:1810.00147 (2018)
[23] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., Graepel, T.: Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. p. 2085–2087. AAMAS ’18, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC (2018)
[24] Wen, Y., Yang, Y., Luo, R., Wang, J., Pan, W.: Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207 (2019)
[25] Zhao, J., Yang, M., Zhao, Y., Hu, X., Zhou, W., Zhu, J., Li, H.: Mcmarl: Parameterizing value function via mixture of categorical distributions for multi-agent reinforcement learning (2022)