Shaping Advice in Deep Reinforcement Learning

Baicen Xiao¹, Member, IEEE, Bhaskar Ramasubramanian^1,2, Member, IEEE, Radha Poovendran¹, Fellow, IEEE ¹Network Security Lab, Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA.
{bcxiao, rp3}@uw.edu²Electrical and Computer Engineering, Department of Engineering and Design, Western Washington University, Bellingham, WA 98225, USA.
{ramasub}@wwu.eduThis work was supported by the Office of Naval Research via Grant N00014-17-S-B001.

Abstract

Reinforcement learning involves agents interacting with an environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose methods to augment the reward signal from the environment with an additional reward termed shaping advice in both single- and multi-agent reinforcement learning. The shaping advice is specified as a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The use of potential functions is underpinned by an insight that the total potential when starting from any state and returning to the same state is always equal to zero. We show through theoretical analyses and experimental validation that the shaping advice does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that the convergence of policy gradients and value functions when using shaping advice implies the convergence of these quantities in the absence of shaping advice. We design two algorithms- Shaping Advice in Single-agent reinforcement learning (SAS) and Shaping Advice in Multi-agent reinforcement learning (SAM). Shaping advice in SAS and SAM needs to be specified only once at the start of training, and can easily be provided by non-experts. Experimentally, we evaluate SAS and SAM on two tasks in single-agent environments and three tasks in multi-agent environments that have sparse rewards. We observe that using shaping advice results in agents learning policies to complete tasks faster, and obtain higher rewards than algorithms that do not use shaping advice. Code for our experiments is available at https://github.com/baicenxiao/Shaping-Advice.

Index Terms:

shaping advice, potential-based, reinforcement learning, centralized training with decentralized execution

I Introduction

Reinforcement learning (RL) is a framework that allows agents to complete tasks in an environment, even when a model of the environment is not known [1]. An RL agent ‘learns’ to complete tasks by maximizing an expected long-term reward, where the reward signal is provided by the environment. RL algorithms have been successfully implemented in many fields, including games [2, 3, 4], robotics [5], autonomous vehicle coordination [6], analysis of social dilemmas [7], and resource allocation in cellular networks [8].

When the environment is not known, availability of immediate feedback on quality of actions taken at each time-step is critical to the learning of behaviors to successfully complete a task. This is termed credit assignment [1]. When reward signals provided by the environment are sparse, it becomes difficult to perform effective credit assignment at intermediate steps of the learning process. One approach that has been shown to improve learning when rewards are sparse is reward shaping [9, 10, 11]. Reward shaping techniques augment the reward provided by the environment with an additional shaping reward. The shaping reward can be designed to be dense (i.e., not sparse), and agents learn policies (which action to take in a particular state) using the augmented reward.

Any additional reward can distract an agent from completing a task specified by the reward provided by the environment, and therefore will need to be provided in a systematic manner [12]. In this paper, we term the additional reward given to agents at each time-step as shaping advice. The shaping advice is specified by a difference of potential functions at consecutive time-steps, where each potential function depends on observations and actions of the agents. Potential functions satisfy a critical property that the total potential when starting from a state and returning to the same state is zero [13]. This ensures that an agent will not be distracted from completing a task specified by the reward provided by the environment.

There are additional challenges when there are multiple RL agents. In such a setting, each agent will have to interact not only with its environment, but also with other agents. As behaviors of agents evolve in the environment, the environment will become non-stationary from the perspective of any single agent. Thus, agents that independently learn behaviors by assuming other agents to be part of the environment can result in unstable learning regimes [14, 15, 16].

When multiple trained agents are deployed independently, or when communication among agents is costly, the agents will need to be able to learn decentralized policies. Decentralized policies can be efficicently learned by adopting the centralized training with decentralized execution (CTDE) paradigm, first introduced in [17]. An agent using CTDE can make use of information about other agents’ observations and actions to aid its own learning during training, but will have to take decisions independently at test-time. However, the ability of an agent to learn decentralized policies can be affected if reward signals from the environment are sparse.

Reward shaping techniques that use potential functions satisfy an additional property that the identity of the optimal policies with and without the shaping reward is the same [13]. The state-of-the-art on the potential-based reward shaping in RL [9, 10, 11] is focused on environments with finite and discrete action spaces. To the best of our knowledge, adapting solution techniques proposed in the above papers to more general settings with continuous action spaces has not been explored. Moreover, in the multi-agent case, these works emphasize learning joint policies for agents. In comparison, we propose to learn decentralized policies for each agent, which will enable application of our method to environments with large numbers of agents.

In this paper, we develop a comprehensive framework to enable effective credit assignment in RL environments with sparse rewards. These methods incorporate information about the task and environment to define shaping advice. We term our algorithms Shaping Advice in Single-agent reinforcement learning (SAS) when there is only a single agent, and Shaping Advice in Multi-agent reinforcement learning (SAM), when there are multiple agents. The shaping advice in SAS and SAM can be interpreted as domain knowledge that aids credit assignment [18]. This advice needs to be specified only once at the start of the training process. SAS and SAM can be applied in environments with continuous or discrete state and action spaces. We demonstrate that both algorithms do not distract agents from completing tasks specified by rewards from the environment. Specifically, our contributions are:

•

We introduce SAS and SAM to incorporate potential-based shaping advice in single- and multi-agent deep RL environments with continuous action spaces.
•

We demonstrate that shaping advice provided by SAS and SAM does not distract agents from completing tasks specified by the environment reward. We accomplish this by theoretically establishing that convergence of policy gradients and values when using the shaping advice implies convergence of these quantities in the absence of the shaping advice.
•
We verify our theoretical results through extensive experimental evaluations. Specifically,
- –
  
  we evaluate SAS on two environments with sparse rewards- puddle-jump gridworld and continuous mountain car;
- –
  
  we evaluate SAM on three tasks in the multi-agent Particle World environment [17]. All these tasks have sparse rewards. We show that using shaping advice allows agents to learn policies to complete the tasks faster, and obtain higher rewards than: i) using sparse rewards alone, and ii) a state-of-the-art reward redistribution technique from [19].

Compared to a preliminary version that appeared in [20], in this paper, we develop a comprehensive framework for providing shaping advice in both, single- and multi-agent RL. We provide detailed theoretical analyses and experimental evaluations for each setting.

The remainder of this paper is organized as follows: Section II presents related work and Section III provides an introduction to single- and multi-agent RL and potential-based shaping advice. Sections IV - VI present the main contributions of this paper. We provide details on SAS and SAM and present theoretical analysis of their convergence in Sections IV and V. Experiments validating the use of SAS and SAM are reported in Section VI, and Section VII concludes the paper.

II Related Work

Techniques to improve credit assignment using feedback signals provided by a human operator have been studied in single-agent RL. Demonstrations provided by a human operator were used to synthesize a ‘baseline policy’ that was used to guide learning in [21, 22]. When expert demonstrations were available, imitation learning was used to guide exploration of the RL agent in [23, 24]. Feedback provided by a human operator was converted to a shaping reward to aid training a deep RL agent in environments with delayed rewards in [25]. These techniques, however, presume the availability of a human operator, which might limit their applicability.

In the multi-agent setting, decentralized and distributed control techniques is a popular area of research. A widely studied problem in such systems is the development of algorithms to specify methods by which information can be exchanged among agents so that they can jointly complete tasks. A technique to ensure fixed-time consensus for multi-agent systems whose interactions were specified by a directed graph was studied in [26, 27]. The authors of [28] proposed an adaptive distributed event-triggering protocol to guarantee consensus for multi-agent systems specified by linear dynamics and interactions specified by an undirected graph. We direct the reader to [29] for a survey of recent developments in consensus of multi-agent systems. These works, however, assumed the availability of models of individual agent’s dynamics, and of the interactions between agents.

Data-driven techniques are being increasingly adopted to solve problems in multi-agent systems. These methods do not require complete knowledge of the system model; rather, they use information about the state, input, and output of each agent to establish feasibility of solutions and guarantees on convergence of reinforcement learning-based algorithms. Input and state data were used in an online manner to design a distributed control algorithm to solve a cooperative optimal output regulation problem in leader-follower systems in [30]. Information obtained from trajectories of each player were used in [31] to develop real-time solutions to multi-player games through the design of an actor-critic-based adaptive learning algorithm. The authors of [32] identified a class of networked MARL problems where enforcing specific local interactions among players permitted the exploiting an exponential decay property which enabled the development of scalable, distributed algorithms for optimization and control. A comprehensive overview of algorithms in cooperative and competitive MARL with a focus on theoretical analyses of their convergence and complexity was presented in [33].

Cooperative MARL tasks are one instance of the setup described above, wherein all agents share the same global reward. The authors of [34] introduced value decomposition networks that decomposed a centralized value into a sum of individual agent values to assess contributions of individual agents to a shared global reward. An additional assumption on monotonicity of the centralized value function was imposed in QMIX [35] to assign credit to an agent by enumerating a value for each valid action at a state. The action spaces of agents in the above-mentioned works were discrete and finite, and these techniques cannot be easily adapted to settings with continuous action spaces. In comparison, we study reward shaping in MARL tasks in environments with continuous action spaces.

An alternative approach to improve credit assignment is potential-based reward shaping. Although this requires prior knowledge of the problem domain, potential-based techniques have been shown to offer guarantees on optimality and convergence of policies in both single [13] and multi-agent [10, 36, 37] cases. The aforementioned works had focused on the use of potential-based methods in environments with discrete action spaces. A preliminary version of this paper [20] introduced potential-based techniques to learn stochastic policies in single-agent RL with continuous states and actions. In this paper, we adapt and extend methods from [20] to develop a comprehensive framework for providing potential-based shaping advice in both single- and multi-agent RL.

The authors of [19] developed a method called iterative relative credit refinement (IRCR), which used a ‘surrogate objective’ to uniformly redistribute a sparse reward along the length of a trajectory in single and multi-agent RL. We empirically compare SAM with IRCR, and explain why SAM is able to guide agents to learn policies that result in higher average rewards than IRCR.

Methods to learn a potential-based reward shaping function have been investigated in recent research. When domain knowledge was available, the authors of [38] proposed a temporal-difference learning method to transform a given reward function to potential-based advice. In the absence of domain knowledge, graph convolutional networks were used in [39] to generate potential functions ‘automatically’ by performing message passing between states at which the agent received a reward. Different from the above works, an approach to perform credit assignment by conditioning the value function on future events from a trajectory was examined by the authors of [40]. We note that these works have focused on the single-agent case. An analogous approach for the multi-agent case, while interesting, is beyond the scope of the present paper, and remains a promising direction for future research.

III Background

This section presents some preliminary material on single- and multi-agent RL.

III-A Single-agent Reinforcement Learning

An MDP [41] is a tuple $(S,A,\mathbb{T},\rho_{0},R)$ . $S$ is the set of states, $A$ the set of actions, $\mathbb{T}:S\times A\times S\rightarrow[0,1]$ encodes $\mathbb{P}(s_{t+1}|s_{t},a_{t})$ , the probability of transition to $s_{t+1}$ , given current state $s_{t}$ and action $a_{t}$ . $\rho_{0}$ is a probability distribution over the initial states. $R:S\times A\rightarrow\mathbb{R}$ denotes the reward that the agent receives when transitioning from $s_{t}$ while taking action $a_{t}$ . In this paper, $R<\infty$ .

The goal for an RL agent [1] is to learn a policy $\pi$ , in order to maximize $J:=\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})]$ . Here, $\gamma$ is a discounting factor, and the expectation is taken over the trajectory $\tau=(s_{0},a_{0},r_{0},s_{1},\dots)$ induced by policy $\pi$ . If $\pi:S\rightarrow A$ , the policy is deterministic. On the other hand, a randomized policy returns a probability distribution over the set of actions, and is denoted $\pi:S\times A\rightarrow[0,1]$ .

The value of a state-action pair $(s,a)$ following policy $\pi$ is represented by the Q-function, written $Q^{\pi}(s,a)=\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})|s_{0}=s,a_{0}=a]$ . The Q-function allows us to calculate the state value $V^{\pi}(s)=\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)]$ . The advantage of a particular action $a$ , over other actions at a state $s$ is defined by $A^{\pi}(s,a):=Q^{\pi}(s,a)-V^{\pi}(s)$ .

III-B Stochastic Games for Multi-agent Reinforcement Learning

A stochastic game with $n$ players is a tuple $\mathcal{G}=(S,A^{1},\dots,A^{n},\mathbb{T},R^{1},\dots,R^{n},O^{1},\dots,O^{n},\rho_{0},\gamma)$ . $S$ is the set of states, $A^{i}$ is the action set of player $i$ , $\mathbb{T}:S\times A^{1}\times\dots\times A^{n}\times S\rightarrow[0,1]$ encodes $\mathbb{P}(s_{t+1}|s_{t},a^{1}_{t},\dots,a^{n}_{t})$ , the probability of transition to state $s_{t+1}$ from $s_{t}$ , given the respective player actions. $R^{i}:S\times A^{1}\times\dots\times A^{n}\rightarrow\mathbb{R}$ is the reward obtained by agent $i$ when transiting from $s_{t}$ while each player takes action $a^{i}_{t}$ . $O^{i}$ is the set of observations for agent $i$ . At every state, each agent receives an observation correlated with the state: $o^{i}:S\rightarrow O^{i}$ . $\rho_{0}$ is a distribution over the initial states, and $\gamma\in[0,1]$ is a discounting factor.

A policy for agent $i$ is a distribution over actions, defined by $\pi^{i}:O^{i}\times A^{i}\rightarrow[0,1]$ . Let $\bm{\pi}:=\{\pi^{1},\dots,\pi^{n}\}$ and $s:=(s^{1},\dots,s^{n})$ . Following [17], in the simplest case, $s^{i}=o^{i}$ for each agent $i$ , and we use this for the remainder of the paper. Additional information about states of agents can be included since we compute centralized value functions. Let $V_{i}^{\bm{\pi}}(s)=V_{i}(s,\pi^{1},\dots,\pi^{n}):=\mathbb{E}_{\bm{\pi}}[\sum_{t}\gamma^{t}R^{i}_{t}|s_{0}=s,\bm{\pi}]$ and $Q_{i}^{\bm{\pi}}(s,a^{1},\dots,a^{n}):=R^{i}+\gamma\mathbb{E}_{s^{\prime}}[V_{i}^{\bm{\pi}}(s^{\prime})]$ where $V_{i}^{\bm{\pi}}(s)=\mathbb{E}_{\{a^{i}\sim\pi^{i}\}_{i=1}^{n}}[Q_{i}^{\bm{\pi}}(s,a^{1},\dots,a^{n})]$ .

III-C Reward Shaping in Reinforcement Learning

Reward shaping methods augment the environment reward $R$ with an additional reward $F\in\mathbb{R}$ , $F<\infty$ . This changes the structure of the original MDP $M(=(S,A,\mathbb{T},\rho_{0},R))$ to $M^{\prime}=(S,A,\mathbb{T},\rho_{0},R+F)$ . The goal is to choose $F$ so that an optimal policy for $M^{\prime}$ , $\pi^{*}_{M^{\prime}}$ , is also optimal for the original MDP $M$ . Potential-based reward shaping (PBRS) schemes were shown to be able to preserve the optimality of deterministic policies in [13]. PBRS was used in model-based RL in [20], in episodic RL in [42], and was extended to planning in partially observable domains in [43]. These works focused on the finite-horizon case. In comparison, we consider the infinite horizon, discounted cost setting.

The additional reward $F$ in PBRS is defined as a difference of potentials, $\phi(\cdot)$ . Specifically, $F(s_{t},a_{t},s_{t+1}):=\gamma\phi(s_{t+1})-\phi(s_{t})$ . Then, the Q-function, $Q^{*}_{M}(s,a)$ , of the optimal greedy policy for $M$ and the optimal Q-function $Q^{*}_{M^{\prime}}(s,a)$ for $M^{\prime}$ are related by: $Q^{*}_{M^{\prime}}(s,a)=Q^{*}_{M}(s,a)-\phi(s)$ . Therefore, the optimal greedy policy is not changed [13, 44], since:

	$\displaystyle\pi^{}_{M^{\prime}}(s)\in\arg\!\max_{a\in A}~{}Q^{}_{M^{\prime}}(s,a)$
	$\displaystyle\qquad=\arg\!\max_{a\in A}~{}\big{(}Q^{}_{M}(s,a)-\phi(s)\big{)}=\arg\!\max_{a\in A}~{}Q^{}_{M}(s,a).$

The authors of [45] augmented $\phi(s)$ to include action $a$ as an argument and termed this potential-based advice (PBA). They defined two forms of PBA, look-ahead PBA and look-back PBA, respectively defined by:

	$\displaystyle F(s_{t},a_{t},s_{t+1},a_{t+1})$	$\displaystyle=\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})$		(1)
	$\displaystyle F(s_{t},a_{t},s_{t-1},a_{t-1})$	$\displaystyle=\phi(s_{t},a_{t})-{\gamma}^{-1}\phi(s_{t-1},a_{t-1}).$		(2)

For the look-ahead PBA scheme, the state-action value function for $M$ following policy $\pi$ is given by:

\displaystyle Q^{\pi}_{M}(s,a)=Q^{\pi}_{M^{\prime}}(s,a)+\phi(s,a).

(3)

The optimal greedy policy for $M$ can be recovered from the optimal state-action value function for $M^{\prime}$ using the fact [45]:

\displaystyle\pi^{*}_{M}(s_{t})

\displaystyle\in\arg\!\max_{a\in A}\big{(}Q^{*}_{M^{\prime}}(s_{t},a)+\phi(s_{t},a)\big{)}.

(4)

The optimal greedy policy for $M$ using look-back PBA can be recovered similarly.

In the multi-agent case, the shaping advice for an agent at each time is a function of observations and actions of all agents. The shaping advice is augmented to the environment reward during training, and can take one of two forms, look-ahead and look-back, respectively given by:

	$\displaystyle F^{i}_{t}(s_{t},a^{i}_{t},a^{-i}_{t},s_{t+1},a^{i}_{t+1},a^{-i}_{t+1})$		(5)
	$\displaystyle\qquad\qquad:=\gamma\phi_{i}(s_{t+1},a^{i}_{t+1},a^{-i}_{t+1})-\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})$

	$\displaystyle F^{i}_{t}(s_{t},a^{i}_{t},a^{-i}_{t},s_{t-1},a^{i}_{t-1},a^{-i}_{t-1})$		(6)
	$\displaystyle\qquad\qquad:=\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})-\gamma^{-1}\phi_{i}(s_{t-1},a^{i}_{t-1},a^{-i}_{t-1})$

We will denote by $\mathcal{G}^{\prime}$ the $n$ player stochastic game that is identical to $\mathcal{G}$ , but with rewards $R^{\prime i}:=R^{i}+F^{i}$ for each $i$ .

When the value of the potential function is identical for all actions in a particular state, we will term this uniform advice. On the other hand, when the value of the potential function depends on the action taken in a state, we will term this nonuniform advice. We will explicitly distinguish between uniform and nonuniform variants of shaping advice in the single and multi-agent settings subsequently.

The shaping advice is a heuristic that uses knowledge of the environment and task, along with information available to the agent [46]. For example, in the particle world tasks that we study, each agent has access to positions of other agents and of landmarks, relative to itself. This is used to design shaping advice for individual agents at each time step.

In this paper, we develop a framework for incorporating shaping advice in RL environments with continuous action spaces. Moreover, in the multi-agent case, different from prior works that emphasize learning joint policies for agents, we learn decentralized policies for each agent.

IV Shaping Advice in Single-Agent RL

This section presents our results when potential-based shaping advice is used to learn stochastic policies in single-agent RL. We term this Shaping Advice in Single-agent RL (SAS). This generalizes and extends the use of potential-based methods in the literature which have hitherto focused on augmenting value-based methods to learn optimal deterministic policies. We consider two variants- (i) SAS-Uniform when the advice for all actions at a particular state is the same, and (ii) SAS-NonUniform, when a higher weight might be given to some ‘good’ actions over others at each state. We first prove that the optimality of stochastic policies is preserved when using SAS-Uniform. Then, we describe an approach to integrate SAS-NonUniform in to policy-gradient algorithms to enable effective learning of stochastic policies in single-agent RL.

IV-A Uniform Advice

The following result shows that SAS-Uniform preserves optimality even when the optimal policy is stochastic.

Proposition IV.1.

Let ${\pi}_{M}^{*}$ denote the optimal policy for an MDP $M$ , and suppose that ${\pi}_{M}^{*}$ is a stochastic policy. Let ${\pi}_{M^{\prime}}^{*}$ denote the optimal policy for the MDP $M^{\prime}$ whose reward is augmented by $F:=\gamma\phi(s_{t+1})-\phi(s_{t})$ . Then, SAS-Uniform preserves the optimality of stochastic policies- i.e., ${\pi}_{M}^{*}={\pi}_{M^{\prime}}^{*}$ .

Proof.

The goal in the original MDP $M$ was to find a policy $\pi$ in order to maximize:

\displaystyle{\pi}_{M}^{*}

\displaystyle=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right].

(7)

When using SAS-Uniform, the goal is to determine a policy so that:

	$\displaystyle{\pi}_{M^{\prime}}^{*}=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\big{(}R(s_{t},a_{t})+F(s_{t},a_{t},s_{t+1},a_{t+1})\big{)}\big{]}$
	$\displaystyle=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\big{(}R(s_{t},a_{t})+\gamma\phi(s_{t+1})-\phi(s_{t})\big{)}\big{]}$
	$\displaystyle=\arg\!\max_{\pi}\bigg{[}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\big{]}-\mathbb{E}_{\tau\sim\pi}\big{[}\phi(s_{0})\big{]}\bigg{]}$
	$\displaystyle=\arg\!\max_{\pi}~{}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\big{]}-\int_{s}\rho_{0}(s)\phi(s)\text{d}s.$		(8)

The last term in Equation (8) is constant, and does not affect the identity of the maximizing policy of (7). ∎

IV-B Nonuniform Advice

Although SAS-Uniform preserves optimality of policies in several settings, it suffers from the drawback of being unable to encode richer information, such as desired relations between states and actions. The authors of [45] proposed potential-based nonuniform advice, a scheme that augmented the potential function by including actions as an argument together with states. In this part, we show that when using SAS-NonUniform, recovering the optimal policy can be difficult if the optimal policy is stochastic. To overcome this barrier, we propose a novel way to impart prior information about the environment in order to use SAS-NonUniform to learn a stochastic policy.

IV-B1 Stochastic policy learning with nonuniform advice

Assume that we can compute $Q^{*}_{M}(s,a)$ , the optimal value for state-action pair $(s,a)$ in MDP $M$ . The optimal stochastic policy for $M$ is $\pi^{*}_{M}=\arg\!\max_{\tau\sim\pi}\mathbb{E}_{\pi}\big{[}Q^{*}_{M}(s,a)\big{]}$ . From Eqn. (3), the optimal stochastic policy for the modified MDP $M^{\prime}$ is given by $\pi^{*}_{M^{\prime}}=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\big{[}Q^{*}_{M}(s,a)-\phi(s,a)\big{]}$ . Without loss of generality, $\pi^{*}_{M}\neq\pi^{*}_{M^{\prime}}$ . If the optimal policy is deterministic, then the policy for $M$ can be recovered easily from that for $M^{\prime}$ using Eqn. (4). However, for stochastic optimal policies, we will need to average over trajectories of the MDP, which makes it difficult to recover the optimal policy for $M$ from that of $M^{\prime}$ .

In what follows, we will propose a novel way to take advantage of SAS-NonUniform in the policy gradient framework in order to directly learn a stochastic policy.

IV-B2 Imparting nonuniform advice in policy gradient

Let $J_{M}(\theta)$ denote the value of a parameterized policy $\pi_{\theta}$ in MDP $M$ . That is, $J_{M}(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]$ . Following the policy gradient theorem [1], and defining $G(s_{t},a_{t}):=\sum_{i=t}^{i=\infty}\gamma^{i-t}r_{i}$ , the gradient of $J(\theta)$ with respect to $\theta$ is:

\displaystyle\nabla_{\theta}J_{M}(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\big{[}G(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}.

(9)

Then, $\mathbb{E}_{\tau\sim\pi_{\theta}}\big{[}G(s_{t},a_{t})\big{]}=Q^{\pi_{\theta}}(s_{t},a_{t})$ .

REINFORCE [1] is a policy gradient method that uses Monte Carlo simulation to learn $\theta$ , where the parameter update is performed only at the end of an episode (a trajectory of length $T$ ). If we apply the look-ahead variant of SAS-NonUniform as in Equation (1) along with REINFORCE, then the total return from time $t$ is given by:

\displaystyle\begin{split}G^{a}(s_{t},a_{t})&=\sum_{i=t}^{i=T}\gamma^{i-t}r_{i}+\gamma^{T-t}\phi(s_{T},a_{T})-\phi(s_{t},a_{t})\\ &=G(s_{t},a_{t})+\gamma^{T-t}\phi(s_{T},a_{T})-\phi(s_{t},a_{t}).\end{split}

(10)

Notice that if $G^{a}(s_{t},a_{t})$ is used in Eqn. (9) instead of $G(s_{t},a_{t})$ , then the policy gradient is biased. One way to resolve the problem is to add the difference $-\gamma^{T-t}\phi(s_{T},a_{T})+\phi(s_{t},a_{t})$ to $G^{a}(s_{t},a_{t})$ . However, this makes the learning process identical to the original REINFORCE and nonuniform advice is not used. When using nonuniform advice in a policy gradient setup, it is important to add the term $\phi(s,a)$ so that the policy gradient is unbiased, and also leverage the advantage that nonuniform advice offers during learning.

IV-C Analysis and Algorithm

To integrate SAS-NonUniform with policy gradient-based techniques, we turn to temporal difference (TD) methods. TD methods update estimates of the accumulated return based in part on other learned estimates, before the end of an episode. A popular TD-based policy gradient method is the actor-critic framework [1]. In this setup, after performing action $a_{t}$ at step $t$ , the accumulated return $G(s_{t},a_{t})$ is estimated by $Q_{M}(s_{t},a_{t})$ which, in turn, is estimated by $r_{t}+\gamma V_{M}(s_{t+1})$ . It should be noted that the estimates are unbiased.

When the reward is augmented with look-ahead SAS-NonUniform, the accumulated return is changed to $Q_{M^{\prime}}(s_{t},a_{t})$ , which is estimated by $r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})+\gamma V_{M^{\prime}}(s_{t+1})$ . From Eqn. (3), at steady state, $Q_{M}(s_{t},a_{t})-Q_{M^{\prime}}(s_{t},a_{t})=\phi(s_{t},a_{t})$ . Intuitively, to keep policy gradient unbiased when augmented with look-ahead nonuniform advice, we can add $\phi(s_{t},a_{t})$ at each training step. In other words, we can use $r_{t}+\gamma\phi(s_{t+1},a_{t+1})+\gamma V_{M^{\prime}}(s_{t+1})$ as the estimated return. It should be noted that before the policy reaches steady state, adding $\phi(s_{t},a_{t})$ at each time step will not cancel out the effect of nonuniform advice. This is unlike in REINFORCE, where the addition of this term negates the effect of using nonuniform advice. In the advantage actor-critic, an advantage term is used instead of the Q-function in order to reduce the variance of the estimated policy gradient. In this case also, the potential term $\phi(s_{t},a_{t})$ can be added in order to keep the policy gradient unbiased.

Algorithm 1 SAS: Shaping Advice in Single-agent RL

0: Differentiable policy and value functions

\pi_{\theta}(a|s)

V_{\omega}(s)

; shaping advice

\phi(s,a)

; maximum episode length

T_{max}

. Initialization: policy parameter

\theta

, value parameter

\omega

, learning rate

\alpha^{\theta}

and

\alpha^{\omega}

, discount factor

\gamma

T=0

2: repeat

3: initialize state

s_{0}

t\leftarrow 0

4: repeat

5: Sample action

a_{t}\sim\pi_{\theta}(\cdot|s_{t})

6: Take action

a_{t}

, observe reward

r_{t}

, next state

s_{t+1}

R=\begin{cases}0,&\text{if }\begin{aligned} s_{t+1}\text{ is a terminal state },\end{aligned}\\ V_{\omega}(s_{t+1}),&\text{otherwise.}\end{cases}

8: if use look-ahead advice then

\delta_{t}=r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})+\gamma R-V_{\omega}(s_{t})

10: Update

\theta\leftarrow\theta+\alpha^{\theta}\big{(}\delta_{t}+\phi(s_{t},a_{t})\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})

11: else

12:

\delta_{t}=r_{t}+\phi(s_{t},a_{t})-\gamma^{-1}\phi(s_{t-1},a_{t-1})+\gamma R-V_{\omega}(s_{t})

13: Update

\theta\leftarrow\theta+\alpha^{\theta}\delta_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})

14: end if

15: Update

\omega\leftarrow\omega-\alpha^{\omega}\delta_{t}\nabla_{\omega}V_{\omega}(s_{t})

16: until

s_{t+1}

is a terminal state

17:

T\leftarrow T+1

18: until

T>T_{max}

A procedure for augmenting the advantage actor-critic with SAS-NonUniform is presented in Algorithm 1. $\alpha^{\theta}$ and $\alpha^{\omega}$ denote learning rates for the actor and critic respectively. When applying look-ahead nonuniform advice, at training step $t$ , parameter $\omega$ of the critic $V_{\omega}(s)$ is updated as follows:

	$\displaystyle\delta^{a}_{t}$	$\displaystyle=r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})+\gamma V_{\omega}(s_{t+1})-V_{\omega}(s_{t})$
	$\displaystyle\omega$	$\displaystyle=\omega-\alpha^{\omega}\delta^{a}_{t}\nabla_{\omega}V_{\omega}(s_{t}),$

where $\delta^{a}_{t}$ is the estimation error of the state value after receiving new reward $[r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})]$ at step $t$ . To ensure an unbiased estimate of the policy gradient, the potential term $\phi(s_{t},a_{t})$ is added while updating $\theta$ as [20]:

\displaystyle\theta=\theta+\alpha^{\theta}\big{(}\delta^{a}_{t}+\phi(s_{t},a_{t})\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}).

A similar method can be used when learning with look-back nonuniform advice. In this case, the critic and the policy parameter are updated as follows:

$\displaystyle\delta^{b}_{t}$	$\displaystyle=r_{t}+\phi(s_{t},a_{t})-\gamma^{-1}\phi(s_{t-1},a_{t-1})+\gamma V_{\omega}(s_{t+1})-V_{\omega}(s_{t})$
$\displaystyle\omega$	$\displaystyle=\omega-\alpha^{\omega}\delta^{b}_{t}\nabla_{\omega}V_{\omega}(s_{t}),$
$\displaystyle\theta$	$\displaystyle=\theta+\alpha\big{(}\delta^{b}_{t}+\gamma^{-1}\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})\|s_{t}\big{]}\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})$	(11)

In the above case, the potential term need not be added to ensure an unbiased estimate. Then, the policy parameter update becomes:

\displaystyle\theta=\theta+\alpha\delta^{b}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}),

(12)

which is exactly the policy update of the advantage actor-critic. This is formally stated in Proposition IV.2

Proposition IV.2.

When the actor-critic is augmented with look-back variant of SAS-NonUniform, Equations (11) and (12) are equal in the sense of expectation. That is

	$\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}\big{(}\delta^{b}_{t}+$	$\displaystyle\gamma^{-1}\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})\|s_{t}\big{]}\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\big{]}$
	$\displaystyle=\quad$	$\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}\delta^{b}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\big{]},$		(13)

where $\rho^{\pi_{\theta}}$ is the distribution induced by the policy $\pi_{\theta}$ .

Proof.

It is equivalent to show that:

\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})|s_{t}\big{]}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}=0.

(14)

The inner expectation $\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})|s_{t}\big{]}$ is a function of $s_{t}$ , policy $\pi_{\theta}$ , and transition probability $\mathbb{T}$ . Denoting this expectation by $f(s_{t},\pi_{\theta},\mathbb{T})$ , we obtain:

	$\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}f(s_{t},\pi_{\theta},\mathbb{T})\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\big{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t}\sim\rho^{\pi_{\theta}}}\bigg{[}\mathbb{E}_{a_{t}\sim\pi_{\theta}}\big{[}f(s_{t},\pi_{\theta},\mathbb{T})\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})\big{]}\bigg{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t}\sim\rho^{\pi_{\theta}}}\bigg{[}\int_{A}\pi_{\theta}(a_{t}\|s_{t})f(s_{t},\pi_{\theta},\mathbb{T})\frac{\nabla_{\theta}\pi_{\theta}(a_{t}\|s_{t})}{\pi_{\theta}(a_{t}\|s_{t})}\text{d}a\bigg{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{s_{t}\sim\rho^{\pi_{\theta}}}\bigg{[}f(s_{t},\pi_{\theta},\mathbb{T})\nabla_{\theta}\int_{A}\pi_{\theta}(a_{t}\|s_{t})\text{d}a\bigg{]}=0.$	(15)

The last equality follows from the fact that the integral evaluates to $1$ , and its gradient is $0$ . ∎

Remark IV.3.

Look-back nonuniform advice could result in better performance compared to look-ahead nonuniform advice since look-back nonuniform advice does not involve estimating a future action.

V Shaping Advice in Multi-Agent RL

Refer to caption — Figure 1: Schematic of SAM. A centralized critic estimates value functions $V_{\omega_{1}},\dots,V_{\omega_{n}}$ . Actions for an agent $i$ are sampled from its policy $\pi_{\theta_{i}}$ in a decentralized manner. Actions and observations of all agents are used to determine *shaping advice* $F^{1},\dots,F^{n}$ . The advice $F^{i}$ is augmented to the reward $r^{i}$ from the RL environment. The workflow shown by blue arrows in the outer box is required only during training. During execution, only the workflow shown by the red arrows inside the inner boxes is needed.

This section introduces shaping advice in multi-agent reinforcement learning (SAM). The goal of SAM is to augment shaping advice to the reward supplied by the MARL environment to provide immediate feedback to agents on their actions. SAM uses the CTDE paradigm wherein agents share parameters with each other during the training phase, but execute decentralized policies using only their own observations at test-time. Figure 1 shows a schematic of SAM. We subsequently detail how the shaping advice is provided to the agents, and analyze the optimality and convergence of policies when using SAM.

V-A Centralized Critic

SAM uses a centralized critic during the training phase. Information about observations and actions of all agents is used to learn a decentralized policy for each agent. One way to do this is by using an actor-critic framework, which combines policy gradients with temporal difference (TD) techniques.

At time $t$ , the joint action $(a^{1}_{t},\dots,a^{n}_{t})$ is used to estimate the accumulated return for each agent $i$ as $r^{i}_{t}+\gamma V^{i}(s_{t+1})$ . This quantity is called the TD-target. Subtracting $V^{i}(s_{t})$ from the TD-target gives the TD-error, which is an unbiased estimate of the agent advantage [1]. Each actor can then be updated following a gradient based on this TD-error.

We learn a separate critic for each agent like in [17]. However, the learning process can be affected when rewards provided by the environment are sparse. SAM uses a potential-based heuristic as shaping advice that is augmented to the reward received from the environment. This resulting reward is less sparse and can be used by the agents to learn policies.

V-B Shaping Advice in Multi-Agent Actor-Critic

We describe how to augment shaping advice to the multi-agent policy gradient to assign credit. We use the actor-critic framework with a centralized critic and decentralized actors.

For an agent $i$ , shaping advice $F^{i}$ is augmented to the environment reward $r^{i}$ at each time step. $F^{i}$ is specified by a difference of potentials (Eqn. (5) or (6)). The centralized critic allows using observations and actions of all agents to specify $F^{i}$ . Using look-ahead advice, $Q$ -values in the modified game $\mathcal{G}^{\prime}$ with reward $R^{i}+F^{i}$ and original game $\mathcal{G}$ with reward $R^{i}$ are related as [45]:

	$\displaystyle[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}}$	$\displaystyle=[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}}$
		$\displaystyle\qquad\qquad+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})$		(16)

The accumulated return in $\mathcal{G}^{\prime}$ for agent $i$ is then estimated by $r^{i}_{t}+\gamma\phi_{i}(s_{t+1},a^{i}_{t+1},a^{-i}_{t+1})-\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})+\gamma V^{i}(s_{t+1})$ . From Eqn. (16), we can add $\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})$ to the TD-target in $\mathcal{G}^{\prime}$ at each time $t$ to keep the policy gradient unbiased in $\mathcal{G}$ .

Let the critic and actor in SAM for agent $i$ be respectively parameterized by $\omega_{i}$ and $\theta_{i}$ . When the actor is updated at a slower rate than critic, the asymptotic behavior of the critic can be analyzed by keeping the actor fixed using two time-scale stochastic approximation methods [47]. For agent $i$ , the TD-error at time $t$ is given by:

\displaystyle\delta^{i}_{t}

\displaystyle:=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}}(s_{t+1})-V_{\omega_{i}}(s_{t}).

(17)

The update of the critic can be expressed as a first-order ordinary differential equation (ODE) in $\omega_{i}$ , given by:

\displaystyle\dot{\omega}_{i}

\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\delta^{i}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})]

(18)

Under an appropriate parameterization of the value function, this ODE will converge to an asymptotically stable equilibrium, denoted $\omega_{i}(\bm{\theta})$ . At this equilibrium, the TD-error for agent $i$ is $\delta_{t,\omega_{i}(\bm{\theta})}^{i}=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}(\bm{\theta})}(s_{t+1})-V_{\omega_{i}(\bm{\theta})}(s_{t})$ .

The update of the actor can then be determined by solving a first order ODE in $\theta_{i}$ . With look-ahead advice, a term corresponding to the shaping advice at time $t$ will have to be added to ensure an unbiased policy gradient (Eqn. (16)). This ODE can be written as:

\displaystyle\dot{\theta}_{i}

\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[(\delta_{t,\omega_{i}(\bm{\theta})}^{i}+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}))\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})]

(19)

A potential term will not have to be added to ensure an unbiased policy gradient when utilizing look-back advice. This insight follows from Proposition 3 in [20] since we consider decentralized policies.

V-C Analysis and Algorithm

In this part, we present a proof of the convergence of the actor and critic parameters when learning with shaping advice. We also demonstrate that convergence of policy gradients and values when using SAM implies convergence of these quantities in the absence of SAM. This will guarantee that policies learned in the modified stochastic game $\mathcal{G}^{\prime}$ will be locally optimal in the original game $\mathcal{G}$ .

For agent $i$ , the update dynamics of the critic can be expressed by the ODE in Eqn. (18). Assuming parameterization of $V(s)$ over a linear family, this ODE will converge to an asymptotically stable equilibrium [47]. The actor update is then given by the ODE in Eqn. (19). The parameters associated with the critics are assumed to be updated on a faster timescale than those of the actors. Then, the behaviors of the actor and critic can be analyzed separately using two timescale stochastic approximation techniques [47].

Assumption V.1.

We make the following assumptions:

1.

At any time $t$ , an agent is aware of the actions taken by all other agents. Rewards received by the agents at each time step are uniformly bounded.
2.

The Markov chain induced by the agent policies is irreducible and aperiodic.
3.

For each agent $i$ , the update of its policy parameter $\theta_{i}$ includes a projection operator $\Gamma_{i}$ , which projects $\theta_{i}$ onto a compact set $\Theta_{i}$ . We assume that $\Theta_{i}$ includes a stationary point of $\nabla_{\theta_{i}}J_{i}(\bm{\theta})$ for each $i$ .
4.

For each agent $i$ , its value function is parameterized by a linear family. That is, $V_{\omega_{i}}(s)=\Phi_{i}\omega_{i}$ , where $\Phi_{i}$ is a known, full-rank feature matrix for each $i$ .
5.

For each agent $i$ , the TD-error at each time $t$ and the gradients $\nabla_{\omega_{i}}V_{\omega_{i}}(s)$ are bounded, and the gradients $\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(\cdot|s_{t})$ are Lipschitz with bounded norm.
6.

The learning rates satisfy $\sum_{t}\alpha^{\theta}_{t}=\sum_{t}\alpha^{\omega}_{t}=\infty$ , $\sum_{t}[(\alpha^{\theta}_{t})^{2}+(\alpha^{\omega}_{t})^{2}]<\infty$ , $\lim_{t\rightarrow\infty}\frac{\alpha^{\theta}_{t}}{\alpha^{\omega}_{t}}=0$ .

We first state a useful result from [48].

Lemma V.2 ([48]).

Let $\Gamma:\mathbb{R}^{k}\rightarrow\mathbb{R}^{k}$ be a projection onto a compact set $K\subset\mathbb{R}^{k}$ . Define

\displaystyle\hat{\Gamma}(h(x)):

\displaystyle=\lim_{\epsilon\downarrow 0}\frac{\Gamma(x+\epsilon h(x))-x}{\epsilon}

for $x\in K$ and $h:\mathbb{R}^{k}\rightarrow\mathbb{R}^{k}$ continuous on $K$ . Consider the update $x_{t+1}=\Gamma(x_{t}+\alpha_{t}(h(x_{t})+\xi_{t,1}+\xi_{t,2}))$ and its associated ODE $\dot{x}=\hat{\Gamma}(h(x))$ . Assume that:

i) $\{\alpha_{t}\}$ is such that $\sum_{t}\alpha_{t}=\infty$ , $\sum_{t}\alpha_{t}^{2}<\infty$ ;

ii) $\{\xi_{t,1}\}$ is such that for all $\epsilon>0$ , $\lim_{t}\mathbb{P}(\sup_{n\geq t}||\sum_{\tau=t}^{n}\alpha_{\tau}\xi_{\tau,1}||\geq\epsilon)=0$ ;

iii) $\{\xi_{t,2}\}$ is an almost surely bounded random sequence, and $\xi_{t,2}\rightarrow 0$ almost surely.

Then, if the set of asymptotically stable equilibria of the ODE in $\dot{x}$ is compact, denoted $K_{eq}$ , the updates $x_{t+1}$ will converge almost surely to $K_{eq}$ .

Let $\{\mathcal{F}^{\omega}_{t}\}$ be the filtration where $\mathcal{F}^{\omega}_{t}:=\sigma(s_{\tau},$ $r^{1}_{\tau},\dots,r^{n}_{\tau},\omega_{1_{\tau}},\dots,\omega_{n_{\tau}}:\tau\leq t)$ is an increasing $\sigma-$ algebra generated by iterates of $\omega_{i}$ up to time $t$ . We first analyze behavior of the critic when parameters of the actor are fixed.

Theorem V.3.

For a fixed policy $\bm{\pi_{\theta}}$ , the update $\omega_{i}\leftarrow\omega_{i}-\alpha_{t}^{\omega}\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})$ converges almost surely to the set of asymptotically stable equilibria of the ODE $\dot{\omega}_{i}=h_{i}(\omega_{i}):=\mathbb{E}_{\bm{\pi_{\theta}}}[\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})|\mathcal{F}^{\omega}_{t}]$ .

Proof.

Let $\xi^{i}_{t,1}:=\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})-\mathbb{E}_{\bm{\pi_{\theta}}}[\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})|\mathcal{F}^{\omega}_{t}]$ . Then, the $\omega_{i}$ update can be written as $\omega_{i}\leftarrow\omega_{i}-\alpha_{t}^{\omega}[h_{i}(\omega_{i})+\xi^{i}_{t,1}]$ , where $h_{i}(\omega_{i})$ is continuous in $\omega_{i}$ . Since $\delta^{i}_{t}$ and $\nabla_{\omega_{i}}V_{\omega_{i}}(s)$ are bounded, $\xi^{i}_{t,1}$ is almost surely bounded.

Let $M^{i}_{t}:=\sum_{\tau=0}^{t}\alpha^{\omega}_{\tau}\xi^{i}_{\tau,1}$ . Then $\{M^{i}_{t}\}$ is a martingale¹¹1A martingale [49] is a stochastic process $S_{1},S_{2},\dots$ that satisfies $\mathbb{E}(|S_{n}|<\infty)$ and $\mathbb{E}(S_{n+1}|S_{1},\dots,S_{n})=S_{n}$ for each $n=1,2,\dots.$ ., and $\sum_{t}||M^{i}_{t}-M^{i}_{t-1}||^{2}=\sum_{t}||\alpha^{\omega}_{t}\xi^{i}_{t,1}||^{2}<\infty$ almost surely. Therefore, from the martingale convergence theorem [49], the sequence $\{M^{i}_{t}\}$ converges almost surely. Therefore, the conditions in Lemma V.2 are satisfied.

Since $V_{\omega_{i}}=\Phi_{i}\omega_{i}$ , with $\Phi_{i}$ a full-rank matrix, $h_{i}(\omega_{i})$ is a linear function, and the ODE will have a unique equibrium point. This will be an asymptotically stable equilibrium since ODE dynamics will be governed by a matrix of the form $(\gamma T_{\pi}-I)$ . Here, $I$ is an identity matrix, and $T_{\pi}$ is a stochastic state-transition matrix under policy $\pi$ , whose eigen-values have (strictly) negative real parts [50]. Denote this asymptotically stable equilibrium by $\omega_{i}(\bm{\theta})$ . ∎

We can now analyze the behavior of the actor, assuming that the critic parameters have converged to an asymptotically stable equilibrium. With $\omega_{i}(\bm{\theta})$ a limit point of the critic update, let $\delta_{t,\omega_{i}(\bm{\theta})}^{i}=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}(\bm{\theta})}(s_{t+1})-V_{\omega_{i}(\bm{\theta})}(s_{t})$ . When using look-ahead or look-back advice, define $\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}$ as:

\displaystyle\begin{split}&\text{look-ahead:}\quad\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}:=(\delta_{t,\omega_{i}(\bm{\theta})}^{i}+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}))\\ &\text{look-back:}\quad\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}:=\delta_{t,\omega_{i}(\bm{\theta})}^{i}.\end{split}

(20)

Let $\{\mathcal{F}^{\theta}_{t}\}$ be a filtration where $\mathcal{F}^{\theta}_{t}:=\sigma(\bm{\theta}_{\tau}:=[\theta_{1_{\tau}}\dots\theta_{n_{\tau}}]:\tau\leq t)$ is an increasing $\sigma-$ algebra generated by iterates of $\theta_{i}$ up to time $t$ .

Theorem V.4.

The update $\theta_{i}\leftarrow\Gamma_{i}[\theta_{i}+\alpha_{t}^{\theta}\tilde{\delta}^{i}_{t}$ $\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})]$ converges almost surely to the set of asymptotically stable equilbria of the ODE $\dot{\theta}_{i}=\hat{\Gamma}_{i}(h_{i}(\theta_{i}))$ , where $h_{i}(\theta_{i})=\mathbb{E}_{\bm{\pi_{\theta}}}[\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})|\mathcal{F}^{\theta}_{t}]$ .

Proof.

Let $\xi^{i}_{t,1}:=\tilde{\delta}^{i}_{t}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})-\mathbb{E}_{\bm{\pi_{\theta}}}[\tilde{\delta}^{i}_{t}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})|\mathcal{F}^{\theta}_{t}]$ and $\xi^{i}_{t,2}:=\mathbb{E}_{\bm{\pi_{\theta}}}[(\tilde{\delta}^{i}_{t}-\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})})\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})|\mathcal{F}^{\theta}_{t}]$ . Then, the update of $\theta_{i}$ can be written as $\theta_{i}\leftarrow\theta_{i}+\alpha^{\theta}_{t}[h_{i}(\theta_{i})+\xi^{i}_{t,1}+\xi^{i}_{t,2}]$ , where $h_{i}(\theta_{i})$ is continuous in $\theta_{i}$ . We now need to verify that the conditions in Lemma V.2 are satisfied.

Since the critic parameters converge almost surely to a fixed point, $\tilde{\delta}^{i}_{t}-\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}\rightarrow 0$ almost surely. Therefore, $\xi^{i}_{t,2}\rightarrow 0$ almost surely, verifying Condition iii) in Lemma V.2.

Since $\tilde{\delta}^{i}_{t}$ and $\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})$ are bounded, $\xi^{i}_{t,1}$ is continuous in $\theta_{i}$ and $\theta_{i}$ belongs to a compact set, the sequence $\{\xi^{i}_{t,1}\}$ is bounded almost surely [51]. If $M^{i}_{t}:=\sum_{\tau=0}^{t}\alpha^{\theta}_{\tau}\xi^{i}_{\tau,1}$ , then $\{M^{i}_{t}\}$ is a martingale, and $\sum_{t}||M^{i}_{t}-M^{i}_{t-1}||^{2}=\sum_{t}||\alpha^{\theta}_{t}\xi^{i}_{t,1}||^{2}<\infty$ almost surely. Then, $\{M^{i}_{t}\}$ converges almost surely [49], satisfying Condition ii) of Lemma V.2. Condition i) is true by assumption, completing the proof. ∎

Theorems V.3 and V.4 demonstrate the convergence of critic and actor parameters in the stochastic game with the shaped reward, $\mathcal{G}^{\prime}$ . However, our objective is to provide a guarantee of convergence in the original game $\mathcal{G}$ . We establish such a guarantee when parameterizations of the value function results in small errors, and policy gradients in $\mathcal{G}^{\prime}$ are bounded.

Definition V.5.

For a probability measure $\mu$ on a finite set $\mathcal{M}$ , the $\ell_{2}-$ norm of a function $f$ with respect to $\mu$ is defined as $||f||_{\mu}:=\bigg{[}\int_{\mathcal{M}}|f(X)|^{2}d\mu(X)\bigg{]}^{\frac{1}{2}}=\bigg{[}\mathbb{E}_{\mu}(|f(X)|^{2})\bigg{]}^{\frac{1}{2}}$ .

Theorem V.6.

In the stochastic game $\mathcal{G}^{\prime}$ , let $(\gamma+1)||V_{i}^{\pi_{\bm{\theta}}}(s)-V_{\omega_{i}(\bm{\theta})}(s)||_{\pi_{\bm{\theta}}}\leq\mathcal{E}_{i}(\bm{\theta})$ , and let $||\nabla_{\theta_{i}}\log\pi_{\theta_{i}}||_{\pi_{\bm{\theta}}}\leq C_{i}(\bm{\theta})$ . Let $(\bm{\theta}^{*},\omega(\bm{\theta})^{*})$ be the set of limit points of SAM.

Then, in the original stochastic game $\mathcal{G}$ , for each agent $i$ , $||\nabla_{\theta_{i}}J_{i}(\bm{\theta}^{*})||_{2}\leq C_{i}(\bm{\theta}^{*})\mathcal{E}_{i}(\bm{\theta}^{*})$ .

Proof.

Let $\Theta_{i_{eq}}$ denote the set of asymptotically stable equilibria of the ODE in $\theta_{i}$ . Let $\Theta_{eq}:=\Theta_{1_{eq}}\times\dots\times\Theta_{n_{eq}}$ . Then, in the set $\Theta_{eq}$ , $\dot{\theta}_{i}=0$ for each agent $i$ .

Consider a policy $\pi_{\bm{\theta}}$ , $\bm{\theta}\in\Theta_{eq}$ . In the original game $\mathcal{G}$ ,

\displaystyle\nabla_{\theta_{i}}J_{i}(\bm{\theta})

\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a^{i}_{t}|o^{i}_{t})Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]

(21)

From Equation (16), $[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}}=[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}}$ $+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})$ . Since we use an advantage actor critic, we replace $[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}}$ with an advantage term, defined as $[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}}-V^{\bm{\pi_{\theta}}}_{i}(s_{t})$ . Substituting these quantities in Equation (21),

$\displaystyle\nabla_{\theta_{i}}J_{i}(\bm{\theta})$	$\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a^{i}_{t}\|o^{i}_{t}).$	(22)
	$\displaystyle\qquad\qquad(r^{i}_{t}+F^{i}_{t}+\gamma V_{i}^{\pi_{\bm{\theta}}}(s_{t+1})$
	$\displaystyle\qquad\qquad-V^{\bm{\pi_{\theta}}}_{i}(s_{t})+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}))]$

At equilibrium, $\dot{\theta}_{i}=0$ in Equation (19). Subtracting this from Equation (22),

	$\displaystyle\nabla_{\theta_{i}}J_{i}(\bm{\theta})-\dot{\theta}_{i}=\nabla_{\theta_{i}}J_{i}(\bm{\theta})$
	$\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a^{i}_{t}\|o^{i}_{t}).$
	$\displaystyle\qquad\qquad(\gamma(V_{i}^{\pi_{\bm{\theta}}}(s_{t+1})-V_{\omega_{i}(\bm{\theta})}(s_{t+1}))$
	$\displaystyle\qquad\qquad-(V_{i}^{\pi_{\bm{\theta}}}(s_{t})-V_{\omega_{i}(\bm{\theta})}(s_{t})))]$

Using the Cauchy-Schwarz inequality,

$\displaystyle\|\|\nabla_{\theta_{i}}J_{i}(\bm{\theta}^{*})\|\|_{2}$	$\displaystyle\leq\|\gamma+1\|.\|\|V_{i}^{\pi_{\bm{\theta}}}(s)-V_{\omega_{i}(\bm{\theta})}(s)\|\|_{\pi_{\bm{\theta}}}.$
	$\displaystyle\qquad\qquad\qquad\|\|\nabla_{\theta_{i}}\log\pi_{\theta_{i}}\|\|_{\pi_{\bm{\theta}}}$
	$\displaystyle\leq C_{i}(\bm{\theta}^{})\mathcal{E}_{i}(\bm{\theta}^{})$	(23)

Each term on the right side of Eqn. (23) is bounded. Thus, $J_{i}(\bm{\theta})$ converges for each agent $i$ in the original game $\mathcal{G}$ , even though policies are synthesized in the modified game $\mathcal{G}^{\prime}$ . ∎

Proposition V.6 demonstrates that the additional reward $F^{i}$ provided by SAM to guide the agents does not distract them from accomplishing the task objective that is originally specified by the environment reward $R^{i}$ .

Given similar assumptions, we can obtain Corollary V.7 for single-agent cases and show guarantees on the convergence of Algorithm 1 using the theory of ‘two time-scale stochastic analysis’. Corollary V.7 gives a bound on the error introduced as a result of approximating the value function $V_{M^{\prime}}$ with $V_{M^{\prime}}^{\omega}$ in the MDP $M^{\prime}$ . This error term is small if the linear function family for $V_{M^{\prime}}^{\omega}$ is rich. In fact, if the critic is updated in batches, a tighter bound can be achieved, as shown in Proposition 1 of [52].

Corollary V.7.

Let $\mathcal{E}(\theta):=\left\lVert V^{\omega(\theta)}_{M^{\prime}}(s)-V^{\pi_{\theta}}_{M^{\prime}}(s)\right\rVert_{\rho^{\pi_{\theta}}}$ . Then, for any limit point $(\theta^{*},\omega^{*}):=\lim\limits_{T_{max}\to\infty}(\theta_{T_{max}},\omega_{T_{max}})\}$ of Algorithm 1, $\left\lVert\nabla_{\theta}J_{M}(\theta^{*})\right\rVert_{2}\leq C\mathcal{E}(\theta^{*})$ .

Algorithm 2 desecribes SAM. The shaping advice is specified as a difference of potential functions (Line 15), and is added to the reward received from the environment. We use an advantage-based actor-critic, and use the TD-error to estimate this advantage (Line 16). This is used to update the actor and critic parameters for each agent (Lines 18-19).

Algorithm 2 SAM: Shaping Advice in Multi-agent RL

0: For each agent

i

: differentiable policy

\pi_{\theta_{i}}

; differentiable value function

V_{\omega_{i}}

; shaping advice

\phi_{i}(s,a^{i},a^{-i})

. Maximum episode length

T_{max}

. Initialization: policy and value function parameters

\theta_{i},\omega_{i}

for all agents

i

, learning rates

\alpha^{\theta},\alpha^{\omega}

T=0

2: repeat

t\leftarrow-1

;

\phi_{i}(s_{-1},a^{i}_{-1},a^{-i}_{-1})=0

for all

i

4: Initialize information

s_{0}=[o_{0}^{1},\dots,o_{0}^{n}]

5: repeat

t\leftarrow t+1

7: for agent

i=1

n

8: sample

a^{i}_{t}\sim\pi_{\theta_{i}}(\cdot|o^{i}_{t})

9: end for

10: Take action

a_{t}=[a^{1}_{t},\dots,a^{n}_{t}]

, observe new information

s_{t+1}

and obtain reward

r^{i}_{t}

for each agent. Use

a_{t}

to determine

\phi_{i}(s_{t},a_{t})

for all agents

11: if

s_{t+1}

is terminal then

12:

V_{\omega_{i}}(s_{t+1})=0

13: end if

14: for agent

i=1

n

15: compute

F^{i}_{t}

based on equations (5) and (6)

16: TD-error:

\delta^{i}_{t}:=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}}(s_{t+1})-V_{\omega_{i}}(s_{t})

17: compute

\tilde{\delta}^{i}_{t}

based on equations (20)

18: Update actor:

\theta_{i}\leftarrow\Gamma_{i}[\theta_{i}+\alpha_{t}^{\theta}\tilde{\delta}^{i}_{t}\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})]

19: Update critic:

\omega_{i}\leftarrow\omega_{i}-\alpha_{t}^{\omega}\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})

20: end for

21: until

s_{t+1}

is terminal

22:

T\leftarrow T+1

23: until

T>T_{max}

Remark V.8.

We note that our objective is to maximize the rewards that can be obtained by agents equipped with shaping advice. Our algorithms are termed to converge when the values of these rewards reaches a ‘steady-state’. This is distinct from game-theoretic notions of equilibrium (e.g., Nash equilibrium), as studied in [53, 54]. The flavor of theoretical analysis of convergence that we adopt also allows better interpretability of our experimental evaluations, reported in Section VI.

VI Experiments

Our experiments study the incorporation of shaping advice in both single-agent and multi-agent environments. The code for our experimental evaluations is available at https://github.com/baicenxiao/Shaping-Advice.

VI-A Shaping advice for single-agent RL

In the single-agent case, we seek to compare the performance of actor-critic without shaping advice, SAS-Uniform and SAS-NonUniform. For both SAS-Uniform and SAS-NonUniform, actor-critic [1] is adopted as the base RL algorithm. We consider two setups. The first is a Puddle-Jump Gridworld [55], where state and action spaces are discrete. The second environment is a continuous state and action space mountain car [56]. In each experiment, we compare the rewards received by the agent when it uses the following schemes: i): actor-critic with sparse rewards (Sparse); ii): SAS-Uniform; iii): SAS-NonUniform. For SAS-NonUniform, we apply it in a look-back way since it does not require the estimation of future actions.

VI-A1 Puddle-Jump Gridworld

Figure 2 depicts the Puddle-jump gridworld environment as a 10x10 grid. The state space is $s=(x,y)$ denoting the position of the agent in the grid, where $x,y\in\{0,1,\dots,9\}$ . The goal of the agent is to navigate from the start state $S=(0,0)$ to the goal $G=(9,9)$ . At each step, the agent can choose from actions in the set $A=\{up,down,left,right,jump\}$ . There is a puddle along row $2$ which the agent should jump over. Further, the states $(9,8)$ and $(8,9)$ (blue squares in Figure 2) are indistinguishable to the agent. As a result, any optimal policy for the agent will be a stochastic policy.

If the $jump$ action is chosen in rows $3$ or $1$ , the agent will land on the other side of the puddle with probability $p_{j}$ , and remain in the same state otherwise. This action chosen in other rows will keep the agent in its current state. Any action that will move the agent off the grid will keep its state unchanged. The agent receives a reward of $-0.05$ for each action, and $+1000$ for reaching $G$ .

When using SAS-Uniform, we set $\phi^{U}(s):=u_{0}$ for states in rows $0$ and $1$ , and $\phi^{U}(s):=u_{1}$ for all other states. We need $u_{1}>u_{0}$ to encourage the agent to jump over the puddle. Unlike SAS-Uniform, SAS-NonUniform can provide the agent with more information about the actions it can take. We set $\phi^{NU}(s,a)$ to a ‘large’ value if action $a$ at state $s$ results in the agent moving closer to the goal according to the $\ell_{1}$ norm, $\big{(}|G-x|+|G-y|\big{)}$ . We additionally stipulate that $\frac{1}{|A|}\sum_{a\in A}\phi^{NU}(s,a)=\phi^{U}(s)$ . That is, the state potential of SAS-NonUniform is the same as that of SAS-Uniform under a uniform distribution over actions. This is to ensure a fair comparison between SAS-Uniform and SAS-NonUniform.

In our experiment, we set the discount factor $\gamma=1$ . Since the dimensions of the state and action spaces is not large, we do not use a function approximator for the policy $\pi$ . A parameter $\theta_{s,a}$ is associated to each state-action pair, and the policy is computed as: $\pi_{\theta}(a|s)=\frac{\exp(\theta_{s,a})}{\sum_{a\in A}\exp(\theta_{s,a})}$ . We fix $\alpha^{\omega}=0.001$ , and $\alpha^{\theta}=0.2$ for all cases.

From Figure 3, we observe that SAS-NonUniform scheme performs the best, in that the agent converges to the goal in five times fewer episodes ( $25$ vs. $125$ episodes) than A2C without advice (Sparse). When A2C is augmented with SAS-Uniform, convergence to the goal is slightly faster than without any reward shaping.

A smaller jump success probability $p_{j}$ is an indication that it is more difficult for the agent to reach the goal state $G$ . Figure 4 shows that SAS-NonUniform results in the highest reward for a more difficult task (lower $p_{j}$ ), when compared with the other reward schemes.

VI-A2 Continuous Mountain Car

In the mountain car (MC) environment, an under-powered car in a valley has to drive up a steep hill to reach the goal. In order to achieve this, the car should learn to accumulate momentum. A schematic for this environment is shown in Figure 5.

This environment has continuous state and action spaces. The state $s=(p,v)$ denotes position $p\in[-1.2,0.6]$ and velocity $v\in[-0.07,0.07]$ . The action $a\in[-1,+1]$ . The continuous action space makes it difficult to use classic value-based methods, such as Q-learning and SARSA-learning. The reward provided by the environment depends on the action and whether the car reaches the goal. Specifically, once the car reaches the goal it receives $+100$ , and before that, the reward at time $t$ is $-|a_{t}|^{2}$ . This reward structure therefore discourages the waste of energy. This acts as a barrier for learning, because there appears to be a sub-optimal solution where the agent remains at the bottom of the valley. Moreover, the reward for reaching the goal is significantly delayed, which makes it difficult for the conventional actor-critic algorithm to learn a good policy.

One choice of a potential function while using SAS-Uniform in this environment is $\phi^{U}(s_{t}):=p_{t}+P$ . Since the position $p_{t}$ can take values in $[-1.2,0.6]$ , the offset $P$ is chosen so that the potential $\phi^{U}(s_{t})$ will always be positive. We use $P=2$ . An interpretation of this scheme is: ‘state value is larger when the car is horizontally closer to the goal.’ The SAS-NonUniform scheme we use for this environment encourages the accumulation of momentum by the car– the direction of the action is encouraged to be the same as the current direction of the car’s velocity. In the meanwhile, we discourage inaction. Mathematically, the potential advice function has a larger value if $a_{t}\neq 0$ . We let $\phi^{NU}(s_{t},a_{t})=1$ , if $a_{t}v_{t}>0$ , and $\phi^{NU}(s_{t},a_{t})=0$ , otherwise.

In our experiments, we set $\gamma=0.99$ . To deal with the continuous state space, we use a neural network (NN) as a function approximator. The policy distribution $\pi_{\theta}(a|s)$ is approximated by a normal distribution, the mean and variance of which are the outputs of the NN. The value function is also represented by an NN. We set $\alpha^{\theta}=1\times 10^{-5}$ and $\alpha^{\omega}=5.6\times 10^{-4}$ , and use Adam [57] to update the NN parameters. The results we report are averaged over 10 different environment seeds.

Our experiments indicate that the policy makes the agent converge to one of two points: the goal, or remain stationary at the bottom of the valley. We observe that when learning with the base algorithm (A2C with sparse rewards), the agent is able to reach the goal only in $10\%$ of the trials (out of 10 trials), and was stuck at the sub-optimal solution for the remaining trials. With SAS-Uniform, the agent could converge correctly in only $20\%$ of the trials. This is because the agent might have to take an action that moves it away from the goal in order to accumulate momentum. However, the potential function $\phi^{U}(\cdot)$ discourages such actions. SAS-NonUniform performs the best, where we observed that the agent was able to reach the goal in $100\%$ of the trials. Figure 6 reflects these observations, where we see that SAS-NonUniform results in the agent obtaining the highest rewards.

Task	$\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})$ : SAM-Uniform	$\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})$ : SAM-NonUniform
CN	$\alpha_{1}exp(-\beta_{1}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j}))$	$-M_{1}\theta_{{a^{i}_{t}L_{i}}}+\alpha_{2}exp(-\beta_{2}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j}))$
PD	$\alpha_{3}exp(-\beta_{2}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j}))$	$-M_{2}\theta_{{a^{i}_{t}L_{i}}}+\alpha_{4}exp(-\beta_{4}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j}))$
PP	$\alpha_{5}exp(-\beta_{5}\sum_{j=1}^{N}dist(s_{t}^{pred_{j}},s_{t}^{prey}))$	$-M_{3}\sum_{j=1}^{N}\theta_{{a^{pred_{j}}_{t}s_{t}^{prey}}}+\alpha_{6}exp(-\beta_{6}\sum_{j=1}^{N}dist(s_{t}^{pred_{j}},s_{t}^{prey}))$

TABLE I: Shaping advice,

F^{i}_{t}

provided by SAM is given by Equation (5) or (6). The table lists the potential functions used in the Cooperative Navigation (CN), Physical Deception (PD), and Predator-Prey (PP) tasks.

L_{j}

is the landmark to which agent

j

is anchored to.

dist(\cdot,\cdot)

denotes the Euclidean distance.

\theta_{{a^{j}_{t}L_{j}}}\in[0,\pi]

is the angle between the direction of the action taken by agent

j

and the vector directed from its current position to

L_{j}

. In SAM-Uniform, advice for every action of the agents for a particular

s_{t}

is the same. In SAM-NonUniform, agents are additionally penalized if their actions are not in the direction of their target. In each case,

F^{i}_{t}

is positive when agents take actions that move it towards their target.

VI-B Shaping Advice for Multi-agent RL

This section describes the multi-agent tasks that we evaluate SAM on, and these include tasks with cooperative and competitive objectives. In each case, the rewards provided to the agents are sparse, which affects the agents’ ability to obtain immediate feedback on the quality of their actions at each time-step. Shaping advice provided by SAM is used to guide the agents to obtain higher rewards than in the case without advice. We conclude the section by presenting the results of our experiments evaluating SAM on these tasks.

VI-B1 Task Descriptions and Shaping Advice

We examine three tasks from the Particle World environment [17] where multiple agents share a two-dimensional space with continuous states and actions. An illustration of the tasks is shown in Figure 7, and we describe them below.

Predator-Prey: This task has $N$ predator agents who cooperate to capture $1$ faster-moving prey. Predators are rewarded when one of them collides with the prey, while the prey is penalized for the collision. The reward at other times is zero. Two landmarks impede movement of the agents.

Cooperative Navigation: This task has $N$ agents and $N$ landmarks. Agents are each rewarded $r$ when an agent reaches a landmark, and penalized for collisions with each other. The reward at other times is zero. Therefore, the maximum rewards agents can obtain is $rN$ . Thus, agents must learn to cover the landmarks, and not collide with each other.

Physical Deception: This task has $1$ adversary, $N$ agents, and $N$ landmarks. Only one landmark is the true target. Agents are rewarded when any one reaches the target, and penalized if the adversary reaches the target. At all other times, the agents get a reward of zero. An adversary also wants to reach the target, but it does not know which landmark is the target landmark. Thus, agents have to learn to split up and cover the landmarks to deceive the adversary.

In each environment, SAM provides shaping advice to guide agents to obtain a higher positive reward. This advice is augmented to the reward received from the environment. The advice is a heuristic given by a difference of potential functions (Equations (5) or (6)), and only needs to be specified once at the start of the training process.

In the Cooperative Navigation and Physical Deception tasks, we anchor each agent to a (distinct) landmark. The shaping advice will then depend on the distance of an agent to the landmark it is anchored to. Although distances computed in this manner will depend on the order in which the agents and landmarks are chosen, we observe that it empirically works across multiple training episodes where positions of landmarks and initial positions of agents are generated randomly. The advice provided by SAM is positive when agents move closer to landmarks they are anchored to. In the absence of anchoring, they may get distracted and move towards different landmarks at different time steps. Anchoring results in agents learning to cover landmarks faster.

We consider two variants of advice for each task. In SAM-Uniform, the advice for every action taken is the same. In SAM-NonUniform, a higher weight is given to some ‘good’ actions over others for each $s_{t}$ . We enumerate the advice for each task in Table I. We use MADDPG as the base RL algorithm [17]. We compare the performance of agents trained with SAM (SAM-Uniform or SAM-NonUniform) to the performance of agents trained using the sparse reward from the environment. We also compare SAM with a state-of-the-art reward redistribution technique introduced in [19] called Iterative Relative Credit Assignment (IRCR).

VI-B2 Implementation details

When we tested SAM-NonUniform, we used look-back advice following Equation (6). This was done to avoid estimating a ‘future’ action, i.e. $a^{i}_{t+1}$ , at each time-step (since the replay buffer contains tuples of the form $(s_{t},a^{1}_{t},\dots,a^{n}_{t},r^{1}_{t},\dots,r^{n}_{t},s_{t+1})$ ). Noisy estimates of $a^{i}_{t+1}$ can cause oscillations in the rewards obtained by the agent.

We adopt the same hyperparameter values and network architectures as used in [17]. We let $\Gamma_{i}$ Line 18 of the SAM Algorithm be the identity matrix. We list values of $\alpha_{\cdot},\beta_{\cdot},M_{\cdot}$ (from Table 1 in the main paper) that were used in our experiments. Sparse Reward Setting: For results reported in Figures 8 and 9, we use the following parameters for the shaping advice:

CN ( $N=3$ )	$\alpha_{1}=\alpha_{2}=100$	$\beta_{1}=\beta_{2}=1$	$M_{1}=1$
CN ( $N=6$ )	$\alpha_{1}=\alpha_{2}=1000$	$\beta_{1}=\beta_{2}=1$	$M_{1}=10$
PD ( $N=2$ )	$\alpha_{3}=\alpha_{4}=500$	$\beta_{3}=\beta_{4}=1$	$M_{2}=1$
PD ( $N=4$ )	$\alpha_{3}=\alpha_{4}=500$	$\beta_{3}=\beta_{4}=1$	$M_{2}=10$
PP ( $N=3$ )	$\alpha_{5}=\alpha_{6}=100$	$\beta_{5}=\beta_{6}=1$	$M_{3}=1$

Other forms of shaping rewards: In the three tasks that we evaluated, we observed that a ‘linear’ distance-based advice (i.e., advice of the form $\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}):=\sum_{j}dist(s_{t}^{j},L_{j})$ ) did not work. From Equations (5) and (6), using this form of advice, an agent would get the same additional reward when it took a step towards the target, independent of its distance to the target. For example, an agent $100$ steps from the target would get the same shaping advice if it took one step towards the target as an agent who takes a step towards the target from a state that is $50$ steps from the target.

VI-B3 Results

Figure 8 shows $0-1$ normalized scores, averaged over the last 1000 training episodes, comparing SAM augmented to MADDPG and classical MADDPG policies. The score for a task is the average agent reward in cooperative tasks, and the average agent advantage ( $=$ agent $-$ adversary reward) in competitive tasks [58]. Agents equipped with SAM-NonUniform have the best performance. This is because SAM-NonUniform provides specific feedback on the quality of agents’ actions. SAM-Uniform also performs well in these tasks. SAM-NonUniform outperforms SAM-Uniform and the classical MADDPG baseline by a significant margin when there are more agents.

In cooperative navigation, when rewards are sparse, the agents are not able to learn policies that will allow them to even partially cover the landmarks using the MADDPG baseline. In comparison, SAM guides agents to learn to adapt to each others’ policies, and cover all the landmarks. SAM-NonUniform results in much higher rewards than other methods in the complex task with $N=6$ agents.

We observe a similar phenomenon in physical deception, where SAM guides agents to learn policies to cover the landmarks. This behavior of the agents is useful in deceiving the adversary from moving towards the true landmark, thereby resulting in lower rewards for the adversary. Therefore, agent advantage is higher with SAM.

In the predator-prey task, we see that the performance of SAM is comparable to MADDPG. We believe that this is because this task might have a well-defined and unique equilibrium to which agent policies eventually converge.

Figure 9 shows the average and variance of the score during different stages of the training process. The score for a task is the average agent reward in cooperative tasks, and the average agent advantage ( $=$ agent $-$ adversary reward) in competitive tasks [58]. In terms of agent scores averaged over the last $1000$ training episodes, agents equipped with SAM-NonUniform have the best performance. This is because SAM-NonUniform provides specific feedback on quality of agents’ actions. SAM-Uniform also performs well in these tasks.

In cooperative navigation, when agents use only the sparse rewards from the environment, the agents are not able to learn policies that will allow them to even partially cover the landmarks. In comparison, SAM guides agents to learn to adapt to each others’ policies, and cover all the landmarks. A similar phenomenon is observed in physical deception, where SAM guides agents to learn policies to cover the landmarks. This behavior of the agents is useful in deceiving the adversary from moving towards the true landmark, thereby resulting in lower final rewards for the adversary.

We additionally compare the performance of SAM with a technique called IRCR that was introduced in [19]. We observe that agents using IRCR receive the lowest scores in all three tasks. We believe that a possible reason for this is that in each training episode, IRCR accumulates rewards till the end of the episode, and then uniformly redistributes the accumulated reward along the length of the episode. A consequence of this is that an agent may find it more difficult to identify the time-step when it reaches a landmark or when a collision occurs. For example, in the Predator-Prey task, suppose that the length of an episode is $T_{ep}$ . Consider a scenario where one of the predators collides with the prey at a time $T<T_{ep}$ , and subsequently moves away from the prey. When IRCR is applied to this scenario, the redistributed reward at time $T$ will be the same as that at other time steps before $T_{ep}$ . This property makes it difficult to identify critical time-steps when collisions between agents happen.

The authors of [17] observed that agent policies being unable to adapt to each other in competitive environments resulted in oscillations in rewards. Figure 9(b) indicates that SAM is able to alleviate this problem. Policies learned by agents using SAM in the physical deception task result in much smaller oscillations in the rewards than when using sparse rewards alone.

VII Conclusion

This paper presented a comprehensive framework to incorporate domain knowledge through shaping advice in single- and multi-agent reinforcement learning environments with sparse rewards. The shaping advice for each agent was a heuristic specified as a difference of potential functions, and was augmented to the reward provided by the environment. The modified reward signal provided agents with immediate feedback on the quality of the actions taken at each time-step. In the single-agent case, our algorithm SAS enabled the agent to obtain higher rewards, and was able to reach a target state more successfully than without a shaping reward. For the multi-agent setting, SAM used the centralized training with decentralized execution paradigm to efficiently learn decentralized policies for each agent that used only their individual local observations. We showed through theoretical analyses and experimental validation that shaping advice provided by SAS and SAM did not distract agents from accomplishing task objectives specified by the environment reward. Using SAS or SAM allowed the agents to obtain a higher average reward in fewer number of training episodes.

Future research will aim to extend the scope of shaping advice based techniques. Two research questions are: i) can we design principled methods based on SAS and SAM in order to adaptively learn shaping advice (rather than being fixed for the duration of training)?, ii) how might such an adaptive procedure affect the sample efficiency and number of training episodes required? We believe that answers to these questions will be a step towards broadening the application of shaping advice to more challenging real-world RL environments.

References

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 2018.
[2] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.
[3] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, 2016.
[4] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS One, vol. 12, no. 4, 2017.
[5] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning and Representations, 2016.
[6] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” Electronic Imaging, vol. 2017, no. 19, pp. 70–76, 2017.
[7] J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel, “Multi-agent reinforcement learning in sequential social dilemmas,” in Conference on Autonomous Agents and Multi-Agent Systems, 2017, pp. 464–473.
[8] S. Yin and F. R. Yu, “Resource allocation and trajectory design in uav-aided cellular networks based on multi-agent reinforcement learning,” IEEE Internet of Things Journal, 2021.
[9] A. K. Agogino and K. Tumer, “Analyzing and visualizing multiagent rewards in dynamic and stochastic domains,” Autonomous Agents and Multi-Agent Systems, vol. 17, no. 2, pp. 320–338, 2008.
[10] S. Devlin, D. Kudenko, and M. Grześ, “An empirical study of potential-based reward shaping and advice in complex, multi-agent systems,” Advances in Complex Systems, vol. 14, pp. 251–278, 2011.
[11] S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer, “Potential-based difference rewards for multiagent reinforcement learning,” in Conference on Autonomous Agents and Multi-Agent Systems, 2014.
[12] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping.” in International Conference on Machine Learning, 1998.
[13] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning, 1999, pp. 278–287.
[14] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, 2017, pp. 1146–1155.
[15] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems.” The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
[16] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in International Conference on Machine Learning, 1993.
[17] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
[18] P. Mannion, S. Devlin, J. Duggan, and E. Howley, “Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning,” The Knowledge Engineering Review, vol. 33, 2018.
[19] T. Gangwani, Y. Zhou, and J. Peng, “Learning guidance rewards with trajectory-space smoothing,” in Neural Information Processing Systems, 2020.
[20] B. Xiao, B. Ramasubramanian, A. Clark, H. Hajishirzi, L. Bushnell, and R. Poovendran, “Potential-based advice for stochastic policy learning,” in IEEE Conference on Decision and Control, 2019, pp. 1842–1849.
[21] M. E. Taylor, H. B. Suay, and S. Chernova, “Integrating reinforcement learning with human demonstrations of varying ability,” in Conference on Autonomous Agents and Multiagent Systems, 2011, pp. 617–624.
[22] Z. Wang and M. E. Taylor, “Improving reinforcement learning with confidence-based demonstrations.” in International Joint Conference on Artificial Intelligence, 2017, pp. 3027–3033.
[23] M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “HG-DAgger: Interactive imitation learning with human experts,” in International Conference on Robotics and Automation. IEEE, 2019, pp. 8077–8083.
[24] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in International Conference on Artificial Intelligence and Statistics, 2011, pp. 627–635.
[25] B. Xiao, Q. Lu, B. Ramasubramanian, A. Clark, L. Bushnell, and R. Poovendran, “FRESH: Interactive reward shaping in high-dimensional state spaces using human feedback,” in Conference on Autonomous Agents and Multi-Agent Systems, 2020, pp. 1512–1520.
[26] Z. Zuo, “Nonsingular fixed-time consensus tracking for second-order multi-agent networks,” Automatica, vol. 54, pp. 305–309, 2015.
[27] B. Tian, H. Lu, Z. Zuo, and W. Yang, “Fixed-time leader–follower output feedback consensus for second-order multiagent systems,” IEEE Transactions on Cybernetics, vol. 49, pp. 1545–1550, 2018.
[28] X. Li, Y. Tang, and H. R. Karimi, “Consensus of multi-agent systems via fully distributed event-triggered control,” Automatica, vol. 116, p. 108898, 2020.
[29] J. Qin, Q. Ma, Y. Shi, and L. Wang, “Recent advances in consensus of multi-agent systems: A brief survey,” IEEE Transactions on Industrial Electronics, vol. 64, no. 6, pp. 4972–4983, 2016.
[30] W. Gao, M. Mynuddin, D. C. Wunsch, and Z.-P. Jiang, “Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model,” IEEE transactions on neural networks and learning systems, 2021.
[31] K. G. Vamvoudakis, H. Modares, B. Kiumarsi, and F. L. Lewis, “Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,” IEEE Control Systems Magazine, vol. 37, no. 1, pp. 33–52, 2017.
[32] G. Qu, Y. Lin, A. Wierman, and N. Li, “Scalable multi-agent reinforcement learning for networked systems with average reward,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[33] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
[34] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls et al., “Value-decomposition networks for cooperative multi-agent learning based on team reward.” in Conference on Autonomous Agents and Multiagent Systems, 2018, pp. 2085–2087.
[35] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, 2018, pp. 4295–4304.
[36] S. Devlin and D. Kudenko, “Theoretical considerations of potential-based reward shaping for multi-agent systems,” in Conference on Autonomous Agents and Multi-Agent Systems, 2011, pp. 225–232.
[37] X. Lu, H. M. Schwartz, and S. N. Givigi, “Policy invariance under reward transformations for general-sum stochastic games,” Journal of Artificial Intelligence Research, vol. 41, pp. 397–406, 2011.
[38] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe, “Expressing arbitrary reward functions as potential-based advice,” in AAAI Conference on Artificial Intelligence, 2015, pp. 2652–2658.
[39] M. Klissarov and D. Precup, “Reward propagation using graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[40] T. Mesnard, T. Weber, F. Viola, S. Thakoor, A. Saade, A. Harutyunyan, W. Dabney, T. Stepleton, N. Heess, A. Guez et al., “Counterfactual credit assignment in model-free reinforcement learning,” International Conference on Machine Learning, 2021.
[41] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
[42] M. Grześ, “Reward shaping in episodic reinforcement learning,” in Autonomous Agents and MultiAgent Systems, 2017, pp. 565–573.
[43] A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko, “Potential-based reward shaping for finite horizon online POMDP planning,” Autonomous Agents and Multi-Agent Systems, vol. 30, no. 3, 2016.
[44] S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping,” in Conference on Autonomous Agents and Multiagent Systems, 2012, pp. 433–440.
[45] E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in International Conference on Machine Learning, 2003, pp. 792–799.
[46] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Conference on Autonomous Agents and Multiagent Systems, 2017, pp. 66–83.
[47] V. S. Borkar, Stochastic approximation: A dynamical systems viewpoint. Springer, 2009, vol. 48.
[48] H. J. Kushner and D. S. Clark, Stochastic approximation methods for constrained and unconstrained systems. Springer Science & Business Media, 2012, vol. 26.
[49] D. Williams, Probability with Martingales. Cambridge University Press, 1991.
[50] H. Prasad, L. Prashanth, and S. Bhatnagar, “Two-timescale algorithms for learning nash equilibria in general-sum stochastic games,” in Conference on Autonomous Agents and Multiagent Systems, 2015, pp. 1371–1379.
[51] W. Rudin, Principles of Mathematical Analysis. McGraw-Hill, New York, 1964.
[52] Z. Yang, K. Zhang, M. Hong, and T. Başar, “A finite sample analysis of the actor-critic algorithm,” in IEEE Conference on Decision and Control (CDC), 2018, pp. 2759–2764.
[53] Y. Jiang and F. Tan, “An enhanced model-free reinforcement learning algorithm to solve Nash equilibrium for multi-agent cooperative game systems,” IEEE Access, vol. 8, pp. 223 743–223 755, 2020.
[54] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-agent reinforcement learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 5571–5580.
[55] O. Marom and B. Rosman, “Belief reward shaping in reinforcement learning,” in AAAI, 2018, pp. 3762–3769.
[56] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai Gym,” arXiv:1606.01540, 2016.
[57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
[58] Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan, “Probabilistic recursive reasoning for multi-agent reinforcement learning,” in International Conference on Learning Representations, 2019.

$\displaystyle\|\|\nabla_{\theta_{i}}J_{i}(\bm{\theta}^{*})\|\|_{2}$	$\displaystyle\leq\|\gamma+1\|.\|\|V_{i}^{\pi_{\bm{\theta}}}(s)-V_{\omega_{i}(\bm{\theta})}(s)\|\|_{\pi_{\bm{\theta}}}.$
	$\displaystyle\qquad\qquad\qquad\|\|\nabla_{\theta_{i}}\log\pi_{\theta_{i}}\|\|_{\pi_{\bm{\theta}}}$
	$\displaystyle\leq C_{i}(\bm{\theta}^{})\mathcal{E}_{i}(\bm{\theta}^{})$	(23)