This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Shaping Advice in Deep Reinforcement Learning

Baicen Xiao1, Member, IEEE, Bhaskar Ramasubramanian1,2, Member, IEEE, Radha Poovendran1, Fellow, IEEE 1Network Security Lab, Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA.
{bcxiao, rp3}@uw.edu2Electrical and Computer Engineering, Department of Engineering and Design, Western Washington University, Bellingham, WA 98225, USA.
{ramasub}@wwu.eduThis work was supported by the Office of Naval Research via Grant N00014-17-S-B001.
Abstract

Reinforcement learning involves agents interacting with an environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose methods to augment the reward signal from the environment with an additional reward termed shaping advice in both single- and multi-agent reinforcement learning. The shaping advice is specified as a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The use of potential functions is underpinned by an insight that the total potential when starting from any state and returning to the same state is always equal to zero. We show through theoretical analyses and experimental validation that the shaping advice does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that the convergence of policy gradients and value functions when using shaping advice implies the convergence of these quantities in the absence of shaping advice. We design two algorithms- Shaping Advice in Single-agent reinforcement learning (SAS) and Shaping Advice in Multi-agent reinforcement learning (SAM). Shaping advice in SAS and SAM needs to be specified only once at the start of training, and can easily be provided by non-experts. Experimentally, we evaluate SAS and SAM on two tasks in single-agent environments and three tasks in multi-agent environments that have sparse rewards. We observe that using shaping advice results in agents learning policies to complete tasks faster, and obtain higher rewards than algorithms that do not use shaping advice. Code for our experiments is available at https://github.com/baicenxiao/Shaping-Advice.

Index Terms:
shaping advice, potential-based, reinforcement learning, centralized training with decentralized execution

I Introduction

Reinforcement learning (RL) is a framework that allows agents to complete tasks in an environment, even when a model of the environment is not known [1]. An RL agent ‘learns’ to complete tasks by maximizing an expected long-term reward, where the reward signal is provided by the environment. RL algorithms have been successfully implemented in many fields, including games [2, 3, 4], robotics [5], autonomous vehicle coordination [6], analysis of social dilemmas [7], and resource allocation in cellular networks [8].

When the environment is not known, availability of immediate feedback on quality of actions taken at each time-step is critical to the learning of behaviors to successfully complete a task. This is termed credit assignment [1]. When reward signals provided by the environment are sparse, it becomes difficult to perform effective credit assignment at intermediate steps of the learning process. One approach that has been shown to improve learning when rewards are sparse is reward shaping [9, 10, 11]. Reward shaping techniques augment the reward provided by the environment with an additional shaping reward. The shaping reward can be designed to be dense (i.e., not sparse), and agents learn policies (which action to take in a particular state) using the augmented reward.

Any additional reward can distract an agent from completing a task specified by the reward provided by the environment, and therefore will need to be provided in a systematic manner [12]. In this paper, we term the additional reward given to agents at each time-step as shaping advice. The shaping advice is specified by a difference of potential functions at consecutive time-steps, where each potential function depends on observations and actions of the agents. Potential functions satisfy a critical property that the total potential when starting from a state and returning to the same state is zero [13]. This ensures that an agent will not be distracted from completing a task specified by the reward provided by the environment.

There are additional challenges when there are multiple RL agents. In such a setting, each agent will have to interact not only with its environment, but also with other agents. As behaviors of agents evolve in the environment, the environment will become non-stationary from the perspective of any single agent. Thus, agents that independently learn behaviors by assuming other agents to be part of the environment can result in unstable learning regimes [14, 15, 16].

When multiple trained agents are deployed independently, or when communication among agents is costly, the agents will need to be able to learn decentralized policies. Decentralized policies can be efficicently learned by adopting the centralized training with decentralized execution (CTDE) paradigm, first introduced in [17]. An agent using CTDE can make use of information about other agents’ observations and actions to aid its own learning during training, but will have to take decisions independently at test-time. However, the ability of an agent to learn decentralized policies can be affected if reward signals from the environment are sparse.

Reward shaping techniques that use potential functions satisfy an additional property that the identity of the optimal policies with and without the shaping reward is the same [13]. The state-of-the-art on the potential-based reward shaping in RL [9, 10, 11] is focused on environments with finite and discrete action spaces. To the best of our knowledge, adapting solution techniques proposed in the above papers to more general settings with continuous action spaces has not been explored. Moreover, in the multi-agent case, these works emphasize learning joint policies for agents. In comparison, we propose to learn decentralized policies for each agent, which will enable application of our method to environments with large numbers of agents.

In this paper, we develop a comprehensive framework to enable effective credit assignment in RL environments with sparse rewards. These methods incorporate information about the task and environment to define shaping advice. We term our algorithms Shaping Advice in Single-agent reinforcement learning (SAS) when there is only a single agent, and Shaping Advice in Multi-agent reinforcement learning (SAM), when there are multiple agents. The shaping advice in SAS and SAM can be interpreted as domain knowledge that aids credit assignment [18]. This advice needs to be specified only once at the start of the training process. SAS and SAM can be applied in environments with continuous or discrete state and action spaces. We demonstrate that both algorithms do not distract agents from completing tasks specified by rewards from the environment. Specifically, our contributions are:

  • We introduce SAS and SAM to incorporate potential-based shaping advice in single- and multi-agent deep RL environments with continuous action spaces.

  • We demonstrate that shaping advice provided by SAS and SAM does not distract agents from completing tasks specified by the environment reward. We accomplish this by theoretically establishing that convergence of policy gradients and values when using the shaping advice implies convergence of these quantities in the absence of the shaping advice.

  • We verify our theoretical results through extensive experimental evaluations. Specifically,

    • we evaluate SAS on two environments with sparse rewards- puddle-jump gridworld and continuous mountain car;

    • we evaluate SAM on three tasks in the multi-agent Particle World environment [17]. All these tasks have sparse rewards. We show that using shaping advice allows agents to learn policies to complete the tasks faster, and obtain higher rewards than: i) using sparse rewards alone, and ii) a state-of-the-art reward redistribution technique from [19].

Compared to a preliminary version that appeared in [20], in this paper, we develop a comprehensive framework for providing shaping advice in both, single- and multi-agent RL. We provide detailed theoretical analyses and experimental evaluations for each setting.

The remainder of this paper is organized as follows: Section II presents related work and Section III provides an introduction to single- and multi-agent RL and potential-based shaping advice. Sections IV - VI present the main contributions of this paper. We provide details on SAS and SAM and present theoretical analysis of their convergence in Sections IV and V. Experiments validating the use of SAS and SAM are reported in Section VI, and Section VII concludes the paper.

II Related Work

Techniques to improve credit assignment using feedback signals provided by a human operator have been studied in single-agent RL. Demonstrations provided by a human operator were used to synthesize a ‘baseline policy’ that was used to guide learning in [21, 22]. When expert demonstrations were available, imitation learning was used to guide exploration of the RL agent in [23, 24]. Feedback provided by a human operator was converted to a shaping reward to aid training a deep RL agent in environments with delayed rewards in [25]. These techniques, however, presume the availability of a human operator, which might limit their applicability.

In the multi-agent setting, decentralized and distributed control techniques is a popular area of research. A widely studied problem in such systems is the development of algorithms to specify methods by which information can be exchanged among agents so that they can jointly complete tasks. A technique to ensure fixed-time consensus for multi-agent systems whose interactions were specified by a directed graph was studied in [26, 27]. The authors of [28] proposed an adaptive distributed event-triggering protocol to guarantee consensus for multi-agent systems specified by linear dynamics and interactions specified by an undirected graph. We direct the reader to [29] for a survey of recent developments in consensus of multi-agent systems. These works, however, assumed the availability of models of individual agent’s dynamics, and of the interactions between agents.

Data-driven techniques are being increasingly adopted to solve problems in multi-agent systems. These methods do not require complete knowledge of the system model; rather, they use information about the state, input, and output of each agent to establish feasibility of solutions and guarantees on convergence of reinforcement learning-based algorithms. Input and state data were used in an online manner to design a distributed control algorithm to solve a cooperative optimal output regulation problem in leader-follower systems in [30]. Information obtained from trajectories of each player were used in [31] to develop real-time solutions to multi-player games through the design of an actor-critic-based adaptive learning algorithm. The authors of [32] identified a class of networked MARL problems where enforcing specific local interactions among players permitted the exploiting an exponential decay property which enabled the development of scalable, distributed algorithms for optimization and control. A comprehensive overview of algorithms in cooperative and competitive MARL with a focus on theoretical analyses of their convergence and complexity was presented in [33].

Cooperative MARL tasks are one instance of the setup described above, wherein all agents share the same global reward. The authors of [34] introduced value decomposition networks that decomposed a centralized value into a sum of individual agent values to assess contributions of individual agents to a shared global reward. An additional assumption on monotonicity of the centralized value function was imposed in QMIX [35] to assign credit to an agent by enumerating a value for each valid action at a state. The action spaces of agents in the above-mentioned works were discrete and finite, and these techniques cannot be easily adapted to settings with continuous action spaces. In comparison, we study reward shaping in MARL tasks in environments with continuous action spaces.

An alternative approach to improve credit assignment is potential-based reward shaping. Although this requires prior knowledge of the problem domain, potential-based techniques have been shown to offer guarantees on optimality and convergence of policies in both single [13] and multi-agent [10, 36, 37] cases. The aforementioned works had focused on the use of potential-based methods in environments with discrete action spaces. A preliminary version of this paper [20] introduced potential-based techniques to learn stochastic policies in single-agent RL with continuous states and actions. In this paper, we adapt and extend methods from [20] to develop a comprehensive framework for providing potential-based shaping advice in both single- and multi-agent RL.

The authors of [19] developed a method called iterative relative credit refinement (IRCR), which used a ‘surrogate objective’ to uniformly redistribute a sparse reward along the length of a trajectory in single and multi-agent RL. We empirically compare SAM with IRCR, and explain why SAM is able to guide agents to learn policies that result in higher average rewards than IRCR.

Methods to learn a potential-based reward shaping function have been investigated in recent research. When domain knowledge was available, the authors of [38] proposed a temporal-difference learning method to transform a given reward function to potential-based advice. In the absence of domain knowledge, graph convolutional networks were used in [39] to generate potential functions ‘automatically’ by performing message passing between states at which the agent received a reward. Different from the above works, an approach to perform credit assignment by conditioning the value function on future events from a trajectory was examined by the authors of [40]. We note that these works have focused on the single-agent case. An analogous approach for the multi-agent case, while interesting, is beyond the scope of the present paper, and remains a promising direction for future research.

III Background

This section presents some preliminary material on single- and multi-agent RL.

III-A Single-agent Reinforcement Learning

An MDP [41] is a tuple (S,A,𝕋,ρ0,R)(S,A,\mathbb{T},\rho_{0},R). SS is the set of states, AA the set of actions, 𝕋:S×A×S[0,1]\mathbb{T}:S\times A\times S\rightarrow[0,1] encodes (st+1|st,at)\mathbb{P}(s_{t+1}|s_{t},a_{t}), the probability of transition to st+1s_{t+1}, given current state sts_{t} and action ata_{t}. ρ0\rho_{0} is a probability distribution over the initial states. R:S×AR:S\times A\rightarrow\mathbb{R} denotes the reward that the agent receives when transitioning from sts_{t} while taking action ata_{t}. In this paper, R<R<\infty.

The goal for an RL agent [1] is to learn a policy π\pi, in order to maximize J:=𝔼τπ[t=0γtR(st,at)]J:=\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})]. Here, γ\gamma is a discounting factor, and the expectation is taken over the trajectory τ=(s0,a0,r0,s1,)\tau=(s_{0},a_{0},r_{0},s_{1},\dots) induced by policy π\pi. If π:SA\pi:S\rightarrow A, the policy is deterministic. On the other hand, a randomized policy returns a probability distribution over the set of actions, and is denoted π:S×A[0,1]\pi:S\times A\rightarrow[0,1].

The value of a state-action pair (s,a)(s,a) following policy π\pi is represented by the Q-function, written Qπ(s,a)=𝔼τπ[t=0γtR(st,at)|s0=s,a0=a]Q^{\pi}(s,a)=\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})|s_{0}=s,a_{0}=a]. The Q-function allows us to calculate the state value Vπ(s)=𝔼aπ[Qπ(s,a)]V^{\pi}(s)=\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)]. The advantage of a particular action aa, over other actions at a state ss is defined by Aπ(s,a):=Qπ(s,a)Vπ(s)A^{\pi}(s,a):=Q^{\pi}(s,a)-V^{\pi}(s).

III-B Stochastic Games for Multi-agent Reinforcement Learning

A stochastic game with nn players is a tuple 𝒢=(S,A1,,An,𝕋,R1,,Rn,O1,,On,ρ0,γ)\mathcal{G}=(S,A^{1},\dots,A^{n},\mathbb{T},R^{1},\dots,R^{n},O^{1},\dots,O^{n},\rho_{0},\gamma). SS is the set of states, AiA^{i} is the action set of player ii, 𝕋:S×A1××An×S[0,1]\mathbb{T}:S\times A^{1}\times\dots\times A^{n}\times S\rightarrow[0,1] encodes (st+1|st,at1,,atn)\mathbb{P}(s_{t+1}|s_{t},a^{1}_{t},\dots,a^{n}_{t}), the probability of transition to state st+1s_{t+1} from sts_{t}, given the respective player actions. Ri:S×A1××AnR^{i}:S\times A^{1}\times\dots\times A^{n}\rightarrow\mathbb{R} is the reward obtained by agent ii when transiting from sts_{t} while each player takes action atia^{i}_{t}. OiO^{i} is the set of observations for agent ii. At every state, each agent receives an observation correlated with the state: oi:SOio^{i}:S\rightarrow O^{i}. ρ0\rho_{0} is a distribution over the initial states, and γ[0,1]\gamma\in[0,1] is a discounting factor.

A policy for agent ii is a distribution over actions, defined by πi:Oi×Ai[0,1]\pi^{i}:O^{i}\times A^{i}\rightarrow[0,1]. Let 𝝅:={π1,,πn}\bm{\pi}:=\{\pi^{1},\dots,\pi^{n}\} and s:=(s1,,sn)s:=(s^{1},\dots,s^{n}). Following [17], in the simplest case, si=ois^{i}=o^{i} for each agent ii, and we use this for the remainder of the paper. Additional information about states of agents can be included since we compute centralized value functions. Let Vi𝝅(s)=Vi(s,π1,,πn):=𝔼𝝅[tγtRti|s0=s,𝝅]V_{i}^{\bm{\pi}}(s)=V_{i}(s,\pi^{1},\dots,\pi^{n}):=\mathbb{E}_{\bm{\pi}}[\sum_{t}\gamma^{t}R^{i}_{t}|s_{0}=s,\bm{\pi}] and Qi𝝅(s,a1,,an):=Ri+γ𝔼s[Vi𝝅(s)]Q_{i}^{\bm{\pi}}(s,a^{1},\dots,a^{n}):=R^{i}+\gamma\mathbb{E}_{s^{\prime}}[V_{i}^{\bm{\pi}}(s^{\prime})] where Vi𝝅(s)=𝔼{aiπi}i=1n[Qi𝝅(s,a1,,an)]V_{i}^{\bm{\pi}}(s)=\mathbb{E}_{\{a^{i}\sim\pi^{i}\}_{i=1}^{n}}[Q_{i}^{\bm{\pi}}(s,a^{1},\dots,a^{n})].

III-C Reward Shaping in Reinforcement Learning

Reward shaping methods augment the environment reward RR with an additional reward FF\in\mathbb{R}, F<F<\infty. This changes the structure of the original MDP M(=(S,A,𝕋,ρ0,R))M(=(S,A,\mathbb{T},\rho_{0},R)) to M=(S,A,𝕋,ρ0,R+F)M^{\prime}=(S,A,\mathbb{T},\rho_{0},R+F). The goal is to choose FF so that an optimal policy for MM^{\prime}, πM\pi^{*}_{M^{\prime}}, is also optimal for the original MDP MM. Potential-based reward shaping (PBRS) schemes were shown to be able to preserve the optimality of deterministic policies in [13]. PBRS was used in model-based RL in [20], in episodic RL in [42], and was extended to planning in partially observable domains in [43]. These works focused on the finite-horizon case. In comparison, we consider the infinite horizon, discounted cost setting.

The additional reward FF in PBRS is defined as a difference of potentials, ϕ()\phi(\cdot). Specifically, F(st,at,st+1):=γϕ(st+1)ϕ(st)F(s_{t},a_{t},s_{t+1}):=\gamma\phi(s_{t+1})-\phi(s_{t}). Then, the Q-function, QM(s,a)Q^{*}_{M}(s,a), of the optimal greedy policy for MM and the optimal Q-function QM(s,a)Q^{*}_{M^{\prime}}(s,a) for MM^{\prime} are related by: QM(s,a)=QM(s,a)ϕ(s)Q^{*}_{M^{\prime}}(s,a)=Q^{*}_{M}(s,a)-\phi(s). Therefore, the optimal greedy policy is not changed [13, 44], since:

πM(s)argmaxaAQM(s,a)\displaystyle\pi^{*}_{M^{\prime}}(s)\in\arg\!\max_{a\in A}~{}Q^{*}_{M^{\prime}}(s,a)
=argmaxaA(QM(s,a)ϕ(s))=argmaxaAQM(s,a).\displaystyle\qquad=\arg\!\max_{a\in A}~{}\big{(}Q^{*}_{M}(s,a)-\phi(s)\big{)}=\arg\!\max_{a\in A}~{}Q^{*}_{M}(s,a).

The authors of [45] augmented ϕ(s)\phi(s) to include action aa as an argument and termed this potential-based advice (PBA). They defined two forms of PBA, look-ahead PBA and look-back PBA, respectively defined by:

F(st,at,st+1,at+1)\displaystyle F(s_{t},a_{t},s_{t+1},a_{t+1}) =γϕ(st+1,at+1)ϕ(st,at)\displaystyle=\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t}) (1)
F(st,at,st1,at1)\displaystyle F(s_{t},a_{t},s_{t-1},a_{t-1}) =ϕ(st,at)γ1ϕ(st1,at1).\displaystyle=\phi(s_{t},a_{t})-{\gamma}^{-1}\phi(s_{t-1},a_{t-1}). (2)

For the look-ahead PBA scheme, the state-action value function for MM following policy π\pi is given by:

QMπ(s,a)=QMπ(s,a)+ϕ(s,a).\displaystyle Q^{\pi}_{M}(s,a)=Q^{\pi}_{M^{\prime}}(s,a)+\phi(s,a). (3)

The optimal greedy policy for MM can be recovered from the optimal state-action value function for MM^{\prime} using the fact [45]:

πM(st)\displaystyle\pi^{*}_{M}(s_{t}) argmaxaA(QM(st,a)+ϕ(st,a)).\displaystyle\in\arg\!\max_{a\in A}\big{(}Q^{*}_{M^{\prime}}(s_{t},a)+\phi(s_{t},a)\big{)}. (4)

The optimal greedy policy for MM using look-back PBA can be recovered similarly.

In the multi-agent case, the shaping advice for an agent at each time is a function of observations and actions of all agents. The shaping advice is augmented to the environment reward during training, and can take one of two forms, look-ahead and look-back, respectively given by:

Fti(st,ati,ati,st+1,at+1i,at+1i)\displaystyle F^{i}_{t}(s_{t},a^{i}_{t},a^{-i}_{t},s_{t+1},a^{i}_{t+1},a^{-i}_{t+1}) (5)
:=γϕi(st+1,at+1i,at+1i)ϕi(st,ati,ati)\displaystyle\qquad\qquad:=\gamma\phi_{i}(s_{t+1},a^{i}_{t+1},a^{-i}_{t+1})-\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})
Fti(st,ati,ati,st1,at1i,at1i)\displaystyle F^{i}_{t}(s_{t},a^{i}_{t},a^{-i}_{t},s_{t-1},a^{i}_{t-1},a^{-i}_{t-1}) (6)
:=ϕi(st,ati,ati)γ1ϕi(st1,at1i,at1i)\displaystyle\qquad\qquad:=\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})-\gamma^{-1}\phi_{i}(s_{t-1},a^{i}_{t-1},a^{-i}_{t-1})

We will denote by 𝒢\mathcal{G}^{\prime} the nn player stochastic game that is identical to 𝒢\mathcal{G}, but with rewards Ri:=Ri+FiR^{\prime i}:=R^{i}+F^{i} for each ii.

When the value of the potential function is identical for all actions in a particular state, we will term this uniform advice. On the other hand, when the value of the potential function depends on the action taken in a state, we will term this nonuniform advice. We will explicitly distinguish between uniform and nonuniform variants of shaping advice in the single and multi-agent settings subsequently.

The shaping advice is a heuristic that uses knowledge of the environment and task, along with information available to the agent [46]. For example, in the particle world tasks that we study, each agent has access to positions of other agents and of landmarks, relative to itself. This is used to design shaping advice for individual agents at each time step.

In this paper, we develop a framework for incorporating shaping advice in RL environments with continuous action spaces. Moreover, in the multi-agent case, different from prior works that emphasize learning joint policies for agents, we learn decentralized policies for each agent.

IV Shaping Advice in Single-Agent RL

This section presents our results when potential-based shaping advice is used to learn stochastic policies in single-agent RL. We term this Shaping Advice in Single-agent RL (SAS). This generalizes and extends the use of potential-based methods in the literature which have hitherto focused on augmenting value-based methods to learn optimal deterministic policies. We consider two variants- (i) SAS-Uniform when the advice for all actions at a particular state is the same, and (ii) SAS-NonUniform, when a higher weight might be given to some ‘good’ actions over others at each state. We first prove that the optimality of stochastic policies is preserved when using SAS-Uniform. Then, we describe an approach to integrate SAS-NonUniform in to policy-gradient algorithms to enable effective learning of stochastic policies in single-agent RL.

IV-A Uniform Advice

The following result shows that SAS-Uniform preserves optimality even when the optimal policy is stochastic.

Proposition IV.1.

Let πM{\pi}_{M}^{*} denote the optimal policy for an MDP MM, and suppose that πM{\pi}_{M}^{*} is a stochastic policy. Let πM{\pi}_{M^{\prime}}^{*} denote the optimal policy for the MDP MM^{\prime} whose reward is augmented by F:=γϕ(st+1)ϕ(st)F:=\gamma\phi(s_{t+1})-\phi(s_{t}). Then, SAS-Uniform preserves the optimality of stochastic policies- i.e., πM=πM{\pi}_{M}^{*}={\pi}_{M^{\prime}}^{*}.

Proof.

The goal in the original MDP MM was to find a policy π\pi in order to maximize:

πM\displaystyle{\pi}_{M}^{*} =argmaxπ𝔼τπ[t=0γtR(st,at)].\displaystyle=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]. (7)

When using SAS-Uniform, the goal is to determine a policy so that:

πM=argmaxπ𝔼τπ[t=0γt(R(st,at)+F(st,at,st+1,at+1))]\displaystyle{\pi}_{M^{\prime}}^{*}=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\big{(}R(s_{t},a_{t})+F(s_{t},a_{t},s_{t+1},a_{t+1})\big{)}\big{]}
=argmaxπ𝔼τπ[t=0γt(R(st,at)+γϕ(st+1)ϕ(st))]\displaystyle=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\big{(}R(s_{t},a_{t})+\gamma\phi(s_{t+1})-\phi(s_{t})\big{)}\big{]}
=argmaxπ[𝔼τπ[t=0γtR(st,at)]𝔼τπ[ϕ(s0)]]\displaystyle=\arg\!\max_{\pi}\bigg{[}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\big{]}-\mathbb{E}_{\tau\sim\pi}\big{[}\phi(s_{0})\big{]}\bigg{]}
=argmaxπ𝔼τπ[t=0γtR(st,at)]sρ0(s)ϕ(s)ds.\displaystyle=\arg\!\max_{\pi}~{}\mathbb{E}_{\tau\sim\pi}\big{[}\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\big{]}-\int_{s}\rho_{0}(s)\phi(s)\text{d}s. (8)

The last term in Equation (8) is constant, and does not affect the identity of the maximizing policy of (7). ∎

IV-B Nonuniform Advice

Although SAS-Uniform preserves optimality of policies in several settings, it suffers from the drawback of being unable to encode richer information, such as desired relations between states and actions. The authors of [45] proposed potential-based nonuniform advice, a scheme that augmented the potential function by including actions as an argument together with states. In this part, we show that when using SAS-NonUniform, recovering the optimal policy can be difficult if the optimal policy is stochastic. To overcome this barrier, we propose a novel way to impart prior information about the environment in order to use SAS-NonUniform to learn a stochastic policy.

IV-B1 Stochastic policy learning with nonuniform advice

Assume that we can compute QM(s,a)Q^{*}_{M}(s,a), the optimal value for state-action pair (s,a)(s,a) in MDP MM. The optimal stochastic policy for MM is πM=argmaxτπ𝔼π[QM(s,a)]\pi^{*}_{M}=\arg\!\max_{\tau\sim\pi}\mathbb{E}_{\pi}\big{[}Q^{*}_{M}(s,a)\big{]}. From Eqn. (3), the optimal stochastic policy for the modified MDP MM^{\prime} is given by πM=argmaxπ𝔼τπ[QM(s,a)ϕ(s,a)]\pi^{*}_{M^{\prime}}=\arg\!\max_{\pi}\mathbb{E}_{\tau\sim\pi}\big{[}Q^{*}_{M}(s,a)-\phi(s,a)\big{]}. Without loss of generality, πMπM\pi^{*}_{M}\neq\pi^{*}_{M^{\prime}}. If the optimal policy is deterministic, then the policy for MM can be recovered easily from that for MM^{\prime} using Eqn. (4). However, for stochastic optimal policies, we will need to average over trajectories of the MDP, which makes it difficult to recover the optimal policy for MM from that of MM^{\prime}.

In what follows, we will propose a novel way to take advantage of SAS-NonUniform in the policy gradient framework in order to directly learn a stochastic policy.

IV-B2 Imparting nonuniform advice in policy gradient

Let JM(θ)J_{M}(\theta) denote the value of a parameterized policy πθ\pi_{\theta} in MDP MM. That is, JM(θ)=𝔼τπθ[t=0γtR(st,at)]J_{M}(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]. Following the policy gradient theorem [1], and defining G(st,at):=i=ti=γitriG(s_{t},a_{t}):=\sum_{i=t}^{i=\infty}\gamma^{i-t}r_{i}, the gradient of J(θ)J(\theta) with respect to θ\theta is:

θJM(θ)=𝔼τπθ[G(st,at)θlogπθ(at|st)].\displaystyle\nabla_{\theta}J_{M}(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\big{[}G(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}. (9)

Then, 𝔼τπθ[G(st,at)]=Qπθ(st,at)\mathbb{E}_{\tau\sim\pi_{\theta}}\big{[}G(s_{t},a_{t})\big{]}=Q^{\pi_{\theta}}(s_{t},a_{t}).

REINFORCE [1] is a policy gradient method that uses Monte Carlo simulation to learn θ\theta, where the parameter update is performed only at the end of an episode (a trajectory of length TT). If we apply the look-ahead variant of SAS-NonUniform as in Equation (1) along with REINFORCE, then the total return from time tt is given by:

Ga(st,at)=i=ti=Tγitri+γTtϕ(sT,aT)ϕ(st,at)=G(st,at)+γTtϕ(sT,aT)ϕ(st,at).\displaystyle\begin{split}G^{a}(s_{t},a_{t})&=\sum_{i=t}^{i=T}\gamma^{i-t}r_{i}+\gamma^{T-t}\phi(s_{T},a_{T})-\phi(s_{t},a_{t})\\ &=G(s_{t},a_{t})+\gamma^{T-t}\phi(s_{T},a_{T})-\phi(s_{t},a_{t}).\end{split} (10)

Notice that if Ga(st,at)G^{a}(s_{t},a_{t}) is used in Eqn. (9) instead of G(st,at)G(s_{t},a_{t}), then the policy gradient is biased. One way to resolve the problem is to add the difference γTtϕ(sT,aT)+ϕ(st,at)-\gamma^{T-t}\phi(s_{T},a_{T})+\phi(s_{t},a_{t}) to Ga(st,at)G^{a}(s_{t},a_{t}). However, this makes the learning process identical to the original REINFORCE and nonuniform advice is not used. When using nonuniform advice in a policy gradient setup, it is important to add the term ϕ(s,a)\phi(s,a) so that the policy gradient is unbiased, and also leverage the advantage that nonuniform advice offers during learning.

IV-C Analysis and Algorithm

To integrate SAS-NonUniform with policy gradient-based techniques, we turn to temporal difference (TD) methods. TD methods update estimates of the accumulated return based in part on other learned estimates, before the end of an episode. A popular TD-based policy gradient method is the actor-critic framework [1]. In this setup, after performing action ata_{t} at step tt, the accumulated return G(st,at)G(s_{t},a_{t}) is estimated by QM(st,at)Q_{M}(s_{t},a_{t}) which, in turn, is estimated by rt+γVM(st+1)r_{t}+\gamma V_{M}(s_{t+1}). It should be noted that the estimates are unbiased.

When the reward is augmented with look-ahead SAS-NonUniform, the accumulated return is changed to QM(st,at)Q_{M^{\prime}}(s_{t},a_{t}), which is estimated by rt+γϕ(st+1,at+1)ϕ(st,at)+γVM(st+1)r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})+\gamma V_{M^{\prime}}(s_{t+1}). From Eqn. (3), at steady state, QM(st,at)QM(st,at)=ϕ(st,at)Q_{M}(s_{t},a_{t})-Q_{M^{\prime}}(s_{t},a_{t})=\phi(s_{t},a_{t}). Intuitively, to keep policy gradient unbiased when augmented with look-ahead nonuniform advice, we can add ϕ(st,at)\phi(s_{t},a_{t}) at each training step. In other words, we can use rt+γϕ(st+1,at+1)+γVM(st+1)r_{t}+\gamma\phi(s_{t+1},a_{t+1})+\gamma V_{M^{\prime}}(s_{t+1}) as the estimated return. It should be noted that before the policy reaches steady state, adding ϕ(st,at)\phi(s_{t},a_{t}) at each time step will not cancel out the effect of nonuniform advice. This is unlike in REINFORCE, where the addition of this term negates the effect of using nonuniform advice. In the advantage actor-critic, an advantage term is used instead of the Q-function in order to reduce the variance of the estimated policy gradient. In this case also, the potential term ϕ(st,at)\phi(s_{t},a_{t}) can be added in order to keep the policy gradient unbiased.

Algorithm 1 SAS: Shaping Advice in Single-agent RL
0:  Differentiable policy and value functions πθ(a|s)\pi_{\theta}(a|s), Vω(s)V_{\omega}(s); shaping advice ϕ(s,a)\phi(s,a); maximum episode length TmaxT_{max}. Initialization: policy parameter θ\theta, value parameter ω\omega, learning rate αθ\alpha^{\theta} and αω\alpha^{\omega}, discount factor γ\gamma.
1:  T=0T=0
2:  repeat
3:     initialize state s0s_{0}, t0t\leftarrow 0
4:     repeat
5:        Sample action atπθ(|st)a_{t}\sim\pi_{\theta}(\cdot|s_{t})
6:        Take action ata_{t}, observe reward rtr_{t}, next state st+1s_{t+1}
7:        R={0,if st+1 is a terminal state ,Vω(st+1),otherwise.R=\begin{cases}0,&\text{if }\begin{aligned} s_{t+1}\text{ is a terminal state },\end{aligned}\\ V_{\omega}(s_{t+1}),&\text{otherwise.}\end{cases}
8:        if use look-ahead advice then
9:           δt=rt+γϕ(st+1,at+1)ϕ(st,at)+γRVω(st)\delta_{t}=r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})+\gamma R-V_{\omega}(s_{t})
10:           Update θθ+αθ(δt+ϕ(st,at))θlogπθ(at|st)\theta\leftarrow\theta+\alpha^{\theta}\big{(}\delta_{t}+\phi(s_{t},a_{t})\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})
11:        else
12:           δt=rt+ϕ(st,at)γ1ϕ(st1,at1)+γRVω(st)\delta_{t}=r_{t}+\phi(s_{t},a_{t})-\gamma^{-1}\phi(s_{t-1},a_{t-1})+\gamma R-V_{\omega}(s_{t})
13:           Update θθ+αθδtθlogπθ(at|st)\theta\leftarrow\theta+\alpha^{\theta}\delta_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})
14:        end if
15:        Update ωωαωδtωVω(st)\omega\leftarrow\omega-\alpha^{\omega}\delta_{t}\nabla_{\omega}V_{\omega}(s_{t})
16:     until st+1s_{t+1} is a terminal state
17:     TT+1T\leftarrow T+1
18:  until T>TmaxT>T_{max}

A procedure for augmenting the advantage actor-critic with SAS-NonUniform is presented in Algorithm 1. αθ\alpha^{\theta} and αω\alpha^{\omega} denote learning rates for the actor and critic respectively. When applying look-ahead nonuniform advice, at training step tt, parameter ω\omega of the critic Vω(s)V_{\omega}(s) is updated as follows:

δta\displaystyle\delta^{a}_{t} =rt+γϕ(st+1,at+1)ϕ(st,at)+γVω(st+1)Vω(st)\displaystyle=r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})+\gamma V_{\omega}(s_{t+1})-V_{\omega}(s_{t})
ω\displaystyle\omega =ωαωδtaωVω(st),\displaystyle=\omega-\alpha^{\omega}\delta^{a}_{t}\nabla_{\omega}V_{\omega}(s_{t}),

where δta\delta^{a}_{t} is the estimation error of the state value after receiving new reward [rt+γϕ(st+1,at+1)ϕ(st,at)][r_{t}+\gamma\phi(s_{t+1},a_{t+1})-\phi(s_{t},a_{t})] at step tt. To ensure an unbiased estimate of the policy gradient, the potential term ϕ(st,at)\phi(s_{t},a_{t}) is added while updating θ\theta as [20]:

θ=θ+αθ(δta+ϕ(st,at))θlogπθ(at|st).\displaystyle\theta=\theta+\alpha^{\theta}\big{(}\delta^{a}_{t}+\phi(s_{t},a_{t})\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}).

A similar method can be used when learning with look-back nonuniform advice. In this case, the critic and the policy parameter are updated as follows:

δtb\displaystyle\delta^{b}_{t} =rt+ϕ(st,at)γ1ϕ(st1,at1)+γVω(st+1)Vω(st)\displaystyle=r_{t}+\phi(s_{t},a_{t})-\gamma^{-1}\phi(s_{t-1},a_{t-1})+\gamma V_{\omega}(s_{t+1})-V_{\omega}(s_{t})
ω\displaystyle\omega =ωαωδtbωVω(st),\displaystyle=\omega-\alpha^{\omega}\delta^{b}_{t}\nabla_{\omega}V_{\omega}(s_{t}),
θ\displaystyle\theta =θ+α(δtb+γ1𝔼[ϕ(st1,at1)|st])θlogπθ(at|st)\displaystyle=\theta+\alpha\big{(}\delta^{b}_{t}+\gamma^{-1}\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})|s_{t}\big{]}\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}) (11)

In the above case, the potential term need not be added to ensure an unbiased estimate. Then, the policy parameter update becomes:

θ=θ+αδtbθlogπθ(at|st),\displaystyle\theta=\theta+\alpha\delta^{b}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}), (12)

which is exactly the policy update of the advantage actor-critic. This is formally stated in Proposition IV.2

Proposition IV.2.

When the actor-critic is augmented with look-back variant of SAS-NonUniform, Equations (11) and (12) are equal in the sense of expectation. That is

𝔼(st,at)ρπθ[(δtb+\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}\big{(}\delta^{b}_{t}+ γ1𝔼[ϕ(st1,at1)|st])θlogπθ(at|st)]\displaystyle\gamma^{-1}\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})|s_{t}\big{]}\big{)}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}
=\displaystyle=\quad 𝔼(st,at)ρπθ[δtbθlogπθ(at|st)],\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}\delta^{b}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}, (13)

where ρπθ\rho^{\pi_{\theta}} is the distribution induced by the policy πθ\pi_{\theta}.

Proof.

It is equivalent to show that:

𝔼(st,at)ρπθ[𝔼[ϕ(st1,at1)|st]θlogπθ(at|st)]=0.\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})|s_{t}\big{]}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}=0. (14)

The inner expectation 𝔼[ϕ(st1,at1)|st]\mathbb{E}\big{[}\phi(s_{t-1},a_{t-1})|s_{t}\big{]} is a function of sts_{t}, policy πθ\pi_{\theta}, and transition probability 𝕋\mathbb{T}. Denoting this expectation by f(st,πθ,𝕋)f(s_{t},\pi_{\theta},\mathbb{T}), we obtain:

𝔼(st,at)ρπθ[f(st,πθ,𝕋)θlogπθ(at|st)]\displaystyle\mathbb{E}_{(s_{t},a_{t})\sim\rho^{\pi_{\theta}}}\big{[}f(s_{t},\pi_{\theta},\mathbb{T})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}
=\displaystyle= 𝔼stρπθ[𝔼atπθ[f(st,πθ,𝕋)θlogπθ(at|st)]]\displaystyle\mathbb{E}_{s_{t}\sim\rho^{\pi_{\theta}}}\bigg{[}\mathbb{E}_{a_{t}\sim\pi_{\theta}}\big{[}f(s_{t},\pi_{\theta},\mathbb{T})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}\bigg{]}
=\displaystyle= 𝔼stρπθ[Aπθ(at|st)f(st,πθ,𝕋)θπθ(at|st)πθ(at|st)da]\displaystyle\mathbb{E}_{s_{t}\sim\rho^{\pi_{\theta}}}\bigg{[}\int_{A}\pi_{\theta}(a_{t}|s_{t})f(s_{t},\pi_{\theta},\mathbb{T})\frac{\nabla_{\theta}\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}\text{d}a\bigg{]}
=\displaystyle= 𝔼stρπθ[f(st,πθ,𝕋)θAπθ(at|st)da]=0.\displaystyle\mathbb{E}_{s_{t}\sim\rho^{\pi_{\theta}}}\bigg{[}f(s_{t},\pi_{\theta},\mathbb{T})\nabla_{\theta}\int_{A}\pi_{\theta}(a_{t}|s_{t})\text{d}a\bigg{]}=0. (15)

The last equality follows from the fact that the integral evaluates to 11, and its gradient is 0. ∎

Remark IV.3.

Look-back nonuniform advice could result in better performance compared to look-ahead nonuniform advice since look-back nonuniform advice does not involve estimating a future action.

V Shaping Advice in Multi-Agent RL

Refer to caption
Figure 1: Schematic of SAM. A centralized critic estimates value functions Vω1,,VωnV_{\omega_{1}},\dots,V_{\omega_{n}}. Actions for an agent ii are sampled from its policy πθi\pi_{\theta_{i}} in a decentralized manner. Actions and observations of all agents are used to determine shaping advice F1,,FnF^{1},\dots,F^{n}. The advice FiF^{i} is augmented to the reward rir^{i} from the RL environment. The workflow shown by blue arrows in the outer box is required only during training. During execution, only the workflow shown by the red arrows inside the inner boxes is needed.

This section introduces shaping advice in multi-agent reinforcement learning (SAM). The goal of SAM is to augment shaping advice to the reward supplied by the MARL environment to provide immediate feedback to agents on their actions. SAM uses the CTDE paradigm wherein agents share parameters with each other during the training phase, but execute decentralized policies using only their own observations at test-time. Figure 1 shows a schematic of SAM. We subsequently detail how the shaping advice is provided to the agents, and analyze the optimality and convergence of policies when using SAM.

V-A Centralized Critic

SAM uses a centralized critic during the training phase. Information about observations and actions of all agents is used to learn a decentralized policy for each agent. One way to do this is by using an actor-critic framework, which combines policy gradients with temporal difference (TD) techniques.

At time tt, the joint action (at1,,atn)(a^{1}_{t},\dots,a^{n}_{t}) is used to estimate the accumulated return for each agent ii as rti+γVi(st+1)r^{i}_{t}+\gamma V^{i}(s_{t+1}). This quantity is called the TD-target. Subtracting Vi(st)V^{i}(s_{t}) from the TD-target gives the TD-error, which is an unbiased estimate of the agent advantage [1]. Each actor can then be updated following a gradient based on this TD-error.

We learn a separate critic for each agent like in [17]. However, the learning process can be affected when rewards provided by the environment are sparse. SAM uses a potential-based heuristic as shaping advice that is augmented to the reward received from the environment. This resulting reward is less sparse and can be used by the agents to learn policies.

V-B Shaping Advice in Multi-Agent Actor-Critic

We describe how to augment shaping advice to the multi-agent policy gradient to assign credit. We use the actor-critic framework with a centralized critic and decentralized actors.

For an agent ii, shaping advice FiF^{i} is augmented to the environment reward rir^{i} at each time step. FiF^{i} is specified by a difference of potentials (Eqn. (5) or (6)). The centralized critic allows using observations and actions of all agents to specify FiF^{i}. Using look-ahead advice, QQ-values in the modified game 𝒢\mathcal{G}^{\prime} with reward Ri+FiR^{i}+F^{i} and original game 𝒢\mathcal{G} with reward RiR^{i} are related as [45]:

[Qi𝝅𝜽(st,ati,ati)]𝒢\displaystyle[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}} =[Qi𝝅𝜽(st,ati,ati)]𝒢\displaystyle=[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}}
+ϕi(st,ati,ati)\displaystyle\qquad\qquad+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}) (16)

The accumulated return in 𝒢\mathcal{G}^{\prime} for agent ii is then estimated by rti+γϕi(st+1,at+1i,at+1i)ϕi(st,ati,ati)+γVi(st+1)r^{i}_{t}+\gamma\phi_{i}(s_{t+1},a^{i}_{t+1},a^{-i}_{t+1})-\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t})+\gamma V^{i}(s_{t+1}). From Eqn. (16), we can add ϕi(st,ati,ati)\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}) to the TD-target in 𝒢\mathcal{G}^{\prime} at each time tt to keep the policy gradient unbiased in 𝒢\mathcal{G}.

Let the critic and actor in SAM for agent ii be respectively parameterized by ωi\omega_{i} and θi\theta_{i}. When the actor is updated at a slower rate than critic, the asymptotic behavior of the critic can be analyzed by keeping the actor fixed using two time-scale stochastic approximation methods [47]. For agent ii, the TD-error at time tt is given by:

δti\displaystyle\delta^{i}_{t} :=rti+Fti+γVωi(st+1)Vωi(st).\displaystyle:=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}}(s_{t+1})-V_{\omega_{i}}(s_{t}). (17)

The update of the critic can be expressed as a first-order ordinary differential equation (ODE) in ωi\omega_{i}, given by:

ω˙i\displaystyle\dot{\omega}_{i} =𝔼𝝅𝜽[δiωiVωi(st)]\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\delta^{i}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})] (18)

Under an appropriate parameterization of the value function, this ODE will converge to an asymptotically stable equilibrium, denoted ωi(𝜽)\omega_{i}(\bm{\theta}). At this equilibrium, the TD-error for agent ii is δt,ωi(𝜽)i=rti+Fti+γVωi(𝜽)(st+1)Vωi(𝜽)(st)\delta_{t,\omega_{i}(\bm{\theta})}^{i}=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}(\bm{\theta})}(s_{t+1})-V_{\omega_{i}(\bm{\theta})}(s_{t}).

The update of the actor can then be determined by solving a first order ODE in θi\theta_{i}. With look-ahead advice, a term corresponding to the shaping advice at time tt will have to be added to ensure an unbiased policy gradient (Eqn. (16)). This ODE can be written as:

θ˙i\displaystyle\dot{\theta}_{i} =𝔼𝝅𝜽[(δt,ωi(𝜽)i+ϕi(st,ati,ati))θilogπθi(ati|oti)]\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[(\delta_{t,\omega_{i}(\bm{\theta})}^{i}+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}))\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})] (19)

A potential term will not have to be added to ensure an unbiased policy gradient when utilizing look-back advice. This insight follows from Proposition 3 in [20] since we consider decentralized policies.

V-C Analysis and Algorithm

In this part, we present a proof of the convergence of the actor and critic parameters when learning with shaping advice. We also demonstrate that convergence of policy gradients and values when using SAM implies convergence of these quantities in the absence of SAM. This will guarantee that policies learned in the modified stochastic game 𝒢\mathcal{G}^{\prime} will be locally optimal in the original game 𝒢\mathcal{G}.

For agent ii, the update dynamics of the critic can be expressed by the ODE in Eqn. (18). Assuming parameterization of V(s)V(s) over a linear family, this ODE will converge to an asymptotically stable equilibrium [47]. The actor update is then given by the ODE in Eqn. (19). The parameters associated with the critics are assumed to be updated on a faster timescale than those of the actors. Then, the behaviors of the actor and critic can be analyzed separately using two timescale stochastic approximation techniques [47].

Assumption V.1.

We make the following assumptions:

  1. 1.

    At any time tt, an agent is aware of the actions taken by all other agents. Rewards received by the agents at each time step are uniformly bounded.

  2. 2.

    The Markov chain induced by the agent policies is irreducible and aperiodic.

  3. 3.

    For each agent ii, the update of its policy parameter θi\theta_{i} includes a projection operator Γi\Gamma_{i}, which projects θi\theta_{i} onto a compact set Θi\Theta_{i}. We assume that Θi\Theta_{i} includes a stationary point of θiJi(𝜽)\nabla_{\theta_{i}}J_{i}(\bm{\theta}) for each ii.

  4. 4.

    For each agent ii, its value function is parameterized by a linear family. That is, Vωi(s)=ΦiωiV_{\omega_{i}}(s)=\Phi_{i}\omega_{i}, where Φi\Phi_{i} is a known, full-rank feature matrix for each ii.

  5. 5.

    For each agent ii, the TD-error at each time tt and the gradients ωiVωi(s)\nabla_{\omega_{i}}V_{\omega_{i}}(s) are bounded, and the gradients θilogπθi(|st)\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(\cdot|s_{t}) are Lipschitz with bounded norm.

  6. 6.

    The learning rates satisfy tαtθ=tαtω=\sum_{t}\alpha^{\theta}_{t}=\sum_{t}\alpha^{\omega}_{t}=\infty, t[(αtθ)2+(αtω)2]<\sum_{t}[(\alpha^{\theta}_{t})^{2}+(\alpha^{\omega}_{t})^{2}]<\infty, limtαtθαtω=0\lim_{t\rightarrow\infty}\frac{\alpha^{\theta}_{t}}{\alpha^{\omega}_{t}}=0.

We first state a useful result from [48].

Lemma V.2 ([48]).

Let Γ:kk\Gamma:\mathbb{R}^{k}\rightarrow\mathbb{R}^{k} be a projection onto a compact set KkK\subset\mathbb{R}^{k}. Define

Γ^(h(x)):\displaystyle\hat{\Gamma}(h(x)): =limϵ0Γ(x+ϵh(x))xϵ\displaystyle=\lim_{\epsilon\downarrow 0}\frac{\Gamma(x+\epsilon h(x))-x}{\epsilon}

for xKx\in K and h:kkh:\mathbb{R}^{k}\rightarrow\mathbb{R}^{k} continuous on KK. Consider the update xt+1=Γ(xt+αt(h(xt)+ξt,1+ξt,2))x_{t+1}=\Gamma(x_{t}+\alpha_{t}(h(x_{t})+\xi_{t,1}+\xi_{t,2})) and its associated ODE x˙=Γ^(h(x))\dot{x}=\hat{\Gamma}(h(x)). Assume that:

i) {αt}\{\alpha_{t}\} is such that tαt=\sum_{t}\alpha_{t}=\infty, tαt2<\sum_{t}\alpha_{t}^{2}<\infty;

ii) {ξt,1}\{\xi_{t,1}\} is such that for all ϵ>0\epsilon>0, limt(supntτ=tnατξτ,1ϵ)=0\lim_{t}\mathbb{P}(\sup_{n\geq t}||\sum_{\tau=t}^{n}\alpha_{\tau}\xi_{\tau,1}||\geq\epsilon)=0;

iii) {ξt,2}\{\xi_{t,2}\} is an almost surely bounded random sequence, and ξt,20\xi_{t,2}\rightarrow 0 almost surely.

Then, if the set of asymptotically stable equilibria of the ODE in x˙\dot{x} is compact, denoted KeqK_{eq}, the updates xt+1x_{t+1} will converge almost surely to KeqK_{eq}.

Let {tω}\{\mathcal{F}^{\omega}_{t}\} be the filtration where tω:=σ(sτ,\mathcal{F}^{\omega}_{t}:=\sigma(s_{\tau}, rτ1,,rτn,ω1τ,,ωnτ:τt)r^{1}_{\tau},\dots,r^{n}_{\tau},\omega_{1_{\tau}},\dots,\omega_{n_{\tau}}:\tau\leq t) is an increasing σ\sigma-algebra generated by iterates of ωi\omega_{i} up to time tt. We first analyze behavior of the critic when parameters of the actor are fixed.

Theorem V.3.

For a fixed policy 𝛑𝛉\bm{\pi_{\theta}}, the update ωiωiαtωδtiωiVωi(st)\omega_{i}\leftarrow\omega_{i}-\alpha_{t}^{\omega}\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t}) converges almost surely to the set of asymptotically stable equilibria of the ODE ω˙i=hi(ωi):=𝔼𝛑𝛉[δtiωiVωi(st)|tω]\dot{\omega}_{i}=h_{i}(\omega_{i}):=\mathbb{E}_{\bm{\pi_{\theta}}}[\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})|\mathcal{F}^{\omega}_{t}].

Proof.

Let ξt,1i:=δtiωiVωi(st)𝔼𝝅𝜽[δtiωiVωi(st)|tω]\xi^{i}_{t,1}:=\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})-\mathbb{E}_{\bm{\pi_{\theta}}}[\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})|\mathcal{F}^{\omega}_{t}]. Then, the ωi\omega_{i} update can be written as ωiωiαtω[hi(ωi)+ξt,1i]\omega_{i}\leftarrow\omega_{i}-\alpha_{t}^{\omega}[h_{i}(\omega_{i})+\xi^{i}_{t,1}], where hi(ωi)h_{i}(\omega_{i}) is continuous in ωi\omega_{i}. Since δti\delta^{i}_{t} and ωiVωi(s)\nabla_{\omega_{i}}V_{\omega_{i}}(s) are bounded, ξt,1i\xi^{i}_{t,1} is almost surely bounded.

Let Mti:=τ=0tατωξτ,1iM^{i}_{t}:=\sum_{\tau=0}^{t}\alpha^{\omega}_{\tau}\xi^{i}_{\tau,1}. Then {Mti}\{M^{i}_{t}\} is a martingale111A martingale [49] is a stochastic process S1,S2,S_{1},S_{2},\dots that satisfies 𝔼(|Sn|<)\mathbb{E}(|S_{n}|<\infty) and 𝔼(Sn+1|S1,,Sn)=Sn\mathbb{E}(S_{n+1}|S_{1},\dots,S_{n})=S_{n} for each n=1,2,.n=1,2,\dots.., and tMtiMt1i2=tαtωξt,1i2<\sum_{t}||M^{i}_{t}-M^{i}_{t-1}||^{2}=\sum_{t}||\alpha^{\omega}_{t}\xi^{i}_{t,1}||^{2}<\infty almost surely. Therefore, from the martingale convergence theorem [49], the sequence {Mti}\{M^{i}_{t}\} converges almost surely. Therefore, the conditions in Lemma V.2 are satisfied.

Since Vωi=ΦiωiV_{\omega_{i}}=\Phi_{i}\omega_{i}, with Φi\Phi_{i} a full-rank matrix, hi(ωi)h_{i}(\omega_{i}) is a linear function, and the ODE will have a unique equibrium point. This will be an asymptotically stable equilibrium since ODE dynamics will be governed by a matrix of the form (γTπI)(\gamma T_{\pi}-I). Here, II is an identity matrix, and TπT_{\pi} is a stochastic state-transition matrix under policy π\pi, whose eigen-values have (strictly) negative real parts [50]. Denote this asymptotically stable equilibrium by ωi(𝜽)\omega_{i}(\bm{\theta}). ∎

We can now analyze the behavior of the actor, assuming that the critic parameters have converged to an asymptotically stable equilibrium. With ωi(𝜽)\omega_{i}(\bm{\theta}) a limit point of the critic update, let δt,ωi(𝜽)i=rti+Fti+γVωi(𝜽)(st+1)Vωi(𝜽)(st)\delta_{t,\omega_{i}(\bm{\theta})}^{i}=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}(\bm{\theta})}(s_{t+1})-V_{\omega_{i}(\bm{\theta})}(s_{t}). When using look-ahead or look-back advice, define δ~t,ωi(𝜽)i\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})} as:

look-ahead:δ~t,ωi(𝜽)i:=(δt,ωi(𝜽)i+ϕi(st,ati,ati))look-back:δ~t,ωi(𝜽)i:=δt,ωi(𝜽)i.\displaystyle\begin{split}&\text{look-ahead:}\quad\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}:=(\delta_{t,\omega_{i}(\bm{\theta})}^{i}+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}))\\ &\text{look-back:}\quad\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}:=\delta_{t,\omega_{i}(\bm{\theta})}^{i}.\end{split} (20)

Let {tθ}\{\mathcal{F}^{\theta}_{t}\} be a filtration where tθ:=σ(𝜽τ:=[θ1τθnτ]:τt)\mathcal{F}^{\theta}_{t}:=\sigma(\bm{\theta}_{\tau}:=[\theta_{1_{\tau}}\dots\theta_{n_{\tau}}]:\tau\leq t) is an increasing σ\sigma-algebra generated by iterates of θi\theta_{i} up to time tt.

Theorem V.4.

The update θiΓi[θi+αtθδ~ti\theta_{i}\leftarrow\Gamma_{i}[\theta_{i}+\alpha_{t}^{\theta}\tilde{\delta}^{i}_{t} θilogπθi(ati|oti)]\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})] converges almost surely to the set of asymptotically stable equilbria of the ODE θ˙i=Γ^i(hi(θi))\dot{\theta}_{i}=\hat{\Gamma}_{i}(h_{i}(\theta_{i})), where hi(θi)=𝔼𝛑𝛉[δ~t,ωi(𝛉)iθilogπθi(ati|oti)|tθ]h_{i}(\theta_{i})=\mathbb{E}_{\bm{\pi_{\theta}}}[\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})|\mathcal{F}^{\theta}_{t}].

Proof.

Let ξt,1i:=δ~tiθilogπθi(ati|oti)𝔼𝝅𝜽[δ~tiθilogπθi(ati|oti)|tθ]\xi^{i}_{t,1}:=\tilde{\delta}^{i}_{t}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})-\mathbb{E}_{\bm{\pi_{\theta}}}[\tilde{\delta}^{i}_{t}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})|\mathcal{F}^{\theta}_{t}] and ξt,2i:=𝔼𝝅𝜽[(δ~tiδ~t,ωi(𝜽)i)θilogπθi(ati|oti)|tθ]\xi^{i}_{t,2}:=\mathbb{E}_{\bm{\pi_{\theta}}}[(\tilde{\delta}^{i}_{t}-\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})})\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})|\mathcal{F}^{\theta}_{t}]. Then, the update of θi\theta_{i} can be written as θiθi+αtθ[hi(θi)+ξt,1i+ξt,2i]\theta_{i}\leftarrow\theta_{i}+\alpha^{\theta}_{t}[h_{i}(\theta_{i})+\xi^{i}_{t,1}+\xi^{i}_{t,2}], where hi(θi)h_{i}(\theta_{i}) is continuous in θi\theta_{i}. We now need to verify that the conditions in Lemma V.2 are satisfied.

Since the critic parameters converge almost surely to a fixed point, δ~tiδ~t,ωi(𝜽)i0\tilde{\delta}^{i}_{t}-\tilde{\delta}^{i}_{t,\omega_{i}(\bm{\theta})}\rightarrow 0 almost surely. Therefore, ξt,2i0\xi^{i}_{t,2}\rightarrow 0 almost surely, verifying Condition iii) in Lemma V.2.

Since δ~ti\tilde{\delta}^{i}_{t} and θilogπθi(ati|oti)\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t}) are bounded, ξt,1i\xi^{i}_{t,1} is continuous in θi\theta_{i} and θi\theta_{i} belongs to a compact set, the sequence {ξt,1i}\{\xi^{i}_{t,1}\} is bounded almost surely [51]. If Mti:=τ=0tατθξτ,1iM^{i}_{t}:=\sum_{\tau=0}^{t}\alpha^{\theta}_{\tau}\xi^{i}_{\tau,1}, then {Mti}\{M^{i}_{t}\} is a martingale, and tMtiMt1i2=tαtθξt,1i2<\sum_{t}||M^{i}_{t}-M^{i}_{t-1}||^{2}=\sum_{t}||\alpha^{\theta}_{t}\xi^{i}_{t,1}||^{2}<\infty almost surely. Then, {Mti}\{M^{i}_{t}\} converges almost surely [49], satisfying Condition ii) of Lemma V.2. Condition i) is true by assumption, completing the proof. ∎

Theorems V.3 and V.4 demonstrate the convergence of critic and actor parameters in the stochastic game with the shaped reward, 𝒢\mathcal{G}^{\prime}. However, our objective is to provide a guarantee of convergence in the original game 𝒢\mathcal{G}. We establish such a guarantee when parameterizations of the value function results in small errors, and policy gradients in 𝒢\mathcal{G}^{\prime} are bounded.

Definition V.5.

For a probability measure μ\mu on a finite set \mathcal{M}, the 2\ell_{2}-norm of a function ff with respect to μ\mu is defined as fμ:=[|f(X)|2𝑑μ(X)]12=[𝔼μ(|f(X)|2)]12||f||_{\mu}:=\bigg{[}\int_{\mathcal{M}}|f(X)|^{2}d\mu(X)\bigg{]}^{\frac{1}{2}}=\bigg{[}\mathbb{E}_{\mu}(|f(X)|^{2})\bigg{]}^{\frac{1}{2}}.

Theorem V.6.

In the stochastic game 𝒢\mathcal{G}^{\prime}, let (γ+1)Viπ𝛉(s)Vωi(𝛉)(s)π𝛉i(𝛉)(\gamma+1)||V_{i}^{\pi_{\bm{\theta}}}(s)-V_{\omega_{i}(\bm{\theta})}(s)||_{\pi_{\bm{\theta}}}\leq\mathcal{E}_{i}(\bm{\theta}), and let θilogπθiπ𝛉Ci(𝛉)||\nabla_{\theta_{i}}\log\pi_{\theta_{i}}||_{\pi_{\bm{\theta}}}\leq C_{i}(\bm{\theta}). Let (𝛉,ω(𝛉))(\bm{\theta}^{*},\omega(\bm{\theta})^{*}) be the set of limit points of SAM.

Then, in the original stochastic game 𝒢\mathcal{G}, for each agent ii, θiJi(𝛉)2Ci(𝛉)i(𝛉)||\nabla_{\theta_{i}}J_{i}(\bm{\theta}^{*})||_{2}\leq C_{i}(\bm{\theta}^{*})\mathcal{E}_{i}(\bm{\theta}^{*}).

Proof.

Let Θieq\Theta_{i_{eq}} denote the set of asymptotically stable equilibria of the ODE in θi\theta_{i}. Let Θeq:=Θ1eq××Θneq\Theta_{eq}:=\Theta_{1_{eq}}\times\dots\times\Theta_{n_{eq}}. Then, in the set Θeq\Theta_{eq}, θ˙i=0\dot{\theta}_{i}=0 for each agent ii.

Consider a policy π𝜽\pi_{\bm{\theta}}, 𝜽Θeq\bm{\theta}\in\Theta_{eq}. In the original game 𝒢\mathcal{G},

θiJi(𝜽)\displaystyle\nabla_{\theta_{i}}J_{i}(\bm{\theta}) =𝔼𝝅𝜽[θilogπθi(ati|oti)Qi𝝅𝜽(st,ati,ati)]\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a^{i}_{t}|o^{i}_{t})Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})] (21)

From Equation (16), [Qi𝝅𝜽(st,ati,ati)]𝒢=[Qi𝝅𝜽(st,ati,ati)]𝒢[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}}=[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}} +ϕi(st,ati,ati)+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}). Since we use an advantage actor critic, we replace [Qi𝝅𝜽(st,ati,ati)]𝒢[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}} with an advantage term, defined as [Qi𝝅𝜽(st,ati,ati)]𝒢Vi𝝅𝜽(st)[Q^{\bm{\pi_{\theta}}}_{i}(s_{t},a^{i}_{t},a^{-i}_{t})]_{\mathcal{G}^{\prime}}-V^{\bm{\pi_{\theta}}}_{i}(s_{t}). Substituting these quantities in Equation (21),

θiJi(𝜽)\displaystyle\nabla_{\theta_{i}}J_{i}(\bm{\theta}) =𝔼𝝅𝜽[θilogπθi(ati|oti).\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a^{i}_{t}|o^{i}_{t}). (22)
(rti+Fti+γViπ𝜽(st+1)\displaystyle\qquad\qquad(r^{i}_{t}+F^{i}_{t}+\gamma V_{i}^{\pi_{\bm{\theta}}}(s_{t+1})
Vi𝝅𝜽(st)+ϕi(st,ati,ati))]\displaystyle\qquad\qquad-V^{\bm{\pi_{\theta}}}_{i}(s_{t})+\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}))]

At equilibrium, θ˙i=0\dot{\theta}_{i}=0 in Equation (19). Subtracting this from Equation (22),

θiJi(𝜽)θ˙i=θiJi(𝜽)\displaystyle\nabla_{\theta_{i}}J_{i}(\bm{\theta})-\dot{\theta}_{i}=\nabla_{\theta_{i}}J_{i}(\bm{\theta})
=𝔼𝝅𝜽[θilogπθi(ati|oti).\displaystyle=\mathbb{E}_{\bm{\pi_{\theta}}}[\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a^{i}_{t}|o^{i}_{t}).
(γ(Viπ𝜽(st+1)Vωi(𝜽)(st+1))\displaystyle\qquad\qquad(\gamma(V_{i}^{\pi_{\bm{\theta}}}(s_{t+1})-V_{\omega_{i}(\bm{\theta})}(s_{t+1}))
(Viπ𝜽(st)Vωi(𝜽)(st)))]\displaystyle\qquad\qquad-(V_{i}^{\pi_{\bm{\theta}}}(s_{t})-V_{\omega_{i}(\bm{\theta})}(s_{t})))]

Using the Cauchy-Schwarz inequality,

θiJi(𝜽)2\displaystyle||\nabla_{\theta_{i}}J_{i}(\bm{\theta}^{*})||_{2} |γ+1|.Viπ𝜽(s)Vωi(𝜽)(s)π𝜽.\displaystyle\leq|\gamma+1|.||V_{i}^{\pi_{\bm{\theta}}}(s)-V_{\omega_{i}(\bm{\theta})}(s)||_{\pi_{\bm{\theta}}}.
θilogπθiπ𝜽\displaystyle\qquad\qquad\qquad||\nabla_{\theta_{i}}\log\pi_{\theta_{i}}||_{\pi_{\bm{\theta}}}
Ci(𝜽)i(𝜽)\displaystyle\leq C_{i}(\bm{\theta}^{*})\mathcal{E}_{i}(\bm{\theta}^{*}) (23)

Each term on the right side of Eqn. (23) is bounded. Thus, Ji(𝜽)J_{i}(\bm{\theta}) converges for each agent ii in the original game 𝒢\mathcal{G}, even though policies are synthesized in the modified game 𝒢\mathcal{G}^{\prime}. ∎

Proposition V.6 demonstrates that the additional reward FiF^{i} provided by SAM to guide the agents does not distract them from accomplishing the task objective that is originally specified by the environment reward RiR^{i}.

Given similar assumptions, we can obtain Corollary V.7 for single-agent cases and show guarantees on the convergence of Algorithm 1 using the theory of ‘two time-scale stochastic analysis’. Corollary V.7 gives a bound on the error introduced as a result of approximating the value function VMV_{M^{\prime}} with VMωV_{M^{\prime}}^{\omega} in the MDP MM^{\prime}. This error term is small if the linear function family for VMωV_{M^{\prime}}^{\omega} is rich. In fact, if the critic is updated in batches, a tighter bound can be achieved, as shown in Proposition 1 of [52].

Corollary V.7.

Let (θ):=VMω(θ)(s)VMπθ(s)ρπθ\mathcal{E}(\theta):=\left\lVert V^{\omega(\theta)}_{M^{\prime}}(s)-V^{\pi_{\theta}}_{M^{\prime}}(s)\right\rVert_{\rho^{\pi_{\theta}}}. Then, for any limit point (θ,ω):=limTmax(θTmax,ωTmax)}(\theta^{*},\omega^{*}):=\lim\limits_{T_{max}\to\infty}(\theta_{T_{max}},\omega_{T_{max}})\} of Algorithm 1, θJM(θ)2C(θ)\left\lVert\nabla_{\theta}J_{M}(\theta^{*})\right\rVert_{2}\leq C\mathcal{E}(\theta^{*}).

Algorithm 2 desecribes SAM. The shaping advice is specified as a difference of potential functions (Line 15), and is added to the reward received from the environment. We use an advantage-based actor-critic, and use the TD-error to estimate this advantage (Line 16). This is used to update the actor and critic parameters for each agent (Lines 18-19).

Algorithm 2 SAM: Shaping Advice in Multi-agent RL
0:  For each agent ii: differentiable policy πθi\pi_{\theta_{i}}; differentiable value function VωiV_{\omega_{i}}; shaping advice ϕi(s,ai,ai)\phi_{i}(s,a^{i},a^{-i}). Maximum episode length TmaxT_{max}. Initialization: policy and value function parameters θi,ωi\theta_{i},\omega_{i} for all agents ii, learning rates αθ,αω\alpha^{\theta},\alpha^{\omega}.
1:  T=0T=0
2:  repeat
3:     t1t\leftarrow-1; ϕi(s1,a1i,a1i)=0\phi_{i}(s_{-1},a^{i}_{-1},a^{-i}_{-1})=0 for all ii
4:     Initialize information s0=[o01,,o0n]s_{0}=[o_{0}^{1},\dots,o_{0}^{n}]
5:     repeat
6:        tt+1t\leftarrow t+1
7:        for agent i=1i=1 to nn do
8:           sample atiπθi(|oti)a^{i}_{t}\sim\pi_{\theta_{i}}(\cdot|o^{i}_{t})
9:        end for
10:        Take action at=[at1,,atn]a_{t}=[a^{1}_{t},\dots,a^{n}_{t}], observe new information st+1s_{t+1} and obtain reward rtir^{i}_{t} for each agent. Use ata_{t} to determine ϕi(st,at)\phi_{i}(s_{t},a_{t}) for all agents
11:        if st+1s_{t+1} is terminal then
12:           Vωi(st+1)=0V_{\omega_{i}}(s_{t+1})=0
13:        end if
14:        for agent i=1i=1 to nn do
15:           compute FtiF^{i}_{t} based on equations (5) and (6)
16:           TD-error: δti:=rti+Fti+γVωi(st+1)Vωi(st)\delta^{i}_{t}:=r^{i}_{t}+F^{i}_{t}+\gamma V_{\omega_{i}}(s_{t+1})-V_{\omega_{i}}(s_{t})
17:           compute δ~ti\tilde{\delta}^{i}_{t} based on equations (20)
18:           Update actor: θiΓi[θi+αtθδ~tiθilogπθi(ati|oti)]\theta_{i}\leftarrow\Gamma_{i}[\theta_{i}+\alpha_{t}^{\theta}\tilde{\delta}^{i}_{t}\nabla_{\theta_{i}}\log~{}\pi_{\theta_{i}}(a_{t}^{i}|o^{i}_{t})]
19:           Update critic: ωiωiαtωδtiωiVωi(st)\omega_{i}\leftarrow\omega_{i}-\alpha_{t}^{\omega}\delta^{i}_{t}\nabla_{\omega_{i}}V_{\omega_{i}}(s_{t})
20:        end for
21:     until st+1s_{t+1} is terminal
22:     TT+1T\leftarrow T+1
23:  until T>TmaxT>T_{max}
Remark V.8.

We note that our objective is to maximize the rewards that can be obtained by agents equipped with shaping advice. Our algorithms are termed to converge when the values of these rewards reaches a ‘steady-state’. This is distinct from game-theoretic notions of equilibrium (e.g., Nash equilibrium), as studied in [53, 54]. The flavor of theoretical analysis of convergence that we adopt also allows better interpretability of our experimental evaluations, reported in Section VI.

VI Experiments

Our experiments study the incorporation of shaping advice in both single-agent and multi-agent environments. The code for our experimental evaluations is available at https://github.com/baicenxiao/Shaping-Advice.

VI-A Shaping advice for single-agent RL

In the single-agent case, we seek to compare the performance of actor-critic without shaping advice, SAS-Uniform and SAS-NonUniform. For both SAS-Uniform and SAS-NonUniform, actor-critic [1] is adopted as the base RL algorithm. We consider two setups. The first is a Puddle-Jump Gridworld [55], where state and action spaces are discrete. The second environment is a continuous state and action space mountain car [56]. In each experiment, we compare the rewards received by the agent when it uses the following schemes: i): actor-critic with sparse rewards (Sparse); ii): SAS-Uniform; iii): SAS-NonUniform. For SAS-NonUniform, we apply it in a look-back way since it does not require the estimation of future actions.

VI-A1 Puddle-Jump Gridworld

Refer to caption
Figure 2: Schematic of the puddle-jump gridworld. The state of the agent is its position (x,y)(x,y). The shaded row (row 22) represents the puddle the agent should jump over. The two blue grids denote states that are indistinguishable to the agent. The agent can choose an action from the set {up,down,left,right,jump}\{up,down,left,right,jump\} at each step.

Figure 2 depicts the Puddle-jump gridworld environment as a 10x10 grid. The state space is s=(x,y)s=(x,y) denoting the position of the agent in the grid, where x,y{0,1,,9}x,y\in\{0,1,\dots,9\}. The goal of the agent is to navigate from the start state S=(0,0)S=(0,0) to the goal G=(9,9)G=(9,9). At each step, the agent can choose from actions in the set A={up,down,left,right,jump}A=\{up,down,left,right,jump\}. There is a puddle along row 22 which the agent should jump over. Further, the states (9,8)(9,8) and (8,9)(8,9) (blue squares in Figure 2) are indistinguishable to the agent. As a result, any optimal policy for the agent will be a stochastic policy.

If the jumpjump action is chosen in rows 33 or 11, the agent will land on the other side of the puddle with probability pjp_{j}, and remain in the same state otherwise. This action chosen in other rows will keep the agent in its current state. Any action that will move the agent off the grid will keep its state unchanged. The agent receives a reward of 0.05-0.05 for each action, and +1000+1000 for reaching GG.

When using SAS-Uniform, we set ϕU(s):=u0\phi^{U}(s):=u_{0} for states in rows 0 and 11, and ϕU(s):=u1\phi^{U}(s):=u_{1} for all other states. We need u1>u0u_{1}>u_{0} to encourage the agent to jump over the puddle. Unlike SAS-Uniform, SAS-NonUniform can provide the agent with more information about the actions it can take. We set ϕNU(s,a)\phi^{NU}(s,a) to a ‘large’ value if action aa at state ss results in the agent moving closer to the goal according to the 1\ell_{1} norm, (|Gx|+|Gy|)\big{(}|G-x|+|G-y|\big{)}. We additionally stipulate that 1|A|aAϕNU(s,a)=ϕU(s)\frac{1}{|A|}\sum_{a\in A}\phi^{NU}(s,a)=\phi^{U}(s). That is, the state potential of SAS-NonUniform is the same as that of SAS-Uniform under a uniform distribution over actions. This is to ensure a fair comparison between SAS-Uniform and SAS-NonUniform.

In our experiment, we set the discount factor γ=1\gamma=1. Since the dimensions of the state and action spaces is not large, we do not use a function approximator for the policy π\pi. A parameter θs,a\theta_{s,a} is associated to each state-action pair, and the policy is computed as: πθ(a|s)=exp(θs,a)aAexp(θs,a)\pi_{\theta}(a|s)=\frac{\exp(\theta_{s,a})}{\sum_{a\in A}\exp(\theta_{s,a})}. We fix αω=0.001\alpha^{\omega}=0.001, and αθ=0.2\alpha^{\theta}=0.2 for all cases.

From Figure 3, we observe that SAS-NonUniform scheme performs the best, in that the agent converges to the goal in five times fewer episodes (2525 vs. 125125 episodes) than A2C without advice (Sparse). When A2C is augmented with SAS-Uniform, convergence to the goal is slightly faster than without any reward shaping.

Refer to caption
Figure 3: Average rewards in puddle-jump gridworld when jump success probability pj=0.2p_{j}=0.2. The baseline is the advantage actor-critic without advice.

A smaller jump success probability pjp_{j} is an indication that it is more difficult for the agent to reach the goal state GG. Figure 4 shows that SAS-NonUniform results in the highest reward for a more difficult task (lower pjp_{j}), when compared with the other reward schemes.

Refer to caption
Figure 4: Average reward for the first 100 episodes with respect to the jump success probability pjp_{j}.

VI-A2 Continuous Mountain Car

In the mountain car (MC) environment, an under-powered car in a valley has to drive up a steep hill to reach the goal. In order to achieve this, the car should learn to accumulate momentum. A schematic for this environment is shown in Figure 5.

Refer to caption
Figure 5: Schematic of the mountain-car environment. The agent’s state is represented by its position ptp_{t} (along the xx-coordinate) and velocity vtv_{t}. The action ata_{t} is a force applied to the car. The goal is marked as a flag.

This environment has continuous state and action spaces. The state s=(p,v)s=(p,v) denotes position p[1.2,0.6]p\in[-1.2,0.6] and velocity v[0.07,0.07]v\in[-0.07,0.07]. The action a[1,+1]a\in[-1,+1]. The continuous action space makes it difficult to use classic value-based methods, such as Q-learning and SARSA-learning. The reward provided by the environment depends on the action and whether the car reaches the goal. Specifically, once the car reaches the goal it receives +100+100, and before that, the reward at time tt is |at|2-|a_{t}|^{2}. This reward structure therefore discourages the waste of energy. This acts as a barrier for learning, because there appears to be a sub-optimal solution where the agent remains at the bottom of the valley. Moreover, the reward for reaching the goal is significantly delayed, which makes it difficult for the conventional actor-critic algorithm to learn a good policy.

One choice of a potential function while using SAS-Uniform in this environment is ϕU(st):=pt+P\phi^{U}(s_{t}):=p_{t}+P. Since the position ptp_{t} can take values in [1.2,0.6][-1.2,0.6], the offset PP is chosen so that the potential ϕU(st)\phi^{U}(s_{t}) will always be positive. We use P=2P=2. An interpretation of this scheme is: ‘state value is larger when the car is horizontally closer to the goal.’ The SAS-NonUniform scheme we use for this environment encourages the accumulation of momentum by the car– the direction of the action is encouraged to be the same as the current direction of the car’s velocity. In the meanwhile, we discourage inaction. Mathematically, the potential advice function has a larger value if at0a_{t}\neq 0. We let ϕNU(st,at)=1\phi^{NU}(s_{t},a_{t})=1, if atvt>0a_{t}v_{t}>0, and ϕNU(st,at)=0\phi^{NU}(s_{t},a_{t})=0, otherwise.

In our experiments, we set γ=0.99\gamma=0.99. To deal with the continuous state space, we use a neural network (NN) as a function approximator. The policy distribution πθ(a|s)\pi_{\theta}(a|s) is approximated by a normal distribution, the mean and variance of which are the outputs of the NN. The value function is also represented by an NN. We set αθ=1×105\alpha^{\theta}=1\times 10^{-5} and αω=5.6×104\alpha^{\omega}=5.6\times 10^{-4}, and use Adam [57] to update the NN parameters. The results we report are averaged over 10 different environment seeds.

Refer to caption
Figure 6: Average rewards for continuous mountain car problem (averaged over 10 different environment random seeds). The baseline is the A2C without advice.

Our experiments indicate that the policy makes the agent converge to one of two points: the goal, or remain stationary at the bottom of the valley. We observe that when learning with the base algorithm (A2C with sparse rewards), the agent is able to reach the goal only in 10%10\% of the trials (out of 10 trials), and was stuck at the sub-optimal solution for the remaining trials. With SAS-Uniform, the agent could converge correctly in only 20%20\% of the trials. This is because the agent might have to take an action that moves it away from the goal in order to accumulate momentum. However, the potential function ϕU()\phi^{U}(\cdot) discourages such actions. SAS-NonUniform performs the best, where we observed that the agent was able to reach the goal in 100%100\% of the trials. Figure 6 reflects these observations, where we see that SAS-NonUniform results in the agent obtaining the highest rewards.

Task ϕi(st,ati,ati)\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}): SAM-Uniform ϕi(st,ati,ati)\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}): SAM-NonUniform
CN α1exp(β1j=1Ndist(stj,Lj))\alpha_{1}exp(-\beta_{1}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j})) M1θatiLi+α2exp(β2j=1Ndist(stj,Lj))-M_{1}\theta_{{a^{i}_{t}L_{i}}}+\alpha_{2}exp(-\beta_{2}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j}))
PD α3exp(β2j=1Ndist(stj,Lj))\alpha_{3}exp(-\beta_{2}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j})) M2θatiLi+α4exp(β4j=1Ndist(stj,Lj))-M_{2}\theta_{{a^{i}_{t}L_{i}}}+\alpha_{4}exp(-\beta_{4}\sum_{j=1}^{N}dist(s_{t}^{j},L_{j}))
PP α5exp(β5j=1Ndist(stpredj,stprey))\alpha_{5}exp(-\beta_{5}\sum_{j=1}^{N}dist(s_{t}^{pred_{j}},s_{t}^{prey})) M3j=1Nθatpredjstprey+α6exp(β6j=1Ndist(stpredj,stprey))-M_{3}\sum_{j=1}^{N}\theta_{{a^{pred_{j}}_{t}s_{t}^{prey}}}+\alpha_{6}exp(-\beta_{6}\sum_{j=1}^{N}dist(s_{t}^{pred_{j}},s_{t}^{prey}))
TABLE I: Shaping advice, FtiF^{i}_{t} provided by SAM is given by Equation (5) or (6). The table lists the potential functions used in the Cooperative Navigation (CN), Physical Deception (PD), and Predator-Prey (PP) tasks. LjL_{j} is the landmark to which agent jj is anchored to. dist(,)dist(\cdot,\cdot) denotes the Euclidean distance. θatjLj[0,π]\theta_{{a^{j}_{t}L_{j}}}\in[0,\pi] is the angle between the direction of the action taken by agent jj and the vector directed from its current position to LjL_{j}. In SAM-Uniform, advice for every action of the agents for a particular sts_{t} is the same. In SAM-NonUniform, agents are additionally penalized if their actions are not in the direction of their target. In each case, FtiF^{i}_{t} is positive when agents take actions that move it towards their target.

VI-B Shaping Advice for Multi-agent RL

This section describes the multi-agent tasks that we evaluate SAM on, and these include tasks with cooperative and competitive objectives. In each case, the rewards provided to the agents are sparse, which affects the agents’ ability to obtain immediate feedback on the quality of their actions at each time-step. Shaping advice provided by SAM is used to guide the agents to obtain higher rewards than in the case without advice. We conclude the section by presenting the results of our experiments evaluating SAM on these tasks.

VI-B1 Task Descriptions and Shaping Advice

Refer to caption
Figure 7: Representations of tasks from the Particle World Environment [17] that we study. (Left to Right) Predator-Prey (PP), Cooperative Navigation (CN), and Physical Deception (PD). In PP, predators (red) seek to catch the prey (green) while avoiding obstacles (grey). In CN, agents (green) each seek to navigate to a different landmark (×\times) and are penalized for collisions with each other. In PD, one of the agents (green) must reach the true landmark (red ×\times), while preventing the adversary from reaching this landmark. In all tasks, rewards are sparse. Agents receive a reward or penalty only when a corresponding reachability or collision criterion is satisfied.

We examine three tasks from the Particle World environment [17] where multiple agents share a two-dimensional space with continuous states and actions. An illustration of the tasks is shown in Figure 7, and we describe them below.

Predator-Prey: This task has NN predator agents who cooperate to capture 11 faster-moving prey. Predators are rewarded when one of them collides with the prey, while the prey is penalized for the collision. The reward at other times is zero. Two landmarks impede movement of the agents.

Cooperative Navigation: This task has NN agents and NN landmarks. Agents are each rewarded rr when an agent reaches a landmark, and penalized for collisions with each other. The reward at other times is zero. Therefore, the maximum rewards agents can obtain is rNrN. Thus, agents must learn to cover the landmarks, and not collide with each other.

Physical Deception: This task has 11 adversary, NN agents, and NN landmarks. Only one landmark is the true target. Agents are rewarded when any one reaches the target, and penalized if the adversary reaches the target. At all other times, the agents get a reward of zero. An adversary also wants to reach the target, but it does not know which landmark is the target landmark. Thus, agents have to learn to split up and cover the landmarks to deceive the adversary.

In each environment, SAM provides shaping advice to guide agents to obtain a higher positive reward. This advice is augmented to the reward received from the environment. The advice is a heuristic given by a difference of potential functions (Equations (5) or (6)), and only needs to be specified once at the start of the training process.

In the Cooperative Navigation and Physical Deception tasks, we anchor each agent to a (distinct) landmark. The shaping advice will then depend on the distance of an agent to the landmark it is anchored to. Although distances computed in this manner will depend on the order in which the agents and landmarks are chosen, we observe that it empirically works across multiple training episodes where positions of landmarks and initial positions of agents are generated randomly. The advice provided by SAM is positive when agents move closer to landmarks they are anchored to. In the absence of anchoring, they may get distracted and move towards different landmarks at different time steps. Anchoring results in agents learning to cover landmarks faster.

We consider two variants of advice for each task. In SAM-Uniform, the advice for every action taken is the same. In SAM-NonUniform, a higher weight is given to some ‘good’ actions over others for each sts_{t}. We enumerate the advice for each task in Table I. We use MADDPG as the base RL algorithm [17]. We compare the performance of agents trained with SAM (SAM-Uniform or SAM-NonUniform) to the performance of agents trained using the sparse reward from the environment. We also compare SAM with a state-of-the-art reward redistribution technique introduced in [19] called Iterative Relative Credit Assignment (IRCR).

VI-B2 Implementation details

When we tested SAM-NonUniform, we used look-back advice following Equation (6). This was done to avoid estimating a ‘future’ action, i.e. at+1ia^{i}_{t+1}, at each time-step (since the replay buffer contains tuples of the form (st,at1,,atn,rt1,,rtn,st+1)(s_{t},a^{1}_{t},\dots,a^{n}_{t},r^{1}_{t},\dots,r^{n}_{t},s_{t+1})). Noisy estimates of at+1ia^{i}_{t+1} can cause oscillations in the rewards obtained by the agent.

We adopt the same hyperparameter values and network architectures as used in [17]. We let Γi\Gamma_{i} Line 18 of the SAM Algorithm be the identity matrix. We list values of α,β,M\alpha_{\cdot},\beta_{\cdot},M_{\cdot} (from Table 1 in the main paper) that were used in our experiments. Sparse Reward Setting: For results reported in Figures 8 and 9, we use the following parameters for the shaping advice:

CN (N=3N=3) α1=α2=100\alpha_{1}=\alpha_{2}=100 β1=β2=1\beta_{1}=\beta_{2}=1 M1=1M_{1}=1
CN (N=6N=6) α1=α2=1000\alpha_{1}=\alpha_{2}=1000 β1=β2=1\beta_{1}=\beta_{2}=1 M1=10M_{1}=10
PD (N=2N=2) α3=α4=500\alpha_{3}=\alpha_{4}=500 β3=β4=1\beta_{3}=\beta_{4}=1 M2=1M_{2}=1
PD (N=4N=4) α3=α4=500\alpha_{3}=\alpha_{4}=500 β3=β4=1\beta_{3}=\beta_{4}=1 M2=10M_{2}=10
PP (N=3N=3) α5=α6=100\alpha_{5}=\alpha_{6}=100 β5=β6=1\beta_{5}=\beta_{6}=1 M3=1M_{3}=1

Other forms of shaping rewards: In the three tasks that we evaluated, we observed that a ‘linear’ distance-based advice (i.e., advice of the form ϕi(st,ati,ati):=jdist(stj,Lj)\phi_{i}(s_{t},a^{i}_{t},a^{-i}_{t}):=\sum_{j}dist(s_{t}^{j},L_{j})) did not work. From Equations (5) and (6), using this form of advice, an agent would get the same additional reward when it took a step towards the target, independent of its distance to the target. For example, an agent 100100 steps from the target would get the same shaping advice if it took one step towards the target as an agent who takes a step towards the target from a state that is 5050 steps from the target.

VI-B3 Results

Refer to caption
Figure 8: Comparison between SAM (SAM-NonUniform (blue) or SAM-Uniform (purple)) augmented to MADDPG and classical MADDPG policies (orange) on cooperative and competitive tasks with sparse rewards. The score for a task is the average agent reward in cooperative tasks, and the average agent advantage (== agent reward - adversary reward) in competitive tasks. Each bar cluster shows Normalized 010-1 scores, averaged over the last 10001000 training episodes. Higher score is better. SAM-NonUniform outperforms SAM-Uniform and the classical MADDPG baseline by a larger margin when there are more agents in the cooperative navigation and physical deception tasks. The scores of IRCR are not shown here, since it performs consistently worse than other approaches in these three tasks.
Refer to caption
(a) Cooperative Navigation (N=6N=6)
Refer to caption
(b) Physical Deception (N=4N=4)
Refer to caption
(c) Predator-Prey (33 pred., 11 prey.)
Figure 9: Average and variance of scores when agents use SAM-NonUniform (blue), SAM-Uniform (purple), IRCR (green) and sparse rewards (orange). SAM-NonUniform results in the highest average scores. SAM-Uniform compares favorably, and both significantly outperform agents trained using only sparse rewards. IRCR is not able to guide agents to obtain higher rewards in all three tasks.

Figure 8 shows 010-1 normalized scores, averaged over the last 1000 training episodes, comparing SAM augmented to MADDPG and classical MADDPG policies. The score for a task is the average agent reward in cooperative tasks, and the average agent advantage (== agent - adversary reward) in competitive tasks [58]. Agents equipped with SAM-NonUniform have the best performance. This is because SAM-NonUniform provides specific feedback on the quality of agents’ actions. SAM-Uniform also performs well in these tasks. SAM-NonUniform outperforms SAM-Uniform and the classical MADDPG baseline by a significant margin when there are more agents.

In cooperative navigation, when rewards are sparse, the agents are not able to learn policies that will allow them to even partially cover the landmarks using the MADDPG baseline. In comparison, SAM guides agents to learn to adapt to each others’ policies, and cover all the landmarks. SAM-NonUniform results in much higher rewards than other methods in the complex task with N=6N=6 agents.

We observe a similar phenomenon in physical deception, where SAM guides agents to learn policies to cover the landmarks. This behavior of the agents is useful in deceiving the adversary from moving towards the true landmark, thereby resulting in lower rewards for the adversary. Therefore, agent advantage is higher with SAM.

In the predator-prey task, we see that the performance of SAM is comparable to MADDPG. We believe that this is because this task might have a well-defined and unique equilibrium to which agent policies eventually converge.

Figure 9 shows the average and variance of the score during different stages of the training process. The score for a task is the average agent reward in cooperative tasks, and the average agent advantage (== agent - adversary reward) in competitive tasks [58]. In terms of agent scores averaged over the last 10001000 training episodes, agents equipped with SAM-NonUniform have the best performance. This is because SAM-NonUniform provides specific feedback on quality of agents’ actions. SAM-Uniform also performs well in these tasks.

In cooperative navigation, when agents use only the sparse rewards from the environment, the agents are not able to learn policies that will allow them to even partially cover the landmarks. In comparison, SAM guides agents to learn to adapt to each others’ policies, and cover all the landmarks. A similar phenomenon is observed in physical deception, where SAM guides agents to learn policies to cover the landmarks. This behavior of the agents is useful in deceiving the adversary from moving towards the true landmark, thereby resulting in lower final rewards for the adversary.

We additionally compare the performance of SAM with a technique called IRCR that was introduced in [19]. We observe that agents using IRCR receive the lowest scores in all three tasks. We believe that a possible reason for this is that in each training episode, IRCR accumulates rewards till the end of the episode, and then uniformly redistributes the accumulated reward along the length of the episode. A consequence of this is that an agent may find it more difficult to identify the time-step when it reaches a landmark or when a collision occurs. For example, in the Predator-Prey task, suppose that the length of an episode is TepT_{ep}. Consider a scenario where one of the predators collides with the prey at a time T<TepT<T_{ep}, and subsequently moves away from the prey. When IRCR is applied to this scenario, the redistributed reward at time TT will be the same as that at other time steps before TepT_{ep}. This property makes it difficult to identify critical time-steps when collisions between agents happen.

The authors of [17] observed that agent policies being unable to adapt to each other in competitive environments resulted in oscillations in rewards. Figure 9(b) indicates that SAM is able to alleviate this problem. Policies learned by agents using SAM in the physical deception task result in much smaller oscillations in the rewards than when using sparse rewards alone.

VII Conclusion

This paper presented a comprehensive framework to incorporate domain knowledge through shaping advice in single- and multi-agent reinforcement learning environments with sparse rewards. The shaping advice for each agent was a heuristic specified as a difference of potential functions, and was augmented to the reward provided by the environment. The modified reward signal provided agents with immediate feedback on the quality of the actions taken at each time-step. In the single-agent case, our algorithm SAS enabled the agent to obtain higher rewards, and was able to reach a target state more successfully than without a shaping reward. For the multi-agent setting, SAM used the centralized training with decentralized execution paradigm to efficiently learn decentralized policies for each agent that used only their individual local observations. We showed through theoretical analyses and experimental validation that shaping advice provided by SAS and SAM did not distract agents from accomplishing task objectives specified by the environment reward. Using SAS or SAM allowed the agents to obtain a higher average reward in fewer number of training episodes.

Future research will aim to extend the scope of shaping advice based techniques. Two research questions are: i) can we design principled methods based on SAS and SAM in order to adaptively learn shaping advice (rather than being fixed for the duration of training)?, ii) how might such an adaptive procedure affect the sample efficiency and number of training episodes required? We believe that answers to these questions will be a step towards broadening the application of shaping advice to more challenging real-world RL environments.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   MIT press, 2018.
  • [2] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.
  • [3] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, 2016.
  • [4] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS One, vol. 12, no. 4, 2017.
  • [5] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning and Representations, 2016.
  • [6] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” Electronic Imaging, vol. 2017, no. 19, pp. 70–76, 2017.
  • [7] J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel, “Multi-agent reinforcement learning in sequential social dilemmas,” in Conference on Autonomous Agents and Multi-Agent Systems, 2017, pp. 464–473.
  • [8] S. Yin and F. R. Yu, “Resource allocation and trajectory design in uav-aided cellular networks based on multi-agent reinforcement learning,” IEEE Internet of Things Journal, 2021.
  • [9] A. K. Agogino and K. Tumer, “Analyzing and visualizing multiagent rewards in dynamic and stochastic domains,” Autonomous Agents and Multi-Agent Systems, vol. 17, no. 2, pp. 320–338, 2008.
  • [10] S. Devlin, D. Kudenko, and M. Grześ, “An empirical study of potential-based reward shaping and advice in complex, multi-agent systems,” Advances in Complex Systems, vol. 14, pp. 251–278, 2011.
  • [11] S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer, “Potential-based difference rewards for multiagent reinforcement learning,” in Conference on Autonomous Agents and Multi-Agent Systems, 2014.
  • [12] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping.” in International Conference on Machine Learning, 1998.
  • [13] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning, 1999, pp. 278–287.
  • [14] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, 2017, pp. 1146–1155.
  • [15] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems.” The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
  • [16] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in International Conference on Machine Learning, 1993.
  • [17] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
  • [18] P. Mannion, S. Devlin, J. Duggan, and E. Howley, “Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning,” The Knowledge Engineering Review, vol. 33, 2018.
  • [19] T. Gangwani, Y. Zhou, and J. Peng, “Learning guidance rewards with trajectory-space smoothing,” in Neural Information Processing Systems, 2020.
  • [20] B. Xiao, B. Ramasubramanian, A. Clark, H. Hajishirzi, L. Bushnell, and R. Poovendran, “Potential-based advice for stochastic policy learning,” in IEEE Conference on Decision and Control, 2019, pp. 1842–1849.
  • [21] M. E. Taylor, H. B. Suay, and S. Chernova, “Integrating reinforcement learning with human demonstrations of varying ability,” in Conference on Autonomous Agents and Multiagent Systems, 2011, pp. 617–624.
  • [22] Z. Wang and M. E. Taylor, “Improving reinforcement learning with confidence-based demonstrations.” in International Joint Conference on Artificial Intelligence, 2017, pp. 3027–3033.
  • [23] M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “HG-DAgger: Interactive imitation learning with human experts,” in International Conference on Robotics and Automation.   IEEE, 2019, pp. 8077–8083.
  • [24] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in International Conference on Artificial Intelligence and Statistics, 2011, pp. 627–635.
  • [25] B. Xiao, Q. Lu, B. Ramasubramanian, A. Clark, L. Bushnell, and R. Poovendran, “FRESH: Interactive reward shaping in high-dimensional state spaces using human feedback,” in Conference on Autonomous Agents and Multi-Agent Systems, 2020, pp. 1512–1520.
  • [26] Z. Zuo, “Nonsingular fixed-time consensus tracking for second-order multi-agent networks,” Automatica, vol. 54, pp. 305–309, 2015.
  • [27] B. Tian, H. Lu, Z. Zuo, and W. Yang, “Fixed-time leader–follower output feedback consensus for second-order multiagent systems,” IEEE Transactions on Cybernetics, vol. 49, pp. 1545–1550, 2018.
  • [28] X. Li, Y. Tang, and H. R. Karimi, “Consensus of multi-agent systems via fully distributed event-triggered control,” Automatica, vol. 116, p. 108898, 2020.
  • [29] J. Qin, Q. Ma, Y. Shi, and L. Wang, “Recent advances in consensus of multi-agent systems: A brief survey,” IEEE Transactions on Industrial Electronics, vol. 64, no. 6, pp. 4972–4983, 2016.
  • [30] W. Gao, M. Mynuddin, D. C. Wunsch, and Z.-P. Jiang, “Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model,” IEEE transactions on neural networks and learning systems, 2021.
  • [31] K. G. Vamvoudakis, H. Modares, B. Kiumarsi, and F. L. Lewis, “Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,” IEEE Control Systems Magazine, vol. 37, no. 1, pp. 33–52, 2017.
  • [32] G. Qu, Y. Lin, A. Wierman, and N. Li, “Scalable multi-agent reinforcement learning for networked systems with average reward,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [33] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
  • [34] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls et al., “Value-decomposition networks for cooperative multi-agent learning based on team reward.” in Conference on Autonomous Agents and Multiagent Systems, 2018, pp. 2085–2087.
  • [35] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, 2018, pp. 4295–4304.
  • [36] S. Devlin and D. Kudenko, “Theoretical considerations of potential-based reward shaping for multi-agent systems,” in Conference on Autonomous Agents and Multi-Agent Systems, 2011, pp. 225–232.
  • [37] X. Lu, H. M. Schwartz, and S. N. Givigi, “Policy invariance under reward transformations for general-sum stochastic games,” Journal of Artificial Intelligence Research, vol. 41, pp. 397–406, 2011.
  • [38] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe, “Expressing arbitrary reward functions as potential-based advice,” in AAAI Conference on Artificial Intelligence, 2015, pp. 2652–2658.
  • [39] M. Klissarov and D. Precup, “Reward propagation using graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [40] T. Mesnard, T. Weber, F. Viola, S. Thakoor, A. Saade, A. Harutyunyan, W. Dabney, T. Stepleton, N. Heess, A. Guez et al., “Counterfactual credit assignment in model-free reinforcement learning,” International Conference on Machine Learning, 2021.
  • [41] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming.   John Wiley & Sons, 2014.
  • [42] M. Grześ, “Reward shaping in episodic reinforcement learning,” in Autonomous Agents and MultiAgent Systems, 2017, pp. 565–573.
  • [43] A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko, “Potential-based reward shaping for finite horizon online POMDP planning,” Autonomous Agents and Multi-Agent Systems, vol. 30, no. 3, 2016.
  • [44] S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping,” in Conference on Autonomous Agents and Multiagent Systems, 2012, pp. 433–440.
  • [45] E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in International Conference on Machine Learning, 2003, pp. 792–799.
  • [46] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Conference on Autonomous Agents and Multiagent Systems, 2017, pp. 66–83.
  • [47] V. S. Borkar, Stochastic approximation: A dynamical systems viewpoint.   Springer, 2009, vol. 48.
  • [48] H. J. Kushner and D. S. Clark, Stochastic approximation methods for constrained and unconstrained systems.   Springer Science & Business Media, 2012, vol. 26.
  • [49] D. Williams, Probability with Martingales.   Cambridge University Press, 1991.
  • [50] H. Prasad, L. Prashanth, and S. Bhatnagar, “Two-timescale algorithms for learning nash equilibria in general-sum stochastic games,” in Conference on Autonomous Agents and Multiagent Systems, 2015, pp. 1371–1379.
  • [51] W. Rudin, Principles of Mathematical Analysis.   McGraw-Hill, New York, 1964.
  • [52] Z. Yang, K. Zhang, M. Hong, and T. Başar, “A finite sample analysis of the actor-critic algorithm,” in IEEE Conference on Decision and Control (CDC), 2018, pp. 2759–2764.
  • [53] Y. Jiang and F. Tan, “An enhanced model-free reinforcement learning algorithm to solve Nash equilibrium for multi-agent cooperative game systems,” IEEE Access, vol. 8, pp. 223 743–223 755, 2020.
  • [54] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-agent reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5571–5580.
  • [55] O. Marom and B. Rosman, “Belief reward shaping in reinforcement learning,” in AAAI, 2018, pp. 3762–3769.
  • [56] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai Gym,” arXiv:1606.01540, 2016.
  • [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
  • [58] Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan, “Probabilistic recursive reasoning for multi-agent reinforcement learning,” in International Conference on Learning Representations, 2019.