Differential Advising in Multi-Agent Reinforcement Learning

Dayong Ye, Tianqing Zhu*, Zishuo Cheng, Wanlei Zhou and Philip S. Yu *Tianqing Zhu is the corresponding author. T. Zhu is with the School of Computer Science, China University of Geosciences, Wuhan, P. R. China. D. Ye, Z. Cheng and W. Zhou are with the Centre for Cyber Security and Privacy and with the School of Computer Science, University of Technology, Sydney, Australia. Philip S. Yu is with Department of Computer Science, University of Illinois at Chicago, USA. Email: [email protected], {Dayong.Ye, 13367166, Wanlei.Zhou}@uts.edu.au, [email protected]. This work is supported by National Natural Science Foundation of China (No.61972366), ARC LP170100123 and DP190100981, Australia, and NSF under grants III-1763325, III-1909323, and SaTC-1930941, USA.

Abstract

Agent advising is one of the main approaches to improve agent learning performance by enabling agents to share advice. Existing advising methods have a common limitation that an adviser agent can offer advice to an advisee agent only if the advice is created in the same state as the advisee’s concerned state. However, in complex environments, it is a very strong requirement that two states are the same, because a state may consist of multiple dimensions and two states being the same means that all these dimensions in the two states are correspondingly identical. Therefore, this requirement may limit the applicability of existing advising methods to complex environments. In this paper, inspired by the differential privacy scheme, we propose a differential advising method which relaxes this requirement by enabling agents to use advice in a state even if the advice is created in a slightly different state. Compared with existing methods, agents using the proposed method have more opportunity to take advice from others. This paper is the first to adopt the concept of differential privacy on advising to improve agent learning performance instead of addressing security issues. The experimental results demonstrate that the proposed method is more efficient in complex environments than existing methods.

Keywords - Multi-Agent Reinforcement Learning, Agent Advising, Differential Privacy

1 Introduction

Multi-agent reinforcement learning (MARL) is one of the fundamental research topics in artificial intelligence [1]. In regular MARL methods, agents typically need a large number of interactions with the environment and other agents to learn proper behaviors. To improve agent learning speed, the agent advising technique is introduced [2, 3, 4], which enables agents to ask for advice between each other.

Existing advising methods share a common limitation that an adviser agent can offer advice to an advisee agent only if the advice is created in the same state as the advisee agent’s concerned state [5]. However, in complex environments, it is usually a strong requirement that two states are the same, because a state may be composed of multiple dimensions and two states being the same implies that all these dimensions in the two states are correspondingly identical. This requirement may hinder the application of existing methods to complex environments. For example, in a multi-robot search and rescue problem (Fig. 1), the aim of each robot is to search and rescue victims as quickly as possible. Each robot is a learning agent which can observe only its surrounding area. To improve robot learning performance, robots can ask for advice between each other. In this problem, an observation of a robot is interpreted as a state of the robot. An observation consists of eight dimensions, where each dimension stands for a small cell around the robot. The format of an observation could be: $s=\langle 0,0,1,0,0,0,1,0\rangle$ , where $0$ means an empty cell and $1$ means an obstacle in a cell. A slightly different observation could be: $s^{\prime}=\langle 0,0,1,0,1,0,1,0\rangle$ , as the two states, $s$ and $s^{\prime}$ , have only the fifth dimension in difference (referring to Definition 4 for more detail). In existing advising methods, when a robot $i$ in state $s$ asks for advice from another robot $j$ , robot $j$ can offer advice to robot $i$ only if robot $j$ has visited state $s$ a number of times.

Refer to caption — Figure 1: A multi-robot search and rescue example

In this paper, we relax this requirement and develop a differential advising method which enables agents to use advice in a state even if the advice is created in a slightly different state. Thus, in the above example, by using our method, even if robot $j$ has never visited state $s$ , robot $j$ can still offer advice to robot $i$ as long as robot $j$ has visited a neighboring state of state $s$ , e.g., state $s^{\prime}$ . However, developing this differential advising method is a difficult task due to the following two challenges. First, how to measure and delimit the difference between two states is a challenge, because the amount of the difference between two states may significantly affect the quality of advice. Second, how to use advice in a concerned state, which is created in another state, is also a challenge, because improper advice may harm the learning performance of agents [6]. To address these two challenges, the differential privacy technique is applied.

Differential privacy is a privacy model which guarantees that a query to two slightly different datasets yields almost the same results [7, 8]. This means that the query result yielded from one dataset can be considered approximately identical to the query result yielded from the other dataset. This property of differential privacy can be taken into our method. Specifically, two slightly different states are similar to two slightly different datasets. Advices generated from states are similar to results yielded from datasets. Since two results from two slightly different datasets can be considered approximately identical, two advices generated from two slightly different states can also be considered approximately identical. This property guarantees that the advice created in a state can still be used in another slightly different state.

In summary, this paper has two contributions. First, we are the first to develop a differential advising method which allows agents to use advices created in different states. This method enables agents to receive more advices than existing methods. Second, we are the first to take advantage of the differential privacy technique for agent advising to improve agent learning performance instead of addressing security issues.

2 Related Work

Torrey and Taylor [9] introduced a teacher-student framework, where a teacher agent provides advice to a student agent to accelerate the student’s learning. Their framework is teacher-initiated, where the teacher determines when to give advice to the student. They developed a set of heuristic approaches for the teacher agent to decide when to provide advice. By extending Torrey and Taylor’s framework [9], Amir et al. [10] proposed an interactive student-teacher framework, where the student agent and the teacher agent jointly determine when to ask for/provide advice. They also developed a set of heuristic methods for both the student agent and the teacher agent to decide when to ask for/provide advice.

Later, Silva et al. [2] extended Amir et al.’s [10] work by taking simultaneity into consideration, where multiple agents simultaneously learn in an environment and an agent can be both a teacher and a student at the same time. The decisions regarding advice request and provision are based on a set of heuristic methods.

Wang et al. [11] proposed a teacher-student framework to accelerate the lexicon convergence speed among agents. Their framework is similar to Silva et al.’s framework [2]. The major difference is that, in Wang et al.’s framework, instead of broadcasting a request to all the neighboring agents, a student agent uses a multicast manner to probabilistically ask each neighboring agent for advice. The asking probability is based on the distance between the student and the teacher.

Ye et al. [6] also proposed an agent advising approach in a simultaneous learning environment. Unlike other advising methods which consider only benign and cooperative agents, Ye et al. introduced malicious agents into the environment, which provide false advice to hurt the learning performance of other agents. Ye et al. then used the differential privacy technique to preserve the learning performance of benign agents and reduce the impact of malicious agents.

Omidshafiei et al. [12] presented an advising framework. Unlike existing methods which use only heuristic methods to decide when to advise, their framework enables agents to learn when to teach/request and what to teach. Specifically, an advisee agent has a student policy to decide when to request advice using advising-level observation. The advising-level observation is based on the advisee agentos task-level observation and action-value vectors. Similarly, an adviser agent has a teacher policy to decide when and what to advise using advising-level observation. This advising-level observation is based on 1) the adviser agentos own task-level knowledge in the advisee agentos state, 2) the adviser agentos measure of the advisee agentos task-level state and knowledge.

Zhu et al. [13] developed a $Q$ -value sharing framework named PSAF (partaker-sharer advising framework). Unlike other advising methods where agents transfer recommended actions as advice, in PSAF, agents transfer $Q$ -values as advice. Using $Q$ -values as advice is more flexible than using actions as advice. This is because the policies of agents may be continuously changing and thus an advisee agent’s learning performance may be impaired by following an adviser’s recommended action. By contrast, if the advice is a $Q$ -value, the advisee can update its policy by using this $Q$ -value and then uses its own action selection method to pick up an action.

Ilhan et al. [14] developed an action advising method in multi-agent deep reinforcement learning (MADRL). Unlike regular MARL, in MADRL, the number of states may be very large or even infinite. Thus, it is infeasible to use the number of visits to a state to measure the confidence in that state as commonly used in existing methods. They, hence, adopt the random network distillation to measure the confidence in a state by measuring the mean squared error between two neural networks: the target network and the predictor network.

Unlike the above research which focuses on the advising process, some works have different focuses. Fachantidis et al. [15] studied the critical factors that affect advice quality. Gupta et al. [16] suggested that teacher agents should not only advise the action to take in a given state but also provide more informative advice using the synthesis of knowledge they have gained. They proposed an advising framework where a teacher augments a student’s knowledge by providing not only the advised action but also the expected long-term reward of following that action. Vanhee et al. [17] introduced Advice-MDPs which extend Markov decision processes for generating policies that take into account advising on the desirability, undesirability and prohibition of certain states and actions.

These above-mentioned works have a common limitation that an advice can be offered only if the advice is created in the same state as the given state. This limitation may deter the application of their works to complex environments. State similarity has also been researched by Castro [18] who defined a bisimulation metric. However, Castroos research focuses on computing state similarity in deterministic MDP rather than agent advising. Also, the definition of bisimulation provided in [18] may not be applicable to our research, because that definition requires an agent to have the full knowledge of reward functions, state space, and state transition functions. In our research, the environments are partially observable and thus, the full state space is unknown to any agent. Moreover, in our research, as multiple agents coexist in an environment, the state transition of an agent highly depends on the actions of other agents. Hence, state transition functions are hard to be pre-defined. In this paper, we propose a differential advising method which overcomes the common limitation of existing methods by taking advantage of differential privacy.

3 Preliminaries

3.1 Multi-agent reinforcement learning

Reinforcement learning is usually used to solve sequential decision-making problems. A sequential decision-making process can be formally modeled as a Markov decision process (MDP) [19]. An MDP is typically a tuple $\langle S,A,T,R\rangle$ , where $S$ is the set of states, $A$ is a set of actions available to the agent, $T$ is the transition function, and $R$ is the reward function.

At each step, an agent observes the state $s\in S$ of the environment, and selects an action $a\in A$ based on its policy $\pi$ , which is a probability distribution over available actions. After performing the action, the agent receives a real-valued reward $r$ and gets into a new state $s^{\prime}\in S$ . The agent then updates its policy $\pi$ based on the reward $r$ and the new state $s^{\prime}$ . Through this way, the agent can gradually accumulate knowledge and improve its policy to maximize its accumulated long-term expected reward.

An MDP can be extended to model a multi-agent reinforcement learning process as a tuple $\langle S,A_{1,...,n},T,R_{1,...,n}\rangle$ , where $S$ is the cartesian product of the sets of local states from each agent, $S=S_{1}\times S_{2}\cdots\times S_{n}$ , $A_{i}$ is the available action set of agent $i$ , $T$ is the transition function, and $R_{i}$ is the reward function of agent $i$ . In multi-agent reinforcement learning, state transitions are based on the joint actions of all the agents. Each agent’s individual reward is also based on the joint actions of all the agents.

3.2 Differential privacy

Differential privacy is a prevalent privacy model and has been broadly applied to various applications [20, 21, 22]. Differential privacy can guarantee an individual’s privacy independent of whether the individual is in or out of a dataset [7]. Two datasets $D$ and $D^{\prime}$ are neighboring datasets if they differ in at most one record. Let $f$ be a query that maps dataset $D$ to a $k$ -dimension vector in range $\mathbb{R}^{k}$ : $f:D\rightarrow\mathbb{R}^{k}$ . The maximal difference on the results of query $f$ is defined as sensitivity of query $\Delta f$ , which determines how much perturbation is required for the privacy-preserving answer. The formal definition of sensitivity is given as follows.

Definition 1 (Sensitivity).

For a query $f:D\rightarrow\mathbb{R}^{k}$ , the sensitivity of $f$ is defined as

\Delta f=\max\limits_{||D-D^{\prime}||_{1}\leq 1}||f(D)-f(D^{\prime})||_{1},

(1)

where $||D-D^{\prime}||_{1}\leq 1$ means that two datasets $D$ and $D^{\prime}$ have at most one record different, and $||f(D)-f(D^{\prime})||_{1}=\sum_{1\leq i\leq k}|f(D)_{i}-f(D^{\prime})_{i}|$ .

The aim of differential privacy is to mask the difference in the answer of query $f$ between two neighboring datasets. To achieve this aim, differential privacy provides a randomized mechanism $\mathcal{M}$ to access a dataset. In $\epsilon$ -differential privacy, parameter $\epsilon$ is defined as the privacy budget, which controls the privacy guarantee level of mechanism $\mathcal{M}$ . A smaller $\epsilon$ represents a stronger privacy. The formal definition of differential privacy is presented as follows.

Definition 2 ( $\epsilon$ -Differential Privacy).

A randomized mechanism $\mathcal{M}$ gives $\epsilon$ -differential privacy if for any pair of neighboring datasets $D$ and $D^{\prime}$ , and for every set of outcomes $\Omega$ , $\mathcal{M}$ satisfies:

Pr[\mathcal{M}(D)\in\Omega]\leq\exp(\epsilon)\cdot Pr[\mathcal{M}(D^{\prime})\in\Omega]

(2)

One of the prevalent differential privacy mechanisms is the Laplace mechanism. The Laplace mechanism adds Laplace noise to query results. We use $Lap(b)$ to represent the noise sampled from the Laplace distribution with scaling $b$ . The mechanism is described as follows.

Definition 3 (Laplace mechanism).

Given any function $f:D\rightarrow\mathbb{R}^{k}$ , the Laplace mechanism $\mathcal{M}_{L}$ is defined as

\mathcal{M}_{L}(D,f,\epsilon)=f(D)+(y_{1},...,y_{k}),

(3)

where $y_{1},...,y_{k}$ are the random noise drawn from $Lap(\frac{\Delta f}{\epsilon})$ .

3.3 Differential advising

Based on the definitions of differential privacy, we provide the definitions of differential advising. The aim of providing the definitions of differential advising is to link the properties of differential privacy to the properties of differential advising. This linkage can guarantee that the differential privacy mechanisms can be used as differential advising mechanisms, as later shown in Lemma 1 and Theorem 1 in Section 5.

The reason of applying differential privacy to advising is to ensure that the knowledge learned in a state can still be used in a different but similar state, i.e., a neighboring state (Definition 4). In previous work, knowledge can only be re-used in identical states. With the introduction of the ‘neighboring state’, a piece of knowledge has more chance to be applied by agents compared with previous work. This idea is reasonable as we human can easily borrow knowledge from a similar environment.

To achieve this goal, we can consider acquiring knowledge, i.e., asking for advice, as querying to a dataset, and consider the $Q$ -values of state/action pairs constituting a dataset, i.e., a $Q$ -table. Then, the differential advising process can be simulated as a differentially private query process. To implement differential advising, we apply the Laplace mechanism to mask the difference between the knowledge learned in two neighboring states. Specifically, a $Q$ -table, shown as below, is a two-dimensional matrix which stores the knowledge learned by an agent. Differential advising is operating on $Q$ -tables.

States/Actions	$a_{1}$	$a_{2}$	$a_{3}$	$a_{4}$
$\boldsymbol{s_{1}}$	$Q(\boldsymbol{s_{1}},a_{1})$	$Q(\boldsymbol{s_{1}},a_{2})$	$Q(\boldsymbol{s_{1}},a_{3})$	$Q(\boldsymbol{s_{1}},a_{4})$
$\boldsymbol{s_{2}}$	$Q(\boldsymbol{s_{2}},a_{1})$	$Q(\boldsymbol{s_{2}},a_{2})$	$Q(\boldsymbol{s_{2}},a_{3})$	$Q(\boldsymbol{s_{2}},a_{4})$
$\boldsymbol{s_{3}}$	$Q(\boldsymbol{s_{3}},a_{1})$	$Q(\boldsymbol{s_{3}},a_{2})$	$Q(\boldsymbol{s_{3}},a_{3})$	$Q(\boldsymbol{s_{3}},a_{4})$

Without loss of generality, we assume that a state $\boldsymbol{s}$ consists of $m$ dimensions: $\boldsymbol{s}=(s_{1},...,s_{m})$ . Each dimension $s_{i}$ could be either an integer or a real number representing discrete or continuous states. The difference between two states $\boldsymbol{s}$ and $\boldsymbol{s}^{\prime}$ is defined as:

Definition 4 (Difference between states).

||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}=\sum_{1\leq i\leq m}|s_{i}-s^{\prime}_{i}|.

(4)

Two states $\boldsymbol{s}$ and $\boldsymbol{s}^{\prime}$ are neighboring, i.e., slightly different, if their difference is at most $1$ , i.e., $||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1$ . The difference threshold is set to $1$ , because this is the prerequisite of using differential privacy. As described in Definition 2, differential privacy works when two datasets have at most one record in difference. In fact, Definition 4 is slightly different from the prerequisite of differential privacy. This is because data records in a dataset are discrete, and thus $||D-D^{\prime}||_{1}\leq 1$ means that one record is different in the two datasets. In Definition 4, if states are discrete and each dimension is represented as an integer, $||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1$ also means that one dimension is different in the two states, such as the multi-robot example in Section 1. If states are continuous, $||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1$ could mean that the difference exists in multiple dimensions and the sum of the difference is less than $1$ . We leave the further research of continuous states as one of our future works.

Similarly, the advice sensitivity is defined as follows.

Definition 5 (Advice sensitivity).

For an advice generation function $Q:S\rightarrow\mathbb{R}^{k}$ , where $S$ is the state set and $k$ is the number of actions, the sensitivity of $Q$ is defined as

\Delta Q=\max\limits_{\boldsymbol{s},\boldsymbol{s}^{\prime}\in S\wedge||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1}||Q(\boldsymbol{s})-Q(\boldsymbol{s}^{\prime})||_{1}.

(5)

As an advice is a $Q$ -vector, for simplicity, we use the letter “ $Q$ ” to represent both the advice generation function and the $Q$ -function. Details will be given in the next section.

Definition 6 ( $\epsilon$ -differential advising).

An advising method $\mathcal{M}_{A}$ is $\epsilon$ -differential advising, if for any pair of neighboring states $\boldsymbol{s}$ and $\boldsymbol{s^{\prime}}$ , and for every set of advice $Ad$ , $\mathcal{M}_{A}$ satisfies:

Pr[\mathcal{M_{A}}(\boldsymbol{s})\in Ad]\leq\exp(\epsilon)\cdot Pr[\mathcal{M_{A}}(\boldsymbol{s^{\prime}})\in Ad]

(6)

The essence of differential privacy is to guarantee that the query results of two neighboring datasets have a very high probability to be the same. Similarly, in differential advising, we set that the query results of two neighboring $Q$ -tables have a very high probability to be the same, so that an agent querying to two neighboring $Q$ -tables can receive almost the same knowledge. Here, two $Q$ -tables are neighboring if 1) they have one record in difference and 2) the two different records correspond to two neighboring states. The query in differential advising is known as differentially private selection [23] which produces the best answer from a space of outcomes. To implement differential advising, we follow the spirit of the Laplace mechanism by adding Laplace noise on query results to mask the difference between two neighboring $Q$ -tables. The detail will be given in the next section.

Table I describes the notations and terms used in this paper.

TABLE I: The meaning of each notation

notations

meaning

S

a set of states of the environment

A

a set of actions available to an agent

r

an immediate reward obtained by

an agent for taking an action under an observation

Q(s,a)

a reinforcement value for an agent

to take action

a

in state

s

\boldsymbol{\pi}

a probability distribution over

the available actions of an agent

\alpha,\zeta

learning rates, both of which are in

(0,1)

\gamma

a discount factor which is in

(0,1)

\Delta f

the sensitivity of a query

\Delta Q

the sensitivity of a score function

\epsilon

privacy budget

b

the scale parameter in Laplace mechanism

\delta

\beta

privacy budget

n_{\boldsymbol{s}}

the number of times

that an agent has visited a state

\boldsymbol{s}

C

the communication budget of an agent

P_{ask}

the probability that an agent asks for advice

P_{give}

the probability that an agent provides advice

4 The Differential Advising Method

Our method is developed in a simultaneous learning framework, where agents are learning simultaneously and can be in both the roles of adviser and advisee. Each agent $i$ has a communication budget $C_{i}$ to control its communication overhead. Every time when an agent $i$ asks for/provides advice from/to another agent, agent $i$ ’s communication budget is deducted by $1$ till the budget $C_{i}$ is used up. Here, we use a combined communication budget $C_{i}$ for asking for and giving advice, instead of two separate budgets, because 1) using a combined budget is more suitable for the real-world applications than using two separate budgets, and 2) using a combined budget can simplify the description of our method. For example, in wireless sensor networks, the communication budget of a sensor is based only on its battery power irrespective of the communication types.

Each agent $i$ has the following knowledge:

1.

its available actions in each state;
2.

the $Q$ -value of each available action;
3.

the existence of its neighboring agents;

Moreover, we assume that 1) a slight change in a state will not significantly change the reward function of an agent, and 2) an agent in two neighboring states has the same action set.

4.1 Overview of the method

In our method, during advising, agents have to address the following three sub-problems:

•

whether to ask for advice;
•

whether to give advice;
•

how to use advice.

The overview of our method is outlined in Fig. 2, and the detail of our method is formally given in Algorithm 1. In summary, Algorithm 1 describes the workflow of our method, which consists of two parts: the advising part (Lines 1-15) and the learning part (Lines 16-23). The advising part of Algorithm 1 depicts the process of an advisee agent asking for and making use of advice. The detail of making use of advice is given in Algorithm 2, where the differential privacy mechanism is introduced. The learning part of Algorithm 1 is a regular reinforcement learning process.

1 /*Take agent

i

at time step

t

as an example */

2 Observes the environment and receives state

\boldsymbol{s}^{t}

;

3 Decides whether to ask for advice;

4 if agent $i$ decides to ask for advice then

5 if agent $i$ itself has a neighboring state $\boldsymbol{s}^{\prime}$ then

6 Uses Algorithm 2 to obtain

\boldsymbol{\pi}(\boldsymbol{s}^{t})

;

7 /* Internal advice from itself */

9 else

10 broadcasts a message to all neighboring agents;

11 Receives advice:

\boldsymbol{Q}(\boldsymbol{s}^{\prime})

;

12 if $\boldsymbol{s}^{\prime}=\boldsymbol{s}^{t}$ then

\boldsymbol{\pi}(\boldsymbol{s}^{t})\leftarrow\boldsymbol{\pi}(\boldsymbol{s}^{\prime})

;

15 else

16 Uses Algorithm 2 to obtain

\boldsymbol{\pi}(\boldsymbol{s}^{t})

;

17 /* External advice from another agent */

21Selects an action

a_{l}

based on the probability distribution

\boldsymbol{\pi}(\boldsymbol{s}^{t})=(\pi(\boldsymbol{s}^{t},a_{1}),...,\pi(\boldsymbol{s}^{t},a_{k}))

;

r\leftarrow

the reward for performing action

a_{l}

in state

\boldsymbol{s}^{t}

;

Q(\boldsymbol{s}^{t},a_{l})\leftarrow(1-\alpha)Q(\boldsymbol{s}^{t},a_{l})+\alpha[r+\gamma max_{a\in A_{i}}Q(\boldsymbol{s}^{t+1},a)]

;

\overline{r}\leftarrow\sum_{a\in A_{i}}\pi(\boldsymbol{s}^{t},a)Q(\boldsymbol{s}^{t},a)

, where

A_{i}

is the set of available actions of agent

i

in state

\boldsymbol{s}^{t}

;

25 for each action $a\in A_{i}$ do

\pi(\boldsymbol{s}^{t},a)\leftarrow\pi(\boldsymbol{s}^{t},a)+\zeta(Q(\boldsymbol{s}^{t},a)-\overline{r})

;

\boldsymbol{\pi}(\boldsymbol{s}^{t})\leftarrow Normalize(\boldsymbol{\pi}(\boldsymbol{s}^{t}))

;

\alpha\leftarrow\frac{t}{t+1}\alpha

;

Algorithm 1 The complete advising process

In Lines 1-3, at time step $t$ , agent $i$ observes a state $\boldsymbol{s}^{t}$ , and decides whether to ask for advice from other agents (Section 4.2). Specifically, an advice is the $Q$ -value vector of a state: $\boldsymbol{Q}(\boldsymbol{s})=(Q(\boldsymbol{s},a_{1}),...,Q(\boldsymbol{s},a_{k}))$ . In Lines 4 and 5, if agent $i$ decides to ask for advice, agent $i$ checks whether it has a neighboring state $\boldsymbol{s}^{\prime}$ itself. If so, agent $i$ takes $\boldsymbol{Q}(\boldsymbol{s}^{\prime})$ as a self-advice. Agent $i$ , thereafter, uses Algorithm 2, a differentially private algorithm, to process this advice and selects an action based on this advice (Line 6). Here, allowing an agent to query itself can conserve its communication budget. This is called “self-advising”. In Lines 8 and 9, if agent $i$ has not visited any neighboring states, agent $i$ sends a broadcast message to all communication-reachable, i.e., neighboring, agents and its communication budget $C_{i}$ is deducted by $1$ . If any neighboring agent $j$ decides to offer advice, agent $j$ sends the advice back to agent $i$ and agent $j$ ’s communication budget $C_{j}$ is deducted by $1$ (Section 4.3).

In Lines 10 and 11, once agent $i$ receives the advice from agent $j$ , agent $i$ checks whether this advice is created based on state $\boldsymbol{s}^{t}$ or a neighboring state of $\boldsymbol{s}^{t}$ . If the advice is created based on $\boldsymbol{s}^{t}$ , agent $i$ directly uses this advice to choose an action (Line 12). Otherwise, in Lines 13 and 14, agent $i$ uses Algorithm 2 to modify this advice and then chooses an action by following the modified advice (Section 4.4). In the case that agent $i$ receives advice from multiple agents, it selects the advice whose state has the minimum difference to agent $i$ ’s current state (recall Equation 4). In the case that no advice is received or agent $i$ decides not to ask for advice, agent $i$ simply chooses an action based on its own experience (Lines 16-23). Specifically, Lines 16-23 describe a regular reinforcement learning process. In Line 16, agent $i$ randomly selects an action based on the probability distribution $\boldsymbol{\pi}(\boldsymbol{s}^{t})$ over its available actions in state $\boldsymbol{s}^{t}$ . Then, in Line 17, agent $i$ performs the selected action and receives a reward $r$ . In Line 18, agent $i$ uses reward $r$ to update the $Q$ -value of the selected action. In Line 19, agent $i$ computes an average reward using the probability distribution in state $\boldsymbol{s}^{t}$ to multiply the $Q$ -values of actions in state $\boldsymbol{s}^{t}$ . After that, in Lines 20 and 21, agent $i$ adopts the average reward to update the probability distribution $\boldsymbol{\pi}(\boldsymbol{s}^{t})$ . In Line 22, $\boldsymbol{\pi}(\boldsymbol{s}^{t})$ is normalized to be a valid probability distribution, where each element in $\boldsymbol{\pi}(\boldsymbol{s}^{t})$ is in $(0,1)$ and the sum of all the elements is $1$ . In Line 23, the learning rate $\alpha$ is decayed to guarantee the convergence of the algorithm.

It has to be noted that in our method, an advice is a $Q$ -vector instead of an action suggestion or a $Q$ -function. This is because our method is differential advising rather than direct advising, and using $Q$ -vectors enables advisee agents to properly modify the advice before they take it. By contrast, modifying $Q$ -vectors is easier and more reliable than modifying action suggestions and $Q$ -functions. Moreover, using $Q$ -vectors as advice does not mean adviser agents must have a complete $Q$ -table. As long as an adviser agent has the $Q$ -values of the concerned state $s$ (or its neighboring state $s^{\prime}$ ), the adviser can offer advice.

4.2 Deciding whether to ask for advice

At each time step, an agent, $i$ , observes a state $\boldsymbol{s}$ . Agent $i$ decides whether to ask for advice based on probability $P_{ask}$ (Equation 7). A random number between $0$ and $1$ is generated to compare with $P_{ask}$ . If the random number is less than $P_{ask}$ , agent $i$ asks for advice.

The calculation of $P_{ask}$ is based on 1) agent $i$ ’s confidence in current state $\boldsymbol{s}$ , and 2) agent $i$ ’s remaining communication budget, $C^{\prime}_{i}$ . Agent $i$ ’s confidence in state $\boldsymbol{s}$ depends on how many times agent $i$ has visited state $\boldsymbol{s}$ , denoted as $n^{i}_{\boldsymbol{s}}$ . A higher $n^{i}_{\boldsymbol{s}}$ value means a higher confidence and results in a lower probability of asking for advice. Agent $i$ ’s remaining communication budget $C^{\prime}_{i}$ depends on that how many times agent $i$ has asked for/provided advice from/to others. Once agent $i$ asks for advice, its communication budget is reduced by $1$ : $C^{\prime}_{i}\leftarrow C^{\prime}_{i}-1$ . A lower remaining communication budget results in a lower probability of asking for advice. Formally, the probability function $P_{ask}$ takes $n^{i}_{\boldsymbol{s}}$ and $C^{\prime}_{i}$ as input, and outputs a probability ranged in $[0,1)$ . The calculation of probability $P_{ask}$ is shown in Equation 7, where square root is used to reduce the decay speed of $P_{ask}$ , and the positive integer, $N$ , is a threshold used to avoid agent $i$ using up its communication budget in very early stages.

P_{ask}=\begin{cases}\frac{1}{\sqrt{n^{i}_{\boldsymbol{s}}}}\cdot\sqrt{\frac{C^{\prime}_{i}}{C_{i}}},\mbox{$n^{i}_{\boldsymbol{s}}\geq N$}\\ 0,\mbox{$n^{i}_{\boldsymbol{s}}<N$}\end{cases}

(7)

If agent $i$ decides to ask for advice, agent $i$ checks whether it has any neighboring states. This check is based on Equation 4 to compute the difference between the new state $\boldsymbol{s}$ and the stored states of agent $i$ . If the difference between $\boldsymbol{s}$ and a stored state is less than or equal to $1$ , that stored state is a neighboring state of $\boldsymbol{s}$ .

On one hand, if agent $i$ has a set of neighboring states, agent $i$ selects the neighboring state $\boldsymbol{s}^{\prime}$ , which has the minimum difference to its current state $\boldsymbol{s}$ . Agent $i$ utilizes the proposed differential advising method to modify $\boldsymbol{Q}(\boldsymbol{s}^{\prime})$ and use the modified $\boldsymbol{Q}(\boldsymbol{s}^{\prime})$ to take an action (Section 4.4). On the other hand, if agent $i$ does not have any neighboring states, it asks for advice from its neighboring agents (Section 4.3).

4.3 Deciding whether to give advice

When an agent, $j$ , receives an advice request message from agent $i$ , agent $j$ needs to decide whether to provide advice to agent $i$ . This decision is based on probability $P_{give}$ (Equation 8). The calculation of $P_{give}$ is based on 1) agent $j$ ’s confidence in agent $i$ ’s concerned state $\boldsymbol{s}$ or neighboring states of $\boldsymbol{s}$ , and 2) agent $j$ ’s remaining communication budget $C^{\prime}_{j}$ . If agent $j$ has more than one neighboring state, agent $j$ uses the one which has the minimum difference to state $\boldsymbol{s}$ . The higher the confidence of agent $j$ , the higher the probability agent $j$ will provide advice. The lower remaining communication budget of agent $j$ , the lower probability agent $j$ will provide advice.

Formally, the probability function $P_{give}$ takes $n^{j}_{\boldsymbol{s}}$ and $C^{\prime}_{j}$ as input, and outputs a probability ranged in $[0,1)$ . Here, for simplicity, we use $n^{j}_{\boldsymbol{s}}$ to denote the number of times that agent $j$ has visited either state $\boldsymbol{s}$ or a neighboring state of $\boldsymbol{s}$ , and $C^{\prime}_{j}$ to denote the remaining communication budget of agent $j$ . Once agent $j$ gives advice, its remaining communication budget is reduced by $1$ : $C^{\prime}_{j}\leftarrow C^{\prime}_{j}-1$ . It should be noted that agent $j$ does not need to know the full state space to find neighboring states, because in some environments, it is infeasible for any individual agent to know the full state space, such as the multi-robot example in Section 1. Thus, agent $j$ simply checks the states which it has visited. If all the visited states are not neighboring to the concerned state, this means that agent $j$ does not have the experience related to the concerned state and should not offer any advice to agent $i$ . Formally, the calculation of $P_{give}$ is given in Equation 8.

P_{give}=\begin{cases}(1-\frac{1}{\sqrt{n^{j}_{\boldsymbol{s}}}})\cdot\sqrt{\frac{C^{\prime}_{j}}{C_{j}}},\mbox{$n^{j}_{\boldsymbol{s}}\geq n^{i}_{\boldsymbol{s}}$}\\ 0,\mbox{$n^{j}_{\boldsymbol{s}}<n^{i}_{\boldsymbol{s}}$}\end{cases}

(8)

In Equation 8, we set a thresholds $n^{i}_{\boldsymbol{s}}$ for adviser agents. If an adviser agent has visited a state for less than $n^{i}_{\boldsymbol{s}}$ times, that means the adviser agent is less experienced than the advisee agent in state $\boldsymbol{s}$ and thus the adviser agent is not allowed to offer advice to the advisee agent for that state. This setting is based on the fact that in theory, an adviser agent can generate advice for any possible states, even if the adviser agent has never visited those states. For example, an adviser agent can generate advice for a never visited state using the initialized $Q$ -values of actions in that state. However, since the $Q$ -value of each action can be randomly initialized, such advice is useless and may even harm the learning performance of the advisee.

4.4 How to use advice

When agent $i$ receives agent $j$ ’s advice, $\boldsymbol{Q}(\boldsymbol{s}^{\prime})$ , agent $i$ checks whether its concerned state $\boldsymbol{s}$ is the same as state $\boldsymbol{s}^{\prime}$ . The aim of this check is to decide whether this advice is created based on the same state ( $\boldsymbol{s}=\boldsymbol{s}^{\prime}$ ) or a neighboring state ( $\boldsymbol{s}\neq\boldsymbol{s}^{\prime}$ ). If $\boldsymbol{s}=\boldsymbol{s}^{\prime}$ , agent $i$ directly takes agent $j$ ’s advice and selects an action based on agent $j$ ’s advice. If $\boldsymbol{s}\neq\boldsymbol{s}^{\prime}$ , agent $i$ uses the Laplace mechanism to add noise to agent $j$ ’s advice, $\boldsymbol{Q}(\boldsymbol{s}^{\prime})$ , shown in Algorithm 2. Here is an example. Agent $i$ arrives at state s and asks for advice from agent $j$ . Agent $j$ has visited three states: $\boldsymbol{s}_{0},\boldsymbol{s}_{1},\boldsymbol{s}_{2}$ and are confident in these states. After calculation, agent $j$ detects that none of the three states is the same as state $\boldsymbol{s}$ and only $\boldsymbol{s}_{0}$ is a neighboring state of $\boldsymbol{s}$ . Thus, agent $j$ sends its advice, Q-vector $Q(\boldsymbol{s}_{0})$ , to agent $i$ . After agent $i$ receives this advice, agent $i$ identifies that this advice is created based on a neighboring state $\boldsymbol{s}_{0}$ . Agent $i$ adopts Algorithm 2 to add Laplace noise to the advice $Q(\boldsymbol{s}_{0})$ , and then uses the modified advice to guide its action selection.

1 Input:

\boldsymbol{Q}(\boldsymbol{s}^{\prime})=(Q(\boldsymbol{s}^{\prime},a_{1}),...,Q(\boldsymbol{s}^{\prime},a_{k}))

;

2 for each action $a\in A_{i}$ do

3 adjusting

Q

-value

Q(\boldsymbol{s}^{\prime},a)

as:

Q(\boldsymbol{s}^{\prime},a)\leftarrow Q(\boldsymbol{s}^{\prime},a)+Lap(\frac{\Delta Q}{\epsilon})

;

\overline{r}\leftarrow\sum_{a\in A_{i}}\pi(\boldsymbol{s}^{t},a)Q(\boldsymbol{s}^{\prime},a)

;

6 for each action $a\in A_{i}$ do

\pi(\boldsymbol{s}^{t},a)\leftarrow\pi(\boldsymbol{s}^{t},a)+\zeta(Q(\boldsymbol{s}^{\prime},a)-\overline{r})

;

\boldsymbol{\pi}(\boldsymbol{s}^{t})\leftarrow Normalize(\boldsymbol{\pi}(\boldsymbol{s}^{t}))

;

10 Output:

\pi(\boldsymbol{s}^{t},a_{1}),...,\pi(\boldsymbol{s}^{t},a_{k})

;

Algorithm 2 Differential advising

In Lines 2 and 3, agent $j$ ’s $Q$ -values are adjusted by adding Laplace noise. Specifically, the Laplace noise is added to each of the $Q$ -values: $Q(\boldsymbol{s^{\prime}},a_{1})+Lap_{1},m,Q(\boldsymbol{s^{\prime}},a_{k})+Lap_{k}$ , where each noise, $Lap_{i}$ , is a random number based on the Laplace distribution. Based on the adjusted $Q$ -values, in Lines 4-6, agent $i$ computes its average reward in state $\boldsymbol{s}^{t}$ and updates its probability distribution $\pi(\boldsymbol{s}^{t},a)$ . Note that the computation of the average reward uses the probability distribution in state $\boldsymbol{s}^{t}$ to multiply the $Q$ -values of the available actions in state $\boldsymbol{s^{\prime}}$ . As we have the assumption that an agent has the same action set in two neighboring states, this multiplication is valid. In Line 7, $\pi(\boldsymbol{s}^{t},a)$ is normalized to be a valid probability distribution.

In the Laplace mechanism in Line 3, $\Delta Q$ is the sensitivity of $Q$ -values (recall Definition 5). Typically, $\Delta Q$ is based on $Q$ -functions, and $Q$ -functions are based on rewards. Thus, essentially, $\Delta Q$ is based on rewards, and we can use rewards to approximately compute $\Delta Q$ . We have to notice that it is infeasible to accurately compute $\Delta Q$ before the learning starts, because the accurate computation of $\Delta Q$ needs the final $Q$ -values of all the states, and these final $Q$ -values can only be obtained when the learning finishes.

We use an example to demonstrate how to approximately compute $\Delta Q$ . Definition 5 states that $\Delta Q$ is the maximum difference between two $Q$ -vectors of any two neighboring states. In the multi-robot problem exemplified in Section 1, for any two neighboring states, the maximum reward difference is $10-(-5)=15$ given that the maximum reward $10$ is finding a victim and the minimum reward $-5$ is hitting an obstacle. This is because in any two neighboring states, only one dimension, i.e., one cell, is different. This different cell incurs a maximum reward difference, if the cell has a victim in a state while has an obstacle in the other state. Moreover, $Q$ -value is usually updated based on a learning rate $\alpha$ . Thus, in the multi-robot problem, $\Delta Q$ can be approximately computed as $\alpha\cdot 15$ . When we set $\alpha=0.2$ , then $\Delta Q=3$ .

The rationale of Algorithm 2 is discussed as follows. According to Definition 3 in Section 3.2, the Laplace mechanism is used to add Laplace noise to the numerical output of a query $f$ to a dataset $D$ , so that both the privacy and the utility of the output can be guaranteed. Typically, a query $f$ maps a dataset $D$ to $k$ real numbers: $f:D\rightarrow\mathbb{R}^{k}$ . In comparison, in advising learning, the $Q$ -table of an adviser agent is interpreted as a dataset, where each entry in the $Q$ -table is the $Q$ -value of a pair of a state and an action, $Q(s,a)$ . Advice request from an advisee agent is interpreted as the query to the Q-table of the adviser agent. The output of the query is the advice of the adviser agent, which consists of $k$ real numbers, $Q(\boldsymbol{s},a_{1}),...,Q(\boldsymbol{s},a_{k})$ , and $k$ is the number of actions in state $s$ . As discussed in Section 3.3, we have connected the properties of differential privacy and the properties of differential advising. Hence, by adding Laplace noise to $Q(\boldsymbol{s},a_{1}),...,Q(\boldsymbol{s},a_{k})$ , both the validity and utility of the advice can be guaranteed, as shown in Theorem 1, 2 and 3 in Section 5. Thus, in state $\boldsymbol{s}$ , agent $i$ can still use the advice $\boldsymbol{Q}(\boldsymbol{s}^{\prime})$ , which is created based on a neighboring state $\boldsymbol{s}^{\prime}$ .

5 Theoretical Analysis

5.1 Differential advising analysis

Lemma 1.

Any methods, which satisfy $\epsilon$ -differential privacy and meet the input and output requirements of $\mathcal{M}_{A}$ in Definition 6, also satisfy $\epsilon$ -differential advising.

Proof.

In Definition 2, the input of $\mathcal{M}$ can be any type of dataset, while the output can be any valid results. In Definition 6, the input of $\mathcal{M}_{A}$ is a state $\boldsymbol{s}$ which is a vector, while the output is also a vector $\boldsymbol{Q}(\boldsymbol{s})$ . Thus, Definition 6 can be considered a special case of Definition 2. Formally, let $D\in\mathbb{N}^{|\mathcal{X}|}$ and $\boldsymbol{s}\in S$ , where $\mathbb{N}^{|\mathcal{X}|}$ is a data universe with $|\mathcal{X}|$ attributes and $S$ is a state space. As $\mathbb{N}^{|\mathcal{X}|}$ is a data universe and can be arbitrarily large, we have $S\subset\mathbb{N}^{|\mathcal{X}|}$ . Similarly, we also have $Ad\subset\Omega$ . Therefore, any methods, which satisfy $\epsilon$ -differential privacy and meet the input and output requirements of $\mathcal{M}_{A}$ ,also satisfy $\epsilon$ -differential advising. ∎

Lemma 1 creates a link between differential privacy and differential advising. Then, differential privacy mechanisms, e.g., the Laplace mechanism, can be used to implement differential advising.

Theorem 1.

The proposed method is $\epsilon$ -differential advising.

Proof.

According to Lemma 1, to prove this theorem, we need only to prove that the proposed method satisfies $\epsilon$ -differential privacy. In the proposed method, Algorithm 2 utilizes a differential privacy mechanism, i.e., the Laplace mechanism, in Line 3. We re-write Line 3 in the $\boldsymbol{Q}$ -vector form as $\mathcal{M}_{a}(\boldsymbol{s},\boldsymbol{Q},\epsilon)=\boldsymbol{Q}(\boldsymbol{s})+(Lap_{1},...,Lap_{k})$ , where $Lap_{1},...,Lap_{k}$ are random noise sampled from $Lap(\frac{\Delta Q}{\epsilon})$ and $k$ is the number of actions in state $\boldsymbol{s}$ . Since the Laplace mechanism is differentially private, by comparing Equation 3 with the re-written equation of Line 3, we can conclude that $\mathcal{M}_{a}(\boldsymbol{s},\boldsymbol{Q},\epsilon)$ satisfies $\epsilon$ -differential privacy. ∎

Theorem 1 theoretically proves that the proposed method is a valid differential advising method which enables agents to use advice created based on similar states. Moreover, the parameter $\epsilon$ is used to control the amount of noise added to an advice. The tuning of $\epsilon$ is left to users.

5.2 Average reward analysis

As shown in Line 4, Algorithm 2, the average reward of an agent is: $\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)$ . We will demonstrate that with a very high probability, the average reward of an agent using differential advising is greater than the average reward without using differential advising.

Definition 7 (( $\delta$ , $\beta$ )-useful).

A differential advising multi-agent system is ( $\delta$ , $\beta$ )-useful if for each agent, we have $Pr(\overline{r}^{\prime}-\overline{r}>\delta)<1-\beta$ , where $\overline{r}^{\prime}$ and $\overline{r}$ are the average rewards of an agent using and without using differential advising, respectively, and $\delta>0$ and $0<\beta<1$ .

Theorem 2.

For any $0<\beta<1$ , with probability at least $1-\beta$ , the average reward of an agent using differential advising is greater than the average reward without using it. The difference is bounded by $\delta$ , which satisfies $0<\delta<-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta)$ , where $t$ is the number of learning iterations.

Proof.

The average reward of an agent is $\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)$ . After applying the Laplace mechanism, the average reward becomes $\overline{r}^{\prime}=\sum_{a\in A}\pi(\boldsymbol{s},a)(Q(\boldsymbol{s},a)+Lap(b))$ . Now, we need to prove that $Pr(\overline{r}^{\prime}-\overline{r}>\delta)>1-\beta$ .

According to calculation, we have $\overline{r}^{\prime}-\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)(Q(\boldsymbol{s},a)+Lap_{a}(b))-\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)=\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)$ . Thus, $Pr(\overline{r}^{\prime}-\overline{r}>\delta)=Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\delta)$ . Because $\sum_{a\in A}\pi(\boldsymbol{s},a)=1$ , we have $Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\delta)=Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\sum_{a\in A}\pi(\boldsymbol{s},a)\delta)$ . The property of $Lap(b)$ is presented in Lemma 2.

Lemma 2.

(Laplace Random Variables). For every $\delta>0$ ,

Pr(Lap(b)>\delta)=\frac{1}{2}\exp(-\frac{\delta}{b}).

(9)

Proof.

The proof of this lemma can be found in [24]. ∎

Based on this property, we have $Pr(\overline{r}^{\prime}-\overline{r}>\delta)=Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\sum_{a\in A}\pi(\boldsymbol{s},a)\delta)=\frac{1}{2}\exp(-\frac{\delta}{b})$ .

Lemma 2 demonstrates the probability in one learning iteration. If an agent learns $t$ iterations, the probability becomes $Pr(\overline{r}^{\prime}-\overline{r}>\delta)=[\frac{1}{2}\exp(-\frac{\delta}{b})]^{t}$ . To guarantee $Pr(\overline{r}^{\prime}-\overline{r}>\delta)>1-\beta$ , we have $[\frac{1}{2}\exp(-\frac{\delta}{b})]^{t}>1-\beta$ . After calculation, we have $\delta<-b\cdot\frac{1}{t}\cdot ln(2-2\beta)$ . As $b=\frac{\Delta S}{\epsilon}$ , we have $\delta<-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta)$ . ∎

In this proof, we use the average reward equation: $\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)$ . This equation is valid to correctly estimate an agent’s average reward, only if the agent has a good estimation of the $Q$ -table. Hence, this theorem holds in late learning stages and reveals the final results.

Remark 1: The upper bound of $\delta$ , $-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta)$ , relies on $t$ . Let $\delta(t)=-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta)$ . Hence, $\delta(t)$ is monotonically decreasing with the increase of $t$ . Specifically, when $t\rightarrow 0$ , $\delta(t)\rightarrow\infty$ , and when $t\rightarrow\infty$ , $\delta(t)\rightarrow 0$ . This means that in early stages, using differential advising can significantly increase agents’ average rewards. However, as time progresses, the improvement of agents’ average rewards decreases. This situation reflects the fact that in early stages, an agent is not knowledgable, so other agents’ knowledge can give the agent significant help. As time progresses, the agent accumulates enough knowledge, and thus other agents’ knowledge is trivial to the agent.

Theorem 2 theoretically demonstrates the effectiveness of our method by analyzing lower bound of an agent’s average reward difference between using and not using our method. The analysis result shows that the lower bound of the average reward difference is positive with a very high probability. This means that the agent using our method can receive a higher average reward than not using our method with a very high probability. Given that the average reward of using and not using the proposed method is $\overline{r}^{\prime}$ and $\overline{r}$ , respectively, the parameter $\beta$ is used to measure the lower bound of the average reward difference, $\overline{r}^{\prime}-\overline{r}$ . Particularly, when $\beta\rightarrow 0$ , the lower bound is a small positive number; and when $\beta\rightarrow 1$ , the lower bound tends to be $+\infty$ . This means that the average reward difference, $\overline{r}^{\prime}-\overline{r}$ , has a high probability, $1-\beta$ , to be a small positive number, while has a low probability to be very large. Certainly, the average reward difference, $\overline{r}^{\prime}-\overline{r}$ , cannot be $+\infty$ . Thus, for the completeness of Theorem 2, as a supplement, Theorem 3 gives the upper bound of this difference, $\overline{r}^{\prime}-\overline{r}$ .

Theorem 3.

Let $k$ be the number of actions in state $\boldsymbol{s}$ . Let $v\geq\sqrt{k}\frac{\Delta Q}{\epsilon}$ and $0<\lambda<\frac{2\sqrt{2}\epsilon v^{2}}{\Delta Q}$ . Then $Pr(\overline{r}^{\prime}-\overline{r}>\lambda)\leq exp(-\frac{t\lambda^{2}}{8v^{2}})$ , where $t$ is the number of learning iterations.

Proof.

As shown in Theorem 2, we have $\overline{r}^{\prime}-\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)(Q(\boldsymbol{s},a)+Lap(b))-\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)=\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)$ . Since each $\pi(\boldsymbol{s},a)$ is in $[0,1]$ , we have $\overline{r}^{\prime}-\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)<\sum_{a\in A}Lap_{a}(b)$ . Thus, we have $Pr(\overline{r}^{\prime}-\overline{r}>\lambda)<Pr(\sum_{a\in A}Lap_{a}(b)>\lambda)$ . The property of $\sum_{a\in A}Lap_{a}(b)$ is given in Lemma 3.

Lemma 3.

(Sum of Laplace random variables) Let $Lap_{1}(b),...,Lap_{k}(b)$ be $k$ independent variables with distribution $Lap(b)$ , then $Pr(\sum_{a\in A}Lap_{a}(b)>\lambda)\leq exp(-\frac{\lambda^{2}}{8v^{2}})$ .

Proof.

The proof of this lemma can be found in [23]. ∎

Lemma 3 demonstrates the probability in one learning iteration. If an agent learns $t$ iterations, the probability becomes $Pr(\overline{r}^{\prime}-\overline{r}>\lambda)\leq exp(-\frac{t\lambda^{2}}{8v^{2}})$ . ∎

According to Theorem 2 and 3, it can be found that the performance of the proposed method is affected by the advice sensitivity $\Delta Q$ , and a smaller $\Delta Q$ means a better performance. Since advice sensitivity, $\Delta Q$ , is based on the $Q$ -values of states, we can conclude that the proposed method can work well in the environments where the $Q$ -values of any pair of neighboring states have a small difference.

5.3 Convergence analysis

In Algorithm 1, let $\alpha_{t}$ be the learning rate at time step $t$ . Since $\alpha_{t+1}=\frac{t}{t+1}\cdot\alpha_{t}=\frac{t}{t+1}\cdot\frac{t-1}{t}\cdot\alpha_{t-1}=...$ , we can conclude that $\alpha_{t}=\frac{1}{t}\alpha_{1}$ . Let $t_{i}$ be the index of the $i$ th time that action $a$ is performed in state $\boldsymbol{s}$ . For example, suppose that the $1$ st time that action $a$ is performed in state $\boldsymbol{s}$ is at time step $6$ , then $t_{1}=6$ . To prove the convergence of Algorithm 1, we need the results of Lemma 4 [25] and Lemma 5 [19].

Lemma 4.

A series, $s_{1}+s_{2}+...$ , converges if and only if for any given small positive number $\mu>0$ , there is always an integer $N$ , such that when $n>N$ , for any positive integer $m$ , the inequality holds: $|s_{n+1}+...+s_{n+m}|<\mu$ .

It should be noted that in Lemma 4, if $|s_{n+1}+...+s_{n+m}|\geq\mu$ , the series, $s_{1}+s_{2}+...$ , does not converge.

Lemma 5.

Given bounded rewards and learning rate $0\leq\alpha_{t_{i}}\leq 1$ , for any pair of state $\boldsymbol{s}$ and action $a$ , if series $\sum_{1\leq i}\alpha^{2}_{t_{i}}$ is bounded, i.e., $\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty$ , and series $\sum_{1\leq i}\alpha_{t_{i}}$ is unbounded, i.e., $\sum_{1\leq i}\alpha_{t_{i}}\rightarrow+\infty$ , then, when $i\rightarrow+\infty$ , $Q(\boldsymbol{s},a)$ converges, with probability $1$ , to the optimal $Q$ -value, i.e., the expected reward.

Theorem 4.

In Algorithm 1, for any pair of state $\boldsymbol{s}$ and action $a$ , as $i\rightarrow+\infty$ , $Q(\boldsymbol{s},a)$ converges, with probability $1$ , to the expected reward of an agent performing action $a$ in state $\boldsymbol{s}$ .

Proof.

Based on Lemma 5, to prove the convergence, we need only to prove that in Algorithm 1, $\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty$ and $\sum_{1\leq i}\alpha_{t_{i}}\rightarrow+\infty$ always hold for any pair of state $\boldsymbol{s}$ and action $a$ .

First, we prove that $\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty$ always holds. It has been known that $\sum_{1\leq t}\frac{1}{t^{2}}=\frac{\pi^{2}}{6}$ , where $t$ means time step, a positive integer. Then, the following inequality can be concluded: $\sum_{1\leq i}\alpha^{2}_{t_{i}}=\sum_{1\leq i}\frac{\alpha^{2}_{1}}{t^{2}_{i}}\leq\sum_{1\leq t}\frac{\alpha^{2}_{1}}{t^{2}}=\alpha^{2}_{1}\frac{\pi^{2}}{6}<+\infty$ .

Then, we prove that $\sum_{1\leq i}\alpha_{t_{i}}\rightarrow+\infty$ always holds. Given a small positive number $\mu$ and a positive integer $N$ , let $0<\mu<\beta\cdot prob(\boldsymbol{s})\cdot\frac{\alpha_{1}}{2}$ . Here, $\beta$ is the mapping lower bound, a small positive number, used to normalize probability distribution $\boldsymbol{\pi}(\boldsymbol{s})$ , and $prob(\boldsymbol{s})$ is the probability with which an agent observes state $\boldsymbol{s}$ . As $\pi(\boldsymbol{s},a)\leq\beta>0$ , there is at least one integer $n>N$ , such that in $(n+1)$ th, $(n+2)$ th, …, $(n+n)$ th time steps, the number of times that action $a$ is performed in state $\boldsymbol{s}$ is greater than or equal to $\beta\cdot prob(\boldsymbol{s})\cdot n$ . Therefore, $\sum_{n<t_{i}\leq 2n}\alpha_{t_{i}}=\sum_{n<t_{i}\leq 2n}\frac{\alpha_{1}}{t_{i}}\geq\sum_{n<t_{i}\leq 2n}\frac{\alpha_{1}}{2n}\geq\beta\cdot prob(\boldsymbol{s})\cdot n\cdot\frac{\alpha_{1}}{2n}=\beta\cdot prob(\boldsymbol{s})\cdot\frac{\alpha_{1}}{2}>\mu$ . Based on Lemma 4, as $\sum_{n<t_{i}\leq 2n}\alpha_{t_{i}}>\mu$ , the series, $\sum_{i}\alpha_{t_{i}}$ , does not converge, which means that $\sum_{i}\alpha_{t_{i}}\rightarrow+\infty$ .

Because $\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty$ and $\sum_{i}\alpha_{t_{i}}\rightarrow+\infty$ , based on Lemma 5, it can be concluded that in Algorithm 1, when $i\rightarrow+\infty$ , $Q(\boldsymbol{s},a)$ converges, with probability $1$ , to the expected reward of a player. ∎

Theorem 4 theoretically proves the convergence of our method, which means that agents using our method can finally receive their expected rewards.

6 Experiments

6.1 Experimental setup

Two experimental scenarios are used to evaluate the proposed differential advising method, denoted as DA-RL (differential advising reinforcement learning).

6.1.1 Scenario 1

The first scenario is the multi-robot problem [26] shown in Fig. 1. Each agent represents a robot which has four actions: up, down, left and right. An observation of an agent is interpreted as its state which consists of $8$ dimensions. The aim of the agents is to achieve the targets on the map, which could be victims in search and rescue or rubbish in office cleaning. The reward for achieving a target is set to $10$ . The reward for hitting an obstacle is set to $-5$ . The reward for moving a step is $0$ . The reward in each situation is pre-defined as knowledge to each agent. For example, when an agent achieves a target, its reward is automatically added by $10$ , while when the agent hits an obstacle, its reward is automatically reduced by $5$ . Moreover, agents cannot be at the same cell at the same time step.

The first scenario is classified as three settings. Setting 1 (Static): agents achieve fixed targets. Setting 2 (Dynamic 1): agents achieve dynamically generated targets. Setting 3 (Dynamic 2): agents achieve dynamically moving targets.

The size of environments varies from $12\times 8$ cells to $24\times 16$ cells. The number of agents varies from $2$ to $6$ . The number of targets varies from $20$ to $60$ . The number of obstacles varies from $15$ to $35$ .

In this scenario, three evaluation metrics are used.

Metric 1 (Average time): the number of time steps to achieve all targets in average in one learning round. This metric is specific for the first and third settings: Static and Dynamic 2. Here, a learning round is a period during which the agents achieve all the targets.

Metric 2 (Average time for one target): the number of time steps to achieve one target in average in one learning round. This metric is specific for the second setting: Dynamic 1. The reason of using different metrics in different settings will be given at the end of this section.

Metric 3 (Average hit): the number of hits to obstacles in average in one learning round.

6.1.2 Scenario 2

This scenario is the multi-agent load balancing [27]. Each agent represents a factory. Each factory has a backlog which stores a set of types of items. Let the number of item types be $k$ and the maximum stock for each type $i$ be $M_{i}$ . A state of an agent is the current number of items of each type in stock. A state consists of $k$ dimensions.

At each time step, a random item from each agent is processed with a given probability $p_{s}$ , and a new item arrives at each agent with another given probability $p_{a}$ . The actions of an agent include whether or not to pass an arrived item to another agent. The reward of an agent $A$ is based on its backlog: $r_{A}=50-\sum_{1\leq i\leq k}w_{i}\cdot m_{i}$ , where $w_{i}$ is the importance weight of items of type $i$ and $m_{i}$ is the current number of items of type $i$ . In addition, if agent $A$ passes an item $i$ to agent $B$ , agent $A$ ’s reward is reduced by $2\cdot w_{i}$ as a cost for redistribution. The aim of agents is to maximize their accumulated reward.

The number of agents varies from $2$ to $6$ . Item types varies from $2$ to $6$ , where for two types, the weights are set to $5$ and $4$ ; for three types, the weights are set to $5$ , $4$ and $3$ ; and so on. The maximum stock for each type of item varies from $3$ to $7$ . The probability of processing an item $p_{s}$ varies from $0.2$ to $0.6$ . The probability of arriving at an item $p_{a}$ is set to $0.4$ .

The evaluation metric is the average reward of all the agents in one learning round. Here, a learning round means a pre-defined number of learning iterations.

6.1.3 The methods for comparison

In this experiment, two methods are involved for comparison. The first method is a regular reinforcement learning method, denoted as RL, which does not include any advising. The RL method is used as a comparison standard. The second method is from [2], denoted as SA-RL (simultaneous advising reinforcement learning). The SA-RL method is an Ad hoc TD advising method, where agents simultaneously and independently learn and share advice when needed. The SA-RL method is similar to our method. The difference includes 1) the SA-RL method allows agents to offer advice only in the same states; and 2) agents in the SA-RL method transfers recommended actions as advice. Moreover, the SA-RL method involves two parameters, $v_{a}$ and $v_{g}$ , used to control the probability of asking for and giving advice, respectively. A higher $v_{a}$ value results in a lower probability of asking for advice, while a higher $v_{g}$ results in a higher probability of giving advice. The SA-RL method is used to demonstrate the effectiveness of the differential privacy technique on agent advising.

The values of the parameters used in the experiments are set as follows: $\alpha=0.2$ , $\gamma=0.8$ , $\zeta=0.1$ , $\epsilon=1$ , $v_{a}=0.4$ and $v_{g}=0.9$ . The values of these parameters are experimentally chosen to yield good results. Moreover, $\Delta Q=3$ in Scenario 1, while $\Delta Q=1$ in Scenario 2. The experimental results are obtained by averaging the results of $500$ runs.

6.2 Experimental results of scenario 1

6.2.1 The first setting: agents achieve targets

Fig. 3 demonstrates the performance of the three methods in different sizes of environments in the first setting. It can be seen that with the increase of the environmental size, in all three methods, agents take more steps to achieve targets and hit more obstacles. This is because with the increase of size, more obstacles are there in the environments, which inevitably increases the difficulty for agents to achieve targets.

In Fig. 3(a), when the environmental size is smaller than $18\times 12$ , the performance difference among the three methods is small. However, when the environmental size is larger than $18\times 12$ , the performance difference in average time steps enlarges up to $20\%$ between DA-RL and RL, and $10\%$ between DA-RL and SA-RL, respectively. In addition, in Fig. 3(b), the performance difference in average hits enlarges up to $25\%$ between DA-RL and RL, and $15\%$ between DA-RL and SA-RL, respectively. This is because when the environmental size increases, the number of targets and obstacles also increases. Thus, the number of states of each agent may increase as well. In this situation, since DA-RL and SA-RL enable agent advising, their performance is better than RL. Moreover, as DA-RL adopts the differential privacy technique, agents using DA-RL can take more advice than using SA-RL. Thus, the performance of DA-RL is better than SA-RL.

This result shows that by using agent advising, agents can ask for advice about the positions of obstacles and targets. Thus, agents’ learning performance can be improved. Moreover, in the DA-RL method, an agent’s advice, created in one state, can also be used by other agents in neighboring states, which further improves those agents’ learning performance.

Fig. 4 shows the performance of the three methods as time progresses in the first setting. The size of the environment is set to $18\times 12$ ; the number of agents is set to $4$ ; the number of targets is set to $40$ ; and the number of obstacles is set to $25$ . In Figs. 4(a), DA-RL and SA-RL methods continuously outperform RL method. Moreover, DA-RL and SA-RL methods converge faster than RL method. This can be explained by the fact that agents using DA-RL and SA-RL methods can ask for advice between each other, which can significantly improve the learning performance, especially in early stages. In Fig. 4(b), the number of average hits are almost the same among the three methods in late stages. This is because the positions of obstacles are fixed and agents using these methods have the learning ability. As time progresses, the positions of obstacles can be gradually memorized by agents. Thus, the number of hits decreases and finally converges to a stable point.

6.2.2 The second setting: agents achieve dynamically generated targets

Fig. 5 demonstrates the performance of the three methods in different sizes of environments in the second setting. It can be seen that the proposed DA-RL method outperforms the other two methods.

Compared to the first setting (Fig. 3(a)), all the three methods need more time steps to achieve targets in the second setting (Fig. 5(a)). This is because in the second setting, new targets are dynamically introduced. It is unavoidable to take more time steps to achieve these new targets.

The average number of hits in the three methods in the second setting (Fig. 5(b)) is almost the same as that in the first setting (Fig. 3(b)). In both first and second settings, the number and positions of obstacles are fixed. Therefore, the positions of obstacles can be learned by agents. Agents then can avoid the obstacles to some extent.

Fig. 6 shows the performance of the three methods as time progresses in the second setting. The size of the environment is set to $18\times 12$ ; the number of agents is set to $4$ ; the number of targets is set to $40$ ; and the number of obstacles is set to $25$ . In Fig. 6(a), as time progresses, the average number of time steps for one target in the three methods decreases in a fluctuant manner. This is caused by the introduction of new targets. The introduction of new targets implies that new knowledge is introduced in environments. Thus, agents need time to learn the new knowledge, which results in the increase of the number of time steps. In Fig. 6(b), as time progresses, the average number of hits gradually decreases in the three methods, because the positions of obstacles can be learned and thus obstacles can be avoided.

6.2.3 The third setting: agents achieve dynamically moving targets

Fig. 7 demonstrates the performance of the three methods in different sizes of environments in the third scenario. Again, the proposed DA-RL method outperforms the other two methods. Comparing Fig. 7 and Fig. 5, the performance trend of the three methods in the third setting is similar to the second setting. This is because both the second and third settings are dynamic, which means that agents need more time steps to achieve targets than in the first setting.

Fig. 8 shows the performance of the three methods as time progresses in the third setting. The size of the environment is set to $18\times 12$ ; the number of agents is set to $4$ ; the number of targets is set to $40$ ; and the number of obstacles is set to $25$ . Unlike the second scenario (Fig. 6(a)), there is no fluctuation in the three methods in the third setting (Fig. 8(a)). This is because in the third setting, although targets are moving, the number of targets does not increase. For example, there are $3$ targets in an environment. In the third setting, the experiment finishes when all the $3$ targets are achieved. In the second setting, however, there may be new targets introduced. The experiment will not finish until all the $3$ and the new targets are achieved. If new targets are generated constantly, the experiment is hard to finish. This means that in the second setting, the completion of an experiment heavily depends on the generation probability of new targets. Hence, there is more fluctuation in the second setting than in the third setting.

6.3 Experimental results of scenario 2

Fig. 9(a) demonstrates the performance of the three methods with different number of agents, where the number of item types is set to $3$ , the maximum stock of each item type is set to $5$ , and the probability of processing an item is set to $0.5$ . In Fig. 9(a), with the increase of the number of agents, the average reward of agents in the three methods also rises. When the number of agents increases, each agent can ask for advice from more agents. Thus, agents can learn faster and receive more reward. However, it should also be noted that when the number of agents is larger than $5$ , the average reward keeps almost steady. This can be explained by the fact that in a given environment, the amount of knowledge is limited. Agents can share knowledge but cannot create new knowledge. When the number of agents is large enough to guarantee a good learning speed, increasing the number of agents does not improve agents’ learning speed or increase agents’ reward.

Fig. 9(b) shows the performance of the three methods with different number of item types, where the number of agents is set to $4$ , the maximum stock of each item type is set to $5$ , and the probability of processing an item is set to $0.5$ . In Fig. 9(b), with the increase of the number of item types, the average reward of agents in the three methods reduces gradually. When the number of item types increases, the dimension of states of each agent also increases. This decreases the learning speed of agents, which reduces the average reward of agents as agents have more opportunity to make non-optimal decisions.

Fig. 9(c) displays the performance of the three methods with different maximum stock, where the number of agents is set to $4$ , the number of item types is set to $3$ , and the probability of processing an item is set to $0.5$ . In Fig. 9(c), with the increase of the maximum stock, the average reward of agents in the three methods increases gracefully. When the stock capacity of agents increases, agents prefer to store new items instead of redistributing them. This preference will augment agents’ average reward, because, based on the experimental setting, storing an item can incur more reward than redistributing it.

Fig. 9(d) exhibits the performance of the three methods with different probabilities of processing an item, where the number of agents is set to $4$ , the number of item types is set to $4$ , and the maximum stock of each item type is set to $5$ . In Fig. 9(d), with the increase of processing probability, the average reward of agents in the three methods rises. When the processing probability increases, items are processed faster and hence, the stock amount of each agent reduces. This improves the average reward of agents, as the reward is based on stock amount in the experimental setting. However, it should also be noted that when the processing probability is too large, e.g., larger than $0.6$ , the difference between the three methods reduces. As mentioned before, a large processing probability implies a small stock amount. Since the state of an agent is based on the stock amount, i.e., the number of items of each item type, small stock amount means a small number of states. Because each agent observes only a small number of states, after a short learning process, each agent is confident in these states and does not ask for advice any more. If agents do not ask for advice, the DA-RL and SA-RL methods are the same as RL method. Hence, they have similar results.

In the above four situations, the proposed DA-RL method outperforms the SA-RL and RL methods, though the performance of the three methods has a similar trend. According to the experimental results, the advantage of differential advising has been empirically proven.

Fig. 10 demonstrates the performance of the three methods as learning progresses. Fig. 10(a) shows the results in a simple environment. The number of agents is set to $3$ , the number of item types is set to $2$ , the maximum stock of each item type is set to $3$ , and the probability of processing an item is set to $0.5$ . Fig. 10(b) shows the results in a complex environment. The number of agents is set to $3$ , the number of item types is set to $6$ , the maximum stock of each item type is set to $7$ , and the probability of processing an item is set to $0.5$ .

In Fig. 10(a), the three methods converge in very early stages (around $200$ learning iterations). In comparison, in Fig. 10(b), the three methods converge much slower, where DA-RL converges at around $500$ learning iterations, SA-RL converges at around $600$ iterations, and RL converges at around $900$ iterations. This can be explained by the fact that in the complex environment, the number of states is much more than the number of states in the simple environment. As more states typically imply more knowledge, agents have to take more learning iterations to learn the knowledge. Therefore, the three methods converge slower in the complex environment than in the simple environment. However, the proposed DA-RL method in the complex environment converges faster than the SA-RL and RL methods. This result demonstrates that by using differential advising, agents in the proposed DA-RL method learn faster than the other two methods.

6.4 Discussion and summary

6.4.1 The metric used in the second setting of the first scenario

In the second setting of the first scenario (Fig. 6(a)), we use the metric, average number of time steps for one target, instead of the metric, average number of time steps for all targets, used in the first and third scenarios.

The average number of time steps for all targets in the second scenario does not converge as time progresses in the three methods. This is because agents in the three methods have learning ability. The aim of learning is to maximize agents’ long-term accumulated rewards. In the experiments, an agent can receive a positive reward only when the agent achieves a target. In the second setting, new targets are introduced dynamically. This motivates agents to stay in the experiment to keep achieving targets so as to increase their accumulated rewards. Therefore, the average number of time steps for achieving all the targets in the second scenario increases gradually. However, this does not mean that the three methods do not converge in other aspects in the second setting. Although the number of time steps in the three methods increases, the number of achieved targets also increases. Since the average number of time steps for one target is the ratio between the total number of time steps and the total number of achieved targets, the increased number of achieved targets can offset the increased number of time steps. Thus, the average number of time steps for one target still converges (Fig. 6(a)).

6.4.2 Performance of the three methods

According to the experimental results in the first scenario, the average time steps used in the DA-RL method is fewer than the other two methods in all the three settings: about $15\%\sim 20\%$ and $40\%$ fewer than the SA-RL and RL methods, respectively. Moreover, the DA-RL converges about $10\%$ and $20\%$ faster than the SA-RL and RL methods, respectively. In the second scenario, the average reward of agents in the DA-RL method is about $10\%\sim 15\%$ and $20\%\sim 30\%$ more than the SA-RL and RL methods, respectively. In addition, the DA-RL method converges about $15\%$ and $40\%$ faster than the SA-RL and RL methods, respectively. In summary, the proposed DA-RL method outperforms the other two methods in both scenarios.

7 Conclusion and Future Work

This paper proposed a differential advising method for multi-agent reinforcement learning. This method is the first to allow agents in one state to use advice created in another different state. This method is also the first to take advantage of the differential privacy technique for agent advising to improving learning performance instead of preserving privacy. Experimental results demonstrate that our method outperforms other benchmark methods in various aspects.

In this paper, we have an assumption that in two similar states, the applicability and adequacy of actions to take are similar. This assumption, however, are not applicable in some situations. For example, the good actions to take in a rainy day can become the bad actions to take in a sunny day. In the future, we will relax this assumption by assigning weights to dimensions of states. Moreover, our method mainly focuses on discrete states. It is interesting to extend our method to continuous environments. An intuitive way for the extension is to discretize continuous states to discrete states. Finally, extending our method to multi-agent deep reinforcement learning is also interesting future work. A potential way is to allow agents to transfer experience samples as advice. However, if the number of transferred samples is large, how to select high-quality samples is a challenge to the adviser agent.

References

[1] F. L. D. Silva, R. Glatt, and A. H. R. Costa, “MOO-MDP: An Object-Oriented Representation for Cooperative Multiagent Reinforcement Learning,” IEEE Trans. on Cybernetics, vol. 49, pp. 567–579, 2019.
[2] F. L. D. Silva, R. Glatt, and A. H. R. Costa, “Simultaneously Learning and Advising in Multiagent Reinforcement Learning,” in Proc. of AAMAS, Brazil, May 2017, pp. 1100–1108.
[3] F. L. D. Silva, M. E. Taylor, and A. H. R. Costa, “Autonomously Reusing Knowledge in Multiagent Reinforcement Learning,” in Proc. of IJCAI 2018, Sweden, July 2018, pp. 5487–5493.
[4] F. L. D. Silva, “Integrating Agent Advice and Previous Task Solutions in Multiagent Reinforcement Learning,” in AAMAS, 2019, pp. 2447–2448.
[5] F. L. D. Silva and A. H. R. Costa, “A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems,” Journal of Artificial Intelligence Research, vol. 64, pp. 645–703, 2019.
[6] D. Ye, T. Zhu, W. Zhou, and P. S. Yu, “Differentially Private Malicious Agent Avoidance in Multiagent Advising Learning,” IEEE Transactions on Cybernetics, vol. 50, no. 10, pp. 4214–4227, 2020.
[7] C. Dwork, “Differential privacy,” in Proc. of ICALP, 2006, pp. 1–12.
[8] T. Zhu, G. Li, W. Zhou, and P. S. Yu, “Differentially private data publishing and analysis: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 8, pp. 1619–1638, 2017.
[9] L. Torrey and M. E. Taylor, “Teaching on A Budget: Agents Advising Agents in Reinforcement Learning,” in AAMAS, 2013, pp. 1053–1060.
[10] O. Amir, E. Kamar, A. Kolobov, and B. Grosz, “Interactive Teaching Strategies for Agent Training,” in Proc. of IJCAI, 2016, pp. 804–811.
[11] Y. Wang, W. Lu, J. Hao, J. Wei, and H. Leung, “Efficient Convention Emergence through Decoupled Reinforcement Social Learning with Teacher-Student Mechanism,” in Proc. of AAMAS, 2018, pp. 795–803.
[12] S. Omidshafiei, D. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How, “Learning to Teach in Cooperative Multiagent Reinforcement Learning,” in AAAI, 2019, pp. 6128–6136.
[13] C. Zhu, H. Leung, S. Hu, and Y. Cai, “A Q-values Sharing Framework for Multiple Independent Q-learners,” in AAMAS, 2019, pp. 2324–2326.
[14] E. Ilhan, J. Gow, and D. Perez-Liebana, “Teaching on a Budget in Multi-Agent Deep Reinforcement Learning,” in Proc. of IEEE Conference on Games, 2019.
[15] A. Fachantidis, M. E. Taylor, and I. Vlahavas, “Learning to Teach Reinforcement Learning Agents,” ML and Knowledge Extraction, vol. 1, pp. 21–42, 2018.
[16] V. Gupta, D. Anand, P. Paruchuri, and B. Ravindran, “Advice Replay Approach for Richer Knowledge Transfer in Teacher Student Framework,” in Proc. of AAMAS, 2019, pp. 1997–1999.
[17] L. Vanhee, L. Jeanpierre, and A.-I. Mouaddib, “Augmenting Markong Decision Processes with Advising,” in AAAI, 2019, pp. 2531–2538.
[18] P. S. Castro, “Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes,” in AAAI, 2020, pp. 10 069–10 076.
[19] C. J. Watkins, “Q-Learning,” Machine Learning, 1992.
[20] T. Zhu, D. Ye, W. Wang, W. Zhou, and P. S. Yu, “More Than Privacy: Applying Differential Privacy in Key Areas of Artificial Intelligence,” IEEE Transactions on Knowledge and Data Engineering, p. DOI: 10.1109/TKDE.2020.3014246, 2020.
[21] D. Ye, T. Zhu, S. Shen, W. Zhou, and P. S. Yu, “Differentially Private Multi-Agent Planning for Logistic-like Problems,” IEEE Transactions on Dependable and Secure Computing, p. DOI: 10.1109/TDSC.2020.3017497, 2020.
[22] D. Ye, T. Zhu, S. Shen, and W. Zhou, “A Differentially Private Game Theoretic Approach for Deceiving Cyber Adversaries,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 569–584, 2021.
[23] C. Dwork and A. Roth, “The Algorithmic Foundations of Differential Privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
[24] S. Kasiviswanathan, H. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?” in Proc. of Annual IEEE Symposium on Foundations of Computer Science, 2008, pp. 531–540.
[25] H. Anton, I. Bivens, and S. Davis, Calculus. Wiley, 2012.
[26] Y. Wang and C. W. D. Silva, “A Machine-Learning Approach to Multi-Robot Coordination,” Engineering Applications of Artificial Intelligence, vol. 21, no. 3, pp. 470–484, 2008.
[27] J. Sakuma, S. Kobayashi, and R. N. Wright, “Privacy-Preserving Reinforcement Learning,” in Proc. of ICML, 2008.

Differential Advising in Multi-Agent Reinforcement Learning

Abstract

1 Introduction

2 Related Work

3 Preliminaries

3.1 Multi-agent reinforcement learning

3.2 Differential privacy

Definition 1 (Sensitivity).

Definition 2 (ϵ\epsilon-Differential Privacy).

Definition 3 (Laplace mechanism).

3.3 Differential advising

Definition 4 (Difference between states).

Definition 5 (Advice sensitivity).

Definition 6 (ϵ\epsilon-differential advising).

4 The Differential Advising Method

4.1 Overview of the method

4.2 Deciding whether to ask for advice

4.3 Deciding whether to give advice

4.4 How to use advice

5 Theoretical Analysis

5.1 Differential advising analysis

Lemma 1.

Proof.

Theorem 1.

Proof.

5.2 Average reward analysis

Definition 7 ((δ\delta,β\beta)-useful).

Theorem 2.

Proof.

Lemma 2.

Proof.

Theorem 3.

Proof.

Lemma 3.

Proof.

5.3 Convergence analysis

Lemma 4.

Lemma 5.

Theorem 4.

Proof.

6 Experiments

6.1 Experimental setup

6.1.1 Scenario 1

6.1.2 Scenario 2

6.1.3 The methods for comparison

6.2 Experimental results of scenario 1

6.2.1 The first setting: agents achieve targets

6.2.2 The second setting: agents achieve dynamically generated targets

6.2.3 The third setting: agents achieve dynamically moving targets

6.3 Experimental results of scenario 2

6.4 Discussion and summary

6.4.1 The metric used in the second setting of the first scenario

6.4.2 Performance of the three methods

7 Conclusion and Future Work

References

Definition 2 ( $\epsilon$ -Differential Privacy).

Definition 6 ( $\epsilon$ -differential advising).

Definition 7 (( $\delta$ , $\beta$ )-useful).