This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differential Advising in Multi-Agent Reinforcement Learning

Dayong Ye, Tianqing Zhu*, Zishuo Cheng, Wanlei Zhou and Philip S. Yu *Tianqing Zhu is the corresponding author. T. Zhu is with the School of Computer Science, China University of Geosciences, Wuhan, P. R. China. D. Ye, Z. Cheng and W. Zhou are with the Centre for Cyber Security and Privacy and with the School of Computer Science, University of Technology, Sydney, Australia. Philip S. Yu is with Department of Computer Science, University of Illinois at Chicago, USA. Email: [email protected], {Dayong.Ye, 13367166, Wanlei.Zhou}@uts.edu.au, [email protected]. This work is supported by National Natural Science Foundation of China (No.61972366), ARC LP170100123 and DP190100981, Australia, and NSF under grants III-1763325, III-1909323, and SaTC-1930941, USA.
Abstract

Agent advising is one of the main approaches to improve agent learning performance by enabling agents to share advice. Existing advising methods have a common limitation that an adviser agent can offer advice to an advisee agent only if the advice is created in the same state as the advisee’s concerned state. However, in complex environments, it is a very strong requirement that two states are the same, because a state may consist of multiple dimensions and two states being the same means that all these dimensions in the two states are correspondingly identical. Therefore, this requirement may limit the applicability of existing advising methods to complex environments. In this paper, inspired by the differential privacy scheme, we propose a differential advising method which relaxes this requirement by enabling agents to use advice in a state even if the advice is created in a slightly different state. Compared with existing methods, agents using the proposed method have more opportunity to take advice from others. This paper is the first to adopt the concept of differential privacy on advising to improve agent learning performance instead of addressing security issues. The experimental results demonstrate that the proposed method is more efficient in complex environments than existing methods.

Keywords - Multi-Agent Reinforcement Learning, Agent Advising, Differential Privacy

1 Introduction

Multi-agent reinforcement learning (MARL) is one of the fundamental research topics in artificial intelligence [1]. In regular MARL methods, agents typically need a large number of interactions with the environment and other agents to learn proper behaviors. To improve agent learning speed, the agent advising technique is introduced [2, 3, 4], which enables agents to ask for advice between each other.

Existing advising methods share a common limitation that an adviser agent can offer advice to an advisee agent only if the advice is created in the same state as the advisee agent’s concerned state [5]. However, in complex environments, it is usually a strong requirement that two states are the same, because a state may be composed of multiple dimensions and two states being the same implies that all these dimensions in the two states are correspondingly identical. This requirement may hinder the application of existing methods to complex environments. For example, in a multi-robot search and rescue problem (Fig. 1), the aim of each robot is to search and rescue victims as quickly as possible. Each robot is a learning agent which can observe only its surrounding area. To improve robot learning performance, robots can ask for advice between each other. In this problem, an observation of a robot is interpreted as a state of the robot. An observation consists of eight dimensions, where each dimension stands for a small cell around the robot. The format of an observation could be: s=0,0,1,0,0,0,1,0s=\langle 0,0,1,0,0,0,1,0\rangle, where 0 means an empty cell and 11 means an obstacle in a cell. A slightly different observation could be: s=0,0,1,0,1,0,1,0s^{\prime}=\langle 0,0,1,0,1,0,1,0\rangle, as the two states, ss and ss^{\prime}, have only the fifth dimension in difference (referring to Definition 4 for more detail). In existing advising methods, when a robot ii in state ss asks for advice from another robot jj, robot jj can offer advice to robot ii only if robot jj has visited state ss a number of times.

Refer to caption
Figure 1: A multi-robot search and rescue example

In this paper, we relax this requirement and develop a differential advising method which enables agents to use advice in a state even if the advice is created in a slightly different state. Thus, in the above example, by using our method, even if robot jj has never visited state ss, robot jj can still offer advice to robot ii as long as robot jj has visited a neighboring state of state ss, e.g., state ss^{\prime}. However, developing this differential advising method is a difficult task due to the following two challenges. First, how to measure and delimit the difference between two states is a challenge, because the amount of the difference between two states may significantly affect the quality of advice. Second, how to use advice in a concerned state, which is created in another state, is also a challenge, because improper advice may harm the learning performance of agents [6]. To address these two challenges, the differential privacy technique is applied.

Differential privacy is a privacy model which guarantees that a query to two slightly different datasets yields almost the same results [7, 8]. This means that the query result yielded from one dataset can be considered approximately identical to the query result yielded from the other dataset. This property of differential privacy can be taken into our method. Specifically, two slightly different states are similar to two slightly different datasets. Advices generated from states are similar to results yielded from datasets. Since two results from two slightly different datasets can be considered approximately identical, two advices generated from two slightly different states can also be considered approximately identical. This property guarantees that the advice created in a state can still be used in another slightly different state.

In summary, this paper has two contributions. First, we are the first to develop a differential advising method which allows agents to use advices created in different states. This method enables agents to receive more advices than existing methods. Second, we are the first to take advantage of the differential privacy technique for agent advising to improve agent learning performance instead of addressing security issues.

2 Related Work

Torrey and Taylor [9] introduced a teacher-student framework, where a teacher agent provides advice to a student agent to accelerate the student’s learning. Their framework is teacher-initiated, where the teacher determines when to give advice to the student. They developed a set of heuristic approaches for the teacher agent to decide when to provide advice. By extending Torrey and Taylor’s framework [9], Amir et al. [10] proposed an interactive student-teacher framework, where the student agent and the teacher agent jointly determine when to ask for/provide advice. They also developed a set of heuristic methods for both the student agent and the teacher agent to decide when to ask for/provide advice.

Later, Silva et al. [2] extended Amir et al.’s [10] work by taking simultaneity into consideration, where multiple agents simultaneously learn in an environment and an agent can be both a teacher and a student at the same time. The decisions regarding advice request and provision are based on a set of heuristic methods.

Wang et al. [11] proposed a teacher-student framework to accelerate the lexicon convergence speed among agents. Their framework is similar to Silva et al.’s framework [2]. The major difference is that, in Wang et al.’s framework, instead of broadcasting a request to all the neighboring agents, a student agent uses a multicast manner to probabilistically ask each neighboring agent for advice. The asking probability is based on the distance between the student and the teacher.

Ye et al. [6] also proposed an agent advising approach in a simultaneous learning environment. Unlike other advising methods which consider only benign and cooperative agents, Ye et al. introduced malicious agents into the environment, which provide false advice to hurt the learning performance of other agents. Ye et al. then used the differential privacy technique to preserve the learning performance of benign agents and reduce the impact of malicious agents.

Omidshafiei et al. [12] presented an advising framework. Unlike existing methods which use only heuristic methods to decide when to advise, their framework enables agents to learn when to teach/request and what to teach. Specifically, an advisee agent has a student policy to decide when to request advice using advising-level observation. The advising-level observation is based on the advisee agentos task-level observation and action-value vectors. Similarly, an adviser agent has a teacher policy to decide when and what to advise using advising-level observation. This advising-level observation is based on 1) the adviser agentos own task-level knowledge in the advisee agentos state, 2) the adviser agentos measure of the advisee agentos task-level state and knowledge.

Zhu et al. [13] developed a QQ-value sharing framework named PSAF (partaker-sharer advising framework). Unlike other advising methods where agents transfer recommended actions as advice, in PSAF, agents transfer QQ-values as advice. Using QQ-values as advice is more flexible than using actions as advice. This is because the policies of agents may be continuously changing and thus an advisee agent’s learning performance may be impaired by following an adviser’s recommended action. By contrast, if the advice is a QQ-value, the advisee can update its policy by using this QQ-value and then uses its own action selection method to pick up an action.

Ilhan et al. [14] developed an action advising method in multi-agent deep reinforcement learning (MADRL). Unlike regular MARL, in MADRL, the number of states may be very large or even infinite. Thus, it is infeasible to use the number of visits to a state to measure the confidence in that state as commonly used in existing methods. They, hence, adopt the random network distillation to measure the confidence in a state by measuring the mean squared error between two neural networks: the target network and the predictor network.

Unlike the above research which focuses on the advising process, some works have different focuses. Fachantidis et al. [15] studied the critical factors that affect advice quality. Gupta et al. [16] suggested that teacher agents should not only advise the action to take in a given state but also provide more informative advice using the synthesis of knowledge they have gained. They proposed an advising framework where a teacher augments a student’s knowledge by providing not only the advised action but also the expected long-term reward of following that action. Vanhee et al. [17] introduced Advice-MDPs which extend Markov decision processes for generating policies that take into account advising on the desirability, undesirability and prohibition of certain states and actions.

These above-mentioned works have a common limitation that an advice can be offered only if the advice is created in the same state as the given state. This limitation may deter the application of their works to complex environments. State similarity has also been researched by Castro [18] who defined a bisimulation metric. However, Castroos research focuses on computing state similarity in deterministic MDP rather than agent advising. Also, the definition of bisimulation provided in [18] may not be applicable to our research, because that definition requires an agent to have the full knowledge of reward functions, state space, and state transition functions. In our research, the environments are partially observable and thus, the full state space is unknown to any agent. Moreover, in our research, as multiple agents coexist in an environment, the state transition of an agent highly depends on the actions of other agents. Hence, state transition functions are hard to be pre-defined. In this paper, we propose a differential advising method which overcomes the common limitation of existing methods by taking advantage of differential privacy.

3 Preliminaries

3.1 Multi-agent reinforcement learning

Reinforcement learning is usually used to solve sequential decision-making problems. A sequential decision-making process can be formally modeled as a Markov decision process (MDP) [19]. An MDP is typically a tuple S,A,T,R\langle S,A,T,R\rangle, where SS is the set of states, AA is a set of actions available to the agent, TT is the transition function, and RR is the reward function.

At each step, an agent observes the state sSs\in S of the environment, and selects an action aAa\in A based on its policy π\pi, which is a probability distribution over available actions. After performing the action, the agent receives a real-valued reward rr and gets into a new state sSs^{\prime}\in S. The agent then updates its policy π\pi based on the reward rr and the new state ss^{\prime}. Through this way, the agent can gradually accumulate knowledge and improve its policy to maximize its accumulated long-term expected reward.

An MDP can be extended to model a multi-agent reinforcement learning process as a tuple S,A1,,n,T,R1,,n\langle S,A_{1,...,n},T,R_{1,...,n}\rangle, where SS is the cartesian product of the sets of local states from each agent, S=S1×S2×SnS=S_{1}\times S_{2}\cdots\times S_{n}, AiA_{i} is the available action set of agent ii, TT is the transition function, and RiR_{i} is the reward function of agent ii. In multi-agent reinforcement learning, state transitions are based on the joint actions of all the agents. Each agent’s individual reward is also based on the joint actions of all the agents.

3.2 Differential privacy

Differential privacy is a prevalent privacy model and has been broadly applied to various applications [20, 21, 22]. Differential privacy can guarantee an individual’s privacy independent of whether the individual is in or out of a dataset [7]. Two datasets DD and DD^{\prime} are neighboring datasets if they differ in at most one record. Let ff be a query that maps dataset DD to a kk-dimension vector in range k\mathbb{R}^{k}: f:Dkf:D\rightarrow\mathbb{R}^{k}. The maximal difference on the results of query ff is defined as sensitivity of query Δf\Delta f, which determines how much perturbation is required for the privacy-preserving answer. The formal definition of sensitivity is given as follows.

Definition 1 (Sensitivity).

For a query f:Dkf:D\rightarrow\mathbb{R}^{k}, the sensitivity of ff is defined as

Δf=maxDD11f(D)f(D)1,\Delta f=\max\limits_{||D-D^{\prime}||_{1}\leq 1}||f(D)-f(D^{\prime})||_{1}, (1)

where DD11||D-D^{\prime}||_{1}\leq 1 means that two datasets DD and DD^{\prime} have at most one record different, and f(D)f(D)1=1ik|f(D)if(D)i|||f(D)-f(D^{\prime})||_{1}=\sum_{1\leq i\leq k}|f(D)_{i}-f(D^{\prime})_{i}|.

The aim of differential privacy is to mask the difference in the answer of query ff between two neighboring datasets. To achieve this aim, differential privacy provides a randomized mechanism \mathcal{M} to access a dataset. In ϵ\epsilon-differential privacy, parameter ϵ\epsilon is defined as the privacy budget, which controls the privacy guarantee level of mechanism \mathcal{M}. A smaller ϵ\epsilon represents a stronger privacy. The formal definition of differential privacy is presented as follows.

Definition 2 (ϵ\epsilon-Differential Privacy).

A randomized mechanism \mathcal{M} gives ϵ\epsilon-differential privacy if for any pair of neighboring datasets DD and DD^{\prime}, and for every set of outcomes Ω\Omega, \mathcal{M} satisfies:

Pr[(D)Ω]exp(ϵ)Pr[(D)Ω]Pr[\mathcal{M}(D)\in\Omega]\leq\exp(\epsilon)\cdot Pr[\mathcal{M}(D^{\prime})\in\Omega] (2)

One of the prevalent differential privacy mechanisms is the Laplace mechanism. The Laplace mechanism adds Laplace noise to query results. We use Lap(b)Lap(b) to represent the noise sampled from the Laplace distribution with scaling bb. The mechanism is described as follows.

Definition 3 (Laplace mechanism).

Given any function f:Dkf:D\rightarrow\mathbb{R}^{k}, the Laplace mechanism L\mathcal{M}_{L} is defined as

L(D,f,ϵ)=f(D)+(y1,,yk),\mathcal{M}_{L}(D,f,\epsilon)=f(D)+(y_{1},...,y_{k}), (3)

where y1,,yky_{1},...,y_{k} are the random noise drawn from Lap(Δfϵ)Lap(\frac{\Delta f}{\epsilon}).

3.3 Differential advising

Based on the definitions of differential privacy, we provide the definitions of differential advising. The aim of providing the definitions of differential advising is to link the properties of differential privacy to the properties of differential advising. This linkage can guarantee that the differential privacy mechanisms can be used as differential advising mechanisms, as later shown in Lemma 1 and Theorem 1 in Section 5.

The reason of applying differential privacy to advising is to ensure that the knowledge learned in a state can still be used in a different but similar state, i.e., a neighboring state (Definition 4). In previous work, knowledge can only be re-used in identical states. With the introduction of the ‘neighboring state’, a piece of knowledge has more chance to be applied by agents compared with previous work. This idea is reasonable as we human can easily borrow knowledge from a similar environment.

To achieve this goal, we can consider acquiring knowledge, i.e., asking for advice, as querying to a dataset, and consider the QQ-values of state/action pairs constituting a dataset, i.e., a QQ-table. Then, the differential advising process can be simulated as a differentially private query process. To implement differential advising, we apply the Laplace mechanism to mask the difference between the knowledge learned in two neighboring states. Specifically, a QQ-table, shown as below, is a two-dimensional matrix which stores the knowledge learned by an agent. Differential advising is operating on QQ-tables.

States/Actions a1a_{1} a2a_{2} a3a_{3} a4a_{4}
𝒔𝟏\boldsymbol{s_{1}} Q(𝒔𝟏,a1)Q(\boldsymbol{s_{1}},a_{1}) Q(𝒔𝟏,a2)Q(\boldsymbol{s_{1}},a_{2}) Q(𝒔𝟏,a3)Q(\boldsymbol{s_{1}},a_{3}) Q(𝒔𝟏,a4)Q(\boldsymbol{s_{1}},a_{4})
𝒔𝟐\boldsymbol{s_{2}} Q(𝒔𝟐,a1)Q(\boldsymbol{s_{2}},a_{1}) Q(𝒔𝟐,a2)Q(\boldsymbol{s_{2}},a_{2}) Q(𝒔𝟐,a3)Q(\boldsymbol{s_{2}},a_{3}) Q(𝒔𝟐,a4)Q(\boldsymbol{s_{2}},a_{4})
𝒔𝟑\boldsymbol{s_{3}} Q(𝒔𝟑,a1)Q(\boldsymbol{s_{3}},a_{1}) Q(𝒔𝟑,a2)Q(\boldsymbol{s_{3}},a_{2}) Q(𝒔𝟑,a3)Q(\boldsymbol{s_{3}},a_{3}) Q(𝒔𝟑,a4)Q(\boldsymbol{s_{3}},a_{4})

Without loss of generality, we assume that a state 𝒔\boldsymbol{s} consists of mm dimensions: 𝒔=(s1,,sm)\boldsymbol{s}=(s_{1},...,s_{m}). Each dimension sis_{i} could be either an integer or a real number representing discrete or continuous states. The difference between two states 𝒔\boldsymbol{s} and 𝒔\boldsymbol{s}^{\prime} is defined as:

Definition 4 (Difference between states).
𝒔𝒔1=1im|sisi|.||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}=\sum_{1\leq i\leq m}|s_{i}-s^{\prime}_{i}|. (4)

Two states 𝒔\boldsymbol{s} and 𝒔\boldsymbol{s}^{\prime} are neighboring, i.e., slightly different, if their difference is at most 11, i.e., 𝒔𝒔11||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1. The difference threshold is set to 11, because this is the prerequisite of using differential privacy. As described in Definition 2, differential privacy works when two datasets have at most one record in difference. In fact, Definition 4 is slightly different from the prerequisite of differential privacy. This is because data records in a dataset are discrete, and thus DD11||D-D^{\prime}||_{1}\leq 1 means that one record is different in the two datasets. In Definition 4, if states are discrete and each dimension is represented as an integer, 𝒔𝒔11||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1 also means that one dimension is different in the two states, such as the multi-robot example in Section 1. If states are continuous, 𝒔𝒔11||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1 could mean that the difference exists in multiple dimensions and the sum of the difference is less than 11. We leave the further research of continuous states as one of our future works.

Similarly, the advice sensitivity is defined as follows.

Definition 5 (Advice sensitivity).

For an advice generation function Q:SkQ:S\rightarrow\mathbb{R}^{k}, where SS is the state set and kk is the number of actions, the sensitivity of QQ is defined as

ΔQ=max𝒔,𝒔S𝒔𝒔11Q(𝒔)Q(𝒔)1.\Delta Q=\max\limits_{\boldsymbol{s},\boldsymbol{s}^{\prime}\in S\wedge||\boldsymbol{s}-\boldsymbol{s}^{\prime}||_{1}\leq 1}||Q(\boldsymbol{s})-Q(\boldsymbol{s}^{\prime})||_{1}. (5)

As an advice is a QQ-vector, for simplicity, we use the letter “QQ” to represent both the advice generation function and the QQ-function. Details will be given in the next section.

Definition 6 (ϵ\epsilon-differential advising).

An advising method A\mathcal{M}_{A} is ϵ\epsilon-differential advising, if for any pair of neighboring states 𝐬\boldsymbol{s} and 𝐬\boldsymbol{s^{\prime}}, and for every set of advice AdAd, A\mathcal{M}_{A} satisfies:

Pr[𝒜(𝒔)Ad]exp(ϵ)Pr[𝒜(𝒔)Ad]Pr[\mathcal{M_{A}}(\boldsymbol{s})\in Ad]\leq\exp(\epsilon)\cdot Pr[\mathcal{M_{A}}(\boldsymbol{s^{\prime}})\in Ad] (6)

The essence of differential privacy is to guarantee that the query results of two neighboring datasets have a very high probability to be the same. Similarly, in differential advising, we set that the query results of two neighboring QQ-tables have a very high probability to be the same, so that an agent querying to two neighboring QQ-tables can receive almost the same knowledge. Here, two QQ-tables are neighboring if 1) they have one record in difference and 2) the two different records correspond to two neighboring states. The query in differential advising is known as differentially private selection [23] which produces the best answer from a space of outcomes. To implement differential advising, we follow the spirit of the Laplace mechanism by adding Laplace noise on query results to mask the difference between two neighboring QQ-tables. The detail will be given in the next section.

Table I describes the notations and terms used in this paper.

TABLE I: The meaning of each notation
notations meaning
SS a set of states of the environment
AA a set of actions available to an agent
rr
an immediate reward obtained by
an agent for taking an action under an observation
Q(s,a)Q(s,a)
a reinforcement value for an agent
to take action aa in state ss
𝝅\boldsymbol{\pi}
a probability distribution over
the available actions of an agent
α,ζ\alpha,\zeta learning rates, both of which are in (0,1)(0,1)
γ\gamma a discount factor which is in (0,1)(0,1)
Δf\Delta f the sensitivity of a query
ΔQ\Delta Q the sensitivity of a score function
ϵ\epsilon privacy budget
bb the scale parameter in Laplace mechanism
δ\delta, β\beta privacy budget
n𝒔n_{\boldsymbol{s}}
the number of times
that an agent has visited a state 𝒔\boldsymbol{s}
CC the communication budget of an agent
PaskP_{ask} the probability that an agent asks for advice
PgiveP_{give} the probability that an agent provides advice

4 The Differential Advising Method

Our method is developed in a simultaneous learning framework, where agents are learning simultaneously and can be in both the roles of adviser and advisee. Each agent ii has a communication budget CiC_{i} to control its communication overhead. Every time when an agent ii asks for/provides advice from/to another agent, agent ii’s communication budget is deducted by 11 till the budget CiC_{i} is used up. Here, we use a combined communication budget CiC_{i} for asking for and giving advice, instead of two separate budgets, because 1) using a combined budget is more suitable for the real-world applications than using two separate budgets, and 2) using a combined budget can simplify the description of our method. For example, in wireless sensor networks, the communication budget of a sensor is based only on its battery power irrespective of the communication types.

Each agent ii has the following knowledge:

  1. 1.

    its available actions in each state;

  2. 2.

    the QQ-value of each available action;

  3. 3.

    the existence of its neighboring agents;

Moreover, we assume that 1) a slight change in a state will not significantly change the reward function of an agent, and 2) an agent in two neighboring states has the same action set.

4.1 Overview of the method

In our method, during advising, agents have to address the following three sub-problems:

  • whether to ask for advice;

  • whether to give advice;

  • how to use advice.

The overview of our method is outlined in Fig. 2, and the detail of our method is formally given in Algorithm 1. In summary, Algorithm 1 describes the workflow of our method, which consists of two parts: the advising part (Lines 1-15) and the learning part (Lines 16-23). The advising part of Algorithm 1 depicts the process of an advisee agent asking for and making use of advice. The detail of making use of advice is given in Algorithm 2, where the differential privacy mechanism is introduced. The learning part of Algorithm 1 is a regular reinforcement learning process.

Refer to caption
Figure 2: The overview of agent ii asking for advice from agent jj
1 /*Take agent ii at time step tt as an example */
2 Observes the environment and receives state 𝒔t\boldsymbol{s}^{t};
3 Decides whether to ask for advice;
4 if agent ii decides to ask for advice then
5       if agent ii itself has a neighboring state 𝐬\boldsymbol{s}^{\prime} then
6             Uses Algorithm 2 to obtain 𝝅(𝒔t)\boldsymbol{\pi}(\boldsymbol{s}^{t});
7             /* Internal advice from itself */
8            
9      else
10             broadcasts a message to all neighboring agents;
11             Receives advice: 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime});
12             if 𝐬=𝐬t\boldsymbol{s}^{\prime}=\boldsymbol{s}^{t} then
13                   𝝅(𝒔t)𝝅(𝒔)\boldsymbol{\pi}(\boldsymbol{s}^{t})\leftarrow\boldsymbol{\pi}(\boldsymbol{s}^{\prime});
14                  
15            else
16                   Uses Algorithm 2 to obtain 𝝅(𝒔t)\boldsymbol{\pi}(\boldsymbol{s}^{t});
17                   /* External advice from another agent */
18                  
19            
20      
21Selects an action ala_{l} based on the probability distribution 𝝅(𝒔t)=(π(𝒔t,a1),,π(𝒔t,ak))\boldsymbol{\pi}(\boldsymbol{s}^{t})=(\pi(\boldsymbol{s}^{t},a_{1}),...,\pi(\boldsymbol{s}^{t},a_{k}));
22 rr\leftarrowthe reward for performing action ala_{l} in state 𝒔t\boldsymbol{s}^{t};
23 Q(𝒔t,al)(1α)Q(𝒔t,al)+α[r+γmaxaAiQ(𝒔t+1,a)]Q(\boldsymbol{s}^{t},a_{l})\leftarrow(1-\alpha)Q(\boldsymbol{s}^{t},a_{l})+\alpha[r+\gamma max_{a\in A_{i}}Q(\boldsymbol{s}^{t+1},a)];
24 r¯aAiπ(𝒔t,a)Q(𝒔t,a)\overline{r}\leftarrow\sum_{a\in A_{i}}\pi(\boldsymbol{s}^{t},a)Q(\boldsymbol{s}^{t},a), where AiA_{i} is the set of available actions of agent ii in state 𝒔t\boldsymbol{s}^{t};
25 for each action aAia\in A_{i} do
26       π(𝒔t,a)π(𝒔t,a)+ζ(Q(𝒔t,a)r¯)\pi(\boldsymbol{s}^{t},a)\leftarrow\pi(\boldsymbol{s}^{t},a)+\zeta(Q(\boldsymbol{s}^{t},a)-\overline{r});
27      
28𝝅(𝒔t)Normalize(𝝅(𝒔t))\boldsymbol{\pi}(\boldsymbol{s}^{t})\leftarrow Normalize(\boldsymbol{\pi}(\boldsymbol{s}^{t}));
29 αtt+1α\alpha\leftarrow\frac{t}{t+1}\alpha;
Algorithm 1 The complete advising process

In Lines 1-3, at time step tt, agent ii observes a state 𝒔t\boldsymbol{s}^{t}, and decides whether to ask for advice from other agents (Section 4.2). Specifically, an advice is the QQ-value vector of a state: 𝑸(𝒔)=(Q(𝒔,a1),,Q(𝒔,ak))\boldsymbol{Q}(\boldsymbol{s})=(Q(\boldsymbol{s},a_{1}),...,Q(\boldsymbol{s},a_{k})). In Lines 4 and 5, if agent ii decides to ask for advice, agent ii checks whether it has a neighboring state 𝒔\boldsymbol{s}^{\prime} itself. If so, agent ii takes 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime}) as a self-advice. Agent ii, thereafter, uses Algorithm 2, a differentially private algorithm, to process this advice and selects an action based on this advice (Line 6). Here, allowing an agent to query itself can conserve its communication budget. This is called “self-advising”. In Lines 8 and 9, if agent ii has not visited any neighboring states, agent ii sends a broadcast message to all communication-reachable, i.e., neighboring, agents and its communication budget CiC_{i} is deducted by 11. If any neighboring agent jj decides to offer advice, agent jj sends the advice back to agent ii and agent jj’s communication budget CjC_{j} is deducted by 11 (Section 4.3).

In Lines 10 and 11, once agent ii receives the advice from agent jj, agent ii checks whether this advice is created based on state 𝒔t\boldsymbol{s}^{t} or a neighboring state of 𝒔t\boldsymbol{s}^{t}. If the advice is created based on 𝒔t\boldsymbol{s}^{t}, agent ii directly uses this advice to choose an action (Line 12). Otherwise, in Lines 13 and 14, agent ii uses Algorithm 2 to modify this advice and then chooses an action by following the modified advice (Section 4.4). In the case that agent ii receives advice from multiple agents, it selects the advice whose state has the minimum difference to agent ii’s current state (recall Equation 4). In the case that no advice is received or agent ii decides not to ask for advice, agent ii simply chooses an action based on its own experience (Lines 16-23). Specifically, Lines 16-23 describe a regular reinforcement learning process. In Line 16, agent ii randomly selects an action based on the probability distribution 𝝅(𝒔t)\boldsymbol{\pi}(\boldsymbol{s}^{t}) over its available actions in state 𝒔t\boldsymbol{s}^{t}. Then, in Line 17, agent ii performs the selected action and receives a reward rr. In Line 18, agent ii uses reward rr to update the QQ-value of the selected action. In Line 19, agent ii computes an average reward using the probability distribution in state 𝒔t\boldsymbol{s}^{t} to multiply the QQ-values of actions in state 𝒔t\boldsymbol{s}^{t}. After that, in Lines 20 and 21, agent ii adopts the average reward to update the probability distribution 𝝅(𝒔t)\boldsymbol{\pi}(\boldsymbol{s}^{t}). In Line 22, 𝝅(𝒔t)\boldsymbol{\pi}(\boldsymbol{s}^{t}) is normalized to be a valid probability distribution, where each element in 𝝅(𝒔t)\boldsymbol{\pi}(\boldsymbol{s}^{t}) is in (0,1)(0,1) and the sum of all the elements is 11. In Line 23, the learning rate α\alpha is decayed to guarantee the convergence of the algorithm.

It has to be noted that in our method, an advice is a QQ-vector instead of an action suggestion or a QQ-function. This is because our method is differential advising rather than direct advising, and using QQ-vectors enables advisee agents to properly modify the advice before they take it. By contrast, modifying QQ-vectors is easier and more reliable than modifying action suggestions and QQ-functions. Moreover, using QQ-vectors as advice does not mean adviser agents must have a complete QQ-table. As long as an adviser agent has the QQ-values of the concerned state ss (or its neighboring state ss^{\prime}), the adviser can offer advice.

4.2 Deciding whether to ask for advice

At each time step, an agent, ii, observes a state 𝒔\boldsymbol{s}. Agent ii decides whether to ask for advice based on probability PaskP_{ask} (Equation 7). A random number between 0 and 11 is generated to compare with PaskP_{ask}. If the random number is less than PaskP_{ask}, agent ii asks for advice.

The calculation of PaskP_{ask} is based on 1) agent ii’s confidence in current state 𝒔\boldsymbol{s}, and 2) agent ii’s remaining communication budget, CiC^{\prime}_{i}. Agent ii’s confidence in state 𝒔\boldsymbol{s} depends on how many times agent ii has visited state 𝒔\boldsymbol{s}, denoted as n𝒔in^{i}_{\boldsymbol{s}}. A higher n𝒔in^{i}_{\boldsymbol{s}} value means a higher confidence and results in a lower probability of asking for advice. Agent ii’s remaining communication budget CiC^{\prime}_{i} depends on that how many times agent ii has asked for/provided advice from/to others. Once agent ii asks for advice, its communication budget is reduced by 11: CiCi1C^{\prime}_{i}\leftarrow C^{\prime}_{i}-1. A lower remaining communication budget results in a lower probability of asking for advice. Formally, the probability function PaskP_{ask} takes n𝒔in^{i}_{\boldsymbol{s}} and CiC^{\prime}_{i} as input, and outputs a probability ranged in [0,1)[0,1). The calculation of probability PaskP_{ask} is shown in Equation 7, where square root is used to reduce the decay speed of PaskP_{ask}, and the positive integer, NN, is a threshold used to avoid agent ii using up its communication budget in very early stages.

Pask={1n𝒔iCiCi,n𝒔iN0,n𝒔i<NP_{ask}=\begin{cases}\frac{1}{\sqrt{n^{i}_{\boldsymbol{s}}}}\cdot\sqrt{\frac{C^{\prime}_{i}}{C_{i}}},\mbox{$n^{i}_{\boldsymbol{s}}\geq N$}\\ 0,\mbox{$n^{i}_{\boldsymbol{s}}<N$}\end{cases} (7)

If agent ii decides to ask for advice, agent ii checks whether it has any neighboring states. This check is based on Equation 4 to compute the difference between the new state 𝒔\boldsymbol{s} and the stored states of agent ii. If the difference between 𝒔\boldsymbol{s} and a stored state is less than or equal to 11, that stored state is a neighboring state of 𝒔\boldsymbol{s}.

On one hand, if agent ii has a set of neighboring states, agent ii selects the neighboring state 𝒔\boldsymbol{s}^{\prime}, which has the minimum difference to its current state 𝒔\boldsymbol{s}. Agent ii utilizes the proposed differential advising method to modify 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime}) and use the modified 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime}) to take an action (Section 4.4). On the other hand, if agent ii does not have any neighboring states, it asks for advice from its neighboring agents (Section 4.3).

4.3 Deciding whether to give advice

When an agent, jj, receives an advice request message from agent ii, agent jj needs to decide whether to provide advice to agent ii. This decision is based on probability PgiveP_{give} (Equation 8). The calculation of PgiveP_{give} is based on 1) agent jj’s confidence in agent ii’s concerned state 𝒔\boldsymbol{s} or neighboring states of 𝒔\boldsymbol{s}, and 2) agent jj’s remaining communication budget CjC^{\prime}_{j}. If agent jj has more than one neighboring state, agent jj uses the one which has the minimum difference to state 𝒔\boldsymbol{s}. The higher the confidence of agent jj, the higher the probability agent jj will provide advice. The lower remaining communication budget of agent jj, the lower probability agent jj will provide advice.

Formally, the probability function PgiveP_{give} takes n𝒔jn^{j}_{\boldsymbol{s}} and CjC^{\prime}_{j} as input, and outputs a probability ranged in [0,1)[0,1). Here, for simplicity, we use n𝒔jn^{j}_{\boldsymbol{s}} to denote the number of times that agent jj has visited either state 𝒔\boldsymbol{s} or a neighboring state of 𝒔\boldsymbol{s}, and CjC^{\prime}_{j} to denote the remaining communication budget of agent jj. Once agent jj gives advice, its remaining communication budget is reduced by 11: CjCj1C^{\prime}_{j}\leftarrow C^{\prime}_{j}-1. It should be noted that agent jj does not need to know the full state space to find neighboring states, because in some environments, it is infeasible for any individual agent to know the full state space, such as the multi-robot example in Section 1. Thus, agent jj simply checks the states which it has visited. If all the visited states are not neighboring to the concerned state, this means that agent jj does not have the experience related to the concerned state and should not offer any advice to agent ii. Formally, the calculation of PgiveP_{give} is given in Equation 8.

Pgive={(11n𝒔j)CjCj,n𝒔jn𝒔i0,n𝒔j<n𝒔iP_{give}=\begin{cases}(1-\frac{1}{\sqrt{n^{j}_{\boldsymbol{s}}}})\cdot\sqrt{\frac{C^{\prime}_{j}}{C_{j}}},\mbox{$n^{j}_{\boldsymbol{s}}\geq n^{i}_{\boldsymbol{s}}$}\\ 0,\mbox{$n^{j}_{\boldsymbol{s}}<n^{i}_{\boldsymbol{s}}$}\end{cases} (8)

In Equation 8, we set a thresholds n𝒔in^{i}_{\boldsymbol{s}} for adviser agents. If an adviser agent has visited a state for less than n𝒔in^{i}_{\boldsymbol{s}} times, that means the adviser agent is less experienced than the advisee agent in state 𝒔\boldsymbol{s} and thus the adviser agent is not allowed to offer advice to the advisee agent for that state. This setting is based on the fact that in theory, an adviser agent can generate advice for any possible states, even if the adviser agent has never visited those states. For example, an adviser agent can generate advice for a never visited state using the initialized QQ-values of actions in that state. However, since the QQ-value of each action can be randomly initialized, such advice is useless and may even harm the learning performance of the advisee.

4.4 How to use advice

When agent ii receives agent jj’s advice, 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime}), agent ii checks whether its concerned state 𝒔\boldsymbol{s} is the same as state 𝒔\boldsymbol{s}^{\prime}. The aim of this check is to decide whether this advice is created based on the same state (𝒔=𝒔\boldsymbol{s}=\boldsymbol{s}^{\prime}) or a neighboring state (𝒔𝒔\boldsymbol{s}\neq\boldsymbol{s}^{\prime}). If 𝒔=𝒔\boldsymbol{s}=\boldsymbol{s}^{\prime}, agent ii directly takes agent jj’s advice and selects an action based on agent jj’s advice. If 𝒔𝒔\boldsymbol{s}\neq\boldsymbol{s}^{\prime}, agent ii uses the Laplace mechanism to add noise to agent jj’s advice, 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime}), shown in Algorithm 2. Here is an example. Agent ii arrives at state s and asks for advice from agent jj. Agent jj has visited three states: 𝒔0,𝒔1,𝒔2\boldsymbol{s}_{0},\boldsymbol{s}_{1},\boldsymbol{s}_{2} and are confident in these states. After calculation, agent jj detects that none of the three states is the same as state 𝒔\boldsymbol{s} and only 𝒔0\boldsymbol{s}_{0} is a neighboring state of 𝒔\boldsymbol{s}. Thus, agent jj sends its advice, Q-vector Q(𝒔0)Q(\boldsymbol{s}_{0}), to agent ii. After agent ii receives this advice, agent ii identifies that this advice is created based on a neighboring state 𝒔0\boldsymbol{s}_{0}. Agent ii adopts Algorithm 2 to add Laplace noise to the advice Q(𝒔0)Q(\boldsymbol{s}_{0}), and then uses the modified advice to guide its action selection.

1 Input: 𝑸(𝒔)=(Q(𝒔,a1),,Q(𝒔,ak))\boldsymbol{Q}(\boldsymbol{s}^{\prime})=(Q(\boldsymbol{s}^{\prime},a_{1}),...,Q(\boldsymbol{s}^{\prime},a_{k}));
2 for each action aAia\in A_{i} do
3       adjusting QQ-value Q(𝒔,a)Q(\boldsymbol{s}^{\prime},a) as: Q(𝒔,a)Q(𝒔,a)+Lap(ΔQϵ)Q(\boldsymbol{s}^{\prime},a)\leftarrow Q(\boldsymbol{s}^{\prime},a)+Lap(\frac{\Delta Q}{\epsilon});
4      
5r¯aAiπ(𝒔t,a)Q(𝒔,a)\overline{r}\leftarrow\sum_{a\in A_{i}}\pi(\boldsymbol{s}^{t},a)Q(\boldsymbol{s}^{\prime},a);
6 for each action aAia\in A_{i} do
7       π(𝒔t,a)π(𝒔t,a)+ζ(Q(𝒔,a)r¯)\pi(\boldsymbol{s}^{t},a)\leftarrow\pi(\boldsymbol{s}^{t},a)+\zeta(Q(\boldsymbol{s}^{\prime},a)-\overline{r});
8      
9𝝅(𝒔t)Normalize(𝝅(𝒔t))\boldsymbol{\pi}(\boldsymbol{s}^{t})\leftarrow Normalize(\boldsymbol{\pi}(\boldsymbol{s}^{t}));
10 Output: π(𝒔t,a1),,π(𝒔t,ak)\pi(\boldsymbol{s}^{t},a_{1}),...,\pi(\boldsymbol{s}^{t},a_{k});
Algorithm 2 Differential advising

In Lines 2 and 3, agent jj’s QQ-values are adjusted by adding Laplace noise. Specifically, the Laplace noise is added to each of the QQ-values: Q(𝒔,a1)+Lap1,m,Q(𝒔,ak)+LapkQ(\boldsymbol{s^{\prime}},a_{1})+Lap_{1},m,Q(\boldsymbol{s^{\prime}},a_{k})+Lap_{k}, where each noise, LapiLap_{i}, is a random number based on the Laplace distribution. Based on the adjusted QQ-values, in Lines 4-6, agent ii computes its average reward in state 𝒔t\boldsymbol{s}^{t} and updates its probability distribution π(𝒔t,a)\pi(\boldsymbol{s}^{t},a). Note that the computation of the average reward uses the probability distribution in state 𝒔t\boldsymbol{s}^{t} to multiply the QQ-values of the available actions in state 𝒔\boldsymbol{s^{\prime}}. As we have the assumption that an agent has the same action set in two neighboring states, this multiplication is valid. In Line 7, π(𝒔t,a)\pi(\boldsymbol{s}^{t},a) is normalized to be a valid probability distribution.

In the Laplace mechanism in Line 3, ΔQ\Delta Q is the sensitivity of QQ-values (recall Definition 5). Typically, ΔQ\Delta Q is based on QQ-functions, and QQ-functions are based on rewards. Thus, essentially, ΔQ\Delta Q is based on rewards, and we can use rewards to approximately compute ΔQ\Delta Q. We have to notice that it is infeasible to accurately compute ΔQ\Delta Q before the learning starts, because the accurate computation of ΔQ\Delta Q needs the final QQ-values of all the states, and these final QQ-values can only be obtained when the learning finishes.

We use an example to demonstrate how to approximately compute ΔQ\Delta Q. Definition 5 states that ΔQ\Delta Q is the maximum difference between two QQ-vectors of any two neighboring states. In the multi-robot problem exemplified in Section 1, for any two neighboring states, the maximum reward difference is 10(5)=1510-(-5)=15 given that the maximum reward 1010 is finding a victim and the minimum reward 5-5 is hitting an obstacle. This is because in any two neighboring states, only one dimension, i.e., one cell, is different. This different cell incurs a maximum reward difference, if the cell has a victim in a state while has an obstacle in the other state. Moreover, QQ-value is usually updated based on a learning rate α\alpha. Thus, in the multi-robot problem, ΔQ\Delta Q can be approximately computed as α15\alpha\cdot 15. When we set α=0.2\alpha=0.2, then ΔQ=3\Delta Q=3.

The rationale of Algorithm 2 is discussed as follows. According to Definition 3 in Section 3.2, the Laplace mechanism is used to add Laplace noise to the numerical output of a query ff to a dataset DD, so that both the privacy and the utility of the output can be guaranteed. Typically, a query ff maps a dataset DD to kk real numbers: f:Dkf:D\rightarrow\mathbb{R}^{k}. In comparison, in advising learning, the QQ-table of an adviser agent is interpreted as a dataset, where each entry in the QQ-table is the QQ-value of a pair of a state and an action, Q(s,a)Q(s,a). Advice request from an advisee agent is interpreted as the query to the Q-table of the adviser agent. The output of the query is the advice of the adviser agent, which consists of kk real numbers, Q(𝒔,a1),,Q(𝒔,ak)Q(\boldsymbol{s},a_{1}),...,Q(\boldsymbol{s},a_{k}), and kk is the number of actions in state ss. As discussed in Section 3.3, we have connected the properties of differential privacy and the properties of differential advising. Hence, by adding Laplace noise to Q(𝒔,a1),,Q(𝒔,ak)Q(\boldsymbol{s},a_{1}),...,Q(\boldsymbol{s},a_{k}), both the validity and utility of the advice can be guaranteed, as shown in Theorem 1, 2 and 3 in Section 5. Thus, in state 𝒔\boldsymbol{s}, agent ii can still use the advice 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}^{\prime}), which is created based on a neighboring state 𝒔\boldsymbol{s}^{\prime}.

5 Theoretical Analysis

5.1 Differential advising analysis

Lemma 1.

Any methods, which satisfy ϵ\epsilon-differential privacy and meet the input and output requirements of A\mathcal{M}_{A} in Definition 6, also satisfy ϵ\epsilon-differential advising.

Proof.

In Definition 2, the input of \mathcal{M} can be any type of dataset, while the output can be any valid results. In Definition 6, the input of A\mathcal{M}_{A} is a state 𝒔\boldsymbol{s} which is a vector, while the output is also a vector 𝑸(𝒔)\boldsymbol{Q}(\boldsymbol{s}). Thus, Definition 6 can be considered a special case of Definition 2. Formally, let D|𝒳|D\in\mathbb{N}^{|\mathcal{X}|} and 𝒔S\boldsymbol{s}\in S, where |𝒳|\mathbb{N}^{|\mathcal{X}|} is a data universe with |𝒳||\mathcal{X}| attributes and SS is a state space. As |𝒳|\mathbb{N}^{|\mathcal{X}|} is a data universe and can be arbitrarily large, we have S|𝒳|S\subset\mathbb{N}^{|\mathcal{X}|}. Similarly, we also have AdΩAd\subset\Omega. Therefore, any methods, which satisfy ϵ\epsilon-differential privacy and meet the input and output requirements of A\mathcal{M}_{A},also satisfy ϵ\epsilon-differential advising. ∎

Lemma 1 creates a link between differential privacy and differential advising. Then, differential privacy mechanisms, e.g., the Laplace mechanism, can be used to implement differential advising.

Theorem 1.

The proposed method is ϵ\epsilon-differential advising.

Proof.

According to Lemma 1, to prove this theorem, we need only to prove that the proposed method satisfies ϵ\epsilon-differential privacy. In the proposed method, Algorithm 2 utilizes a differential privacy mechanism, i.e., the Laplace mechanism, in Line 3. We re-write Line 3 in the 𝑸\boldsymbol{Q}-vector form as a(𝒔,𝑸,ϵ)=𝑸(𝒔)+(Lap1,,Lapk)\mathcal{M}_{a}(\boldsymbol{s},\boldsymbol{Q},\epsilon)=\boldsymbol{Q}(\boldsymbol{s})+(Lap_{1},...,Lap_{k}), where Lap1,,LapkLap_{1},...,Lap_{k} are random noise sampled from Lap(ΔQϵ)Lap(\frac{\Delta Q}{\epsilon}) and kk is the number of actions in state 𝒔\boldsymbol{s}. Since the Laplace mechanism is differentially private, by comparing Equation 3 with the re-written equation of Line 3, we can conclude that a(𝒔,𝑸,ϵ)\mathcal{M}_{a}(\boldsymbol{s},\boldsymbol{Q},\epsilon) satisfies ϵ\epsilon-differential privacy. ∎

Theorem 1 theoretically proves that the proposed method is a valid differential advising method which enables agents to use advice created based on similar states. Moreover, the parameter ϵ\epsilon is used to control the amount of noise added to an advice. The tuning of ϵ\epsilon is left to users.

5.2 Average reward analysis

As shown in Line 4, Algorithm 2, the average reward of an agent is: r¯=aAπ(𝒔,a)Q(𝒔,a)\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a). We will demonstrate that with a very high probability, the average reward of an agent using differential advising is greater than the average reward without using differential advising.

Definition 7 ((δ\delta,β\beta)-useful).

A differential advising multi-agent system is (δ\delta,β\beta)-useful if for each agent, we have Pr(r¯r¯>δ)<1βPr(\overline{r}^{\prime}-\overline{r}>\delta)<1-\beta, where r¯\overline{r}^{\prime} and r¯\overline{r} are the average rewards of an agent using and without using differential advising, respectively, and δ>0\delta>0 and 0<β<10<\beta<1.

Theorem 2.

For any 0<β<10<\beta<1, with probability at least 1β1-\beta, the average reward of an agent using differential advising is greater than the average reward without using it. The difference is bounded by δ\delta, which satisfies 0<δ<ΔQϵ1tln(22β)0<\delta<-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta), where tt is the number of learning iterations.

Proof.

The average reward of an agent is r¯=aAπ(𝒔,a)Q(𝒔,a)\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a). After applying the Laplace mechanism, the average reward becomes r¯=aAπ(𝒔,a)(Q(𝒔,a)+Lap(b))\overline{r}^{\prime}=\sum_{a\in A}\pi(\boldsymbol{s},a)(Q(\boldsymbol{s},a)+Lap(b)). Now, we need to prove that Pr(r¯r¯>δ)>1βPr(\overline{r}^{\prime}-\overline{r}>\delta)>1-\beta.

According to calculation, we have r¯r¯=aAπ(𝒔,a)(Q(𝒔,a)+Lapa(b))aAπ(𝒔,a)Q(𝒔,a)=aAπ(𝒔,a)Lapa(b)\overline{r}^{\prime}-\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)(Q(\boldsymbol{s},a)+Lap_{a}(b))-\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)=\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b). Thus, Pr(r¯r¯>δ)=Pr(aAπ(𝒔,a)Lapa(b)>δ)Pr(\overline{r}^{\prime}-\overline{r}>\delta)=Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\delta). Because aAπ(𝒔,a)=1\sum_{a\in A}\pi(\boldsymbol{s},a)=1, we have Pr(aAπ(𝒔,a)Lapa(b)>δ)=Pr(aAπ(𝒔,a)Lapa(b)>aAπ(𝒔,a)δ)Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\delta)=Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\sum_{a\in A}\pi(\boldsymbol{s},a)\delta). The property of Lap(b)Lap(b) is presented in Lemma 2.

Lemma 2.

(Laplace Random Variables). For every δ>0\delta>0,

Pr(Lap(b)>δ)=12exp(δb).Pr(Lap(b)>\delta)=\frac{1}{2}\exp(-\frac{\delta}{b}). (9)
Proof.

The proof of this lemma can be found in [24]. ∎

Based on this property, we have Pr(r¯r¯>δ)=Pr(aAπ(𝒔,a)Lapa(b)>aAπ(𝒔,a)δ)=12exp(δb)Pr(\overline{r}^{\prime}-\overline{r}>\delta)=Pr(\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)>\sum_{a\in A}\pi(\boldsymbol{s},a)\delta)=\frac{1}{2}\exp(-\frac{\delta}{b}).

Lemma 2 demonstrates the probability in one learning iteration. If an agent learns tt iterations, the probability becomes Pr(r¯r¯>δ)=[12exp(δb)]tPr(\overline{r}^{\prime}-\overline{r}>\delta)=[\frac{1}{2}\exp(-\frac{\delta}{b})]^{t}. To guarantee Pr(r¯r¯>δ)>1βPr(\overline{r}^{\prime}-\overline{r}>\delta)>1-\beta, we have [12exp(δb)]t>1β[\frac{1}{2}\exp(-\frac{\delta}{b})]^{t}>1-\beta. After calculation, we have δ<b1tln(22β)\delta<-b\cdot\frac{1}{t}\cdot ln(2-2\beta). As b=ΔSϵb=\frac{\Delta S}{\epsilon}, we have δ<ΔQϵ1tln(22β)\delta<-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta). ∎

In this proof, we use the average reward equation: r¯=aAπ(𝒔,a)Q(𝒔,a)\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a). This equation is valid to correctly estimate an agent’s average reward, only if the agent has a good estimation of the QQ-table. Hence, this theorem holds in late learning stages and reveals the final results.

Remark 1: The upper bound of δ\delta, ΔQϵ1tln(22β)-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta), relies on tt. Let δ(t)=ΔQϵ1tln(22β)\delta(t)=-\frac{\Delta Q}{\epsilon}\cdot\frac{1}{t}\cdot ln(2-2\beta). Hence, δ(t)\delta(t) is monotonically decreasing with the increase of tt. Specifically, when t0t\rightarrow 0, δ(t)\delta(t)\rightarrow\infty, and when tt\rightarrow\infty, δ(t)0\delta(t)\rightarrow 0. This means that in early stages, using differential advising can significantly increase agents’ average rewards. However, as time progresses, the improvement of agents’ average rewards decreases. This situation reflects the fact that in early stages, an agent is not knowledgable, so other agents’ knowledge can give the agent significant help. As time progresses, the agent accumulates enough knowledge, and thus other agents’ knowledge is trivial to the agent.

Theorem 2 theoretically demonstrates the effectiveness of our method by analyzing lower bound of an agent’s average reward difference between using and not using our method. The analysis result shows that the lower bound of the average reward difference is positive with a very high probability. This means that the agent using our method can receive a higher average reward than not using our method with a very high probability. Given that the average reward of using and not using the proposed method is r¯\overline{r}^{\prime} and r¯\overline{r}, respectively, the parameter β\beta is used to measure the lower bound of the average reward difference, r¯r¯\overline{r}^{\prime}-\overline{r}. Particularly, when β0\beta\rightarrow 0, the lower bound is a small positive number; and when β1\beta\rightarrow 1, the lower bound tends to be ++\infty. This means that the average reward difference, r¯r¯\overline{r}^{\prime}-\overline{r}, has a high probability, 1β1-\beta, to be a small positive number, while has a low probability to be very large. Certainly, the average reward difference, r¯r¯\overline{r}^{\prime}-\overline{r}, cannot be ++\infty. Thus, for the completeness of Theorem 2, as a supplement, Theorem 3 gives the upper bound of this difference, r¯r¯\overline{r}^{\prime}-\overline{r}.

Theorem 3.

Let kk be the number of actions in state 𝐬\boldsymbol{s}. Let vkΔQϵv\geq\sqrt{k}\frac{\Delta Q}{\epsilon} and 0<λ<22ϵv2ΔQ0<\lambda<\frac{2\sqrt{2}\epsilon v^{2}}{\Delta Q}. Then Pr(r¯r¯>λ)exp(tλ28v2)Pr(\overline{r}^{\prime}-\overline{r}>\lambda)\leq exp(-\frac{t\lambda^{2}}{8v^{2}}), where tt is the number of learning iterations.

Proof.

As shown in Theorem 2, we have r¯r¯=aAπ(𝒔,a)(Q(𝒔,a)+Lap(b))aAπ(𝒔,a)Q(𝒔,a)=aAπ(𝒔,a)Lapa(b)\overline{r}^{\prime}-\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)(Q(\boldsymbol{s},a)+Lap(b))-\sum_{a\in A}\pi(\boldsymbol{s},a)Q(\boldsymbol{s},a)=\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b). Since each π(𝒔,a)\pi(\boldsymbol{s},a) is in [0,1][0,1], we have r¯r¯=aAπ(𝒔,a)Lapa(b)<aALapa(b)\overline{r}^{\prime}-\overline{r}=\sum_{a\in A}\pi(\boldsymbol{s},a)Lap_{a}(b)<\sum_{a\in A}Lap_{a}(b). Thus, we have Pr(r¯r¯>λ)<Pr(aALapa(b)>λ)Pr(\overline{r}^{\prime}-\overline{r}>\lambda)<Pr(\sum_{a\in A}Lap_{a}(b)>\lambda). The property of aALapa(b)\sum_{a\in A}Lap_{a}(b) is given in Lemma 3.

Lemma 3.

(Sum of Laplace random variables) Let Lap1(b),,Lapk(b)Lap_{1}(b),...,Lap_{k}(b) be kk independent variables with distribution Lap(b)Lap(b), then Pr(aALapa(b)>λ)exp(λ28v2)Pr(\sum_{a\in A}Lap_{a}(b)>\lambda)\leq exp(-\frac{\lambda^{2}}{8v^{2}}).

Proof.

The proof of this lemma can be found in [23]. ∎

Lemma 3 demonstrates the probability in one learning iteration. If an agent learns tt iterations, the probability becomes Pr(r¯r¯>λ)exp(tλ28v2)Pr(\overline{r}^{\prime}-\overline{r}>\lambda)\leq exp(-\frac{t\lambda^{2}}{8v^{2}}). ∎

According to Theorem 2 and 3, it can be found that the performance of the proposed method is affected by the advice sensitivity ΔQ\Delta Q, and a smaller ΔQ\Delta Q means a better performance. Since advice sensitivity, ΔQ\Delta Q, is based on the QQ-values of states, we can conclude that the proposed method can work well in the environments where the QQ-values of any pair of neighboring states have a small difference.

5.3 Convergence analysis

In Algorithm 1, let αt\alpha_{t} be the learning rate at time step tt. Since αt+1=tt+1αt=tt+1t1tαt1=\alpha_{t+1}=\frac{t}{t+1}\cdot\alpha_{t}=\frac{t}{t+1}\cdot\frac{t-1}{t}\cdot\alpha_{t-1}=..., we can conclude that αt=1tα1\alpha_{t}=\frac{1}{t}\alpha_{1}. Let tit_{i} be the index of the iith time that action aa is performed in state 𝒔\boldsymbol{s}. For example, suppose that the 11st time that action aa is performed in state 𝒔\boldsymbol{s} is at time step 66, then t1=6t_{1}=6. To prove the convergence of Algorithm 1, we need the results of Lemma 4 [25] and Lemma 5 [19].

Lemma 4.

A series, s1+s2+s_{1}+s_{2}+..., converges if and only if for any given small positive number μ>0\mu>0, there is always an integer NN, such that when n>Nn>N, for any positive integer mm, the inequality holds: |sn+1++sn+m|<μ|s_{n+1}+...+s_{n+m}|<\mu.

It should be noted that in Lemma 4, if |sn+1++sn+m|μ|s_{n+1}+...+s_{n+m}|\geq\mu, the series, s1+s2+s_{1}+s_{2}+..., does not converge.

Lemma 5.

Given bounded rewards and learning rate 0αti10\leq\alpha_{t_{i}}\leq 1, for any pair of state 𝐬\boldsymbol{s} and action aa, if series 1iαti2\sum_{1\leq i}\alpha^{2}_{t_{i}} is bounded, i.e., 1iαti2<+\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty, and series 1iαti\sum_{1\leq i}\alpha_{t_{i}} is unbounded, i.e., 1iαti+\sum_{1\leq i}\alpha_{t_{i}}\rightarrow+\infty, then, when i+i\rightarrow+\infty, Q(𝐬,a)Q(\boldsymbol{s},a) converges, with probability 11, to the optimal QQ-value, i.e., the expected reward.

Theorem 4.

In Algorithm 1, for any pair of state 𝐬\boldsymbol{s} and action aa, as i+i\rightarrow+\infty, Q(𝐬,a)Q(\boldsymbol{s},a) converges, with probability 11, to the expected reward of an agent performing action aa in state 𝐬\boldsymbol{s}.

Proof.

Based on Lemma 5, to prove the convergence, we need only to prove that in Algorithm 1, 1iαti2<+\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty and 1iαti+\sum_{1\leq i}\alpha_{t_{i}}\rightarrow+\infty always hold for any pair of state 𝒔\boldsymbol{s} and action aa.

First, we prove that 1iαti2<+\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty always holds. It has been known that 1t1t2=π26\sum_{1\leq t}\frac{1}{t^{2}}=\frac{\pi^{2}}{6}, where tt means time step, a positive integer. Then, the following inequality can be concluded: 1iαti2=1iα12ti21tα12t2=α12π26<+\sum_{1\leq i}\alpha^{2}_{t_{i}}=\sum_{1\leq i}\frac{\alpha^{2}_{1}}{t^{2}_{i}}\leq\sum_{1\leq t}\frac{\alpha^{2}_{1}}{t^{2}}=\alpha^{2}_{1}\frac{\pi^{2}}{6}<+\infty.

Then, we prove that 1iαti+\sum_{1\leq i}\alpha_{t_{i}}\rightarrow+\infty always holds. Given a small positive number μ\mu and a positive integer NN, let 0<μ<βprob(𝒔)α120<\mu<\beta\cdot prob(\boldsymbol{s})\cdot\frac{\alpha_{1}}{2}. Here, β\beta is the mapping lower bound, a small positive number, used to normalize probability distribution 𝝅(𝒔)\boldsymbol{\pi}(\boldsymbol{s}), and prob(𝒔)prob(\boldsymbol{s}) is the probability with which an agent observes state 𝒔\boldsymbol{s}. As π(𝒔,a)β>0\pi(\boldsymbol{s},a)\leq\beta>0, there is at least one integer n>Nn>N, such that in (n+1)(n+1)th, (n+2)(n+2)th, …, (n+n)(n+n)th time steps, the number of times that action aa is performed in state 𝒔\boldsymbol{s} is greater than or equal to βprob(𝒔)n\beta\cdot prob(\boldsymbol{s})\cdot n. Therefore, n<ti2nαti=n<ti2nα1tin<ti2nα12nβprob(𝒔)nα12n=βprob(𝒔)α12>μ\sum_{n<t_{i}\leq 2n}\alpha_{t_{i}}=\sum_{n<t_{i}\leq 2n}\frac{\alpha_{1}}{t_{i}}\geq\sum_{n<t_{i}\leq 2n}\frac{\alpha_{1}}{2n}\geq\beta\cdot prob(\boldsymbol{s})\cdot n\cdot\frac{\alpha_{1}}{2n}=\beta\cdot prob(\boldsymbol{s})\cdot\frac{\alpha_{1}}{2}>\mu. Based on Lemma 4, as n<ti2nαti>μ\sum_{n<t_{i}\leq 2n}\alpha_{t_{i}}>\mu, the series, iαti\sum_{i}\alpha_{t_{i}}, does not converge, which means that iαti+\sum_{i}\alpha_{t_{i}}\rightarrow+\infty.

Because 1iαti2<+\sum_{1\leq i}\alpha^{2}_{t_{i}}<+\infty and iαti+\sum_{i}\alpha_{t_{i}}\rightarrow+\infty, based on Lemma 5, it can be concluded that in Algorithm 1, when i+i\rightarrow+\infty, Q(𝒔,a)Q(\boldsymbol{s},a) converges, with probability 11, to the expected reward of a player. ∎

Theorem 4 theoretically proves the convergence of our method, which means that agents using our method can finally receive their expected rewards.

6 Experiments

6.1 Experimental setup

Two experimental scenarios are used to evaluate the proposed differential advising method, denoted as DA-RL (differential advising reinforcement learning).

6.1.1 Scenario 1

The first scenario is the multi-robot problem [26] shown in Fig. 1. Each agent represents a robot which has four actions: up, down, left and right. An observation of an agent is interpreted as its state which consists of 88 dimensions. The aim of the agents is to achieve the targets on the map, which could be victims in search and rescue or rubbish in office cleaning. The reward for achieving a target is set to 1010. The reward for hitting an obstacle is set to 5-5. The reward for moving a step is 0. The reward in each situation is pre-defined as knowledge to each agent. For example, when an agent achieves a target, its reward is automatically added by 1010, while when the agent hits an obstacle, its reward is automatically reduced by 55. Moreover, agents cannot be at the same cell at the same time step.

The first scenario is classified as three settings. Setting 1 (Static): agents achieve fixed targets. Setting 2 (Dynamic 1): agents achieve dynamically generated targets. Setting 3 (Dynamic 2): agents achieve dynamically moving targets.

The size of environments varies from 12×812\times 8 cells to 24×1624\times 16 cells. The number of agents varies from 22 to 66. The number of targets varies from 2020 to 6060. The number of obstacles varies from 1515 to 3535.

In this scenario, three evaluation metrics are used.

Metric 1 (Average time): the number of time steps to achieve all targets in average in one learning round. This metric is specific for the first and third settings: Static and Dynamic 2. Here, a learning round is a period during which the agents achieve all the targets.

Metric 2 (Average time for one target): the number of time steps to achieve one target in average in one learning round. This metric is specific for the second setting: Dynamic 1. The reason of using different metrics in different settings will be given at the end of this section.

Metric 3 (Average hit): the number of hits to obstacles in average in one learning round.

6.1.2 Scenario 2

This scenario is the multi-agent load balancing [27]. Each agent represents a factory. Each factory has a backlog which stores a set of types of items. Let the number of item types be kk and the maximum stock for each type ii be MiM_{i}. A state of an agent is the current number of items of each type in stock. A state consists of kk dimensions.

At each time step, a random item from each agent is processed with a given probability psp_{s}, and a new item arrives at each agent with another given probability pap_{a}. The actions of an agent include whether or not to pass an arrived item to another agent. The reward of an agent AA is based on its backlog: rA=501ikwimir_{A}=50-\sum_{1\leq i\leq k}w_{i}\cdot m_{i}, where wiw_{i} is the importance weight of items of type ii and mim_{i} is the current number of items of type ii. In addition, if agent AA passes an item ii to agent BB, agent AA’s reward is reduced by 2wi2\cdot w_{i} as a cost for redistribution. The aim of agents is to maximize their accumulated reward.

The number of agents varies from 22 to 66. Item types varies from 22 to 66, where for two types, the weights are set to 55 and 44; for three types, the weights are set to 55, 44 and 33; and so on. The maximum stock for each type of item varies from 33 to 77. The probability of processing an item psp_{s} varies from 0.20.2 to 0.60.6. The probability of arriving at an item pap_{a} is set to 0.40.4.

The evaluation metric is the average reward of all the agents in one learning round. Here, a learning round means a pre-defined number of learning iterations.

6.1.3 The methods for comparison

In this experiment, two methods are involved for comparison. The first method is a regular reinforcement learning method, denoted as RL, which does not include any advising. The RL method is used as a comparison standard. The second method is from [2], denoted as SA-RL (simultaneous advising reinforcement learning). The SA-RL method is an Ad hoc TD advising method, where agents simultaneously and independently learn and share advice when needed. The SA-RL method is similar to our method. The difference includes 1) the SA-RL method allows agents to offer advice only in the same states; and 2) agents in the SA-RL method transfers recommended actions as advice. Moreover, the SA-RL method involves two parameters, vav_{a} and vgv_{g}, used to control the probability of asking for and giving advice, respectively. A higher vav_{a} value results in a lower probability of asking for advice, while a higher vgv_{g} results in a higher probability of giving advice. The SA-RL method is used to demonstrate the effectiveness of the differential privacy technique on agent advising.

The values of the parameters used in the experiments are set as follows: α=0.2\alpha=0.2, γ=0.8\gamma=0.8, ζ=0.1\zeta=0.1, ϵ=1\epsilon=1, va=0.4v_{a}=0.4 and vg=0.9v_{g}=0.9. The values of these parameters are experimentally chosen to yield good results. Moreover, ΔQ=3\Delta Q=3 in Scenario 1, while ΔQ=1\Delta Q=1 in Scenario 2. The experimental results are obtained by averaging the results of 500500 runs.

6.2 Experimental results of scenario 1

6.2.1 The first setting: agents achieve targets

Refer to caption
(a) Average time steps
Refer to caption
(b) Average hits
Figure 3: Performance of the three methods in different sizes of environments in the first setting

Fig. 3 demonstrates the performance of the three methods in different sizes of environments in the first setting. It can be seen that with the increase of the environmental size, in all three methods, agents take more steps to achieve targets and hit more obstacles. This is because with the increase of size, more obstacles are there in the environments, which inevitably increases the difficulty for agents to achieve targets.

In Fig. 3(a), when the environmental size is smaller than 18×1218\times 12, the performance difference among the three methods is small. However, when the environmental size is larger than 18×1218\times 12, the performance difference in average time steps enlarges up to 20%20\% between DA-RL and RL, and 10%10\% between DA-RL and SA-RL, respectively. In addition, in Fig. 3(b), the performance difference in average hits enlarges up to 25%25\% between DA-RL and RL, and 15%15\% between DA-RL and SA-RL, respectively. This is because when the environmental size increases, the number of targets and obstacles also increases. Thus, the number of states of each agent may increase as well. In this situation, since DA-RL and SA-RL enable agent advising, their performance is better than RL. Moreover, as DA-RL adopts the differential privacy technique, agents using DA-RL can take more advice than using SA-RL. Thus, the performance of DA-RL is better than SA-RL.

This result shows that by using agent advising, agents can ask for advice about the positions of obstacles and targets. Thus, agents’ learning performance can be improved. Moreover, in the DA-RL method, an agent’s advice, created in one state, can also be used by other agents in neighboring states, which further improves those agents’ learning performance.

Refer to caption
(a) Average time steps
Refer to caption
(b) Average hits
Figure 4: Performance of the three methods as time progresses in the first setting

Fig. 4 shows the performance of the three methods as time progresses in the first setting. The size of the environment is set to 18×1218\times 12; the number of agents is set to 44; the number of targets is set to 4040; and the number of obstacles is set to 2525. In Figs. 4(a), DA-RL and SA-RL methods continuously outperform RL method. Moreover, DA-RL and SA-RL methods converge faster than RL method. This can be explained by the fact that agents using DA-RL and SA-RL methods can ask for advice between each other, which can significantly improve the learning performance, especially in early stages. In Fig. 4(b), the number of average hits are almost the same among the three methods in late stages. This is because the positions of obstacles are fixed and agents using these methods have the learning ability. As time progresses, the positions of obstacles can be gradually memorized by agents. Thus, the number of hits decreases and finally converges to a stable point.

6.2.2 The second setting: agents achieve dynamically generated targets

Refer to caption
(a) Average time steps
Refer to caption
(b) Average hits
Figure 5: Performance of the three methods in different sizes of environments in the second setting

Fig. 5 demonstrates the performance of the three methods in different sizes of environments in the second setting. It can be seen that the proposed DA-RL method outperforms the other two methods.

Compared to the first setting (Fig. 3(a)), all the three methods need more time steps to achieve targets in the second setting (Fig. 5(a)). This is because in the second setting, new targets are dynamically introduced. It is unavoidable to take more time steps to achieve these new targets.

The average number of hits in the three methods in the second setting (Fig. 5(b)) is almost the same as that in the first setting (Fig. 3(b)). In both first and second settings, the number and positions of obstacles are fixed. Therefore, the positions of obstacles can be learned by agents. Agents then can avoid the obstacles to some extent.

Refer to caption
(a) Average time steps
Refer to caption
(b) Average hits
Figure 6: Performance of the three methods as time progresses in the second setting

Fig. 6 shows the performance of the three methods as time progresses in the second setting. The size of the environment is set to 18×1218\times 12; the number of agents is set to 44; the number of targets is set to 4040; and the number of obstacles is set to 2525. In Fig. 6(a), as time progresses, the average number of time steps for one target in the three methods decreases in a fluctuant manner. This is caused by the introduction of new targets. The introduction of new targets implies that new knowledge is introduced in environments. Thus, agents need time to learn the new knowledge, which results in the increase of the number of time steps. In Fig. 6(b), as time progresses, the average number of hits gradually decreases in the three methods, because the positions of obstacles can be learned and thus obstacles can be avoided.

6.2.3 The third setting: agents achieve dynamically moving targets

Refer to caption
(a) Average time steps
Refer to caption
(b) Average hits
Figure 7: Performance of the three methods in different sizes of environments in the third setting

Fig. 7 demonstrates the performance of the three methods in different sizes of environments in the third scenario. Again, the proposed DA-RL method outperforms the other two methods. Comparing Fig. 7 and Fig. 5, the performance trend of the three methods in the third setting is similar to the second setting. This is because both the second and third settings are dynamic, which means that agents need more time steps to achieve targets than in the first setting.

Refer to caption
(a) Average time steps
Refer to caption
(b) Average hits
Figure 8: Performance of the three methods as time progresses in the third setting

Fig. 8 shows the performance of the three methods as time progresses in the third setting. The size of the environment is set to 18×1218\times 12; the number of agents is set to 44; the number of targets is set to 4040; and the number of obstacles is set to 2525. Unlike the second scenario (Fig. 6(a)), there is no fluctuation in the three methods in the third setting (Fig. 8(a)). This is because in the third setting, although targets are moving, the number of targets does not increase. For example, there are 33 targets in an environment. In the third setting, the experiment finishes when all the 33 targets are achieved. In the second setting, however, there may be new targets introduced. The experiment will not finish until all the 33 and the new targets are achieved. If new targets are generated constantly, the experiment is hard to finish. This means that in the second setting, the completion of an experiment heavily depends on the generation probability of new targets. Hence, there is more fluctuation in the second setting than in the third setting.

6.3 Experimental results of scenario 2

Refer to caption
(a) Average reward with different number of agents
Refer to caption
(b) Average reward with different number of item types
Refer to caption
(c) Average reward with different size of stocks
Refer to caption
(d) Average reward with different processing probabilities
Figure 9: Performance of the three methods

Fig. 9(a) demonstrates the performance of the three methods with different number of agents, where the number of item types is set to 33, the maximum stock of each item type is set to 55, and the probability of processing an item is set to 0.50.5. In Fig. 9(a), with the increase of the number of agents, the average reward of agents in the three methods also rises. When the number of agents increases, each agent can ask for advice from more agents. Thus, agents can learn faster and receive more reward. However, it should also be noted that when the number of agents is larger than 55, the average reward keeps almost steady. This can be explained by the fact that in a given environment, the amount of knowledge is limited. Agents can share knowledge but cannot create new knowledge. When the number of agents is large enough to guarantee a good learning speed, increasing the number of agents does not improve agents’ learning speed or increase agents’ reward.

Fig. 9(b) shows the performance of the three methods with different number of item types, where the number of agents is set to 44, the maximum stock of each item type is set to 55, and the probability of processing an item is set to 0.50.5. In Fig. 9(b), with the increase of the number of item types, the average reward of agents in the three methods reduces gradually. When the number of item types increases, the dimension of states of each agent also increases. This decreases the learning speed of agents, which reduces the average reward of agents as agents have more opportunity to make non-optimal decisions.

Fig. 9(c) displays the performance of the three methods with different maximum stock, where the number of agents is set to 44, the number of item types is set to 33, and the probability of processing an item is set to 0.50.5. In Fig. 9(c), with the increase of the maximum stock, the average reward of agents in the three methods increases gracefully. When the stock capacity of agents increases, agents prefer to store new items instead of redistributing them. This preference will augment agents’ average reward, because, based on the experimental setting, storing an item can incur more reward than redistributing it.

Fig. 9(d) exhibits the performance of the three methods with different probabilities of processing an item, where the number of agents is set to 44, the number of item types is set to 44, and the maximum stock of each item type is set to 55. In Fig. 9(d), with the increase of processing probability, the average reward of agents in the three methods rises. When the processing probability increases, items are processed faster and hence, the stock amount of each agent reduces. This improves the average reward of agents, as the reward is based on stock amount in the experimental setting. However, it should also be noted that when the processing probability is too large, e.g., larger than 0.60.6, the difference between the three methods reduces. As mentioned before, a large processing probability implies a small stock amount. Since the state of an agent is based on the stock amount, i.e., the number of items of each item type, small stock amount means a small number of states. Because each agent observes only a small number of states, after a short learning process, each agent is confident in these states and does not ask for advice any more. If agents do not ask for advice, the DA-RL and SA-RL methods are the same as RL method. Hence, they have similar results.

In the above four situations, the proposed DA-RL method outperforms the SA-RL and RL methods, though the performance of the three methods has a similar trend. According to the experimental results, the advantage of differential advising has been empirically proven.

Refer to caption
(a) Average reward as learning progresses in a simple environment
Refer to caption
(b) Average reward as learning progresses in a complex environment
Figure 10: Performance of the three methods as learning progresses

Fig. 10 demonstrates the performance of the three methods as learning progresses. Fig. 10(a) shows the results in a simple environment. The number of agents is set to 33, the number of item types is set to 22, the maximum stock of each item type is set to 33, and the probability of processing an item is set to 0.50.5. Fig. 10(b) shows the results in a complex environment. The number of agents is set to 33, the number of item types is set to 66, the maximum stock of each item type is set to 77, and the probability of processing an item is set to 0.50.5.

In Fig. 10(a), the three methods converge in very early stages (around 200200 learning iterations). In comparison, in Fig. 10(b), the three methods converge much slower, where DA-RL converges at around 500500 learning iterations, SA-RL converges at around 600600 iterations, and RL converges at around 900900 iterations. This can be explained by the fact that in the complex environment, the number of states is much more than the number of states in the simple environment. As more states typically imply more knowledge, agents have to take more learning iterations to learn the knowledge. Therefore, the three methods converge slower in the complex environment than in the simple environment. However, the proposed DA-RL method in the complex environment converges faster than the SA-RL and RL methods. This result demonstrates that by using differential advising, agents in the proposed DA-RL method learn faster than the other two methods.

6.4 Discussion and summary

6.4.1 The metric used in the second setting of the first scenario

In the second setting of the first scenario (Fig. 6(a)), we use the metric, average number of time steps for one target, instead of the metric, average number of time steps for all targets, used in the first and third scenarios.

The average number of time steps for all targets in the second scenario does not converge as time progresses in the three methods. This is because agents in the three methods have learning ability. The aim of learning is to maximize agents’ long-term accumulated rewards. In the experiments, an agent can receive a positive reward only when the agent achieves a target. In the second setting, new targets are introduced dynamically. This motivates agents to stay in the experiment to keep achieving targets so as to increase their accumulated rewards. Therefore, the average number of time steps for achieving all the targets in the second scenario increases gradually. However, this does not mean that the three methods do not converge in other aspects in the second setting. Although the number of time steps in the three methods increases, the number of achieved targets also increases. Since the average number of time steps for one target is the ratio between the total number of time steps and the total number of achieved targets, the increased number of achieved targets can offset the increased number of time steps. Thus, the average number of time steps for one target still converges (Fig. 6(a)).

6.4.2 Performance of the three methods

According to the experimental results in the first scenario, the average time steps used in the DA-RL method is fewer than the other two methods in all the three settings: about 15%20%15\%\sim 20\% and 40%40\% fewer than the SA-RL and RL methods, respectively. Moreover, the DA-RL converges about 10%10\% and 20%20\% faster than the SA-RL and RL methods, respectively. In the second scenario, the average reward of agents in the DA-RL method is about 10%15%10\%\sim 15\% and 20%30%20\%\sim 30\% more than the SA-RL and RL methods, respectively. In addition, the DA-RL method converges about 15%15\% and 40%40\% faster than the SA-RL and RL methods, respectively. In summary, the proposed DA-RL method outperforms the other two methods in both scenarios.

7 Conclusion and Future Work

This paper proposed a differential advising method for multi-agent reinforcement learning. This method is the first to allow agents in one state to use advice created in another different state. This method is also the first to take advantage of the differential privacy technique for agent advising to improving learning performance instead of preserving privacy. Experimental results demonstrate that our method outperforms other benchmark methods in various aspects.

In this paper, we have an assumption that in two similar states, the applicability and adequacy of actions to take are similar. This assumption, however, are not applicable in some situations. For example, the good actions to take in a rainy day can become the bad actions to take in a sunny day. In the future, we will relax this assumption by assigning weights to dimensions of states. Moreover, our method mainly focuses on discrete states. It is interesting to extend our method to continuous environments. An intuitive way for the extension is to discretize continuous states to discrete states. Finally, extending our method to multi-agent deep reinforcement learning is also interesting future work. A potential way is to allow agents to transfer experience samples as advice. However, if the number of transferred samples is large, how to select high-quality samples is a challenge to the adviser agent.

References

  • [1] F. L. D. Silva, R. Glatt, and A. H. R. Costa, “MOO-MDP: An Object-Oriented Representation for Cooperative Multiagent Reinforcement Learning,” IEEE Trans. on Cybernetics, vol. 49, pp. 567–579, 2019.
  • [2] F. L. D. Silva, R. Glatt, and A. H. R. Costa, “Simultaneously Learning and Advising in Multiagent Reinforcement Learning,” in Proc. of AAMAS, Brazil, May 2017, pp. 1100–1108.
  • [3] F. L. D. Silva, M. E. Taylor, and A. H. R. Costa, “Autonomously Reusing Knowledge in Multiagent Reinforcement Learning,” in Proc. of IJCAI 2018, Sweden, July 2018, pp. 5487–5493.
  • [4] F. L. D. Silva, “Integrating Agent Advice and Previous Task Solutions in Multiagent Reinforcement Learning,” in AAMAS, 2019, pp. 2447–2448.
  • [5] F. L. D. Silva and A. H. R. Costa, “A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems,” Journal of Artificial Intelligence Research, vol. 64, pp. 645–703, 2019.
  • [6] D. Ye, T. Zhu, W. Zhou, and P. S. Yu, “Differentially Private Malicious Agent Avoidance in Multiagent Advising Learning,” IEEE Transactions on Cybernetics, vol. 50, no. 10, pp. 4214–4227, 2020.
  • [7] C. Dwork, “Differential privacy,” in Proc. of ICALP, 2006, pp. 1–12.
  • [8] T. Zhu, G. Li, W. Zhou, and P. S. Yu, “Differentially private data publishing and analysis: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 8, pp. 1619–1638, 2017.
  • [9] L. Torrey and M. E. Taylor, “Teaching on A Budget: Agents Advising Agents in Reinforcement Learning,” in AAMAS, 2013, pp. 1053–1060.
  • [10] O. Amir, E. Kamar, A. Kolobov, and B. Grosz, “Interactive Teaching Strategies for Agent Training,” in Proc. of IJCAI, 2016, pp. 804–811.
  • [11] Y. Wang, W. Lu, J. Hao, J. Wei, and H. Leung, “Efficient Convention Emergence through Decoupled Reinforcement Social Learning with Teacher-Student Mechanism,” in Proc. of AAMAS, 2018, pp. 795–803.
  • [12] S. Omidshafiei, D. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How, “Learning to Teach in Cooperative Multiagent Reinforcement Learning,” in AAAI, 2019, pp. 6128–6136.
  • [13] C. Zhu, H. Leung, S. Hu, and Y. Cai, “A Q-values Sharing Framework for Multiple Independent Q-learners,” in AAMAS, 2019, pp. 2324–2326.
  • [14] E. Ilhan, J. Gow, and D. Perez-Liebana, “Teaching on a Budget in Multi-Agent Deep Reinforcement Learning,” in Proc. of IEEE Conference on Games, 2019.
  • [15] A. Fachantidis, M. E. Taylor, and I. Vlahavas, “Learning to Teach Reinforcement Learning Agents,” ML and Knowledge Extraction, vol. 1, pp. 21–42, 2018.
  • [16] V. Gupta, D. Anand, P. Paruchuri, and B. Ravindran, “Advice Replay Approach for Richer Knowledge Transfer in Teacher Student Framework,” in Proc. of AAMAS, 2019, pp. 1997–1999.
  • [17] L. Vanhee, L. Jeanpierre, and A.-I. Mouaddib, “Augmenting Markong Decision Processes with Advising,” in AAAI, 2019, pp. 2531–2538.
  • [18] P. S. Castro, “Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes,” in AAAI, 2020, pp. 10 069–10 076.
  • [19] C. J. Watkins, “Q-Learning,” Machine Learning, 1992.
  • [20] T. Zhu, D. Ye, W. Wang, W. Zhou, and P. S. Yu, “More Than Privacy: Applying Differential Privacy in Key Areas of Artificial Intelligence,” IEEE Transactions on Knowledge and Data Engineering, p. DOI: 10.1109/TKDE.2020.3014246, 2020.
  • [21] D. Ye, T. Zhu, S. Shen, W. Zhou, and P. S. Yu, “Differentially Private Multi-Agent Planning for Logistic-like Problems,” IEEE Transactions on Dependable and Secure Computing, p. DOI: 10.1109/TDSC.2020.3017497, 2020.
  • [22] D. Ye, T. Zhu, S. Shen, and W. Zhou, “A Differentially Private Game Theoretic Approach for Deceiving Cyber Adversaries,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 569–584, 2021.
  • [23] C. Dwork and A. Roth, “The Algorithmic Foundations of Differential Privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
  • [24] S. Kasiviswanathan, H. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?” in Proc. of Annual IEEE Symposium on Foundations of Computer Science, 2008, pp. 531–540.
  • [25] H. Anton, I. Bivens, and S. Davis, Calculus.   Wiley, 2012.
  • [26] Y. Wang and C. W. D. Silva, “A Machine-Learning Approach to Multi-Robot Coordination,” Engineering Applications of Artificial Intelligence, vol. 21, no. 3, pp. 470–484, 2008.
  • [27] J. Sakuma, S. Kobayashi, and R. N. Wright, “Privacy-Preserving Reinforcement Learning,” in Proc. of ICML, 2008.