This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-agent reinforcement learning using echo-state network and its application to pedestrian dynamics

Hisato Komatsu Data Science and AI Innovation Research Promotion Center, Shiga University, 522-8522, Shiga, Japan [email protected]
Abstract

In recent years, simulations of pedestrians using the multi-agent reinforcement learning (MARL) have been studied. This study considered the roads on a grid-world environment, and implemented pedestrians as MARL agents using an echo-state network and the least squares policy iteration method. Under this environment, the ability of these agents to learn to move forward by avoiding other agents was investigated. Specifically, we considered two types of tasks: the choice between a narrow direct route and a broad detour, and the bidirectional pedestrian flow in a corridor. The simulations results indicated that the learning was successful when the density of the agents was not that high.

1 Introduction

Motions of groups of people or animals have been studied in many fields such as transportation engineering and active matter physics. Comprehensively understanding such motions through only experiments and observations is challenging. Thus, several studies have conducted computer simulations for a better understanding. Traditionally, animals (including humans) are assumed to obey certain mathematical rules in these simulations[1, 2, 3]. However, with the recent development of machine learning, simulation methods that reproduce animals by agents of reinforcement learning (RL) have been proposed[4, 5, 6, 7, 8]. RL in an environment with several agents exist is referred to as multi-agent reinforcement learning (MARL), and has been studied intensively to realize the competition or cooperation between agents.

Currently, deep learning is usually used to implement RL agents, because it outperforms conventional methods of machine learning[9, 10, 11]. However, training of the deep learning incurs high computational costs even in the case of a single-agent RL. Thus, algorithms with lower computational costs are useful, particularly when considering numerous of agents.

This study proposed an RL algorithm using an echo-state network (ESN). ESN is a type of reservoir computing that uses recurrent neural networks (RNN) and trains only the output weight matrix. This restriction reduced the computational cost compared to that of deep learning, which trains all parameters of the neural network. Several methods have been proposed to implement RL using ESN[12, 13, 14, 15, 16]. Among these methods, this study adopted a method similar to, but simpler than that of Ref. [15]. As a specific task, we considered pedestrians proceeding along a road. All pedestrians were RL agents that received positive(negative) rewards when they travelled along the same(opposite) direction as their targeted direction. Although the task itself was simple, other agents were employed as the obstacles preventing each agent’s walking. Hence, the problem increased in complexity with increasing number of agents. We investigated whether the agents could proceed in the direction they wanted to by avoiding other agents.

The remainder of this paper is organized as follows: first, we review the related previous studies in Section 2. Then, our method is proposed in Section 3, and the results of numerical simulations are shown in Section 4. Finally, we summarize the study in Section 5.

2 Related work

2.1 Multi-agent reinforcement learning (MARL)

In environments with several agents, RL agents are required to cooperate or compete with other agents. Such environments appear in various fields, thus, MARL has been studied intensively[17]. In particular, methods of deep reinforcement learning (DRL) have been applied to MARL, as well as single-agent RL[19, 18, 20].

The simplest method to implement MARL is to allow the agents to learn their policies independently. However, independent learners cannot access the experiences of other agents, and consider other agents as a part of environment. This results in the non-stationarity of the environment from the perspective of each agent. To address this problem, many algorithms have been proposed. For example, the multi-agent deep deterministic policy gradient (MADDPG) algorithm adopts the actor-critic method, and allows the critic to observe all agents’ observations, actions, and target policies[27]. This algorithm was improved and applied to the pedestrian dynamics by Zheng and Liu[6]. Gupta et al. introduced a parameter sharing method wherein agents shared the parameters of their neural networks[22]. Their method exhibited high learning efficiency in environments with homogeneous agents.

This study adopted the parameter sharing method because it can be easily extended to a form suitable to ESN. Specifically, we considered one or two groups of agents that shared their experiences and policies but received rewards separately.

2.2 Echo-state network (ESN)

As mentioned earlier, ESN is a type of reservoir computing that uses RNN[23, 24, 25]:

𝑿~res(t)\displaystyle\tilde{\mbox{\boldmath$X$}}^{\mathrm{res}}(t) =\displaystyle= f(𝑾sin𝑿in(t)+𝑾bin+𝑾res𝑿res(t1)),\displaystyle f\left(\mbox{\boldmath$W$}^{\mathrm{in}}_{s}\mbox{\boldmath$X$}^{\mathrm{in}}(t)+\mbox{\boldmath$W$}^{\mathrm{in}}_{b}+\mbox{\boldmath$W$}^{\mathrm{res}}\mbox{\boldmath$X$}^{\mathrm{res}}(t-1)\right), (1)
𝑿res(t)\displaystyle\mbox{\boldmath$X$}^{\mathrm{res}}(t) =\displaystyle= α𝑿~res(t)+(1α)𝑿res(t1),\displaystyle\alpha\tilde{\mbox{\boldmath$X$}}^{\mathrm{res}}(t)+(1-\alpha)\mbox{\boldmath$X$}^{\mathrm{res}}(t-1), (2)
𝑿out(t)\displaystyle\mbox{\boldmath$X$}^{\mathrm{out}}(t) =\displaystyle= 𝑾out(𝑿res(t)1),\displaystyle\mbox{\boldmath$W$}^{\mathrm{out}}\left(\begin{array}[]{c}\mbox{\boldmath$X$}^{\mathrm{res}}(t)\\ 1\end{array}\right), (5)

where 𝑿in(t)\mbox{\boldmath$X$}^{\mathrm{in}}(t) and 𝑿out(t)\mbox{\boldmath$X$}^{\mathrm{out}}(t) are the input and output vectors at time tt, respectively, ff is the activation function, and α\alpha is the constant called leaking rate. In the case of ESN, elements of matrices 𝑾sin\mbox{\boldmath$W$}^{\mathrm{in}}_{s}, 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b}, and 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}} are randomly fixed, and only the output weight matrix 𝑾out\mbox{\boldmath$W$}^{\mathrm{out}} is trained. Through this restriction of the trainable parameters, the computational cost is considerably lowered compared with the deep learning. In addition, the exploding/vanishing gradient problem does not occur unlike the training of RNN using backpropagation. For example, in the case of supervised learning, training is typically conducted by the Ridge regression that minimizes the mean square error between the output vector 𝑿out\mbox{\boldmath$X$}^{\mathrm{out}} and the training data 𝑿train\mbox{\boldmath$X$}^{\mathrm{train}}. This calculation is considerably faster than the backpropagation used in the deep learning.

However, in the case of RL the training data should be collected by agents themselves. Hence, the training method is not as easy as in the case of the supervised learning. Many previous studies have proposed different algorithms to realize RL using ESN. For example, Szita et al. used the SARSA method[12], Oubbati et al. adopted ESN as a critic of the actor-critic method[13], Chang and Futagami used a covariance matrix adaptation evolution strategy (CMA-ES)[14], and Zhang et al. improved the least squares policy iteration (LSPI) method[15]. Previous studies have also utilized ESN to MARL. For example, Chang et al. trained ESN employing a method similar to that of a deep Q network(DQN), and applied it to the distribution of the radio spectra of telecommunication equipment[16]. In this study, we adopted the LSPI method, as in Ref. [15]; however, we used a simpler algorithm. Note that the absence of the exploding/vanishing gradient problem in the ESN comes from the point that the weight matrices involved in this problem are fixed. Hence, although the above-mentioned training algorithms including LSPI are more complex than those of supervised learning, such problem still does not appear.

Typically, the recurrent weight 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}} is a sparse matrix with spectral radius ρ<1\rho<1. To generate such a matrix, first, a random matrix 𝑾0res\mbox{\boldmath$W$}_{0}^{\mathrm{res}} must be created whose components W0,μνresW_{0,\mu\nu}^{\mathrm{res}} are expressed as:

{W0,μνres=0withprob.psresW0,μνres𝒫reswithprob. 1psres,\left\{\begin{array}[]{cc}W_{0,\mu\nu}^{\mathrm{res}}=0&\mathrm{with\ prob.\ }p_{s}^{\mathrm{res}}\\ W_{0,\mu\nu}^{\mathrm{res}}\sim\mathcal{P^{\mathrm{res}}}&\mathrm{with\ prob.\ }1-p_{s}^{\mathrm{res}}\end{array}\right., (6)

where 𝒫res\mathcal{P^{\mathrm{res}}} is a predetermined probability distribution and psresp_{s}^{\mathrm{res}} is a hyperparameter called sparsity. Regularizing this matrix such that its spectral radius is expressed as ρ\rho, we obtain 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}}:

𝑾res=ρρ(𝑾0res)𝑾0res.\mbox{\boldmath$W$}^{\mathrm{res}}=\frac{\rho}{\rho\left(\mbox{\boldmath$W$}_{0}^{\mathrm{res}}\right)}\mbox{\boldmath$W$}_{0}^{\mathrm{res}}. (7)

where ρ(𝑾0res)\rho\left(\mbox{\boldmath$W$}_{0}^{\mathrm{res}}\right) is the spectral radius of 𝑾0res\mbox{\boldmath$W$}_{0}^{\mathrm{res}}. According to Ref. [24], the input weights 𝑾sin\mbox{\boldmath$W$}^{\mathrm{in}}_{s} and 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b} are usually generated as dense matrices. However, we also rendered them sparse, as has been explained later.

2.3 Least squares policy iteration (LSPI) method

The LSPI method is an RL algorithm that was proposed by Lagoudakis and Parr[26]. It approximates the state-action value function Q(s,a)Q(s,a) as the linear combination of given functions of ss and aa, ϕ(s,a)=(ϕ1(s,a),,ϕN(s,a))T\mbox{\boldmath$\phi$}(s,a)=\left(\phi_{1}(s,a),...,\phi_{N}(s,a)\right)^{T}:

Q(s,a)=i=1Nwiϕi(s,a)=𝒘Tϕ(s,a),Q(s,a)=\sum_{i=1}^{N}w_{i}\phi_{i}(s,a)=\mbox{\boldmath$w$}^{T}\mbox{\boldmath$\phi$}(s,a), (8)

where 𝒘=(w1,,wN)T\mbox{\boldmath$w$}=(w_{1},...,w_{N})^{T} is the coefficient vector that is to be trained. Using this value, action ata_{t} of the agent at each time step tt is decided by the ϵ\epsilon-greedy method:

at={argmaxaQ(st,a)withprob. 1ϵrandomlychosenactionwithprob.ϵ,a_{t}=\left\{\begin{array}[]{cc}\mathrm{argmax}_{a}Q(s_{t},a)&\mathrm{with\ prob.\ }1-\epsilon\\ \mathrm{randomly\ chosen\ action}&\mathrm{with\ prob.\ }\epsilon\end{array}\right., (9)

where sts_{t} is the state at this step. Usually, ϵ\epsilon is initially set as a relatively large value ϵ0\epsilon_{0}, and gradually decreases.

To search the appropriate policy, the LSPI method updates 𝒘w iteratively. Let the nn-th update be executed after the duration n\mathcal{E}_{n}. Then the form of 𝒘w after the nln_{l}-th duration, nl\mathcal{E}_{n_{l}}, is expressed as:

𝒘=A1B,\mbox{\boldmath$w$}=A^{-1}B, (10)

where

A\displaystyle A =\displaystyle= n=1nlλnlntnϕ(st,at)(ϕ(st,at)γϕ(st+1,at+1))T,\displaystyle\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{t\in\mathcal{E}_{n}}\mbox{\boldmath$\phi$}(s_{t},a_{t})\left(\mbox{\boldmath$\phi$}(s_{t},a_{t})-\gamma\mbox{\boldmath$\phi$}(s_{t+1},a_{t+1})\right)^{T}, (11)
B\displaystyle B =\displaystyle= n=1nlλnlntnrtϕ(st,at),\displaystyle\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{t\in\mathcal{E}_{n}}r_{t}\mbox{\boldmath$\phi$}(s_{t},a_{t}), (12)

rtr_{t} is the reward at this step, and γ(0,1)\gamma\in(0,1) is the discount factor. We let ϕ(st+1,at+1)=0\mbox{\boldmath$\phi$}(s_{t+1},a_{t+1})=0 if the episode ends at time tt. The constant λ(0,1)\lambda\in(0,1) is a forgetting factor that is introduced because the contributions of old experiences collected under the old policy do not reflect the response of the environment under the present policy. This factor has also been adopted by previous studies such as Ref. [15].

3 Proposed method and settings of simulations

3.1 Application of ESN to the LSPI method

In this study, we considered the partially observable Markov decision process (POMDP), a process wherein each agent observed only a part of the state. As the input of the ii-th agent, we set its observation 𝒐i,t\mbox{\boldmath$o$}_{i,t}, the candidate of the action aa, and the bias term:

𝑿~ires(t;a)\displaystyle\tilde{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a) =\displaystyle= f(𝑾oin𝒐i;t+𝑾ain𝜼a+𝑾bin+𝑾res𝑿ires(t1)),\displaystyle f\left(\mbox{\boldmath$W$}^{\mathrm{in}}_{o}\mbox{\boldmath$o$}_{i;t}+\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\mbox{\boldmath$\eta$}_{a}+\mbox{\boldmath$W$}^{\mathrm{in}}_{b}+\mbox{\boldmath$W$}^{\mathrm{res}}\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t-1)\right), (13)
𝑿¯ires(t;a)\displaystyle\bar{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a) =\displaystyle= α𝑿~ires(t;a)+(1α)𝑿ires(t1),\displaystyle\alpha\tilde{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a)+(1-\alpha)\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t-1), (14)
Qi(𝒐i;t,a)\displaystyle Q_{i}(\mbox{\boldmath$o$}_{i;t},a) =\displaystyle= 𝑾iout(𝑿¯ires(t;a)1),\displaystyle\mbox{\boldmath$W$}_{i}^{\mathrm{out}}\left(\begin{array}[]{c}\bar{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a)\\ 1\end{array}\right), (17)

where 𝜼a\mbox{\boldmath$\eta$}_{a} is the one-hot expression of aa and activation function ff is ReLU. We expressed the number of neurons of the reservoir as NresN^{\mathrm{res}}, i.e. Nresdim𝑿iresN^{\mathrm{res}}\equiv\mathrm{dim}\mbox{\boldmath$X$}_{i}^{\mathrm{res}}. The initial condition of reservoir was set as 𝑿ires(1)=0\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(-1)=0. We assumed that all agents shared the common input and reservoir matrices 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o}, 𝑾ain\mbox{\boldmath$W$}^{\mathrm{in}}_{a}, 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b}, and 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}}. For later convenience, 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o}, 𝑾ain\mbox{\boldmath$W$}^{\mathrm{in}}_{a}, and 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b}, that is, the parts of the input matrix are denoted separately. The action ata_{t} was decided by the ϵ\epsilon-greedy method, Eq. (9), under this Qi(𝒐i;t,a)Q_{i}(\mbox{\boldmath$o$}_{i;t},a). Here, QiQ_{i} should be calculated for all possible candidate of action aa. Using the accepted action, 𝑿ires\mbox{\boldmath$X$}_{i}^{\mathrm{res}} is updated as follows:

𝑿ires(t)=𝑿¯res(t;at)\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t)=\bar{\mbox{\boldmath$X$}}^{\mathrm{res}}(t;a_{t}) (18)

In the actual calculation of Eq. (13), it is convenient to calculate 𝑿~ires(t;a)\tilde{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a) for all candidates of action aa at once using the following relation:

(𝑿~ires(t;a=1),𝑿~ires(t;a=|𝒜|))\displaystyle\left(\tilde{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a=1),\cdots\tilde{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a=|\mathcal{A}|)\right) (19)
=\displaystyle= f((𝑿tin+𝑾ain𝜼a=1,,𝑿tin+𝑾ain𝜼a=|𝒜|))\displaystyle f\left(\left(\mbox{\boldmath$X$}_{t}^{\mathrm{in}}+\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\mbox{\boldmath$\eta$}_{a=1},\cdots,\mbox{\boldmath$X$}_{t}^{\mathrm{in}}+\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\mbox{\boldmath$\eta$}_{a=|\mathcal{A}|}\right)\right)
=\displaystyle= f(rep|𝒜|(𝑿tin)+𝑾ain),\displaystyle f\left(\mathrm{rep}_{|\mathcal{A}|}\left(\mbox{\boldmath$X$}_{t}^{\mathrm{in}}\right)+\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\right),
where𝑿tin𝑾oin𝒐i;t+𝑾bin+𝑾res𝑿ires(t1).\mathrm{where}\ \mbox{\boldmath$X$}_{t}^{\mathrm{in}}\equiv\mbox{\boldmath$W$}^{\mathrm{in}}_{o}\mbox{\boldmath$o$}_{i;t}+\mbox{\boldmath$W$}^{\mathrm{in}}_{b}+\mbox{\boldmath$W$}^{\mathrm{res}}\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t-1). (20)

Here, |𝒜||\mathcal{A}| is the number of candidates of the action, and rep|𝒜|(𝑿tin)\mathrm{rep}_{|\mathcal{A}|}\left(\mbox{\boldmath$X$}_{t}^{\mathrm{in}}\right) is a matrix constructed by repeating the column vector, 𝑿tin\mbox{\boldmath$X$}_{t}^{\mathrm{in}}, |𝒜||\mathcal{A}|-times. To derive the last line of Eq. (19), we should remind that 𝜼a\mbox{\boldmath$\eta$}_{a} is the one-hot expression:

𝑾ain(𝜼a=1,,𝜼a=|𝒜|)=𝑾ain(100010001)=𝑾ain.\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\left(\mbox{\boldmath$\eta$}_{a=1},\cdots,\mbox{\boldmath$\eta$}_{a=|\mathcal{A}|}\right)=\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\left(\begin{array}[]{cccc}1&0&\cdots&0\\ 0&1&\cdots&0\\ &&\vdots&\\ 0&0&\cdots&1\\ \end{array}\right)=\mbox{\boldmath$W$}^{\mathrm{in}}_{a}. (21)

Comparing Eqs. (8), (17) and (18), we can apply the LSPI method by substituting 𝒘T=𝑾iout\mbox{\boldmath$w$}^{T}=\mbox{\boldmath$W$}_{i}^{\mathrm{out}} and ϕ=(𝑿ires(t)T,1)T\mbox{\boldmath$\phi$}=\left(\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t)^{T},1\right)^{T}. Thus, taking the transpose of Eq.(10), 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}} can be calculated by the following relation

𝑾iout=B~GiA~Gi1,\mbox{\boldmath$W$}_{i}^{\mathrm{out}}=\tilde{B}_{G_{i}}\tilde{A}_{G_{i}}^{-1}, (22)

where

A~Gi\displaystyle\tilde{A}_{G_{i}} =\displaystyle= n=1nlλnlnjGitn(𝑿^jres(t)γ𝑿^jres(t+1))𝑿^jres(t)T\displaystyle\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{j\in G_{i}}\sum_{t\in\mathcal{E}_{n}}\left(\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)-\gamma\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t+1)\right)\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)^{T} (23)
+λnl1βI\displaystyle+\lambda^{n_{l}-1}\beta I
B~Gi\displaystyle\tilde{B}_{G_{i}} =\displaystyle= n=1nlλnlnjGitnrj;t𝑿^jres(t)T,\displaystyle\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{j\in G_{i}}\sum_{t\in\mathcal{E}_{n}}r_{j;t}\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)^{T}, (24)

and

𝑿^jres(t)(𝑿ires(t)1).\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)\equiv\left(\begin{array}[]{c}\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t)\\ 1\end{array}\right). (25)

Here, we divided the agents into one or several groups, and let all members of the same group share their experiences when we calculated 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}}. In Eqs. (22), (23) and (24), GiG_{i} indicates the group to which the ii-th agent belongs. In Eq. (23), we let 𝑿^jres(t+1)=0\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t+1)=0 if the episode ended at time tt, as in Section 2.3. The second term of the right-hand side of Eq. (23) was introduced to avoid the divergence of the inverse matrix of Eq. (22). The constant β\beta is a small hyperparameter, and II is the identity matrix. As the duration n\mathcal{E}_{n}, we adopted one episode in this study. In other words, the output weight matrix 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}} was updated at the end of each episode. The agents of the same group shared the common output weight matrix by Eq. (22). Considering that all agents had the same input and reservoir matrices, as explained above, agents of the same group shared all parameters of neural networks. Thus, this is a form of parameter sharing suitable for ESN. As mentioned in Section 2.2, Ref. [15] also applied LSPI method to train ESN. They calculated the output weight matrix using the recursive least squares(RLS) method to avoid the calculation of the inverse matrix of Eq. (22), which incurred a computational cost of O((Nres)3)O\left((N^{\mathrm{res}})^{3}\right). However, we did not adopt RLS method, because an additional approximation called mean-value approximation is required for this method. In addition, when each duration n\mathcal{E}_{n} is long and the parameter sharing method is introduced, the effect of the computational cost mentioned above on the total computation time is limited.

In the actual calculation, we made the matrices

Y1((𝑿^j=1res(0)γ𝑿^j=1res(1))T(𝑿^j=|G|res(0)γ𝑿^j=|G|res(1))T(𝑿^j=1resT(tmax1)γ𝑿^j=1res(tmax))T(𝑿^j=|G|resT(tmax1)γ𝑿^j=|G|resT(tmax))T(𝑿^j=1res(tmax))T(𝑿^j=|G|res(tmax))T),Y2((𝑿^j=1res(0))T(𝑿^j=|G|res(0))T(𝑿^j=1res(tmax1))T(𝑿^j=|G|res(tmax1))T(𝑿^j=1res(tmax))T(𝑿^j=|G|res(tmax))T),\displaystyle Y_{1}\equiv\left(\begin{array}[]{c}\left(\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(0)-\gamma\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(1)\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}}(0)-\gamma\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}}(1)\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}T}(t_{\mathrm{max}}-1)-\gamma\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(t_{\mathrm{max}})\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}T}(t_{\mathrm{max}}-1)-\gamma\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}T}(t_{\mathrm{max}})\right)^{T}\\ \left(\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(t_{\mathrm{max}})\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}}(t_{\mathrm{max}})\right)^{T}\\ \end{array}\right),Y_{2}\equiv\left(\begin{array}[]{c}\left(\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(0)\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}}(0)\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(t_{\mathrm{max}}-1)\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}}(t_{\mathrm{max}}-1)\right)^{T}\\ \left(\hat{\mbox{\boldmath$X$}}_{j=1}^{\mathrm{res}}(t_{\mathrm{max}})\right)^{T}\\ \vdots\\ \left(\hat{\mbox{\boldmath$X$}}_{j=|G|}^{\mathrm{res}}(t_{\mathrm{max}})\right)^{T}\\ \end{array}\right), (46)
(47)

and

R(rj=1;t=0,,rj=|G|;t=0,,rj=|G|;t=tmax1,,rj=|G|;t=tmax1,0,,0)T,\displaystyle R\equiv\left(r_{j=1;t=0},\cdots,r_{j=|G|;t=0},\cdots,r_{j=|G|;t=t_{\mathrm{max}}-1},\cdots,r_{j=|G|;t=t_{\mathrm{max}}-1},0,\cdots,0\right)^{T},
(48)

, and performed the operation

A~G\displaystyle\tilde{A}_{G} \displaystyle\leftarrow A~G+Y1TY2,\displaystyle\tilde{A}_{G}+Y_{1}^{T}Y_{2}, (49)
B~G\displaystyle\tilde{B}_{G} \displaystyle\leftarrow B~G+RTY2,\displaystyle\tilde{B}_{G}+R^{T}Y_{2}, (50)

at the end of each episode n\mathcal{E}_{n}. Here, we used the fact that the reward after the end of the episode is zero: rj;tmax=0r_{j;t_{\mathrm{max}}}=0. However, the term 𝑿^jres(tmax)0\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t_{\mathrm{max}})\neq 0 should not be ignored. After this manipulation, we calculated 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}} using Eq. (22), multiplying forgetting factors to matrices A~G\tilde{A}_{G} and B~G\tilde{B}_{G}:

A~G\displaystyle\tilde{A}_{G} \displaystyle\leftarrow λA~G,\displaystyle\lambda\tilde{A}_{G}, (51)
B~G\displaystyle\tilde{B}_{G} \displaystyle\leftarrow λB~G,\displaystyle\lambda\tilde{B}_{G}, (52)

and updated the value of ϵ\epsilon used for the ϵ\epsilon-greedy method as:

ϵδϵϵ,\epsilon\leftarrow\delta_{\epsilon}\epsilon, (53)

if ϵ\epsilon was larger than the predetermined threshold value ϵmin\epsilon_{\mathrm{min}}. Training is dependent on the information in past state, action, and reward only through Eqs. (49) and (50). Thus, matrices Y1,Y2,Y_{1},Y_{2}, and RR can be deleted after the update of A~G\tilde{A}_{G} and B~G\tilde{B}_{G} using these equations. Note that, in principle, we can update A~G\tilde{A}_{G} and B~G\tilde{B}_{G} by adding the increments every time step even if we do not prepare matrices Y1,Y2,Y_{1},Y_{2}, and RR. However, the number of calculations of large matrices multiplication should be reduced to lower the computational cost, at least when the calculation is implemented by Python. The algorithm explained above is summarized in Algorithm 1.

Algorithm 1 LSPI algorithm used in this study
1:ϵϵ0\epsilon\leftarrow\epsilon_{0}
2:for all GG : group do
3:     A~GβI\tilde{A}_{G}\leftarrow\beta I
4:end for
5:for n1,,nmaxn\leftarrow 1,...,n_{\mathrm{max}} do
6:     for t0,,tmax1t\leftarrow 0,...,t_{\mathrm{max}}-1 do
7:         decide each agent’s action
8:         update the environment
9:         give agents rewards rj;tr_{j;t}
10:     end for
11:     for all GG : group  do
12:         A~GA~G+Y1TY2\tilde{A}_{G}\leftarrow\tilde{A}_{G}+Y_{1}^{T}Y_{2}
13:         B~GB~G+RTY2\tilde{B}_{G}\leftarrow\tilde{B}_{G}+R^{T}Y_{2}
14:         for all jGj\in G do
15:              𝑾joutB~GA~G1\mbox{\boldmath$W$}_{j}^{\mathrm{out}}\leftarrow\tilde{B}_{G}\tilde{A}_{G}^{-1}
16:         end for
17:         A~GλA~G\tilde{A}_{G}\leftarrow\lambda\tilde{A}_{G}
18:         B~GλB~G\tilde{B}_{G}\leftarrow\lambda\tilde{B}_{G}
19:     end for
20:     if ϵ>ϵmin\epsilon>\epsilon_{\mathrm{min}} then
21:         ϵδϵϵ\epsilon\leftarrow\delta_{\epsilon}\epsilon
22:     end if
23:     reset the environment
24:end for

3.2 environment and observation of each agent

We constructed a grid-world environment wherein each cell could have three states: “vacant,” “wall,” and “occupied by an agent.” At each step, each agent chose one of the four adjacent cells: up, down, right, and left, and attempted to move into it. This move succeeded if and only if the candidate cell was vacant and no other agents attempted to move there. Each agent observed 11×1111\times 11 cells centering on the agent itself as 2-channel bitmap data. Here, agents including observer itself are indicated as pairs of numbers (1,0)(1,0), walls are as (0,1)(0,1), and vacant cells are as (0,0)(0,0). Hence, the observation of ii-th agent, 𝒐i;t\mbox{\boldmath$o$}_{i;t}, was 11×11×2=24211\times 11\times 2=242-dimensional vector. Although several previous studies have proposed a method that combined ESN with untrained convolutional neural networks when it is used for image recognition[14, 28, 29], we input the observation directly into ESN. Instead, we improved the sparsity of input weight matrix 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o} to reflect the spatial structure. Specifically, we first called the 3×33\times 3, 7×77\times 7, and 11×1111\times 11 cells centering on agent as 𝒜1\mathcal{A}_{1}, 𝒜2\mathcal{A}_{2} and 𝒜3\mathcal{A}_{3} respectively, as in Fig. 1. Then, we changed the sparsity of the components of 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o} depending on which cell they combined with. Namely, components of Wo,μνinW_{o,\mu\nu}^{\mathrm{in}} are generated as:

{Wo,μνin=0withprob.psin(ν)Wo,μνin𝒫oinwithprob. 1psin(ν),\left\{\begin{array}[]{cc}W_{o,\mu\nu}^{\mathrm{in}}=0&\mathrm{with\ prob.\ }p_{s}^{\mathrm{in}}(\nu)\\ W_{o,\mu\nu}^{\mathrm{in}}\sim\mathcal{P}^{\mathrm{in}}_{o}&\mathrm{with\ prob.\ }1-p_{s}^{\mathrm{in}}(\nu)\end{array}\right., (54)

and the sparsity psin(ν)p_{s}^{\mathrm{in}}(\nu) is expressed as follows:

psin(ν)={ps1inifν𝒜1ps2inifν𝒜2\𝒜1ps3inifν𝒜3\𝒜2,p_{s}^{\mathrm{in}}(\nu)=\left\{\begin{array}[]{cc}p_{s1}^{\mathrm{in}}&\mathrm{if}\ \nu\in\mathcal{A}_{1}\\ p_{s2}^{\mathrm{in}}&\mathrm{if}\ \nu\in\mathcal{A}_{2}\backslash\mathcal{A}_{1}\\ p_{s3}^{\mathrm{in}}&\mathrm{if}\ \nu\in\mathcal{A}_{3}\backslash\mathcal{A}_{2}\end{array}\right., (55)

where the hyperparameters ps1inp_{s1}^{\mathrm{in}}, ps2inp_{s2}^{\mathrm{in}}, and ps3inp_{s3}^{\mathrm{in}} obey ps1inps2inps3inp_{s1}^{\mathrm{in}}\leq p_{s2}^{\mathrm{in}}\leq p_{s3}^{\mathrm{in}}. Thus, information on cells near the agent combined with the reservoir more densely than that on cells far from the agent. Further, we let 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b} be a sparse matrix with sparsity psbinp_{sb}^{\mathrm{in}}, but made 𝑾ain\mbox{\boldmath$W$}^{\mathrm{in}}_{a} dense. In addition, 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}} was generated as a sparce matrix with its spectral radius ρ<1\rho<1, by the process explained in Section 2.2. Here, 𝒫oin\mathcal{P}^{\mathrm{in}}_{o}, 𝒫ain\mathcal{P}^{\mathrm{in}}_{a}, 𝒫bin\mathcal{P}^{\mathrm{in}}_{b}, and 𝒫0res\mathcal{P}^{\mathrm{res}}_{0}, the distribution of nonzero components of 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o}, 𝑾ain\mbox{\boldmath$W$}^{\mathrm{in}}_{a}, 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b}, and 𝑾0res\mbox{\boldmath$W$}^{\mathrm{res}}_{0} were expressed as gaussian functions 𝒩(0,(σoin)2)\mathcal{N}\left(0,\left(\sigma^{\mathrm{in}}_{o}\right)^{2}\right), 𝒩(0,(σain)2)\mathcal{N}\left(0,\left(\sigma^{\mathrm{in}}_{a}\right)^{2}\right), 𝒩(0,(σbin)2)\mathcal{N}\left(0,\left(\sigma^{\mathrm{in}}_{b}\right)^{2}\right), and 𝒩(0,(σ0res)2)\mathcal{N}\left(0,\left(\sigma^{\mathrm{res}}_{0}\right)^{2}\right), respectively. The list of hyperparameters are presented in Table 1.

Refer to caption
Figure 1: Eyesight of each agent. We called 3×33\times 3, 7×77\times 7, and 11×1111\times 11 cells centering on agent itself (painted red in this figure) as 𝒜1\mathcal{A}_{1}, 𝒜2\mathcal{A}_{2}, and 𝒜3\mathcal{A}_{3}, respectively.
character meaning value
NresN^{\mathrm{res}} the number of neurons of the reservoir 1024
α\alpha leaking rate 0.8
ps1inp_{s1}^{\mathrm{in}} sparsity of 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o} corresponding to 𝒜1\mathcal{A}_{1} 0.6
ps2inp_{s2}^{\mathrm{in}} sparsity of 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o} corresponding to 𝒜2\𝒜1\mathcal{A}_{2}\backslash\mathcal{A}_{1} 0.8
ps3inp_{s3}^{\mathrm{in}} sparsity of 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o} corresponding to 𝒜3\𝒜2\mathcal{A}_{3}\backslash\mathcal{A}_{2} 0.9
psbinp_{sb}^{\mathrm{in}} sparsity of 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b} 0.9
psresp_{s}^{\mathrm{res}} sparsity of 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}} 0.9
σoin\sigma^{\mathrm{in}}_{o} stantard deviation of 𝒫oin\mathcal{P}^{\mathrm{in}}_{o} 1.0
σain\sigma^{\mathrm{in}}_{a} stantard deviation of 𝒫ain\mathcal{P}^{\mathrm{in}}_{a} 2.0
σbin\sigma^{\mathrm{in}}_{b} stantard deviation of 𝒫bin\mathcal{P}^{\mathrm{in}}_{b} 1.0
σ0res\sigma^{\mathrm{res}}_{0} stantard deviation of 𝒫0res\mathcal{P}^{\mathrm{res}}_{0} 1.0
ρ\rho spectral radius of 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}} 0.95
tmaxt_{\mathrm{max}} number of time steps during one episode 500
γ\gamma discount factor 0.95
ϵ0\epsilon_{0} initial value of ϵ\epsilon 1.0
δϵ\delta_{\epsilon} decay rate of ϵ\epsilon 0.95
ϵmin\epsilon_{\mathrm{min}} threshold value that stop decaying ϵ\epsilon 0.02
λ\lambda forgetting factor 0.95
β\beta coefficient used for the initial value of A~G\tilde{A}_{G} 1×1041\times 10^{-4}
Table 1: Hyperparameters of this study

As the specific tasks, we considered the following two situations, changing the number of agents, nagentn_{\mathrm{agent}}. Note that we did not execute any kinds of pretraining in the simulations.

3.2.1 Task. I : Choice between a narrow direct route and a broad detour

We first considered a forked road (Fig. 2), which was composed of a narrow direct route and a broad detour. The right and left edges were connected by the periodic boundary condition. In the initial state, agents were arranged in the checkerboard pattern on the part where the road was not forked, as shown in Fig. 3.

Refer to caption
Figure 2: Forked road considered in task. I. Grids indicating vacant areas and walls are painted black and green, respectively. White lines are drawn to emphasize the borders of grids.
Refer to caption
Figure 3: Initial placement of agents at (a).nagent=12n_{\mathrm{agent}}=12, (b).nagent=24n_{\mathrm{agent}}=24, (c).nagent=32n_{\mathrm{agent}}=32 and (d)nagent=40n_{\mathrm{agent}}=40 in the task. I. Meanings of the black and green cells are the same as Fig. 2, and agents are painted red.

Here, the case wherein every agent attempted to go right was considered. Thus, we let the reward of ii-th agent, ri;tr_{i;t}, be +1(-1) when it went right(left), and 0 otherwise. The number of groups was set as 1 because the aim of the agents was similar. Namely, all agents shared one policy, and all of their experiences were reflected in the update of A~G\tilde{A}_{G}, B~G\tilde{B}_{G}, and 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}}. The sharing of the experiences was realized only through Eqs. (22), (23), and (24), and the reward of each agent itself was not affected by those of other agents. In this environment, agents were believed to go through the direct route when they were not disturbed by other agents. However, if the number of agents increases and the road becomes crowded, limited number of agents can use the direct route, and others should choose the detour. We investigated whether the agents could learn this choice between two routes.

3.2.2 Task. II : Bidirectional pedestrian flow in a corridor

We next considered the case where two types of agents, those trying to go right and left, existed in a corridor with width 8 and length 20. As in the task. I, the right and left edges were connected by the periodic boundary condition, and the agents were arranged in the checkerboard pattern in the initial state, as shown in Fig. 4. Here, we divided the agents into two groups, those of the right and left-proceeding agents. The reward of the right-proceeding agent was the same as that of the previous case. Whereas, that of the left-proceeding agent was -1(+1) when it went right(left), and 0 otherwise. Note that the parameter sharing through Eq. (22) was executed only among the agents of the same group. In this case, agents should go through the corridor avoiding oppositely-proceeding agents. We investigated whether and how agents achieve this task.

Refer to caption
Figure 4: Initial placement of agents at (a).nagent=16n_{\mathrm{agent}}=16, (b).nagent=32n_{\mathrm{agent}}=32, (c).nagent=48n_{\mathrm{agent}}=48 and (d)nagent=64n_{\mathrm{agent}}=64 in the task. II. Meanings of the black and green cells are the same as Fig. 2, and right(left)-proceeding agents are painted red(blue). In this figure, agents of different groups are painted in different colors so that we can distinguish them from each other. However, they have the same color, (1,0)(1,0), in the viewpoint of agents.

4 Results

In this section, we present the results of the simulations explained in the previous section.

4.1 Performance in task. I

The learning curves of task. I are shown in Fig. 5. In these graphs, the green curve is the mean value of all agents’ rewards, and the red and blue curves indicate the rewards of the best and worst-performing agents. In each case, we executed 8 independent trials and averaged the values of the graphs, and painted the standard errors taken from these trials in pale colors. As shown in Fig. 5, the performance of each agent appeared to worsen with increasing nagentn_{\mathrm{agent}}. This is because each agent was prevented from proceeding by other agents. In the case that nagent24n_{\mathrm{agent}}\geq 24, the direct route was excessively crowded and certain agents had no choice but to go to the detour, as shown in snapshots and density colormaps in Figs. 6 and 7. (b)-(d). Thus, the difference in the performance between the best and worst agents increased compared to the case where nagent=12n_{\mathrm{agent}}=12. In addition, seeing the density colormaps of Fig. 7, some agents tend to get stuck in the bottom-right corner of the detour when nagentn_{\mathrm{agent}} is large. Considering that there is no benefit for agents in staying at this corner, it is thought to be the misjudgment resulting from the low information processing capacity of ESN.

Refer to caption
Figure 5: Learning curves of task. I at (a).nagent=12n_{\mathrm{agent}}=12, (b).nagent=24n_{\mathrm{agent}}=24, (c).nagent=32n_{\mathrm{agent}}=32 and (d)nagent=40n_{\mathrm{agent}}=40. The green curve is the mean value of all agents’ rewards, and the red and blue curves indicate the rewards of the best and worst-performing agents. Each value is averaged over 8 independent trials, and the standard errors taken from them are painted in pale colors.
Refer to caption
Figure 6: Snapshots of task. I at t=100t=100 in the case that (a).nagent=12n_{\mathrm{agent}}=12, (b).nagent=24n_{\mathrm{agent}}=24, (c).nagent=32n_{\mathrm{agent}}=32, and (d).nagent=40n_{\mathrm{agent}}=40. In these figures, agents have finished 250 episodes training (i.e. these figures are taken at the 251th episode.) Meanings of the cells are the same as Fig. 3.
Refer to caption
Figure 7: Density colormaps of task. I averaged over 100ttmax1(=499)100\leq t\leq t_{\mathrm{max}}-1(=499) of the 151–250 episodes of 8 independent trials in the case that (a).nagent=12n_{\mathrm{agent}}=12, (b).nagent=24n_{\mathrm{agent}}=24, (c).nagent=32n_{\mathrm{agent}}=32, and (d).nagent=40n_{\mathrm{agent}}=40.

From the definition of reward in this study, the total reward that one agent gets during one episode is equal to the displacement along the direction it tries to go. Hence, the average velocity of agents, v¯\bar{v}, can be calculated as

v¯=(meanvalueofallagentsrewards)tmax.\bar{v}=\frac{\left(\mathrm{mean\ value\ of\ all\ agents^{\prime}\ rewards}\right)}{t_{\mathrm{max}}}. (56)

Note that the upper bound of this value is 1, which can be realized when all agents always proceed in the direction they want to. In addition, considering that simulations of this study impose the periodic boundary condition and the number of agents nagentn_{\mathrm{agent}} does not change, the average density ρ¯\bar{\rho} is expressed as

ρ¯=nagent(thenumberofcellsthatagentscanwalkinto).\bar{\rho}=\frac{n_{\mathrm{agent}}}{\left(\mathrm{the\ number\ of\ cells\ that\ agents\ can\ walk\ into}\right)}. (57)

Here, note that the phrase “the number of cells that agents can walk into” means the area that is not devided from agents by walls, and does not mean the number of cells that a certain agent can move into at a certain step. In the case of task. I, for example, this number can be counted as 192, by Fig. 2. Using these equations, we can plot the fundamental diagram[30], the graph that draw the relation between velocity v¯\bar{v} and density ρ¯\bar{\rho}, as Fig. 8. These data are averaged over the 151-250 episodes of 8 independent trials. In this graph, we estimated the error bars by calculating the standard error of the mean over 8 independent trials, but these error bars are smaller than the points. Seeing Fig. 8, v¯\bar{v} has the value near its upper bound when ρ¯\bar{\rho} is small, and gradually decreases with increasing ρ¯\bar{\rho}. This decrease itself is observed in many situations of pedestrian dynamics. However, considering that agents get stuck in the corner under present calculation, the improvement of the algorithm and neural network is thought to be required before quantitative comparisons with previous studies.

Refer to caption
Figure 8: Fundamental diagram of task. I averaged over the 151–250 episodes of 8 independent trials. Dotted lines are guides to the eyes.

4.2 Performance in task. II

The learning curves of task. II are shown in Fig. 9. Here, as discussed in Section 4.1, we averaged the values of the rewards over 8 independent trials, and painted the standard errors among these trials in pale colors. According to Fig. 9, the training itself was successful except when nagent=64n_{\mathrm{agent}}=64. The corresponding snapshots and density colormaps shown in Figs. 10 and 11 show that in the case of nagent48n_{\mathrm{agent}}\leq 48, agents of the same group created lanes and avoided collisions with agents in the opposite direction. Such lane formation was also observed in various previous studies that used mathematical models[2], experiments[21, 31, 32], and other types of RL agents [4, 8]. Note that the lanes had different shapes depending on the seed of random values, i.e. the right-proceeding agents walked on the upper side of the corridor in some trials, and on the lower side in others, for example. Hence, the colormap of Fig. 11 indicates the time average of one representative trial. To show the dependency on the random seed, we also drew the colormaps of the case that nagent=32n_{\mathrm{agent}}=32 with different seeds as Fig. 12.

When nagent=64n_{\mathrm{agent}}=64, two groups of agents failed to learn how to avoid each other and collided. In this case, each agent cannot move forward after the collision, and the reward they obtained was limited to that for the first few steps. Note that in this task, each agent cannot access the experiences of agents of the other group. Thus, thinking from the perspective of MARL, there is a possibility that the non-stationarity of the environment (in the viewpoint of agents) worsened the learning efficiency, as with the case of the independent learners. However, from the perspective of soft matter physics, this stagnation of the movement under high density resembles the jamming transition observed in wide range of granular or active matters[3, 33, 34, 35]. Nevertheless, to investigate whether it is really related with the jamming transition, simulations with larger scale are required.

Refer to caption
Figure 9: Learning curves of task. II at (a).nagent=16n_{\mathrm{agent}}=16, (b)nagent=32n_{\mathrm{agent}}=32, (c).nagent=48n_{\mathrm{agent}}=48, and (d). Nagent=64N_{\mathrm{agent}}=64. Meaning of curves are the same as Fig. 5. Each value is averaged over 8 independent trials, and the standard errors taken from them are painted in pale colors.
Refer to caption
Figure 10: Snapshots of task. II at t=100t=100 in the case that (a).nagent=16n_{\mathrm{agent}}=16, (b).nagent=32n_{\mathrm{agent}}=32, (c).nagent=48n_{\mathrm{agent}}=48, and (d).nagent=64n_{\mathrm{agent}}=64. In these figures, agents have finished 250 episodes training. Meanings of the cells are the same as Fig. 4.
Refer to caption
Figure 11: Density colormaps of task. II averaged over 100ttmax1(=499)100\leq t\leq t_{\mathrm{max}}-1(=499) of the 151–250 episodes in the case that (a).nagent=16n_{\mathrm{agent}}=16, (b).nagent=32n_{\mathrm{agent}}=32, (c).nagent=48n_{\mathrm{agent}}=48, and (d).nagent=64n_{\mathrm{agent}}=64. The left(right) columns indicate the data of right(left)-proceeding agents. These figures are taken from 1 trial because the lanes have different shapes between trials with different random seeds.
Refer to caption
Figure 12: Examples of density colormaps of task. II in the case that nagent=32n_{\mathrm{agent}}=32 with different random seeds, drawn by the same way as Fig. 11.

As in task. I, we can plot the fundamental diagram shown in Fig. 13 using Eqs. (56) and (57). This graph shows the sudden decrease of the average velocity between nagent=48n_{\mathrm{agent}}=48 and 64 (i.e. ρ¯=0.3\bar{\rho}=0.3 and 0.4), reflecting the stagnation of the movement discussed above. Similar behavior is also reported in previous researches that considered the jamming transition in mathematical models of pedestrians[3].

Refer to caption
Figure 13: Fundamental diagram of task. I averaged over the 151–250 episodes of 8 independent trials. Dotted lines are guides to the eyes.

4.3 Comparison with independent learners

As explained in Section 3.1, we adopted the parameter sharing method in our calculations. In this section, we calculate the two tasks discussed above considering the case that the parameter sharing was not adopted, and compare the results with those of previous sections to investigate the effect of this method. Specifically, equations

𝑾iout=B~iA~i1,\mbox{\boldmath$W$}_{i}^{\mathrm{out}}=\tilde{B}_{i}\tilde{A}_{i}^{-1}, (58)
A~i\displaystyle\tilde{A}_{i} =\displaystyle= n=1nlλnlntn(𝑿^ires(t)γ𝑿^ires(t+1))𝑿^ires(t)T\displaystyle\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{t\in\mathcal{E}_{n}}\left(\hat{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t)-\gamma\hat{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t+1)\right)\hat{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t)^{T} (59)
+λnl1βI,\displaystyle+\lambda^{n_{l}-1}\beta I,

and

B~i=n=1nlλnlntnri;t𝑿^ires(t)T,\tilde{B}_{i}=\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{t\in\mathcal{E}_{n}}r_{i;t}\hat{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t)^{T}, (60)

were used to calculate 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}} in this section. Thus, each agent learned independently. To observe the effect of replacing Eq.(22) by Eq.(58), input and reservoir weight matrices, 𝑾oin\mbox{\boldmath$W$}^{\mathrm{in}}_{o}, 𝑾ain\mbox{\boldmath$W$}^{\mathrm{in}}_{a}, 𝑾bin\mbox{\boldmath$W$}^{\mathrm{in}}_{b}, and 𝑾res\mbox{\boldmath$W$}^{\mathrm{res}}, were still shared by all agents, and the hyperparameters were the same as listed in Table 1.

Refer to caption
Figure 14: Learning curves of independent learners in task. I at (a).nagent=12n_{\mathrm{agent}}=12 and (b).nagent=40n_{\mathrm{agent}}=40. Meaning of curves are the same as Fig. 5, and the dotted lines are the corresponding scores under the parameter sharing method averaged over the 151–250 episodes of 8 independent trials.
Refer to caption
Figure 15: Learning curves of independent learners in task. II at (a).nagent=32n_{\mathrm{agent}}=32 and (b).nagent=48n_{\mathrm{agent}}=48. Meaning of curves and dotted lines are the same as Fig. 14.

The learning curves of tasks. I and II are shown in Figs. 14 and 15, respectively. In these graphs, scores when using the parameter sharing method averaged over the 151–250 episodes of 8 independent trials are drawn as the dotted lines. A comparison of the learning curves with these dotted lines revealed that the performance when the parameter sharing was not used was almost the same as or slightly lower than that under the parameter sharing. Hence, the parameter sharing method did not result in the drastic improvement of the learning. The advantage of this method is rather the reduction in the number of inverse matrix calculations, at least in the case of our tasks. We actually calculated the case of task. II under nagent=64n_{\mathrm{agent}}=64, but did not plot the graph because agents failed to make lanes like the case of Section 4.2.

4.4 Comparison with the case that two groups share the parameters in task. II

As we explained in Section 3.2.2, in the case of task. II, we divided the pedestrians into two groups by the direction they want to proceed in. Here, the training of each group was executed independently. However, we can also implement the MARL agents even in the case that the parameters of neural networks are shared between two groups. Specifically, we first define a two-dimensional one-hot vector 𝜼g=(1,0)T\mbox{\boldmath$\eta$}_{g}=(1,0)^{T} or (0,1)T(0,1)^{T}, which represents the information on each agent’s group. Adding this vector to agents’ observation and modifying eq. (13) as follows:

𝑿~ires(t;a)=f(𝑾oin𝒐i;t+𝑾ain𝜼a+𝑾gin𝜼g+𝑾bin+𝑾res𝑿ires(t1)),\tilde{\mbox{\boldmath$X$}}_{i}^{\mathrm{res}}(t;a)=f\left(\mbox{\boldmath$W$}^{\mathrm{in}}_{o}\mbox{\boldmath$o$}_{i;t}+\mbox{\boldmath$W$}^{\mathrm{in}}_{a}\mbox{\boldmath$\eta$}_{a}+\mbox{\boldmath$W$}^{\mathrm{in}}_{g}\mbox{\boldmath$\eta$}_{g}+\mbox{\boldmath$W$}^{\mathrm{in}}_{b}+\mbox{\boldmath$W$}^{\mathrm{res}}\mbox{\boldmath$X$}_{i}^{\mathrm{res}}(t-1)\right), (61)

the learning process can be executed even if parameters are shared between two groups. To share the parameters, equations

𝑾iout=B~A~1,\mbox{\boldmath$W$}_{i}^{\mathrm{out}}=\tilde{B}\tilde{A}^{-1}, (62)
A~\displaystyle\tilde{A} =\displaystyle= n=1nlλnlnj:allagentstn(𝑿^jres(t)γ𝑿^jres(t+1))𝑿^jres(t)T\displaystyle\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{j:\mathrm{all\ agents}}\sum_{t\in\mathcal{E}_{n}}\left(\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)-\gamma\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t+1)\right)\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)^{T} (63)
+λnl1βI,\displaystyle+\lambda^{n_{l}-1}\beta I,

and

B~=n=1nlλnlnj:allagentstnrj;t𝑿^jres(t)T,\tilde{B}=\sum_{n=1}^{n_{l}}\lambda^{n_{l}-n}\sum_{j:\mathrm{all\ agents}}\sum_{t\in\mathcal{E}_{n}}r_{j;t}\hat{\mbox{\boldmath$X$}}_{j}^{\mathrm{res}}(t)^{T}, (64)

are used to calculate 𝑾iout\mbox{\boldmath$W$}_{i}^{\mathrm{out}}, instead of eqs. (22), (23), and (24). Here, we let 𝑾gin\mbox{\boldmath$W$}^{\mathrm{in}}_{g}, the input weight matrix corresponding to 𝜼g\mbox{\boldmath$\eta$}_{g}, be a dense matrix, and used the same standard deviation as 𝑾ain\mbox{\boldmath$W$}^{\mathrm{in}}_{a}. Namely, the distribution of each component of 𝑾gin\mbox{\boldmath$W$}^{\mathrm{in}}_{g} is 𝒩(0,(σain)2)\mathcal{N}\left(0,\left(\sigma^{\mathrm{in}}_{a}\right)^{2}\right). In this section, we investigate whether this method work. The hyperparameters themselves were the same as listed in Table 1.

The result is shown in Fig. 16. In this figure, the dotted lines are the scores of Section 4.2, the case that two groups do not share their parameters, averaged over the 151–250 episodes of 8 independent trials. Seeing Fig. 16 (a) and (b), in the case that nagent=32n_{\mathrm{agent}}=32 and 48, the parameter sharing between two groups slightly improved or hardly changed the mean and best scores compared with those under the training without parameter sharing between different groups, whereas the worst score was lowered. It is thought that the additional information, 𝜼g\mbox{\boldmath$\eta$}_{g}, led to the confusion of the neural network and wrong action choice, especially under subtle situations that is difficult to evaluate Qi(𝒐i;t,a)Q_{i}(\mbox{\boldmath$o$}_{i;t},a). The result under nagent=64n_{\mathrm{agent}}=64 shown in Fig. 16 (c) is interesting. As we saw in Section 4.2, agents that did not share their parameters between different groups failed to learn to make lanes in all trials in this case. However, agents of this section succeeded in making lanes in some trials. We can think two possible factors that cause this difference. One is the number of agents choosing wrong actions. Blockage of agents in this task is caused when they learn wrong strategy that “go straight and earn rewards for first few steps”, before they learn to make lanes. As we discussed above, agents of this section are thought to be sometimes confused by the additional information, 𝜼g\mbox{\boldmath$\eta$}_{g}. Even if agents learn the wrong strategy, blockage is eased compared with that of Section 4.2 because such “confused” agents do not simply go straight. The other possible factor is the sharing of the experience. Agents of this section share the experience between different groups by the parameter sharing, whereas those of Section 4.2 share it only among the same group. This experience sharing is thought to promote the efficient learning of correct strategy that “agents of different groups should go to different lanes”. It is difficult to identify which of these two factors plays the significant role, only by this task.

Refer to caption
Figure 16: Learning curves of the case that two groups share the parameters of the neural network in task. II at (a).nagent=32n_{\mathrm{agent}}=32, (b).nagent=48n_{\mathrm{agent}}=48, and (c).nagent=64n_{\mathrm{agent}}=64. Meaning of curves is the same as Fig. 14, and the dotted lines are the corresponding scores of the case that two groups do not share their parameters averaged over the 151–250 episodes of 8 independent trials.

4.5 Comparison with deep reinforcement learning (DRL) algorithms

In this section, we calculated the same tasks using representative DRL algorithms, DQN[9], advantage actor-critic(A2C)[10], and proximal policy optimization (PPO)[11] methods, and compare the performance with our algorithm. In these calculations, we prepared the neural network with the same size as the reservoir of our ESN, i.e., the network has 1 hidden layer (of perceptron) with 1024 neurons. Each agent observed 2-channel bitmap data with 11×1111\times 11 cells centering on itself and agents of the same group shared the experience by parameter sharing, as in our algorithm. Parameter sharing between different groups like Section 4.4 is not introduced. Note that in the A2C and PPO methods, actor- and critic-networks shared the parameters except for the output weight matrices.

As the optimizer, we adopted Adam[36]. The codes were implemented using Pytorch (torch 2.2.0), and we used the default values of this library for the initial distribution of the parameters of neural network and the hyperparameters of the optimizer except for the learning rate. The parameters related to environment or reward, such as tmaxt_{\mathrm{max}} and γ\gamma, were the same as our method. In addition, we used the same value of ϵ\epsilon as our method for DQN. Other hyperparameters used for this section are listed in Table 2. Here, the learning rate for PPO was set smaller than those of other methods except the cases of task II. with nagent=48n_{\mathrm{agent}}=48 and 64 to stabilize the learning. In addition, we let the minibatch size of DQN for task II. with nagent=48n_{\mathrm{agent}}=48 and 64 larger than other tasks. The reason why we let these tasks be the exception is explained later.

meaning value
learning rate for DQN 1×1031\times 10^{-3} for task II. with nagent=48,64n_{\mathrm{agent}}=48,64
2.5×1042.5\times 10^{-4} otherwise
learning rate for A2C 2.5×1042.5\times 10^{-4}
learning rate for PPO 2.5×1042.5\times 10^{-4} for task II. with nagent=48n_{\mathrm{agent}}=48
1×1031\times 10^{-3} for task II. with nagent=64n_{\mathrm{agent}}=64
5×1055\times 10^{-5} otherwise
replay-memory size for DQN 10610^{6}
minibatch size for DQN 1024 for task II. with nagent=48,64n_{\mathrm{agent}}=48,64
64 otherwise
number of steps between updates of network for A2C 5
number of steps between updates of network for PPO 125
replay-memory size for PPO 500nagentn_{\mathrm{agent}}
minibatch size for PPO 125
number of epochs for PPO 4
value-loss coefficient for A2C and PPO 0.5
entropy coefficient for A2C and PPO 0.01
clipping value for PPO 0.2
λGAE\lambda_{GAE} for PPO 0.95
maximum gradient norm for DQN and A2C 50
maximum gradient norm for PPO 1
Table 2: Hyperparameters for DRL methods

The result is shown in Figs. 17 and 18. Here we plotted the mean value of all agents’ rewards for each algorithm, and all graphs are averaged over 8 independent trials. Seeing this figures, performance of our method on easy tasks, i.e. task. I with nagent=12n_{\mathrm{agent}}=12 and task. II with nagent=32n_{\mathrm{agent}}=32 resembles that of DQN, and slightly worse than the policy gradient methods. In addition, the learning speed of ours is slower than these methods. These points are thought to result from the ϵ\epsilon-greedy method, which decreases ϵ\epsilon gradually and keeps this constant larger than the threshold value, ϵmin\epsilon_{\mathrm{min}}. However, the performance on the task. I with nagent=40n_{\mathrm{agent}}=40 is explicitly worse than all DRL algorithms. As we discussed in Sec. 4.1, the low information processing capacity of ESN is thought to be the cause of this phenomenon.

Refer to caption
Figure 17: Comparison of learning curves in task. I at (a).nagent=12n_{\mathrm{agent}}=12 and (b).nagent=40n_{\mathrm{agent}}=40. The red, green, blue and black curves are the mean value of all agents’ rewards of our method, DQN, A2C, and PPO, respectively.
Refer to caption
Figure 18: Comparison of learning in task. II at (a).nagent=32n_{\mathrm{agent}}=32, (b).nagent=48n_{\mathrm{agent}}=48 and (c).nagent=64n_{\mathrm{agent}}=64. Meaning of curves are the same as Fig. 17.

In task. II with nagent=48n_{\mathrm{agent}}=48, the performance of DRL methods were worse than our method. It is because DRL methods failed to learn to make lanes and stagnated in some trials, whereas our method succeeded in learning in all trials. In this task, we investigated the performance of each DRL method for four values of the learning rate, η=1×103,2.5×104\eta=1\times 10^{-3},2.5\times 10^{-4}, 5×1055\times 10^{-5}, and 1×1051\times 10^{-5}, because it was possible that choice of appropriate hyperparameter led to improve the learning. The result is shown in Fig. 19. As we can see from this figure, stagnation happened in almost all learning rates of every method. Note that the case of PPO with η=1×103\eta=1\times 10^{-3} succeeded in learning to make lanes in every trial exceptionally, however, η\eta of this case was so large that the learning became unstable. In Fig. 18 (b), we chose the learning rate η=1×103\eta=1\times 10^{-3} for DQN (the red curve of Fig. 19 (a)), and η=2.5×104\eta=2.5\times 10^{-4} for A2C and PPO (the green curves of Fig. 19 (b) and (c)), because these values seemed to perform better than other values in each method. We actually executed the similar calculation for DQN in the case that minibatch size is 64, the same size used for other tasks. However, all trials for each of four learning rate η\eta failed to learn to make lanes in this case. Hence we increased the minibatch size to improve the sampling efficiency. The similar investigation was also executed for the case of task. II with nagent=64n_{\mathrm{agent}}=64, i.e. we calculated the performance of each DRL method for four learning rates for this case. However, in this case, only PPO with η=1×103\eta=1\times 10^{-3} succeeded or began to secceed in making lanes in some trials, while the other methods and other learning rates failed in all trials.

It is difficult to interpret the result of task. II because there are many unknown points on the algorithm combining ESN and LSPI method. One possibility is that agents choosing wrong actions cause the unclogging, as we discussed in Section 4.4. Indeed, ESN has lower information processing capacity and choose the wrong action more frequently than DRL methods, as we saw in task. I with nagent=40n_{\mathrm{agent}}=40. This point is thought to led to the unclogging and final success in learning under nagent=48n_{\mathrm{agent}}=48. In addition, PPO with η=1×103\eta=1\times 10^{-3}, which succeeded in making lanes in all trials of nagent=48n_{\mathrm{agent}}=48 and some trials of nagent=64n_{\mathrm{agent}}=64, has unstable learning curves because of large learning rate, as we explained above. It is thought that such unstable behavior of agents also results in the unclogging. The other possible factor of the success of our method under nagent=48n_{\mathrm{agent}}=48 is that sampling efficiency of the LSPI method is higher than the gradient descent method, because it utilizes all experiences gathered by agents completely by the linear regression. However, our method failed in making lanes under nagent=64n_{\mathrm{agent}}=64, and the performance of this case was reversed by that of PPO with η=1×103\eta=1\times 10^{-3}. Considering this point, effects of the above mentioned two factors on the final performance are thought to be complex. Hence, these effects should be studied also in other RL tasks we did not consider.

Refer to caption
Figure 19: Learning curves of (a).DQN, (b).A2C, and (c).PPO in task. II at nagent=48n_{\mathrm{agent}}=48. The red, green, blue and black curves are the mean value of all agents’ rewards when the learning rate is 1×103,2.5×1041\times 10^{-3},2.5\times 10^{-4}, 5×1055\times 10^{-5}, and 1×1051\times 10^{-5}, respectively.

We also measured the computational time of whole simulation in task. I at nagent=1n_{\mathrm{agent}}=1 and 12 for each method, and summarize the result in Table 3 (a). The calculation was executed by a laptop equipped with 12th Gen Intel(R) Core(TM) i9-12900H, and application software other than resident one was closed during the measurement of time. The real computational time for each simulation shown in Table 3 (a) includes the time spent on the calculation not related to the neural network, such as updating of the environment. Hence, to evaluate the computational cost of the calculation on the neural network including both training and forward propagation, we calculated the computational time of the case when agents choose their action randomly, and showed the difference of each simulation from this case in Table 3 (b). Here, in the random-action case, the neural network itself was not implemented. According to this table, computational cost of our method is less than half of that of each DRL method in both cases of nagent=1n_{\mathrm{agent}}=1 and 12. However, as for the difference of time between these two cases, which corresponds to increase in computational time due to increase in agents, DQN with its minibatch size 64 is better than other methods and our method is second to this. This is because the time spent for the calculation of backpropagation for DQN under fixed minibatch size does not depend on nagentn_{\mathrm{agent}}, whereas minibatch size for A2C, replay-memory size for PPO, and the size of matrices used in our method, Y1,Y2Y_{1},Y_{2}, and RR given by eqs. (47) and (48), are all proportional to nagentn_{\mathrm{agent}}. Considering the possibility that fixing minibatch size of DQN results in the worsening of sampling efficiency, we cannot say that DQN is more efficient algorithm than ours under a large number of agents, only from Table 3 (b). Indeed, in the case of difficult tasks such as task. II with nagent=48n_{\mathrm{agent}}=48, we should increase the minibatch size for DQN at the cost of the computational time, as we explained above. Hence, the computational cost of our method under large nagentn_{\mathrm{agent}} is lower than those of DRL methods, whose sampling efficiency do not reduce as nagentn_{\mathrm{agent}} increases.

(a). real computational time
method nagent=1n_{\mathrm{agent}}=1 nagent=12n_{\mathrm{agent}}=12 difference
ESN+LSPI 52.14(8) 420(7) 368(7)
DQN (minibatch size =64=64) 715(7) 898(5) 183(9)
DQN (minibatch size =1024=1024) 2645(4) 4215(9) 1570(10)
A2C 202(2) 1159(1) 958(2)
PPO 164(1) 1301(21) 1137(21)
random action 1.516(6) 11.1(2) 9.6(2)
(b). difference from the random-action case
method nagent=1n_{\mathrm{agent}}=1 nagent=12n_{\mathrm{agent}}=12 difference
ESN+LSPI 50.62(8) 409(7) 358(7)
DQN (minibatch size =64=64) 713(7) 886(5) 173(9)
DQN (minibatch size =1024=1024) 2643(4) 4204(9) 1560(10)
A2C 200(2) 1148(1) 948(2)
PPO 163(1) 1290(21) 1127(21)
Table 3: (a).Time spent for simulation using each method in task. I with nagent=1n_{\mathrm{agent}}=1 and 12, averaged over 8 independent trials (measured in seconds). The right column indicates the difference between these two times. (b).Difference of the time from the case when the agents choose their action randomly.

5 Summary

In this study, we implemented MARL using ESN, and applied it to a simulation of pedestrian dynamics. The LSPI method was utilized to calculate the output weight matrix of the reservoir. As the environment, the grid-world composed of vacant space and walls was considered, and several agents were placed in it. Specifically, we investigated two types of tasks: I. a forked road composed of a narrow direct route and a broad detour, and II. a corridor where two groups of agents proceeded in the opposite directions. The simulations confirmed that the agents could learn to move forward by avoiding other agents, provided the density of the agents was not that high.

In this study, we considered the case wherein the number of agents were at most 64. However, hundreds of agents are required for the simulations of real roads, intersections, or evacuation routes. Furthermore, to investigate certain phenomena such as the jamming transition mentioned in Section 4.2, larger number of agents should be considered. Using deep learning, implementing such numerous RL agents may result in the enormous computational complexity. Hence, it is expected that reservoir computing including ESN, which can reflect agents’ experiences efficiently with comparatively low computational cost, can be utilized to execute these studies in future. Note that RL-based methods including ours require time for training, compared with the traditional pedestrian simulations where agents obey the previously given rules. However, these methods have advantage that agents can automatically learn the appropriate actions without complex rule settings. In this study, for example, agents learned to avoid obstacles such as other agents without introducing repulsive forces or additional rewards for avoiding itself. Hence, the proposed method is thought to be useful in complex tasks such as the evacuation route flittered with obstacles. In addition, we believe that this method is applicable not only to the case of the pedestrian dynamics, but also to other behaviors of animal groups such as competition for food and territories.

Along with such applied studies, the improvement in the performance of reservoir computing itself is also important. Presently, the performance of reservoir computing tends to be poor under difficult tasks because of its low information processing capacity. For example, in image recognition, reservoir computing-based approaches exhibit low accuracy rates at CIFAR-10, although they succeed in recognizing simpler tasks such as MNIST[25, 29]. In this context, in the case of the video games that use images as input data, reservoir computing-based RL has yielded only a few successful examples[14]. In our study, we observed lower performance than DRL methods in some tasks, as we saw in Sec. 4.5. In addition, considering that the learning speed under ϵ\epsilon-greedy method is limited by the value of ϵ\epsilon, we should study effective algorithms for exploration so that we can decrease this value fast. However, as we discussed in Sec. 4.5, it is possible that our method has advantages not only in computational cost but also in other points such as sampling efficiency, compared to DRL ones. Hence, if the disadvantages mentioned above is improved in the future studies, MARL-based simulations of group behaviors of humans or animals will be further promoted.

In addition, whether the parameter sharing between different groups like Section 4.4 improve the sampling efficiency remains to be unknown. This point should also be studied in the future works, to improve the performance of MARL itself.

The source code for this work is uploaded to https://github.com/Hisato-Komatsu/MARL_ESN_pedestrian.

Acknowledgements

We would like to thank Editage for English language editing.

References

References

  • [1] Vicsek T, Czirók A, Ben-Jacob E, Cohen I, and Shochet O, 1995 Phys. Rev. Lett. 75(6) 1226
  • [2] Helbing D and Molnar P, 1995 Phys. Rev. E 51(5) 4282
  • [3] Muramatsu M, Irie T, and Nagatani T, 1999 Physica A 267(3-4) 487-498
  • [4] Martinez-Gil F, Lozano M, and Fernández F, 2014 Simul. Model. Pract. Theory 47 259-275
  • [5] Martinez-Gil F, Lozano M, and Fernández F, 2017 Simul. Model. Pract. Theory 74 117-133
  • [6] Zheng S and Liu H, 2019 IEEE Access 7 147755–147770
  • [7] Bahamid A and Ibrahim A M, 2022 Neural Comput. Applic. 34 21641–21655
  • [8] Huang Z, Liang R, Xiao Y, Fang Z, Li X, and Ye R, 2023 Physica A 625 129011
  • [9] Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, and Hassabis D, 2015 Nature 518 529-533
  • [10] Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, and Kavukcuoglu K, 2016 In International conference on machine learning, vol. 48, (pp. 1928-1937) PMLR.
  • [11] Schulman J, Wolski F, Dhariwal P, A Radford A, Klimov O, 2017 arXiv preprint, arXiv:1707.06347
  • [12] Szita I, Gyenes V, and Lörincz A, 2006 In International Conference on Artificial Neural Networks, (pp.830-839)
  • [13] Oubbati M, Kächele M, Koprinkova-Hristova P, and Palm G, 2011 In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , (pp. 117-122)
  • [14] Chang H, and Futagami K, 2020 Appl. Intell. 50 2400-2410.
  • [15] Zhang C, Liu C, Song Q, and Zhao J, 2021 In 4th International Conference on Artificial Intelligence and Big Data (ICAIBD), (pp. 104-108) IEEE.
  • [16] Chang H H, Song H, Yi Y, Zhang J, He H, and Liu L, 2018 IEEE Internet Things J. 6(2) 1938-1948
  • [17] Buşoniu L, Babuška R, and De Schutter B, 2008 IEEE Trans. Syst. Man Cybern. Part C 38(2) 156-172
  • [18] Gronauer S., and Diepold K. (2022). Artif. Intell. Rev. 55 895-943.
  • [19] Du W, and Ding S, 2021 Artif. Intell. Rev. 54 3215–3238
  • [20] Oroojlooy A, and Hajinezhad D, 2023 Appl. Intell. 53 13677–13722
  • [21] Feliciani C and Nishinari K, 2016 Phys. Rev. E 94(3) 032304.
  • [22] Gupta J K, Egorov M, and Kochenderfer M, 2017 In Autonomous Agents and Multiagent Systems, (pp. 66-83)
  • [23] Jaeger H, 2001 GMD Technical Report 148
  • [24] Lukoševičius M, 2012 In Neural Networks: Tricks of the Trade, (pp. 659–686), Springer Berlin, Heidelberg.
  • [25] Zhang H and Vargas D V, 2023 IEEE Access 11 81033-81070.
  • [26] Lagoudakis M G, and Parr R, 2003 J. Mach. Learn. Res. 4 1107-1149.
  • [27] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, and Mordatch I, 2017 In Advances in neural information processing systems, (pp. 6379-6390)
  • [28] Tong Z and Tanaka G, 2018 In 2018 24th International Conference on Pattern Recognition (ICPR), (pp. 1289–1294). IEEE
  • [29] Tanaka Y and Tamukoh H, 2022 Nonlinear Theory and Its Applications, IEICE, 13(2) 397–402
  • [30] Seyfried A, Steffen B, Klingsch W, and Boltes M, 2005 J. Stat. Mech. P10002
  • [31] Kretz T, Grünebohm A, Kaufman M, Mazur F, and Schreckenberg M, 2006 J. Stat. Mech. P10001
  • [32] Murakami H, Feliciani C, Nishiyama Y, and Nishinari K, 2021 Sci. Adv. 7(12) eabe7758
  • [33] O’Hern C S, Silbert L E, Liu A J, and Nagel S R, 2003 Phys. Rev. E 68(1) 011306
  • [34] Majmudar T S, Sperl M Luding S, and Behringer R P, 2007 Phys. Rev. Lett. 98(5) 058001
  • [35] Henkes S, Fily Y, and Marchetti M C, 2011 Phys. Rev. E 84(4) 040301
  • [36] Kingma D P and Ba J, 2014 arXiv preprint, arXiv:1412.6980