Attention Loss Adjusted Prioritized Experience Replay

Zhuoying Chen¹, Huiping Li¹ , Rizhong Wang¹ ¹ School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 710072 Email:[email protected], [email protected], [email protected]

Abstract

Prioritized Experience Replay (PER) is a technical means of deep reinforcement learning by selecting experience samples with more knowledge quantity to improve the training rate of neural network. However, the non-uniform sampling used in PER inevitably shifts the state-action space distribution and brings the estimation error of Q-value function. In this paper, an Attention Loss Adjusted Prioritized (ALAP) Experience Replay algorithm is proposed, which integrates the improved Self-Attention network with Double-Sampling mechanism to fit the hyperparameter that can regulate the importance sampling weights to eliminate the estimation error caused by PER. In order to verify the effectiveness and generality of the algorithm, the ALAP is tested with value-function based, policy-gradient based and multi-agent reinforcement learning algorithms in OPENAI gym, and comparison studies verify the advantage and efficiency of the proposed training framework.

Index Terms:

PER, ALAP, Double-Sampling, Self-Attention.

I Introduction

In recent years, deep reinforcement learning has made great achievements in many fields such as go[1], natural language processing [2], robot control [3], etc. Experience Replay mechanism [4] is an important training means of deep reinforcement learning. It learns the neural network by randomly selecting a fixed number of experience samples from buffer, and breaks the relevance between training samples in reinforcement learning. Experience Replay puts the samples in the experience pool with the same sampled frequency, and this uniform sampling method is bound to cause too much invalid exploration in the early stage of training. Moreover, when the agent is in the sparse reward environment [5], this disadvantage is rather obvious, which makes the algorithm difficult to converge.

In order to break the bottleneck of the above algorithm, the Priority Experience Replay (PER) mechanism [6] is proposed; its core idea is to take the Temporal Difference (TD) error’s absolute value as the index to measure the importance of the sample. By changing the probability distribution of the collected samples and learning the relatively important experience samples more frequently, the training speed and efficiency of the network have been improved. It is noted that the non-uniform sampling of PER artificially adjusts the sampled frequency of the samples, and shifts the state-action space distribution to estimate the Q-value function [7, 8, 9, 10] at the same time [11], which leads to the estimation error. Aiming at error correction, a hyperparameter $\beta$ [6] [12] is introduced to regulate the weight of importance sampling, which can prevent the instability of the training process, but it only keeps unbiased in the converged phase. Note that $\beta$ is positively related with training progress [6], and it increases linearly to 1 from initial value $\beta_{0}$ as the number of training episodes increases. However, the training progress is not distributed uniformly over training episodes, and a linear change of $\beta$ would cause extra error. For example, when the episodes reach 50 $\%$ of the total number, the model may have already converged. In this case, $\beta$ is linearly set to be $\frac{1-\beta_{0}}{2}$ , much less than the desired value 1 for the converged model, which would introduce large estimation errors of Q-value function during training.

To resolve this issue, in this paper, an efficient and general training framework called Attention Loss Adjusted Prioritized (ALAP) Experience Replay is proposed. ALAP can greatly reduce the estimation error of the Q-value function from both active and passive aspects. On the one hand, the dynamic loss function is constructed by Huber equation [13, 14, 15], which has passive adaptability in the face of different sizes of TD error samples. On the other hand, we introduce the Self-Attention mechanism to add parallel attention network branches into the actor-critic (A-C) framework or Q network. The attention module can actively measure the similarity of sample distribution from experience pool, which represents the exact progress of the training process. In order to establish the mapping relationship between $\beta$ and the output of attention module, we normalize the output to obtain more accurate $\beta$ , so that the estimation error of Q-valued function can be actively reduced.

Furthermore, considering that PER applys the fixed criteria to screen out experience samples, which cannot restore the original sample distribution from experience pool, we propose a Double-Sampling mechanism which covers both priority based sampling (PS) and random uniform sampling (RUS). PS is only in charge of providing training samples for critic network (or Q network), while RUS is responsible for offering data input of attention module.

To evaluate the algorithm’s efficiency and generality, we compare ALAP with conventional PER and the state of art algorithm LAP in DQN [8], DDPG [16] and MADDPG [17], and the performance advantages are verified in three different environments of OPENAI gym[18].

II Related Work

The inspiration of priority sampling used in reinforcement learning comes from the prioritized sweeping for value function iterations, and its success in DQN [19, 20, 21] has attracted a lot of attention. DQN uses the same maximization operator to select and evaluate actions, which brings the problem of overestimation. The PER algorithm in the Double-DQN (DDQN) is developed in [22]. In addition, PER has generated favorable outcomes in many algorithms such as DDPG [23] [24], Rainbow [25], etc.

PER itself has also evolved many improved versions. Liu [26] and Vanseijen [27] studied the effect of buffer size on algorithm performance. In order to alleviate the waste of computing resources caused by the excessive experience pool, Shen et al. [28] proposed an experience classification method, which divides the TD error distribution into different segments, and the transition tuples in the same segment have the same priority. This clustering method can reduce the cache space, and break the relevance of experience samples. Aiming to screen out more valuable experience samples, Gao [29] et al. integrated the reward value with the TD error to form a new priority parameter, which replaces the TD error as screening basis.

The aforementioned works mainly focus on the adjustment of sample priority evaluation, while ignore the deviation caused by PER itself. In order to resolve this problem, a Loss Adjusted Prioritized (LAP) [14] algorithm is proposed, which proves that any loss function evaluated by non-uniform sampling can be transformed into another uniform sampling one with the same expected gradient. The loss function is segmentaly described by Huber function, and different sampling methods are adopted according to the variation of TD error. This can suppress the sensitivity of the root Mean Square Error (MSE) loss to outliers, and reduce the influence of Mean Absolute Error (MAE) loss on convergence rate. On the basis of LAP algorithm, the work [15] theoretically analyzes the reasons for the poor effect on the combination of PER and actor-critic algorithm in continuous action control environment, and claims that there is a big error between Q-function calculation and the actual value, so it cannot correctly guide the actor network to make reasonable actions.

The LAP algorithm improves the robustness aganist outliers by designing the segemental loss function. However, the non-uniform sampling of PER will change the state-action space of the model when the MAE loss is applied, resulting in the estimation deviation of Q-value function. Therefore, LAP does not solve the underlying problem of PER.

In this paper, we propose a new method to resolve this issue by accurately determining the relation between the training progress and $\beta$ . In particular, this paper first designs an improved Self-Attention network, which quantifies the training progress by calculating the similarity of samples in experience pool. Meanwhile, we propose a Double-Sampling mechanism to ensure the parallel execution of training and similarity calculation. In this way, the deviation of the Q-value estimation can be greatly reduced. The main innovation of this study is as follows:

•

An extended actor-critic framework (or DQN) is proposed, which embeds the attention module into the A-C strcture (or DQN) as a parallel branch of the critic network (or Q network), takes the sequence of state-action pairs collected from experience pool as the input of attention module to calculates the similarity of the experience samples, and constructs the nonlinear mapping relationship between the output of the attention module and $\beta$ through the fully connected network.
•

A parallel Double-Sampling Mechanism (DSM) for two sampling procedures is proposed. The priority based sampling (PS) provides training samples for the model, and the random uniform sampling (RUS) is responsible for providing the data input of attention module. In addition, in order to guarantee the stable parallel execution of the Double-Sampling mechanism, we build a mirror buffer with the same data distribution as the original experience pool for the data source of uniform sampling.
•

In addition, an imporved Huber loss function is proposed to describe the loss function in segments, which extremly suppress the model sensitivity to outliers by adaptive loss; meanwhile, the priority clipping part of the LAP is removed to ensure the algorithm will not sacrifice the training speed.

The structure of the rest paper is as follows: Section III is the problem statement and the introduction of the basic knowledge, and Section IV describes the design process of ALAP algorithm in detail. Section V provides comparative experimental results and analysis. Finally, Section VI draws the conclusions.

III PRELIMINARIES

III-A Markov Decision Process

A Markov model with five key elements $\langle S,A,S^{\prime},r,\gamma\rangle$ is defined to describe the interaction process between the agent and environment. Among them, the state $S$ represents the possibility configuration of the agent, while vector $A$ denotes the action space. At each time step $t$ , the agent follows the strategy $\pi_{\theta}$ to select the action $A_{t}$ , and obtains the next state $S^{\prime}$ according to the state transition equation. In this process, the agent received immediate reward $r_{t}$ according to a function of its action and state. We call the process of environment from the initial state to the terminated state as a trajectory $\zeta$ , and the agent’s goal is to maximize the expected return $R=\sum_{t=0}^{T}\gamma_{t}r_{t}$ to continuously optimize $\zeta$ , where $\gamma$ and $T$ represent the discount factor and time horizon, respectively.

III-B Prioritized Experience Replay

PER applys a non-uniform sampling strategy, which adjusts the sampled frequency of experience transitions depending on the magnitude of TD-error $\left|\delta\right|$ . It has two improvements in comparison with the uniform sampling experience replay [30]. Firstly, PER assigns each experience sample $i$ with a sampling probability proportional to its TD error, and improves the training efficiency by learning important experience samples more frequently as follow:

\displaystyle P(i)=\frac{p_{i}^{\alpha}}{\sum_{j}p_{j}^{\alpha}}

(1)

In the formular (1), $p_{i}=\left|\delta(i)\right|+\epsilon$ is the priority of transition $i$ . The hyperparameter $\alpha$ is used to smooth out extremes, and a minimal positive constant value $\epsilon$ makes sure that the sampled probability is greater than 0, which gives all transitions a chance to be selected.

Secondly, it introduces importance sampling weight ratios $\bar{w}(i)$ to correct the estimation deviation caused by shifted state-action distribution:

\displaystyle\bar{w}(i)=\frac{w(i)}{\max_{j}w(j)},w(i)=\left(\frac{1}{N}\cdot\frac{1}{P(i)}\right)^{\beta},

(2)

\displaystyle L_{PER}(\delta(i))=\bar{w}(i)L(\delta(i)).

(3)

Note that $\bar{w}(i)\delta(i)$ will be used instead of $\delta(i)$ when updating formular (3). The hyperparameter $\beta$ increases linearly from $\beta_{0}$ to 1 along the training process.

III-C Loss Adjust Prioritized Experienced Replay

The linear annealing of importance weight in PER cannot completely eliminate the deviation, and the sensitivity to outliers is easy to further magnify the deviation. Fujimoto et al. [14] proposed an LAP algorithm, which uses the segemental loss function

L_{Huber}(\delta(i))=\begin{cases}0.5\delta(i)^{2}&\left|\delta(i)\right|\leq 1,\\ \left|\delta(i)\right|&otherwise,\\ \end{cases}

(4)

combined with priority clipping scheme to further reduce the deviation:

\displaystyle P(i)=\frac{max(\left|\delta(i)\right|^{\alpha},1)}{\sum_{j}max(\left|\delta(j)\right|^{\alpha},1)}.

(5)

By combining (4) and (5), it can be seen that when the absolute value of TD error is less than and equal to 1, MSE is applied as the loss function according to (4), and the priority of transition $i$ is cut up to 1, so that the uniform sampling is performed for each transition $i$ with a sampled probability of $\frac{1}{N}$ according to (5). Similarly, if the TD error is greater than 1, MAE is used to suppress the sensitivity to outliers in (4), and the non-uniform sampling is performed according to (5). Although LAP avoids the interference of outliers on training, the uniform sampling part decreases the training speed. In addition, MAE can only suppress the sensitivity to outliers, but cannot correct the shifted distribution, and the deviation does exist. Saglam et al. [15] proposed an LA3P algorithm based on LAP, using inverse sampling to select samples with small TD error to train the network, which avoids the uncertainty caused by large TD error transition, but does not meet the requirement on sample knowledge of PER.

IV Attention Loss Adjusted Prioritized Experience Replay

In this section, the design of ALAP algorithm is described in detail from two parts: Self-Attention network and Double-Sampling mechanism.

IV-A Design of Self-Attention Network

In PER and its improved algorithms, the hyperparameter $\beta$ is an important index to regulate the importance sampling weight (IS), which determines the correction strength of the algorithm with respect to error caused by PER. $\beta$ is positively correlated with the training progress and reaches 1 to fully compensate for esitmation error as the training ends. The PER uses the linear annealing method to make $\beta$ increase linearly with the training episodes. However, the training progress is not uniformly distributed over the scale of episodes number, and the way of linear parameter tuning may exacerbate the error. Note that the size of $\beta$ depends on the specific progress of the training; in order to quantify the training progress, we build an improved Self-Attention network to measure the training progress by calculating the similarity of sample distribution in the experience pool. At the beginning of training, the transitions in experience pool generated by the stochastic exploration of agent is random and uniform. As the training proceeds, the agent will make more use of the current optimal strategy to select high-return actions, that is to say, given a state, the action selected by the agent will be roughly the same, and the similarity of transitions sequence generated by the interaction between agent and environment will be improved. We input the mini-batch sized state-action pair sequences $X=[(S_{1},A_{1}),(S_{2},A_{2}),...,(S_{m},A_{m})]$ into the Self-Attention network, whose output is the quantized value of the sample similarity in experience pool, and $\beta$ is obtained after normalization.

The attention function can usually be described as a nonlinear fitting network that maps a query and a set of key-value pairs to the output [31, 32]. Its output is essentially a weighted sum containing importance weights. In practice, we package queries, keys and values into the corresponding matrices Q, K and V, and the output of attention value is calculated as follows:

\displaystyle Attention(Q,K,V)=softmax(\frac{Q\cdot K^{T}}{\sqrt{d_{k}}})V,

(6)

where $d_{k}$ is the dimension of queries and keys vectors, dividing the dot-product $Q\cdot K^{T}$ by $\sqrt{d_{k}}$ to prevent the gradient from disappearing.

In order to meet the needs of similarity calculation on internal elements of the input sequence, this paper optimizes the Self-Attention network properly and removes the value vector, and the attention module only outputs the similarity of key-query pairs. Considering that X is a list constructed of m state-action pair vectors, referring to the physical meaning of vector projection, we measure the similarity between $Q$ and $K$ by the sum of the projections among the corresponding vectors, where $Q=XW_{Q}$ , $K=shuffle(Q)$ , and both of them maintain dimensions consistent with X; $W_{Q}$ represents the initial weight matrix of $Q$ . Since the similarity is calculated for the internal elements of $X$ , we need to do projection operation after randomly rearranging the order of the elements (by shuffle operation) in $Q$ for each $q_{i}$ and $k_{i}$ . The process for improved Self-Attention network to calculate the attention value is shown as follows:

\displaystyle Self-Attention(Q,K)=sigmoid(\frac{Q\bullet K^{T}}{\sqrt{d_{k}}}),

(7)

\displaystyle Q\bullet K^{T}=\sum_{i=1}^{m}\frac{(q_{i}\cdot k_{i})}{\left|k_{i}\right|},

(8)

where $\bullet$ stands for projection sum operation of the corresponding elements between $Q$ and $K$ . The input of the Self-Attention module is the state-action pair sequence, and the output attention value represents the similarity of the elements in $X$ . After passing through the fully connection layer, the attention value is normalized by the activation function to obtain $\beta$ . The schematic diagram of Self-Attention mechanism is shown in Fig. 1. Note that the schematic diagram in this paper only depicts the network structure of ALAP under A-C framework, and the combination of ALAP with the algorithm based on value-function is similar and it is omitted here.

Refer to caption — Figure 1: Improved Self-Attemtion network.

We can see that the Attention network’s input is a mini-batch sized sequence of experience transitions provided by PS in Fig.1. However, PS exploits the fixed standard to screen out samples, thus its sample distribution cannot restore the real situation in experience pool, which inspires us to design the Double-Sampling mechanism.

IV-B Design of Double-Sampling Mechanism

The core idea of ALAP is to quantitatively describe the training progress by calculating the similarity of the sample distribution in experience pool through the attention module. Therefore, the sample distribution from data source of the attention module should be consistent with the sample distribution in experience pool. However, the priority based sampling way artificially changes sample’s visited frequency, so that the data source obtained by PS cannot reflect the real situation of data distribution in whole experience pool. Hence, we need to rely on random uniform sampling to provide data input for the attention module. In addition, the model network still needs PS to provide high-quality training samples. To sum up, we propose a parallel sampling method called Double-Sampling mechanism, which covers both PS and RUS. For ease of description, we define the data source acquired by PS as $s$ and the data source obtained by RUS as $s^{*}$ .

In the training process, the priority based sampling provides data samples for training model, and the uniform sampling is only responsible for providing data as input for attention module to fit $\beta$ , which can correct the estimation error in real time. The key point of the Double-Sampling mechanism is not to interfere with the normal operation of training, and to collect data in the same experience pool $D$ using both sampling procedures simultaneously. Since the two sampling procedures are done simultaneously, even though the sampled batchsize is much smaller than the volume of the experience pool $D$ , there is still a certain probability that the same sample will be selected for both sampling methods, resulting in instability of the algorithm. As a result, we propose a concept of mirror buffer $D^{*}$ and construct a mirror experience pool with the same data distribution as the original one. At each iteration step, the mirror buffer $D^{*}$ adds and updates experience samples at the same time as the original experience pool $D$ to keep the data synchronized.

It is worth noting that, although from the physical level, the two sampling steps are carried out in two different buffers, but from the data level, they are actually operated in the same experience pool. The schematic diagram of Double-Sampling mechanism is shown in Fig.2.

Algorithm 1 ALAP Algorithm

0: mini-batch

m

, step-size

\sigma

, replay period

K

, buffer sotrage

N

, exponent

\alpha

and

\beta

, budget

T

, tiny positive contant

\epsilon

1: Initialize replay buffer

D=\emptyset

, mirror replay buffer

D^{*}=\emptyset

p_{1}=1

\Delta=0

2: Observe environment state

S_{0}

and choose action

A_{0}

3: for t=1 to

T

4: Observe

S_{t}

r_{t}

\gamma_{t}

5: Store transition

(S_{t-1},A_{t-1},r_{t},\gamma_{t},S_{t})

D

6: Store transition

(S_{t-1},A_{t-1},r_{t},\gamma_{t},S_{t})

D^{*}

7: if

t>K

then

8: for t=1 to

K

9: Sample transition

j\sim P(j)=p_{j}^{\alpha}/\sum_{i}p_{i}^{\alpha}

10: Compute TD-error

\delta_{j}

and update transition priority

p_{j}\leftarrow\left|\delta(j)\right|+\epsilon

11: Obtain transitions

s

with mini-batch

m

from

D

by PS for model training

12: Obtain transitions

s^{*}

with mini-batch

m

from

D^{*}

by RUS

13: Compute

\beta

through Self-Attention network according to

s^{*}

14: Compute importance sampling weight

\bar{w}(i)=(N\cdot P(j))^{-\beta}/\max_{i}w(i)

15: Compute weight-change

\Delta\leftarrow\Delta+\bar{w}(j)\cdot\delta(j)\cdot\nabla_{\theta}Q(S_{t-1},A_{t-1})

16: end for

17: Update weights

\theta\leftarrow\theta+\sigma\cdot\Delta

, reset

\Delta=0

18: Copy parametes to target network

\theta_{target}\leftarrow\theta

19: end if

20: Choose action

A(t)\sim\pi_{\theta}(S_{t})

21: end for

Finally, in order to reduce the sensitivity of the algorithm to outliers, ALAP follows the Huber loss function applied in LAP and LA3P; unlike them, we remove the sample priority clipping part and use PER-sampled data to train the model throughout the whole process to ensure unbiased Q-value function estimation without sacrificing training speed. The overall operation flow of ALAP is shown in Algorithm 1.

V Experiment

In order to verify the effectiveness and generality, we integrate ALAP with DQN and DDPG, and test them in the environments cartpole-v0 [33] and simple [17] from OPAIgym, respectively. To further demonstrate the versatility of the algorithm, we combine ALAP with multi-agent algorithm MADDPG, and carry out several groups of comparative experiments in simple $\_$ tag [17], and the typical multi-agent confrontation environment of MPE. In each environment, we use the same network structure, reward shaping and hyperparameter configuration to train the algorithms for comparison. Additionally, we test the algorithms under different mini-batch sizes in each environment to compare their performance.

V-A Results with DQN

The environment cartpole-v0 is utilized to test all the algorithms. In the comparison experiments, all algorithms execute different sampling manners to draw mini-batch samples from the corresponding experience pool of volume $2\times 10^{4}$ for training the model, which is consturcted by 3-layer ReLU MLP with 24 units per layer. During the training, the agent chooses the action based on the $\epsilon-greedy$ strategy, the decay rate of $\epsilon$ is 0.0002, and a non-zero lower bound $\epsilon\geq$ 0.0001 is set to maintain the exploration ability of the agent. The total training episodes and the maximum simulation steps of each episode are both 200. The Adam optimizer is used to optimize the network parameters with a learning rate of 0.001 and a discount factor of 0.99. In cartpole-v0, the cartpole keeps the inverted pendulum upright through the left and right displacement, and any behavior of the agent, including the termination action, can be rewarded with a return value of 1. Since the maximum step per episode is 200, the maximum bonus the cart can get is 200 as well.

To avoid unintentional outcomes, we investigate the effectiveness of ALAP against other algorithms under batchsize=32, 64, 128, and we ran each case 20 times.

Fig.3 shows the average reward comparison curves between ALAP + DQN, LAP + DQN, PER + DQN and DQN. To compare the algorithm variance and stability, we set a 50 $\%$ confidence interval, and drew the confidence bands of each curve, which is also applied in other two environments. It is clearly that the ALAP algorithm outperforms others in terms of training speed and average reward. The ALAP algorithm achieves steady state much faster and has a greater reward value. In terms of stability, the ALAP algorithm has a narrower confidence band during the whole training phase, indicating that it can efficiently minimize variation and is more resilient to outliers. The LAP algorithm is slightly inferior to ALAP in terms of convergence speed and quality because LAP partially utilizes uniformly sampled data to train the model at the expense of training speed, but it still provides some suppression to outliers. PER does have an improvement in convergence speed and reward peaks compared to DQN, but it has a large variance in the training process.

V-B Results with DDPG

The scene simple is applied to evaluate the algorithms. Both original and mirror buffers are set with volume of $10^{6}$ to provide transitions for different sampling methods. The network model contains 2-layer ReLU MLP, with 64 neurons in each hidden layer. The total number of training episodes is 2000, the maximum simulation step is 25, the learning rate is 0.001, and the discount factor is 0.95. In this environment, there is just one agent and one obstacle, and the agent’s reward depends inversely on how far away from the obstruction it is. We still trained 20 times under each batchsize to collect the comparison results.

Fig.4 shows the average reward curves of ALAP + DDPG, LAP + DDPG, PER + DDPG, and the conventional DDPG. It demonstrates that ALAP has the best performance with respect to convergence speed and training variance in the face of different batchsize configurations. PER performs poorly under A-C framework, converging slower than baseline when the mini-batch is too large or too small, and PER consistently has the largest variance throughout the training process that shows the extremely instability. Compared with PER, the training variance of LAP is alleviated, but its convergence speed is not significantly improved. Overall, ALAP solves the problem of the degraded performance of PER when embedded in the A-C framework.

V-C Results with MADDPG

The classical environment for Multi-to-Multi pursuit evasion game simple $\_$ tag is used to test all the algorithms. In simple $\_$ tag, both the pursuers and the evaders empoly a zero-sum reward structure, where the reward of the pursuers is inversely proportional to their relative distance from the evaders, while the evaders are rewarded in the opposite way. To visually compare the algorithms’ strengths and weaknesses; we use MADDPG as a baseline to train evaders, while we train the pursuers with the improved algorithm simultaneously, and record their average reward. During training, we trace the model structure and hyperparameter configurations of simple, each agent follows the parameterized policy $\pi_{i}$ , and 20 comparative tests were completed under each batchsize.

Fig.5 shows the average reward curves of ALAP + MADDPG, LAP + MADDPG, PER + MADDPG, and MADDPG under different batchsize configurations. From Fig.5, we can see that under any mini-batch condition, the convergence speed and the average reward of ALAP are much higher than those of other algorithms. Although the stability of LAP decreases after mini-bach $\geq$ 64, its average reward and convergence speed are still higher than those of PER and MADDPG. The performance of PER is even worse than that of baseline after mini-batch $\geq$ 64, and the huge variance in the early stage of training is the reason that restricts the quality of its training.

VI CONCLUSIONS

In this paper, a general reinforcement learning algorithm framework called ALAP is proposed, which significantly reduces the estimation deviation of Q-value function caused by PER class algorithms. Firstly, the loss function is described in segments by Huber equation, which suppresses the sensitivity of the algorithm to outliers. Furthermore, the specific progress of training is quantified by calculating the similarity of samples in experience pool through the improved Self-Attention mechanism, so as to fit the accurate value of the hyperparameter $\beta$ to regulates the importance sampling weight. In addition, we also design a Double-Sampling mechanism based on mirror buffer, and use two sampling methods simultaneously to provide data sources for network model and attention module to ensure algorithm’s stable operation. The comparison results verify that ALAP can greatly improve the speed and stability of training.

References

[1] D. Silver, A. Huang and C. J. Maddison, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[2] Z. Xiao, X. Li and L. Wang, “Using convolution control block for chinese sentiment analysis,” Journal of Parallel and Distributed Computing, vol. 116, pp. 18–26, 2018.
[3] T. Fan, P. Long, W, Liu, and J. Pan, “Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios,” The International Journal of Robotics Research, vol. 39, no. 7, pp. 856–892, 2020.
[4] L. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,“ Machine Learning, vol. 8, no. 3, pp. 293–321, 1992.
[5] Z. Bing, “Solving robotic manipulation with sparse reward reinforcement learning via graph-based diversity and proximity,” IEEE Transactions on Industrial Electronics, vol. 70, no. 3, pp. 2759–2769, 2023.
[6] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations, 2016.
[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv:1312.5602, 2013.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[9] J. Sharma, P. A. Andersen, O. C. Granmo and M. Goodwin, “Deep q-learning with Q-matrix transfer learning for novel fire evacuation environment,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 12, pp. 7363–7381, 2021.
[10] G. Tesauro, “Extending Q-learning to general adaptive multi-agent systems,” in Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 871–878, 2004.
[11] R. Yang, D. Wang, and J. Qiao, “Policy gradient adaptive critic design with dynamic prioritized experience replay for wastewater treatment process control,” IEEE Transactions on Industrial Informatics, vol. 18, no. 5, pp. 3150–3158, 2022.
[12] F. Sovrano, A. Raymond, and A. Prorok, “Explanation-aware experience replay in rule-dense environments,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 898–905, 2022.
[13] G. Kaan, G. Hakan, “Generalized huber loss for robust learning and its dfficient minimization for a robust statistics,” arXiv:2108.12627, 2021.
[14] S. Fujimoto, D. Meger, and D. Precup “An equivalence between loss functions and non-uniform sampling in experience replay,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, vol. 33, pp. 14219–14230, 2020.
[15] B. Saglam, F. B. Mutlu, D. C. Cicek, and S. S. Kozat, “Actor prioritized experience replay,” arXiv:2209.00532, 2022.
[16] P. Timothy, J. Jonathan, “Continuous control with deep reinforcement learning,” arXiv:1509.02971, 2015.
[17] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of the 31th International Conference on Neural Information Processing Systems, pp. 6379–6390, 2017.
[18] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba“Openai gym,” arXiv:1606.01540, 2016.
[19] A. W. Moore and C. G. Atkeson, “ Prioritized sweeping: Reinforcement learning with less data and less time,” Machine learning, vol. 13, no. 1, pp. 103–130, 1993.
[20] A. David, F. Nir, and P. Ronald, “Generalized prioritized sweeping,” in Proceedings of the 10th International Conference on Neural Information Processing Systems, pp. 1001–1007, 1997.
[21] H. V. Seijen, R. S, Sutton, “ Planning by prioritized sweeping with small backups,” arXiv:1301.2343, 2013.
[22] X. Tao and A. S. Hafid, “DeepSensing: A Novel Mobile Crowdsensing Framework With Double Deep Q-Network and Prioritized Experience Replay,” IEEE Internet of Things Journal, vol. 7, no. 12, pp. 11547–11558, 2020.
[23] Y. Hou, L. Liu, Q. Wei, X. Xu and C. Chen, “A novel DDPG method with prioritized experience replay,” in IEEE International Conference on Systems, Man, and Cybernetics, pp. 316–321, 2017.
[24] J. Lu, Y. B. Zhao, Y. Kang, Y. Wang and Y. Deng, “Strategy generation based on DDPG with prioritized experience replay for UCAV,” in International Conference on Advanced Robotics and Mechatronics, pp. 157–162, 2022.
[25] M. Hessel, J. Modayil, V. H. Hasselt, and T. Schaul,“Rainbow: Combining improvements in deep reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 3215–3222, 2018.
[26] R. Liu, J. Zou, “The effects of memory replay in reinforcement learning,” in Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, pp. 478–485, 2018.
[27] H. Vanseijen, R. S. Sutton, “A deeper look at planning as learning from replay,” in Proceedings of the International Conference on Machine Learning, pp. 2314–2322, 2015.
[28] K. H. Shen and P. Y. Tsai, “Memory reduction through experience classification for deep reinforcement learning with prioritized experience replay,” in IEEE International Workshop on Signal Processing Systems, pp. 166–171, 2019.
[29] J. Gao, X. Li, W. Liu and J. Zhao, “Prioritized experience replay method based on experience reward,” in International Conference on Machine Learning and Intelligent Systems Engineering, pp. 214–219, 2021.
[30] C. Kang, C. Rong, W. Ren, F. Huo, and P. Liu, “Deep deterministic policy gradient based on double network prioritized experience replay,” IEEE Access, vol. 9, pp. 60296–60308, 2021.
[31] S. Iqbal, F. Sha, “Actor-attention-critic for Multi-agent reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning Research, pp. 2961–2970, 2019.
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv:1706.03762, 2017.
[33] K. Kanno, and A. Uchida, “Photonic reinforcement learning based on optoelectronic reservoir computing,” arXiv:2202.12896, 2022.