This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2021

Learning user-defined sub-goals using memory editing in reinforcement learning

\fnmGyeongTaek \surLee [email protected] \orgdivDepartment of Industrial and Systems Engineering, \orgnameRutgers University, The State University of New Jersey, \orgaddress\street96 Frelinghuysen Road, \cityPiscataway, \postcode08854, \stateNJ, \countryUSA
Abstract

The aim of reinforcement learning (RL) is to allow the agent to achieve the final goal. Most RL studies have focused on improving the efficiency of learning to achieve the final goal faster. However, the RL model is very difficult to modify an intermediate route in the process of reaching the final goal. That is, the agent cannot be under control to achieve other sub-goals in the existing studies. If the agent can go through the sub-goals on the way to the destination, the RL can be applied and studied in various fields. In this study, I propose a methodology to achieve the user-defined sub-goals as well as the final goal using memory editing. The memory editing is performed to generate various sub-goals and give an additional reward to the agent. In addition, the sub-goals are separately learned from the final goal. I set two simple environments and various scenarios in the test environments. As a result, the agent almost successfully passed the sub-goals as well as the final goal under control. Moreover, the agent was able to be induced to visit the novel state indirectly in the environments. I expect that this methodology can be used in the fields that need to control the agent in a variety of scenarios.

keywords:
Reinforcement learning, learning the sub-goals, memory editing, exploitation

The goal of reinforcement learning (RL) is to maximize a reward that an agent receives in a specific environment. For example, the RL in robotics allows a robot arm to perform the desired task, such as picking up an object sinha2022s4rl . In a game environment such as Atari, the RL allows the agent to get a maximum game score mnih2013playing . In sports such as Alphago silver2017mastering , the RL is aimed to win the game. In a path planning problem bae2019multi , the agent learns to move the shortest distance from the starting point to the target point in the RL. In this study, as well as learning the final goal of the agent, I propose a methodology to learn sub-goals that are defined by the user using memory editing. Let’s assume that we want to reach the destination. The most important thing is to move the shortest distance from the starting point to the target point. However, we usually encounter various situations such as traffic jams or the vagaries of the weather, so we must consider a number of variables that can actually happen. In reality, we can change the intermediate stop or re-plan our route from the scratch depending on the situation. At this point, I focus on making the agent reach the destination by changing the user-defined path by editing the agent’s memory. That is, the main purpose of this study is to make the agent being controlled in the process of the agent performing the policy. While most of the RL studies have paid attention to maximizing a reward to attain the desired policy, I propose a methodology that can control the agent of the learned RL model by making the agent learn various sub-goals. As a result, the agent can achieve various sub-goals as well as the final goal in the test environment.
In various fields, to learn the agent well in the RL, it is indispensable to give enough rewards to the agent. In addition, it is important for the agent to explore a novel state and exploit previous experience. These are the main challenges for the RL, and various studies have been proposed to address these problems. As for the exploration, many researchers have developed exploration bonus methods ostrovski2017count ; bellemare2016unifying ; fox2018dora ; machado2018count ; silvia2012curiosity ; pathak2017curiosity ; burda2018exploration . As for the exploitation, several studies have proposed a method to use a replay memory so that the agent stores transitions and uses them efficiently mnih2013playing ; mnih2015human . Variants of the replay memory were also introduced to improve the exploitation performance schaul2015prioritized ; andrychowicz2017hindsight ; nguyen2019hindsight . Actor-critic algorithms that utilize the replay memory were also proposed wang2016sample ; gruslys2017reactor ; oh2018self . In addition, some studies have recently been introduced to balance the exploitation and the exploration kim2020take ; luger2018dynamic ; kang2020balancing ; wilson2021balancing .
Meanwhile, several studies on the agent’s multi-goal have been proposed veeriah2018many ; bai2019guided ; lee2020weakly ; pitis2020maximum ; okudo2021subgoal ; kim2021landmark . These studies generated the multi-goals or sub-goals to enhance the efficiency of the learning for the final goal. The hindsight experience replay (HER) andrychowicz2017hindsight ; nguyen2019hindsight improves the exploration performance to utilize a pseudo reward for non-successful trajectories. They used concatenation of the sub-goal and the state similar to the present study to improve the performance of the RL. The existing RL methods for learning the sub-goals have focused on the improvement of the exploration performance and the achievement of one single final goal. However, I focus not only on the achievement of the final goal but also on the achievement of the user-defined sub-goals. The sub-goals are trained simultaneously with the final goal, and the sub-goals are customized by the users on their own. That is, the agent that completed the learning can be controlled by the users. To do this work, I introduce learning the sub-goals using memory editing
Memory distortion is common in our daily lives schacter2011memory . In essence, all memories are slightly distorted due to various factors such as sleep, retrieval conditions, and conceptual issues schacter2011memory ; bernstein2009tell ; fernandez2015benefits . When we recall the past, we reconstruct them based on what we know and what we have experienced. That is, we often make decisions based on the distorted memory. While doing that, it can cause mental illnesses such as trauma, but it can also preserve the information that we perceive and we have experienced fernandez2015benefits . If we can artificially edit our memories, we will be able to get away from mental illnesses and concretely learn clearer and essential information phelps2019memory . Also, if we can divide and edit the long memory and remind the divided memory intensively, we can get vivid and clear memories. Likewise, if we edit the memory of the agent, we can make the agent perceive the sub-state and learn the various sub-goals.
In this study, using the concept of the memory editing, I propose a methodology that the agent learns the sub-goals. Through the memory editing, the agent can learn and recognize the sub-goals and can be controlled by the users in the test environment. In order to do this, I generate various sub-goals in the memory editing and separate learning the original goal and learning the sub-goals by distinguishing two replay memories: one is for learning the original goal and the other is for learning the sub-goals. As the episodes progress, the transitions of the agent are edited by specific probability and the transitions are stored in the second memory. When the experiences of the agent are randomly sampled from the second memory, the agent receives the state of the sub-goals as a state vector, and additional rewards are given with the original rewards. The transitions sampled from the second memory will be different from the learning stages. Because the transitions were edited for learning the sub-goals. It is to make the agent perceive real experiences and reminiscence differently when the agent is trained. This means that when the agent reminds the states and the rewards are given, a sign of the sub-goals is given as a state and the sub-goals are learned with rewards by the sign. The proposed methodology allows the agent to achieve not only the final goal but also the user-defined sub-goals. The agent can perform the learned policy under the user’s control. I applied the proposed methodology to simple environments and confirmed that the agent could reach the final goal, passing the sub-goals which are customized by the user. The main contributions of this article are as follows:

  • The proposed methodology is very simple and easy to implement. We utilize the replay memory widely used for the exploitation and learn the final goal and the sub-goals separately.

  • We propose a methodology to indirectly improve the exploration performance using the memory editing. As the agent recognizes the states of the sub-goals in the learning and after that, the states are given to the agent in the episode, the agent is induced to explore the novel states.

  • To the best of our knowledge, this study is the first RL methodology in that the agent is controllable such that it achieves the user-defined sub-goals as well as an original goal. By using this methodology, users could define a lot of sub-goals in the real environment in addition to the achievement of the final goal.

1 Result

1.1 Environment

I conducted two simple experiments to see the agent achieve both the sub-goals and the final goal. In these experiments, I assumed various scenarios and wanted to confirm that the agent can be controlled by the user and that the agent behaves differently with and without the existence of the given sub-goals. In the first experiment, I constructed a simple two-dimensional (2D) environment as shown in Figure 1.a. The final goal is for the agent to move from the starting point to the target point. I represented the coordinates of the agent as a state using an effective coordinate vector lee2020autonomous . The action of the agent was set as a simple movement: left, right, up, and down. The reward was set as zero except for when the agent reaches a target point or moves out of the grid environment. I constructed the environment as a sparse reward environment to show that the proposed methodology can improve the performance of the exploration. After training, I set up various scenarios. Figure 1.b shows the examples of the test environments. The three sub-goals were imposed on the agent, and the agent should pass the sub-goals and reach the target point.
In the second environment, I constructed a ‘key-door domain’ as a hard sparse reward environment as shown in Figure 1.b. The environment consisted of a total of 4 stages. In each stage, although the agent goes to the goal (door), if the agent does not pass the bonus point (key), the agent cannot move to the next stage. To clear the current stage, the agent must pass the bonus point above all. Furthermore, each stage has different positions on the walls, the starting point, the bonus point, the penalty point, and the goal point. Thus, it is very difficult for the agent to clear all stages. The reward was set as zero except for when the agent reaches the bonus point (+10), the penalty point (-10), and the target point (+20).

Refer to caption
Figure 1: Experimental environments. a, Simple 2D grid environment. In the environment, the goal of the agent was set to reach the target point. b, Examples of scenarios in the 2D grid environment. Three sub-goals were given to the agent, the goal of the agent was to pass the sub-goals and reach the target point. c, Key door domain environment. The environment consisted of a total of 4 stages, and the agent should pass the bonus point to clear each stage.

1.2 Experimental settings

In the proposed methodology, the memory editing and the exploitation are the most important parts for the agent to learn the sub-goals. Thus, it is inevitable to use the replay memory for the exploitation of the sub-goals. The self-imitation learning (SIL) only exploits valuable past decisions compared to the current value oh2018self . The exploitation technique can help the exploitation of learning the sub-goals. Therefore, I utilized the SIL as a base architecture of the RL model. Further, I used the random network distillation (RND), which is widely used as an exploration bonus method burda2018exploration .
The probability of performing the memory editing and learning the sub-goals is adjusted by using an e-greedy method. I set the probability of how often the memory editing is performed as 0.1. Also, the memory editing is performed when the total reward of the current episode is in the top 1% of the last 1,000 episodes. In addition, the interval of the sub-goals was set as the number sampled between 5 and 100 for the first experiment and between 5 and 30 for the second experiment. The exploitation of the sub-goals is rarely performed in the earlier stage of the learning. The probability of learning the sub-goals was set as 0.001 in the initial episode. The probability was gradually increased to 0.5 in the final episode.
I learned the agent 50,000 episodes for the two experiments and repeated each experiment 10 times. Also, I assigned a coordinate of the location to be reached by the agent as the state of the sub-goals.

1.3 Experimental results

Figure 2.a shows the plot of the probability of reaching the target point with training the sub-goals and without training the sub-goals in the first experiment. The result shows that the agent reached the target point within 5,000 episodes regardless of training the sub-goals. Figure 2.b is the visualization of the route of the last 10,000 episodes with training the sub-goals (2) and without training the sub-goals (1). The color change from blue to red indicates the frequency of the agent’s visit. It is shown that the agent with training the sub-goals visited wider areas compared to those without training the sub-goals. When the agent started to perceive the sub-goals in the learning and if the agent is given a random sign that makes the agent move to another location, the agent visited the novel state. In addition, without training the sub-goals, the agent moved along the paths that are not varied as the learning progressed, as shown in Figure 2.b.(1). On the other hand, with training the sub-goals, the agent used a variety of routes, as shown in Figure 2.b.(2). We can see that learning the sub-goals indirectly drives the agent to explore the novel state further.
Figure 2.c is the visualization of the route of the agent that is given the sign of the sub-goals. The green dot points are the sub-goal points. The agent gets a sub-goal sign nearest to the current location. If the agent reaches a given sub-goal or passes the sub-goal, the agent gets another sub-goal sign nearest to the current location except for the previously assigned sub-goals. I assumed various scenarios and set the sub-goal points between the starting point and the target point differently for each test experiment. In the result, I confirmed that the agent almost successfully moved to the sub-goal points that I customized. It is very interesting to see that after the learning, the agent can pass the user-set middle point and the agent can be controlled by the users using the sub-goals.

Refer to caption
Figure 2: The result of learning the sub-goals. a, The probability plot of reaching the target point with and without the learning the sub-goals. b, The visualization of the path of the agent with (2) and without (1) learning the sub-goals. The agent used various paths to reach the target point by learning the sub-goals in the last episode interval (2). c, The visualization of the path of the agent with the given sub-goals. I set the various sub-goals to confirm the performance of the agent. The agent almost successfully passed the sub-goals and reached the target point.

Figure 3.a show the cases where the agent fails to pass all of the sub-goals. The sub-goals were difficult for the agent to pass. Because the agent’s task was to reach the goal point and the agent was not trained to move the sub-goals in the learning, the agent does not need to zig-zag or go back a long way. However, although the agent could not pass the first sub-goal, the agent tried to move to the next sub-goal and finally, the agent reached the target point. This result means that if the agent cannot achieve the sub-goal, the agent will try to achieve the next sub-goals and the final goal. This study can be helpful for a variety of test environments as we can assume various scenarios using the sub-goals and the final goal. By using various experimental scenarios and the results, we can decide the optimal policy of the agent depending on the situation where there are a number of variables and the agent should be under control.
Figure 3.b \sim c shows a case that failed to learn the final goal. In this experiment, the agent never reached the target point (c). However, as shown in Figure 3.b, the agent nearly passed the sub-goals and attempted to reach the target point in the test environment. Of course, the agent could not reach the goal point clearly just like the case in the previous experiment. The agent reached the closest sub-goal and tried the next sub-goal step by step. This is a very impressive result that the agent was able to reach the destination through the sub-goals despite the failure of learning the final goal in the training. The implications of these results are very significant. In this experiment, even though I used a technique for the agent to reach the target point using the sub-goals in the test experiment, I never used it in the training environment. If we can utilize the ability for reaching the agent’s sub-goals, we can easily learn the agent to achieve the final goal like the previous studies andrychowicz2017hindsight ; nguyen2019hindsight ; pitis2020maximum ; okudo2021subgoal ; kim2021landmark .
Figure 3.d shows the probability of the actions at the starting point according to the state of the sub-goals. In the top of figure, the further the sub-goal was set to the left of the starting point, the higher the probability of a ’Left’ action was. Likewise, In the bottom of figure, the further the sub-goal was set to the up of the starting point, the higher the probability of an ’Up’ action was. These results were usually found in the area visited by the agent. It means that if the agent is given the specific location as suggested in the sign of the sub-goals, the agent will try to reach the sub-goals. That is, by using several sub-goals, the users can control the agent in various scenarios.

Refer to caption
Figure 3: a, Failure cases of going through almost all the sub-goals. The sub-goals were set in a difficult way for the agent to pass. However, the agent passed the rest of the sub-goals and reached the target point. b, The visualization of the paths by which the agent went through the sub-goals but failed to learn the original goal. The agent was not able to reach the target point in the learning, the agent reached the target point through the sub-goals. c, The visualization of the path by which the agent failed to learn the original goal. d, The probability plot of the action when the sub-goals were given.

In the second experiment, the algorithms such as prioritized experience replay, Sample efficient actor-critic With experience replay, and SIL without the RND never passed Stage 2 for 10 times in the experiment schaul2015prioritized ; wang2016sample ; oh2018self . In the SIL with the RND (SIL + RND) but without training the sub-goals, the agent passed all stages 1 out of 10 times. Otherwise, in the model with training the sub-goals, the agent passed all stages 5 out of 10 times. I confirmed the result in Figure 4. Until the middle of the learning, the agent without learning the sub-goals reached Stage 4 faster. However, as the episode progresses, the agent with learning the sub-goals started to clear the final stage. This result means that training the sub-goals can affect the exploration of the agent indirectly so that agent could clear the difficult stage. After the learning is completed, it was confirmed that the agent reaches the target point through the bonus point in the almost shortest path for all stages without the sign of the sub-goals, as shown in Figure 5.a. In the test environment, I assumed the two scenarios as shown in Figure 5.b and Figure 5.c. Two sub-goals were set differently at each stage, the bonus point was set as a third sub-goal, and the sub-goals were granted as the state of the sub-goals to the agent, similar to the previous experiment. As a result, the agent almost passed the sub-goals and attempted to reach the target point. However, in this experiment, the agent could not move the shortest path, whereas the agent that did not learn the sub-goals showed the almost shortest path. In addition, oftentimes, the agent passed the penalty point as shown in Figure 5.c.(Stage3). I observed that the phenomenon occurred sometimes when the agent learned the sub-goals. This environment is a relatively small world compared to the first experiment. Thus, it is hard for the agent to transform the direction within the short term. Furthermore, the agent trained to pass the bonus point to clear each stage, but if the sub-goals were given to the agent, the agent was likely to be confused about whether to go the sub-goal, bonus point, or target point within a short period. Here, future research is needed for the agent to enforce the given sub-goals clearly.

Refer to caption
Figure 4: Averaged stages for 10 experimental stages with and without learning the sub-goals. In the early stage of the episodes, the original RL showed a better performance. However, as the episode progresses, the agent that learned the sub-goals cleared more stages.
Refer to caption
Figure 5: a, The visualization of the agent in the second environment without learning the sub-goals. The agent cleared all stages from almost the shortest distance. b, The visualization of the agent with learning the sub-goals in the first scenario in the test environment. c, The visualization of the agent with learning the sub-goals in the second scenario. The agent passed the sub-goals, the bonus point, and the target point for all stages. However, the agent sometimes ignored the sub-goals or passed the penalty point.

2 Discussion

In this article, I proposed a methodology to train the sub-goals using the memory editing so that the agent can reach the sub-goals as well as the final goal under control. The results of the two experiments were very impressive, which can be useful in real test environments. To learn the sub-goals, I used the memory editing and the separated exploitation method. I distinguished learning the sub-goals from learning the final goal. The transitions collected from the episode are transformed into the sub-goals in the memory editing, and the intervals between the sub-goals and the additional rewards are given. At the beginning of the learning, the final goal is learned at first. As the learning progresses, learning the sub-goals is performed gradually. Learning the sub-goals is separated from learning the final goal using different replay memory and the value network. As a result, the agent started not only to learn the final goal but also slowly started to learn the sub-goals. I conducted experiments on the various scenarios for the sub-goals. The experimental results show that the agent could successfully reach various sub-goals that are customized by the user as well as one final goal. Even though the agent failed to reach the final goal in the learning, in the test environment, the agent could arrive at the final destination through the sub-goals. This result is very significant and can be contributed to various domains and studies. Especially, this methodology will be helpful in the fields that need to control the agent in various scenarios or situations.
However, in this study, there are several limitations to solve. First, it needs to have enough episodes to learn the sub-goals. Because learning the sub-goals should be performed after the final policy is learned enough. Second, it is hard to adjust a balance between learning the final goal and the sub-goals for controlling the agent completely. If the exploitation of the sub-goals is performed too often, the agent can be fallen into the local policy. Meanwhile, if the exploitation of the sub-goals is rarely performed, the agent cannot recognize the state of the sub-goals. Finally, the agent can reach almost only the sub-goals that had been previously visited by the agent. This problem is possibly due to a natural phenomenon, which still needs to be solved for the agent to be fully controlled. Learning the sub-goals is performed by the exploitation of the transitions collected from the previous episodes. Therefore, the agent was able to move almost only the sub-goals visited by the agent. However, in the real environment, unexpected situations can occur at any time. Future research is needed for the robust RL model to learn the unexpected sub-goals. I expect that this methodology will be developed in the future and applied to various problems and domains that need to control the agent.

3 Method

This study was motivated by the following question: How can we learn the agent to perform various tasks under control as well as to achieve the original goal? In general, the agent is learned to achieve the user-defined goal by maximizing the total sum of rewards or to maximize the rewards given from a specific environment such as a game. In the test environment, the agent that has completed learning merely performs to get the rewards based on the policy network. When we drive a car to reach the goal, we should consider a number of situations such as ‘is there a pedestrian on the street?’, ‘where is a crosswalk?’, and ‘what direction does that sign indicate?’. In a complex driving environment, we generally get some directions for the sub-goals and are aided by many operations such as ’deceleration’, ’stop’, and ’rotation’ via a navigator. Here, we can select and perform many sub-goals on our own to achieve the goal. Our motivation is that humans can manipulate and perform the sub-goals for the achievement of the final goal. Therefore, I propose a methodology to perform the user-defined sub-goals by using the concept of the memory editing.

3.1 The memory editing of the agent

The HER andrychowicz2017hindsight used the sub-goals for the agent to achieve the goal faster. However, the purpose of the sub-goals and the training process of the HER are different from the present study. The HER focused on training agents more efficiently and faster towards their goals. They used the sub-goals and the states together by concatenating the elements for all the training procedures. Meanwhile, I focus on training the sub-goals to allow the agent to perform the sub-goals and the final goal under control using the memory editing. The sub-goals can be customized by the users. In addition, I distinguish training the sub-goals from the training of the final goal by using two replay memories. Because it is possible that the transitions for the sub-goals can be confused with the transitions for the original goal after performing the memory editing. Also, it is difficult to adjust to what extent each learning can be performed. If the sub-goal is not trained properly, the agent is more likely to fall into the local policy.
To train the sub-goals, training the final goal should be preceded to some extent. As the episode progress, training the sub-goals should be gradually performed. When one episode ends, the states in the episode are transformed into the state of the sub-goals by a specific probability and stored in the second memory. This process is called memory editing. At this point, it is very important how often and under what conditions to edit the memory. As mentioned previously, if the sub-goals are trained often, the agent can be fallen into the local policy. Thus, it is reasonable to increase the frequency of the memory editing slowly as the episode progress. Moreover, we need to select valuable sub-goals in the memory editing. It is preferable to choose the sub-goals within the subset of the processes from the initial state to the goal, but it does not necessarily mean that it is a sufficient condition. I propose to generate the sub-goals when the total sum of the reward of the episode is higher than previous episodes or to generate randomly with a low probability.

3.2 The exploitation of the sub-goals

After the episode stage, the states of the sub-goals and the additional rewards are given to the agent in the memory editing. Then, the transitions should be stored in the second memory and should be trained using the replay memory. However, there are two problems to solve. First, the state of the sub-goal has a different characteristic from the original state. The value network is used to evaluate the original states. As a result of the evaluation, the probability for the transitions to be sampled is calculated. The reason is that the state of the sub-goal is fundamentally different from the original state, it is not reasonable to use the value network. Thus, I utilize an additional value network to evaluate the value for the states of the sub-goals. The value network for the sub-goals is only used to evaluate and decide whether the state of the sub-goals is being exploited or not. In this study, this network will be referred to as a value network for the sub-goals (VNsVN_{s}), and the original value network, as a VNVN. Second, as previously mentioned, using one replay memory can cause a decrease in the efficiency of learning. The first aim of the RL is to learn the agent to achieve the final goal. Also, the capacity of the memory is limited, and the states of the sub-goals need to be reevaluated by the VNsVN_{s}. Therefore, after the memory editing, the transitions transformed for the sub-goals should be stored separately in the replay memory and trained gradually. Thus, I employ another replay memory to store the edited memories and learn the sub-goals. It is important for this study to manage and supervise the edited memories individually for the sub-goals. The memory will be referred to as a replay memory for the sub-goals (RMsRM_{s}), and the original replay memory for the final goal, as an RMRM. By utilizing the RMsRM_{s} for the exploitation, learning can become efficient.
Figure 6.a and Algorithm 1 shows the process of the proposed methodology. In the episode stage, almost all the simulation process is similar to the existing RL method. The agent performs based on the current policy network (θp\theta_{p}) and gets the next state (sts_{t}) and the reward (rtr_{t}) until the end of the episode. One of the crucial points different from the original RL framework is to insert a sign for the agent to explore the novel state. The agent is learned for the final goal at first, and is gradually learned for the sub-goals. As the agent starts to recognize the sub-goals in the training, if the agent gets the sub-goals stgs_{t}\|g as a state just like a sign in the traffic, the agent can explore a state that has not been visited in the previous episodes. The sub-goal is given to the agent with small probability.
When one episode finishes, whether the memory editing is performed by a specific probability is decided in the transition. Then, if the sub-goals are generated in the memory editing, it is necessary to decide which sub-goal to be chosen in the episode. The sub-goals can be closer goals for the short term and distant goals for the long term. If we want the agent to achieve a distant goal, the interval of the sub-goals to be generated in the steps of the episode need to be divided into several chunks. Figure 6.b shows various sub-goals being generated in one episode. Depending on the interval, the sub-goal of the specific point in the episode can be a close branch (1) or the goal (3). Figure 6.c shows the process of generating the sub-goals in the memory editing. Let’s assume that the size of steps in the episode is 500 and we want to make the agent achieve the sub-goals for the long term. Then, we can generate the sub-goals every 100 steps in the episode. And the 100th state is given to the previous 99 states as a state of the sub-goal as shown in Figure 6.c. The sub-goals are represented as the state and concatenated with the original state stgints_{t}\|g_{int} . If we want to make the agent reach the sub-goals near the current step, we can impose the sub-goals with short intervals like 5, 10, or 20.
In the learning stage, the transitions are trained using the RMRM to achieve the final goal. Then, as the episode progresses, the sub-goals are sampled by the probabilities that are calculated by the VNsVN_{s} and are trained using the RMsRM_{s} gradually. It is difficult to decide how often the sub-goal is trained. At least, the sub-goals should be learned after the final goal is learned enough. If the sub-goals are learned frequently in the earlier episode, the agent can be fallen into the local policy. Therefore, the exploitation of the sub-goals should be performed when the final goal is almost achieved.

Refer to caption
Figure 6: a, An overview of the proposed methodology. In the simulation stage, the transitions are collected, and when one episode is completed, the memory editing has been performed. In the learning stage, learning the sub-goals and learning the original goal are progressed separately. b, An example of the memory editing. In one episode, depending on the interval of the sub-goals, various sub-goals can be generated. c, An example of generating the sub-goals. When the interval of the sub-goals is 100, the 100th state is given to the previous 99 states.
1:Initialize policy network parameters θp\theta_{p}
2:Initialize replay buffer for original goal \mathcal{RM}\leftarrow\emptyset
3:Initialize replay buffer for sub-goals 𝓈\mathcal{RM_{s}}\leftarrow\emptyset
4:procedure Learning the Sub-goals and the final goal
5:     for episode = 1, M do
6:         \\ Simulation stage.
7:         for each step do
8:              Generate a random sub-goal ststgs_{t}\leftarrow s_{t}\|g with small probability
9:               \ignorespaces\triangleright \| denotes concatenation
10:              Execute an action st,at,et,st+1πθ(atst)s_{t},a_{t},e_{t},s_{t+1}\approx\pi_{\theta}(a_{t}\mid s_{t})
11:              Store transition {(st,at,rt)}\mathcal{E}\leftarrow\mathcal{E}\cup\{(s_{t},a_{t},r_{t})\}
12:         end for
13:         if  st+1s_{t+1} is terminal then
14:              Compute returns Rt=ΣkγktrkR_{t}=\Sigma^{\infty}_{k}\gamma^{k-t}{r}_{k} in \mathcal{E}
15:              {(st,at,Rt)}\mathcal{RM}\leftarrow\mathcal{RM}\cup\{(s_{t},a_{t},R_{t})\}
16:              Clear episode buffer \mathcal{E}\leftarrow\emptyset
17:         end if
18:         
19:         if  RtR_{t} >> memory editing threshold then
20:              Sample interval intint of steps in the episode
21:              Generate sub-goals gg and state of sub-goals sgs\|g
22:              Set additional rewards rtr_{t}^{\prime}
23:              𝓈𝓈{stgint,at,rt)}\mathcal{RM_{s}}\leftarrow\mathcal{RM_{s}}\cup\{s_{t}\|g_{int},a_{t},r_{t}^{\prime})\}
24:         end if
25:         
26:         \\Learning stage.
27:         for k= 1, N do
28:              Sample a minibatch {(s,a,R)}\{(s,a,R)\} from \mathcal{RM}
29:               \ignorespaces\triangleright Optimize policy network θp\theta_{p} for the final goal
30:         end for
31:         for k= 1, P do
32:              Sample a minibatch {(sg,a,r)}\{(s\|g,a,r^{\prime})\} from 𝓈\mathcal{RM_{s}}
33:               \ignorespaces\triangleright Optimize policy network θp\theta_{p} for the sub-goals
34:         end for
35:     end for
36:end procedure
Algorithm 1 Learning the sub-goals using memory editing

References

  • \bibcommenthead
  • (1) Sinha, S., Mandlekar, A., Garg, A.: S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917 (2022). PMLR
  • (2) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  • (3) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. nature 550(7676), 354–359 (2017)
  • (4) Bae, H., Kim, G., Kim, J., Qian, D., Lee, S.: Multi-robot path planning method using reinforcement learning. Applied sciences 9(15), 3057 (2019)
  • (5) Ostrovski, G., Bellemare, M.G., Oord, A.v.d., Munos, R.: Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310 (2017)
  • (6) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems, pp. 1471–1479 (2016)
  • (7) Fox, L., Choshen, L., Loewenstein, Y.: Dora the explorer: Directed outreaching reinforcement action-selection (2018)
  • (8) Machado, M.C., Bellemare, M.G., Bowling, M.: Count-based exploration with the successor representation. arXiv preprint arXiv:1807.11622 (2018)
  • (9) Silvia, P.J.: Curiosity and motivation. The Oxford handbook of human motivation, 157–166 (2012)
  • (10) Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning (ICML), vol. 2017 (2017)
  • (11) Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation. arXiv preprint arXiv:1810.12894 (2018)
  • (12) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
  • (13) Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
  • (14) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., Zaremba, W.: Hindsight experience replay. Advances in neural information processing systems 30 (2017)
  • (15) Nguyen, H., La, H.M., Deans, M.: Hindsight experience replay with experience ranking. In: 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 1–6 (2019). IEEE
  • (16) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224 (2016)
  • (17) Gruslys, A., Dabney, W., Azar, M.G., Piot, B., Bellemare, M., Munos, R.: The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. arXiv preprint arXiv:1704.04651 (2017)
  • (18) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. arXiv preprint arXiv:1806.05635 (2018)
  • (19) Kim, S., Mai, T.-D., Khanh, T.N.D., Han, S., Park, S., Singh, K., Cha, M.: Take a chance: Managing the exploitation-exploration dilemma in customs fraud detection via online active learning. arXiv preprint arXiv:2010.14282 (2020)
  • (20) Luger, J., Raisch, S., Schimmer, M.: Dynamic balancing of exploration and exploitation: The contingent benefits of ambidexterity. Organization Science 29(3), 449–470 (2018)
  • (21) Kang, C.-Y., Chen, M.-S.: Balancing exploration and exploitation in self-imitation learning. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 274–285 (2020). Springer
  • (22) Wilson, R.C., Bonawitz, E., Costa, V.D., Ebitz, R.B.: Balancing exploration and exploitation with information and randomization. Current opinion in behavioral sciences 38, 49–56 (2021)
  • (23) Veeriah, V., Oh, J., Singh, S.: Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605 (2018)
  • (24) Bai, C., Liu, P., Zhao, W., Tang, X.: Guided goal generation for hindsight multi-goal reinforcement learning. Neurocomputing 359, 353–367 (2019)
  • (25) Lee, L., Eysenbach, B., Salakhutdinov, R.R., Gu, S.S., Finn, C.: Weakly-supervised reinforcement learning for controllable behavior. Advances in Neural Information Processing Systems 33, 2661–2673 (2020)
  • (26) Pitis, S., Chan, H., Zhao, S., Stadie, B., Ba, J.: Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. In: International Conference on Machine Learning, pp. 7750–7761 (2020). PMLR
  • (27) Okudo, T., Yamada, S.: Subgoal-based reward shaping to improve efficiency in reinforcement learning. IEEE Access 9, 97557–97568 (2021)
  • (28) Kim, J., Seo, Y., Shin, J.: Landmark-guided subgoal generation in hierarchical reinforcement learning. Advances in Neural Information Processing Systems 34 (2021)
  • (29) Schacter, D.L., Guerin, S.A., Jacques, P.L.S.: Memory distortion: An adaptive perspective. Trends in cognitive sciences 15(10), 467–474 (2011)
  • (30) Bernstein, D.M., Loftus, E.F.: How to tell if a particular memory is true or false. Perspectives on Psychological Science 4(4), 370–374 (2009)
  • (31) Fernández, J.: What are the benefits of memory distortion? (2015)
  • (32) Phelps, E.A., Hofmann, S.G.: Memory editing from science fiction to clinical practice. Nature 572(7767), 43–50 (2019)
  • (33) Lee, G.T., Kim, C.O.: Autonomous control of combat unmanned aerial vehicles to evade surface-to-air missiles using deep reinforcement learning. IEEE Access 8, 226724–226736 (2020)