Identifying Critical States by the Action-Based Variance of Expected Return
Abstract
The balance of exploration and exploitation plays a crucial role in accelerating reinforcement learning (RL). To deploy an RL agent in human society, its explainability is also essential. However, basic RL approaches have difficulties in deciding when to choose exploitation as well as in extracting useful points for a brief explanation of its operation. One reason for the difficulties is that these approaches treat all states the same way. Here, we show that identifying critical states and treating them specially is commonly beneficial to both problems. These critical states are the states at which the action selection changes the potential of success and failure dramatically. We propose to identify the critical states using the variance in the Q-function for the actions and to perform exploitation with high probability on the identified states. These simple methods accelerate RL in a grid world with cliffs and two baseline tasks of deep RL. Our results also demonstrate that the identified critical states are intuitively interpretable regarding the crucial nature of the action selection. Furthermore, our analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information for accelerating RL rapidly.
Keywords:
Critical State Reinforcement Learning Exploration Explainability1 Introduction
Balancing exploration and exploitation is essential for accelerating reinforcement learning (RL) [22]. The explainability of an RL agent is also necessary when considering its application to human society. In both of these problems, identifying a few critical states at which action selection dramatically changes the potential of success or failure would be beneficial. These critical states are states that should be exploited and would be useful points for a brief explanation. Previous work demonstrated that critical states are both useful for designing humanoid motion and related to human recognition [11].
However, basic RL approaches do not try to detect specific states and treat all states in the same way. The basic approach to balancing exploration and exploitation is to add a stochastic perturbation to the policy in any state in the same manner using -greedy or stochastic policy, when considering application to continuous state space. Some methods give additional rewards to rarely experienced states to bias the learning towards exploration [1, 15], but they also added a stochastic perturbation in any state in the same manner. This approach has difficulty exploring in an environment where some state-action pairs change future success and failure dramatically. This approach makes it difficult to choose the required actions for limited transitions to successful state space. For example, an environment might include a narrow path between cliffs through which the agent has to pass to reach a goal, or a state where the agent has to take specific action not to fall into the inescapable region due to torque limits.
Critical states contain important information that helps humans understand success and failure [11]. To the best of our knowledge, no work has tried to detect such critical states autonomously. Some studies have translated the entire policy by decomposing it into simple interpretable policies [20, 12]. However, translated policies are expressed as complex combinations of simple policies and are difficult to understand. Therefore, instead of explaining everything, extracting a few critical states that have essential information regarding success and failure is useful for creating an easily understandable explanation.
Bottleneck detection is related to critical state detection in that it captures special states. To the best of our knowledge, a bottleneck is not defined by the action selection [13, 16, 5]. However, critical states should be defined by action selection to detect the states where specific actions are essential for success.
In this paper, we propose a method to identify critical states by the variance in the Q-function of actions and an exploration method that chooses exploitation with high probability at identified critical states. We showed this simple method accelerates RL in a toy problem and two baseline tasks of deep RL. We also demonstrate that the identification method extracts states that are intuitively interpretable as critical states and can be useful for a brief explanation of where is the critical point to achieve a task. Furthermore, an analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information that accelerates RL rapidly.
2 Identification of critical states
2.1 Related work: bottleneck detection
With respect to the identification of special states, critical state detection is related to the notion of bottleneck detection, which is used to locate subgoals in hierarchical RL. Many studies defined bottleneck states based on state visitation frequency, that was, as ones visited frequently when connecting different state spaces [17, 19], the border states of graph theory-based clustering [18, 9], states visited frequently in successful trajectories but not in unsuccessful ones [13], and states at which it was easy to predict the transition [8]. However, these methods can detect states that are not critical for the success of the task. They might detect states at which the agent can return back to a successful state space by retracing its path even after taking incorrect actions. Another method defined bottlenecks based on the difference in the optimal action at the same state for a set of tasks [5]. Unfortunately, this method highly depends on the given set of tasks, cannot be applied to a single task, and might classify almost all states as bottlenecks if the given tasks have few optimal actions in common.
2.2 Formulation of a critical state
Success and failure in RL are defined according to its return. The critical states at which the action selection changes the potential for future success and failure dramatically correspond to the states where the variance in the return is dramatically decreased by the action selection. We define state importance (SI) as reduction in the variance of the return caused by action selection and define critical states as states that have an SI that exceeds a threshold. This formulation can distinguish the states that are sensitive to action selection and does not depend on a given set of tasks. We consider the standard Markov decision process in RL. From the definition of the SI, we can calculate the SI by
(1) | ||||
(2) | ||||
(5) | ||||
(6) |
where denotes the return as an abbreviation of cumulative reward, and denote the state and action, respectively. We considered that the agent selected action based on a distribution at state and selected the actions to take at the following states based on the current policy . With this setup and the following definition of the Q-function:
(7) |
We used a uniform distribution as to consider the changes in return of all actions at the current state. We can calculate the SI without estimating the complex distributions of return and .
We regard the states whose SI is greater than a threshold as critical states. We set this threshold by assuming critical states make up of all states. To calculate the threshold, we first calculate the SIs of all states, if possible. If the state space is too large to calculate the SIs of all states, we calculate the SIs of recently visited states instead. Then, we set the threshold that separates the upper of SIs from the lower SIs. We used a ratio of critical states of in all the subsequent experiments.
2.3 Exploration with exploitation on critical states
We propose an exploration method that chooses exploitation with a high probability of on the identified critical states and chooses an action based on any exploration method in other situations. This method increases exploration in the areas where the agent can transit with only a limited set of actions and decreases the likelihood of less promising exploration. We choose exploration at the identified critical state with a small probability because the estimated exploitation during learning could still be wrong. We used an exploitation probability of in all the experiments.
3 Experiments
3.1 Cliff Maze Environment
We employed the cliff maze environment, shown on the left of Fig. 1, as the test environment. This environment includes a clear critical state at which action selection divides future success and failure dramatically. A cliff is located in all cells of the center column of the maze except at the center cell. The states are the 121 discrete addresses on the map. The initial state is the upper left corner, and the goal state is the lower right corner. The actions consist of four discrete actions: up, down, left, and right. The agent receives a reward of +1 at the goal state and -1 at the cliff states. An episode terminates when the agent visits either a cliff or the goal. The state between the two cliffs should be the most critical state. We evaluated the proposed identification method of the critical states and the exploration performance, which was evaluated by the number of steps required for the agent to learn the optimal shortest path.
The agent learned the Q-function via Q-learning [21]. The Q-function was updated every episode sequentially from the end of trajectory with a learning rate of . We evaluated the exploration performance of the proposed method by comparing the proposed method (based on -greedy) with the simple -greedy method. The proposed method chooses exploitation if the current state is a critical state with high probability of and chooses a random action with low probability 1 - depending on the current state. If the current state is not a critical state, this method selects action as -greedy does with exploration ratio . We controlled the exploration ratio using the following equation so that the exploitation probabilities of both the methods were as similar as possible.
(8) |
This control means that exploitation at a critical state is essential rather than randomly taking exploitation with certain probability regardless of the state. The exploration ratio of -greedy was annealed linearly from 0.905 at the initial step to 0.005 at the final 100k-th step. The exploration ratio of the proposed method was correspondingly decayed from 1.0 to 0.0. We conducted 100 experiments each with both methods with different random seeds.

The right part of the Fig. 1 shows the SI calculated after a 100k-step exploration. The SI was maximum at the state between the two cliffs. Intuitively, this state is a critical state, where the action selection divides the potential for success and failure dramatically. Table 1 shows the mean and standard deviation of the steps required to learn the shortest path. The proposed method reduced the median number of required steps by about 71% and the mean by about 68% compared to the simple -greedy method. To analyze the results, we used the Wilcoxon rank-sum test, and the value was . Its result shows that the proposed method significantly reduces the number of exploration steps required to learn the optimal shortest path in the cliff maze.
Method | value () | median | mean | std |
---|---|---|---|---|
-greedy | - | 22815 | 26789 | 19907 |
Proposed | 6450 | 8468 | 6443 |
3.2 Scalability to large state-action space
We evaluated the proposed method on two baseline tasks of deep RL to confirm whether it could scale up to high-dimensional complex tasks. The statistical significance of the mean of the acquired return was calculated with a Bayes regression of the time-series model, as defined in the Appendix. We show the results for a ratio of critical states below, since changing the ratio from 0.05 to 0.5 did not make a significant difference in the results.
3.2.1 Large discrete space.
We used the Arcade Learning Environment Atari2600 Breakout [2] as an example of a high-dimensional discrete space task. The agent learned the Q-function with the deep Q-network (DQN) [14]. The comparison procedure was the same as that in the experiment of 3.1 except that we linearly annealed from 0.905 at the initial step to 0.005 at the final 2,000k-th step. We calculated the threshold of the SI every 1,000 steps with the 1,000 most recently visited states because it is difficult to calculate the SI of all the states in the large state space. We evaluated the return after every 1,000 steps of exploration by choosing exploitation from the initial state to the end of the episode.
The hyperparameter settings were as follows. The environment setting was the same as that of [14] except that we did not use flickering reduction and we normalized each pixel to the range [0, 1]. We approximated Q-function with a convolutional neural network (CNN). The CNN had three convolutional layers with size (3232, 8, 4), (6464, 4, 2), and (6464, 3, 1) with rectified linear unit (ReLU) activation. The elements in the brackets correspond to (filter size, channel, stride). A fully-connected layer of unit size 256 with ReLU activation followed the last convolutional layer. A linear output layer followed this layer with a single output for each action. We used the target network [14] as the prior work. The target network was updated every 500 steps, and the Q-function was updated every four steps after 1,000 steps. The size of the replay buffer was , the discount factor was 0.99, and the batch size was 64. We used the Adam optimizer [10] and clipped the gradient when the L2 norm of the gradient of the DQN loss function exceeded 10. We initialized the weights using Xavier initialization [4] and the biases to zero.


The left part of Fig. 2 shows the change in the acquired return during learning. The right part of Fig. 2 shows the mean and the 95% credibility interval of the difference of mean return between the proposed method and -greedy method. The difference in the mean return was greater than 0 from the 50k-th step to the final 2000k-th step. A statistical difference existed in the latter 97.5% steps, at least in the sense of Bayes regression under our modeling (detailed modeling is shown in the Appendix). This result shows that the proposed method acquired a higher return in fewer exploration steps in the latter 97.5% of the learning process than -greedy. Fig. 3 shows part of a fully exploited trajectory and the identified critical states. The agent could identify critical states at which the action selection is critical. At these critical states, the agent must move the plate toward the ball. Otherwise, the ball will fall. The fifth frame is not a critical state because it has already been determined that the agent can hit the ball back without depending on action selection. This enables the agent to explore where to hit the ball.
3.2.2 Continuous space.
We tested our method in an environment with continuous state space and continuous action space. We used the Walker2d environment of the OpenAI Gym [3] baseline tasks. The RL algorithm used to train the Q-function was a soft actor-critic [6] without entropy bonus. The coefficient of the entropy bonus was set to 0 to evaluate how our method works on a standard RL objective function. We approximated the SI with the Monte Carlo method by sampling 1,000 actions from a uniform distribution. The threshold of the SI was calculated in the manner used in the experiment in large discrete space. The proposed method chooses exploitation with high probability of and chooses action based on a stochastic policy in other situations. We compared the exploration performance of the proposed method to two methods. One of the methods is called Default, and chooses action based on a stochastic policy at all states. The other one is EExploitation, which chooses exploitation with probability and uses a stochastic policy with a low probability of . We controlled the exploitation ratio so that exploitation probabilities of both the proposed method and EExploitation were as similar as possible. This control means that exploitation at a critical state is essential rather than randomly taking exploitation with certain probability regardless of the state. The exploitation ratio was set to . We evaluated the return every 1,000 steps of exploration by choosing exploitation from the initial state to the end of the episode.
The hyperparameter settings were as follows. We used neural networks to approximate the Q-function, V-function, and policy. The Q-function and V-function and the policy had two fully-connected layers of unit size 100 with ReLU activations. The output layers of the Q-function and V-function networks were linear layers that output a scalar value. The policy was modeled as a Gaussian mixture model (GMM) with four components. The output layer of the policy network outputted the four means, log standard deviations, and log mixture weights. The log standard deviations and the log mixture weights were clipped in the ranges of [-5, 2] and [-10,], respectively. To limit the action in its range, a sampled action from the GMM passed through a tanh function and then was multiplied by its range. L2 regularization was applied to the policy parameters. We initialized all the weights of neural networks using Xavier initialization [4] and all the biases to zero. The size of the replay buffer was , the discount factor was 0.99, and the batch size was 128. We used the Adam optimizer [10]. We updated the networks every 1,000 steps.


The left part of Fig. 4 shows the trend of the return acquired through learning. The right part of Fig. 4 shows the mean and the 95% credibility interval of the difference in the mean returns of the proposed method and each of the other methods. The difference of the mean return was greater than 0 from the 720k-th step to the final 2000k-th step for Default, and from the 740k-th step to the final 2000k-th step for EExploitation. A statistical difference existed in the latter 63% steps, at least in the sense of Bayes regression under the modeling (the detailed model is given in Appendix). This result shows that the proposed method acquired a higher return with fewer exploration steps in the latter 63% of the learning process. Fig. 5 shows a part of a fully exploited trajectory and the identified critical states. The agent could identify critical states at which the action selection was critical. The agent had to lift the body by pushing the ground with one leg while it lifted and moved forward with the other leg. Otherwise it would stumble and fall.
3.3 Relationship between the identification of critical states and the speed of learning

We analyzed the relationship between the timing of the identification of the especially critical states and the rapid progress of learning in the Walker2d task. The analysis process was as follows. The parameters of the Q-function at each step and all states visited by the agent over the entire training were saved. We calculated the SI of all of the states with the saved Q-function’s parameters. We regarded states whose SI values were in the top ten as especially critical states. We calculated the minimum Euclidean distance between the top-ten critical states of the final step and those of the -th step, assuming that the SIs at the final step were correct. We calculated the from the 1k-th step to the 2000k-th step in 10k step intervals. We calculated the match ratio at each step as , where is the maximum value in . The value of match ratio becomes closer to 1 as at least one of the top-ten critical states of step become closer to those of the final step.
Fig. 6 shows two examples of the relationship between the timing of the identification of the top-ten critical states and the progress of learning. The return curves rapidly rose shortly after the agent begins to identify the top-ten critical states. This phenomenon might be interpreted as the agents ”got the knack of” the task. This tendency was observed in some but not all agents. This result suggests there are a few critical states that had important information for accelerating RL rapidly.
4 Conclusion
In this paper, we assumed that the critical states at which the action selection dramatically changes the potential for future success and failure are useful for both exploring environments that include such states and extracting useful information for explaining the RL agent’s operation. We calculated the critical states using the variance of the Q-function for the actions. We proposed to choose exploitation with high probability at the critical states for efficient exploration. The results show that the proposed method increases the performance of exploration and the identified states are intuitively critical for success. An analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information for accelerating RL rapidly.
One limitation of our proposed method is that it does not increase the performance of exploration in environments with no critical states, where it does not decrease the performance significantly. Exploitation at falsely detected critical states is not critical for future success in such environments. In addition, this approach does not greatly increase the possibility of being trapped in local optima because it does not aim to reach critical states but just chooses exploitation when the visited state is a critical state. We decided the ratio of critical states to use to set the threshold and the exploitation ratio in advance. These values could be decided automatically during learning using the uncertainty of the Q-function with a Bayesian method. The proposed identification method extracted useful contents for a brief explanation, but their interpretation was constructed by a human. A method to construct a brief explanation using the identified critical states is future work.
4.0.1 Acknowledgments.
This paper is based partly on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and partly on results obtained from research activity in Chair for Frontier AI Education, School of Information Science and Technology, The University of Tokyo. The final authenticated publication is available online at https://doi.org/10.1007/978-3-030-61609-0_29.
Appendix
0..0.2 Calculation of the statistical difference of two learning dynamics.
A statistical test to confirm the difference between algorithms is necessary for deep RL, because large fluctuations in performance have been observed even as a result of different random seeds [7]. We cannot evaluate the exploration steps required to learn the optimal solution because the optimal solution is difficult to define in a large state space. A statistical test has been performed on the difference of the mean of return at a certain step [7], but this makes determining the evaluation points arbitrary. In addition, independently treating each timestep does not properly handle the data because the return curve is time-series data. In this paper, we used Bayesian regression for the return curves by modeling them as random walks depending on one previous evaluation step. We consider there to be a statistical difference when the 95% credibility interval of the time series of the average difference of returns between the proposed method and each method exceeds 0. We model the return curve by
(9) | ||||
(10) | ||||
(11) | ||||
(12) |
We suppose that return of the proposed method is observed from the true value with observation noise . This observation noise is caused by evaluating only several episodes instead of evaluating all possible trajectories. We assume that the return of a comparison method has fluctuation from , and that observation noise has been added. Here, is modeled as a random walk with a normal distribution depending on the previous evaluation step with a variance of . In addition, is modeled as a random walk with a Cauchy distribution depending on the previous evaluation step with a parameter of so that large fluctuations can be handled robustly. The statistical difference was calculated by estimating the posterior with a Markov chain Monte Carlo (MCMC) method. We modeled the priors of all variables as a no-information uniform prior and initialized the first MCMC sample of by the mean return of the proposed method at each evaluation step .
References
- [1] Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems. pp. 1471–1479 (2016)
- [2] Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279 (2013)
- [3] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym (2016)
- [4] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. vol. 9, pp. 249–256. PMLR (13–15 May 2010)
- [5] Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Larochelle, H., Botvinick, M., Levine, S., Bengio, Y.: Transfer and exploration via the information bottleneck. In: International Conference on Learning Representations (2019)
- [6] Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning. vol. 80, pp. 1861–1870. PMLR (10–15 Jul 2018)
- [7] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
- [8] Jayaraman, D., Ebert, F., Efros, A., Levine, S.: Time-agnostic prediction: Predicting predictable video frames. In: International Conference on Learning Representations (2019)
- [9] Kazemitabar, S.J., Beigy, H.: Using strongly connected components as a basis for autonomous skill acquisition in reinforcement learning. In: International Symposium on Neural Networks. pp. 794–803. Springer (2009)
- [10] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
- [11] Kuniyoshi, Y., Ohmura, Y., Terada, K., Nagakubo, A., Eitoku, S., Yamamoto, T.: Embodied basis of invariant features in execution and perception of whole-body dynamic actions—knacks and focuses of roll-and-rise motion. Robotics and Autonomous Systems 48(4), 189–201 (2004)
- [12] Liu, G., Schulte, O., Zhu, W., Li, Q.: Toward interpretable deep reinforcement learning with linear model u-trees. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 414–429. Springer (2018)
- [13] McGovern, A., Barto, A.G.: Automatic discovery of subgoals in reinforcement learning using diverse density. In: Proceedings of the Eighteenth International Conference on Machine Learning. pp. 361–368 (2001)
- [14] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
- [15] Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning (ICML). vol. 2017 (2017)
- [16] Şimşek, Ö., Barto, A.G.: Using relative novelty to identify useful temporal abstractions in reinforcement learning. In: Proceedings of the twenty-first international conference on Machine learning. p. 95. ACM (2004)
- [17] Şimşek, Ö., Barto, A.G.: Skill characterization based on betweenness. In: Advances in neural information processing systems. pp. 1497–1504 (2009)
- [18] Şimşek, Ö., Wolfe, A.P., Barto, A.G.: Identifying useful subgoals in reinforcement learning by local graph partitioning. In: Proceedings of the 22nd international conference on Machine learning. pp. 816–823. ACM (2005)
- [19] Stolle, M., Precup, D.: Learning options in reinforcement learning. In: International Symposium on abstraction, reformulation, and approximation. pp. 212–223. Springer (2002)
- [20] Verma, A., Murali, V., Singh, R., Kohli, P., Chaudhuri, S.: Programmatically interpretable reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. vol. 80, pp. 5045–5054. PMLR (10–15 Jul 2018)
- [21] Watkins, C.J., Dayan, P.: Q-learning. Machine learning 8(3-4), 279–292 (1992)
- [22] Witten, I.H.: The apparent conflict between estimation and control—a survey of the two-armed bandit problem. Journal of the Franklin Institute 301(1-2), 161–189 (1976)