Information Regarding Previous Submission to ICRA 2021

We previously submitted to ICRA under the title “XAI-N: Explainable Robot Navigation using Expert Policies and Decision Trees.”

Below, find our cover letter, original reviews, and the original paper. In the cover letter, reviewer quotes are in bold.

I Cover Letter

The below cover letter describes the critiques of the ICRA reviewers and our responses to them. In some cases, changes were made to the paper, and in other cases, we learned of an opportunity to clarify our presentation.

In the below, note the following abbreviations: Associate Editor (AE), Reviewer 1 (R1), Reviewer 2 (R2)

Some of the critiques regarded our focus on explainability:

AE: “The title does not reflect that the focus of the paper is not direct explainability and does not have any novelty related to explanability. Rather this approach is compatible with some existing methods for interpreting policies.” R1: “The claim of explainability is not supported beyond stating that DTs are explainable. Why are the presented robot navigation choices ‘explainable’? ” R2: “The claimed contribution (stated at the end of Sec. I) is not entirely correct as the explainability of the policy (either the PPO or DT) is not demonstrated. While explainability is presented as a major theme of this work, the purpose of the explainable element, i.e., the decision tree representation of the opaque policy, for interpretability/explainability is unclear. In other words, it is not demonstrated that human-understandable insights can be gained (or how to) from the tree representation, or that it can provide some insightful diagnostics to why the PPO or the DT policies fail when they do, and how to improve the PPO. Rather than providing such explainability/interpretability the DT is used for the sole purpose of improving itself. … this Reviewer wonders how the information represented in the DT can help improve the PPO policy. ”

Our understanding of the word “explainability” seemed to differ from the reviewers’.

For example, R2 writes that “ Rather than providing such explainability/interpretability the DT is used for the sole purpose of improving itself,” which is the process that we considered to be interpretable. Insight was gleaned into the DT policy that could not have been gleaned from a neural net, and used to improve itself.

In our paper, we note that a decision tree can be analyzed and modified more easily than a neural net. We provide novel examples of such processes. We considered this to qualify as “explainability.” Although the reviewers greatly appreciated these contributions, they did not consider them to be “explainable.”

We were not going for human-understandable explanations. Rather, we desired the decision tree format to allow an engineer or computer to analyze the policy, which the reviewers elsewhere agree that we demonstrated.

The goal of our paper was not to improve the PPO policy, but rather to improve upon the PPO policy, to enable creating a policy that has a specific error fixed. The end result is a DT policy. When a a system is running, the user does not care the format of the policy, only whether it works. PPO is just used to create the expert policy. It later remains in DT format. (Theoretically one could use imitation learning to cast the DT back into a neural net, if some application demanded it, but such would be unrelated to our purposes here.) R2 feels we don’t show why the policy fails. We do show where the policy fails with regards to our two error conditions. This is the main purpose of our method. This was the specific aspect of the policy that was interpreted/analyzed using the DT format and then resolved, again taking advantage of the DT format. (To our knowledge, we are one of the first to do this.)

In our new version, we have reduced the focus on “explainability,” removing it as a central claim and explaining the above contributions using different terminology.

AE: “The authors need to provide a comparative evaluation to other methods.” R1: “Present some sort of benchmark comparison – how do we know this method is useful compared to other approaches to navigation?”

TODO

R1: “How do we know oscillation and freezing will be the only poor behaviors?”

It is not the goal of our method to guarantee that there are no errors. Indeed, R1’s question is asking us to prove a negative.

What our method does do is allow a human to ensure that domain-specific errors they can imagine or which they notice in the expert policy can be addressed, without the need for retraining from scratch.

We have implemented fixes for freezing and oscillation, additional fixes or modification can be created following our approach.

How much training would be required, how do we judge the quality of results, and what domains would be well-suited to this ML pipeline approach?

How much training is required is a very domain-specific question, different tasks, even different navigation tasks, will vary in how much initial training they require. The modification stage requires no training of course. The quality of the results can be judged by the same metrics by which the policy would be judged pre-modification.

A domain is well-suited to this approach if it has a semantically meaningful state space and discrete action space. In addition, we found practically that some domains are so complex that the policy extraction, although possible, takes too long to be practical.

R2: “Technically, the DRL problem at hand should be posed as a Partially Observable Markov Decision Process (POMDP), not as an MDP. This is because the sensor (as described in Sec. V) only observes part of the environment. Aside from this major point, the MPD/POMDP description provided should include the policy as a part of the tuple and the terminal conditions should be made explicit (such as collision). ”

We have updated the paper accordingly, with the exception that we don’t include the policy as part of the tuple, as we find this to be more common in the literature.

R2: “There is no evaluation provided measuring how representative the extracted DT policy is of the expert policy.”

Theoretically, the imitation methods can replicate the expert policy precisely with infinite samples. In practice, we have experienced the creation of trees that matched the performance of the expert policy as well as ones which did not. (We used examples of the less well-performing M-VIPER trees in our examples demonstrate the power of the modification to fix errors that may arise. In practice, different domains will be easier or more difficult to imitate precisely.)

R2: “if a purpose of this paper is to propose the framework of mapping and improving an extracted policy, it would be sound to demonstrate and evaluate on another opaque policy.”

In the ICRA paper, we included only the crowd-based robot navigation environment. In our IRSO submission, we add in a warehouse agent-based game-like environment.

Furthermore, on our new environment, we demonstrate an additional fix, the blocking fix.

II ICRA Reviews

II-A Association Editor

⬇

Comments to author (Associate Editor)

=====================================

The title does not reflect that the focus of the paper is not direct explainability and does not have any novelty related to explanability. Rather this approach is compatible with some existing methods for interpreting policies.

The primary outcome of this paper is the approach to augmenting the decision tree (generated from a learned policy) to handle degenerative path planning cases (oscillation and freezing). The authors need to provide a comparative evaluation to other methods.

----------------------------------------

Comments on Video Attachment:

The video is a good representation of the paper focus and enhances the authors’ presentation.

II-B Editor

⬇

Comments to author (Editor)

================================

This paper lacks the validation required of a method contributing explainabilty. As a method, there is some interest, but it must be shown to do better than other

similar, state-of-the-art, methods. Several reviewer suggestions are provided to move the paper in either direction and enhance the contribution of this work.

----------------------------------------

Comments on Video Attachment:

video demonstrates the method well

II-C Reviewer 1 (Reviewer 11)

⬇

Reviewer 11 of ICRA 2021 submission 599

Comments to the author

======================

This paper presents a two-stage deep reinforcement learning (DRL) / decision tree (DT) pipeline with application to robot navigation. Manual repairs to the policies to mitigate freeze and oscillation behaviors occur. All presented methods are taken from the literature without modification in this work; the authors indicate their contribution is in the combination of DRL to learn complex behaviors with DT extraction to improve explainability. A combination of robot simulation tools (AI Gym RL interface, Gazebo with ROS) are used to train and evaluate the system.

The machine learning pipeline and identification/management of freezing and oscillation behaviors were strengths of the paper. However, the domain itself is simple and frequently used in robotics studies, and the results are not

comprehensive. Suggestions for improving the paper: (1) Present some sort of benchmark comparison -- how do we know this method is useful compared to other approaches to navigation? (2) The claim of explainability is not supported beyond stating that DTs are explainable. Why are the presented robot navigation choices "explainable"? The paper’s one case study does not show the decision tree, so it is difficult to notice whether the DT addition to DRL is assisting with explainability or not. (3) How do we know oscillation and freezing will be the only poor behaviors? There are no guarantees regarding behaviors, so it is difficult to see a compelling application of this method in real systems without significant future work. (4) How much training would be required, how do we judge the quality of results, and what domains would be well-suited to this ML pipeline approach? Overall, the paper would have been improved by a more compact review of the existing ML methods and more content related to these four issues.

Comments on the Video Attachment

================================

Video is interesting and relevant to the paper.

II-D Reviewer 2 (Reviewer 16)

⬇

Reviewer 16 of ICRA 2021 submission 599

Comments to the author

======================

In this paper, an interpretable representation of an opaque autonomous navigation policy is extracted, as a decision tree (DT), using supervised training. The DT gives a strategic structure for detecting faulty policy elements, e.g., leading to freezing and or oscillation, and hence the ability to modify the DT (the policy) for improvement. The robot’s task is to efficiently navigate to its goal location while avoiding collisions with moving and static obstacles. The proposed method has three stages. In the first, an expert policy (which is opaque) is learned via deep Reinforcement Learning (DRL) using PPO and curriculum learning. In the second stage, this expert policy is mapped to a binary DT whose input is the state and output is the action using VIPER Imitation Learning. In the third stage, two types of algorithms are proposed to improve the DT policy. The first type classifies tree nodes into those leading to freezing or non-freezing and oscillation and non-oscillation. The second kind modifies the DT for policy improvement based on the classification result. This modification is done in simulation. The policy at all three stages are evaluated in simulation and physical demonstration is provided. The evaluation metrics are measures of freezes, collisions, oscillations, path length, and average reward per time step. It is shown that the improved TD policy has larger average reward and significantly reduced oscillations while a greater collision incidences compared to the expert.

The contributions of this paper are the methods of improvement of DT policy, demonstration of physical feasibility of the proposed method, and the released open-source platform of this work. The claimed contribution 1 (stated at the end of Sec. I) is not entirely correct as the explainability of the policy (either the PPO or DT) is not demonstrated. While explainability is presented as a major theme of this work, the purpose of the explainable element, i.e., the decision tree representation of the opaque policy, for interpretability/explainability is unclear. In other words, it is not demonstrated that human-understandable insights can be gained (or how to) from the tree representation, or that it can provide some insightful diagnostics to why the PPO or the DT policies fail when they do, and how to improve the PPO. Rather than providing such explainability/interpretability the DT is used for the sole purpose of improving itself. In addition to this discrepancy between the stated objective and what is actually presented in this paper, this Reviewer finds one major and one minor technical inaccuracies, a missing DRL problem description, and that the evaluation provided is insufficient.

Technically, the DRL problem at hand should be posed as a Partially Observable Markov Decision Process (POMDP), not as an MDP. This is because the sensor (as described in Sec. V) only observes part of the environment. Aside from this major point, the MPD/POMDP description provided should include the policy as a part of the tuple and the terminal conditions should be made explicit (such as collision).

There is no evaluation provided measuring how representative the extracted DT policy is of the expert policy. This is an important evaluation as a goal of XAI is to provide explainability/interpretability of the opaque learning solution. Furthermore, if a purpose of this paper is to propose the framework of mapping and improving an extracted policy, it would be sound to demonstrate and evaluate on another opaque policy.

Finally, this Reviewer wonders how the information represented in the DT can help improve the PPO policy.

Comments on the Video Attachment

================================

The video attachment efficiently and clearly describes the work done.

III ICRA Paper

On the next page, find the ICRA version of our paper.