Information Regarding Previous Submission to ICRA 2021
We previously submitted to ICRA under the title “XAI-N: Explainable Robot Navigation using Expert Policies and Decision Trees.”
Below, find our cover letter, original reviews, and the original paper. In the cover letter, reviewer quotes are in bold.
I Cover Letter
The below cover letter describes the critiques of the ICRA reviewers and our responses to them. In some cases, changes were made to the paper, and in other cases, we learned of an opportunity to clarify our presentation.
In the below, note the following abbreviations: Associate Editor (AE), Reviewer 1 (R1), Reviewer 2 (R2)
Some of the critiques regarded our focus on explainability:
AE: “The title does not reflect that the focus of the paper is not direct explainability and does not have any novelty related to explanability. Rather this approach is compatible with some existing methods for interpreting policies.” R1: “The claim of explainability is not supported beyond stating that DTs are explainable. Why are the presented robot navigation choices ‘explainable’? ” R2: “The claimed contribution (stated at the end of Sec. I) is not entirely correct as the explainability of the policy (either the PPO or DT) is not demonstrated. While explainability is presented as a major theme of this work, the purpose of the explainable element, i.e., the decision tree representation of the opaque policy, for interpretability/explainability is unclear. In other words, it is not demonstrated that human-understandable insights can be gained (or how to) from the tree representation, or that it can provide some insightful diagnostics to why the PPO or the DT policies fail when they do, and how to improve the PPO. Rather than providing such explainability/interpretability the DT is used for the sole purpose of improving itself. … this Reviewer wonders how the information represented in the DT can help improve the PPO policy. ”
Our understanding of the word “explainability” seemed to differ from the reviewers’.
For example, R2 writes that “ Rather than providing such explainability/interpretability the DT is used for the sole purpose of improving itself,” which is the process that we considered to be interpretable. Insight was gleaned into the DT policy that could not have been gleaned from a neural net, and used to improve itself.
In our paper, we note that a decision tree can be analyzed and modified more easily than a neural net. We provide novel examples of such processes. We considered this to qualify as “explainability.” Although the reviewers greatly appreciated these contributions, they did not consider them to be “explainable.”
We were not going for human-understandable explanations. Rather, we desired the decision tree format to allow an engineer or computer to analyze the policy, which the reviewers elsewhere agree that we demonstrated.
The goal of our paper was not to improve the PPO policy, but rather to improve upon the PPO policy, to enable creating a policy that has a specific error fixed. The end result is a DT policy. When a a system is running, the user does not care the format of the policy, only whether it works. PPO is just used to create the expert policy. It later remains in DT format. (Theoretically one could use imitation learning to cast the DT back into a neural net, if some application demanded it, but such would be unrelated to our purposes here.) R2 feels we don’t show why the policy fails. We do show where the policy fails with regards to our two error conditions. This is the main purpose of our method. This was the specific aspect of the policy that was interpreted/analyzed using the DT format and then resolved, again taking advantage of the DT format. (To our knowledge, we are one of the first to do this.)
In our new version, we have reduced the focus on “explainability,” removing it as a central claim and explaining the above contributions using different terminology.
AE: “The authors need to provide a comparative evaluation to other methods.” R1: “Present some sort of benchmark comparison – how do we know this method is useful compared to other approaches to navigation?”
TODO
R1: “How do we know oscillation and freezing will be the only poor behaviors?”
It is not the goal of our method to guarantee that there are no errors. Indeed, R1’s question is asking us to prove a negative.
What our method does do is allow a human to ensure that domain-specific errors they can imagine or which they notice in the expert policy can be addressed, without the need for retraining from scratch.
We have implemented fixes for freezing and oscillation, additional fixes or modification can be created following our approach.
How much training would be required, how do we judge the quality of results, and what domains would be well-suited to this ML pipeline approach?
How much training is required is a very domain-specific question, different tasks, even different navigation tasks, will vary in how much initial training they require. The modification stage requires no training of course. The quality of the results can be judged by the same metrics by which the policy would be judged pre-modification.
A domain is well-suited to this approach if it has a semantically meaningful state space and discrete action space. In addition, we found practically that some domains are so complex that the policy extraction, although possible, takes too long to be practical.
R2: “Technically, the DRL problem at hand should be posed as a Partially Observable Markov Decision Process (POMDP), not as an MDP. This is because the sensor (as described in Sec. V) only observes part of the environment. Aside from this major point, the MPD/POMDP description provided should include the policy as a part of the tuple and the terminal conditions should be made explicit (such as collision). ”
We have updated the paper accordingly, with the exception that we don’t include the policy as part of the tuple, as we find this to be more common in the literature.
R2: “There is no evaluation provided measuring how representative the extracted DT policy is of the expert policy.”
Theoretically, the imitation methods can replicate the expert policy precisely with infinite samples. In practice, we have experienced the creation of trees that matched the performance of the expert policy as well as ones which did not. (We used examples of the less well-performing M-VIPER trees in our examples demonstrate the power of the modification to fix errors that may arise. In practice, different domains will be easier or more difficult to imitate precisely.)
R2: “if a purpose of this paper is to propose the framework of mapping and improving an extracted policy, it would be sound to demonstrate and evaluate on another opaque policy.”
In the ICRA paper, we included only the crowd-based robot navigation environment. In our IRSO submission, we add in a warehouse agent-based game-like environment.
Furthermore, on our new environment, we demonstrate an additional fix, the blocking fix.
II ICRA Reviews
II-A Association Editor
II-B Editor
II-C Reviewer 1 (Reviewer 11)
II-D Reviewer 2 (Reviewer 16)
III ICRA Paper
On the next page, find the ICRA version of our paper.