Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization — Summary of Revisions

We would like to thank the associate editor and the reviewers for helping improve the quality of this manuscript. We have revised this work accordingly, considering all the comments and suggestions. We provide a summary-of-changes, listing the main differences between the previous version and the current version, so as to identify the improvements we made.

The log number of the previous version is “TPAMI-2022-06-1098”, titled “Sample Dropout in Policy Optimization”, which was submitted to TPAMI on Jun. 13, 2022, and received the first-round review comments on Aug. 17, 2022.

1 Summary of Changes

The main differences between the previous version and this revision are as follows:

•

Based on the comments from the AC and Reviewer #1, we revisit the theoretical foundation of this work thoroughly and close the gap between the theory/motivation and the actual algorithm we propose. Specifically, we present an analysis of the variance in the surrogate objective estimate and provide a bound for the estimation variance. In this revised version, we prove in a principled way that the variance in the importance sampling estimate could grow quadratically with policy ratios. Based on this finding, the proposed sample dropout technique helps keep the ratios small and thus bound the variance of surrogate objective estimate. We have re-written the textual contents in the paper accordingly.
•

In the experimental section, we provide more results to support the theory we develop. To be specific, we show that our sample dropout technique can greatly reduce variance of the surrogate objective estimate, and consequently boost the sample efficiency of policy optimization methods. We also present more ablative studies to help understand the effect of sample dropout better.

2 To Associate Editor and Reviewer #1

Comment 1: There exists gap between theoretical motivation of Proposition 1 and the empirical success.
Response 1: In the new version, we revisit the theoretical foundation. We look into the variance introduced by the importance sampling to the objective estimate and prove that the variance in the importance sampling estimate could grow quadratically with policy ratios and consequently jeopardize the effectiveness of the surrogate objective optimization. In this case, the sample dropout technique is conductive to keeping policy ratios small and bounding the variance of surrogate objective estimate. We provide additional experimental results to support this finding and better understand the effect of sample dropout.

Comment 2: A popular way is to cap the importance ratio [Bottou et al., JMLR 2013]. The paper seems not compare with the baselines that cap the importance ratio.
Response 2: Our sample dropout criteria is motivated by the theoretical analysis for the variance of surrogate objective. We agree that besides discarding samples, an alternative way is to cap the importance ratios such as PPO. However, as shown in the experiments (Section 5.3.2, Figure 4), even with ratio clipping, the ratios of PPO can still become unbounded. We empirically show that in this case, the sample dropout technique can be combined with capping importance ratios to better bound the ratios. In addition, we compare with baselines that cap the importance ratio (i.e., PPO) and show that the sample dropout technique can greatly improve the performance of PPO. We also include the work of [Bottou et al., JMLR 2013] in the related work section.

- [Bottou et al., JMLR 2013] Counterfactual reasoning and learning systems: the example of computational advertising

3 To Reviewer #2

Comment 1: The presentation should be more friendly for non-RL researchers.
Response 1: We include the definition of the MDP protocol in the preliminary section and revise the writing (notation, caption, etc) to make the manuscript more self-contained.