Counterfactual Explanation Policies in RL

Shripad V. Deshmukh Srivatsan R Supriti Vijay Jayakumar Subramanian Chirag Agarwal

Abstract

As Reinforcement Learning (RL) agents are increasingly employed in diverse decision-making problems using reward preferences, it becomes important to ensure that policies learned by these frameworks in mapping observations to a probability distribution of the possible actions are explainable. However, there is little to no work in the systematic understanding of these complex policies in a contrastive manner, i.e., what minimal changes to the policy would improve/worsen its performance to a desired level. In this work, we present Counterpol, the first framework to analyze RL policies using counterfactual explanations in the form of minimal changes to the policy that lead to the desired outcome. We do so by incorporating counterfactuals in supervised learning in RL with the target outcome regulated using desired return. We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL. Extensive empirical analysis shows the efficacy of Counterpol in generating explanations for (un)learning skills while keeping close to the original policy. Our results on five different RL environments with diverse state and action spaces demonstrate the utility of counterfactual explanations, paving the way for new frontiers in designing and developing counterfactual policies.

Machine Learning, ICML

1 Introduction

Reinforcement learning (RL) has been used successfully to train autonomous agents capable of achieving better than human-level performance in simulated environments like Go (Silver et al., 2016) and a suite of Atari games (Mnih et al., 2013). They are increasingly finding new applications across computational analysis (Fawzi et al., 2022), marketing (Theocharous et al., 2015), education (Koedinger et al., 2013), and biomedical research (Ding et al., 2022b) and. Recent breakthroughs in large-language models (LLMs) (OpenAI, 2023) are primarily attributed to key RL components that have improved the generative capability of state-of-the-art LLMs. With RL frameworks being deployed at scale as well as performing autonomously, it becomes imperative to incorporate explainability in them, resulting in increased user trust in autonomous decision-making. Explaining the decisions of black-box RL agents for a given environment state is non-trivial as it not only involves explaining the final agent action but also includes complex decision-making and planning behind the output action.

A myriad of RL explainability methods with various attribution techniques has recently been proposed (Greydanus et al., 2018; Deshmukh et al., ; Iyer et al., 2018; Puri et al., 2019). In particular, they focus on identifying input states and past experiences (trajectories) that led the RL agent to learn complex behaviors. While these methods output important input state features (agent’s observation) and trajectories, they fail to explain the minimal change in the trained policy leading to a desired outcome or (un)learning of a specific skill. Intuitively, this requires generating counterfactuals for a given desired outcome (i.e., identifying what and how much to change a given RL policy to obtain a target return for its current state). While some previous works have explored causal reinforcement learning (Gershman, 2017), there is little to no research on systematically explaining the mechanism of the complex policies learned by a given RL agent using counterfactual explanations.

Present Work. We propose Counterpol, a framework for counterfactual analysis of RL policies. In our framework, we generate explanations by asking the question: “What least change to the current policy would improve or worsen it to a new policy with a specified target return?” To estimate such counterfactual policies, we present an objective that aims to obtain a new policy with an average performance equal to that of a specified return while limiting its modifications with respect to the given policy. The generated policies provide direct insights into how a policy can be modified to achieve better results as well as what to avoid in order not to deteriorate the performance. Further, we theoretically prove the connection between popular trust region-based optimization methods in RL with Counterpol, bringing a new perspective of looking at RL optimization using a prominent explainability tool. Formally, the Counterpol learns minimal changes in the current policy without changing its general behavior. To optimize counterfactual explanation policies, we specify a novel objective function that can be solved using basic on-policy Monte Carlo policy gradients. In our experiments across diverse RL environments, we show how our algorithm reliably achieves counterfactual for any the set target return for a given policy.

Our Contributions. We present our contributions as follows: 1) We formalize the problem of counterfactual explanation policy for explaining RL policies. 2) We propose Counterpol, an explanatory framework for generating contrastive explanations for RL policies that identify minimal changes in the current policy, which would lead to improving/worsening the performance of a given policy. 3) We derive a theoretical equivalence between the Counterpol objective with the widely used trust region-based policy gradient methods. 4) We demonstrate the flexibility of Counterpol through empirical evaluations of explanations generated for five OpenAI gym environments. Qualitative and quantitative results show that Counterpol successfully generates a counterfactual policy for (un)learning skills while keeping close to the original policy.

2 Related Work

This work lies at the intersection of counterfactual explanations, explainability in RL, and proximal gradients. Next, we discuss the related work for each of these topics.

Counterfactual Explanations. Several techniques have been recently proposed to generate counterfactual explanations for providing recourses to individuals negatively impacted by complex black-box model decisions (Wachter et al., 2018; Ustun et al., 2019; Van Looveren & Klaise, 2019; Mahajan et al., 2019; Karimi et al., 2020). In particular, these techniques aim to provide model explanations in the form of minimal changes to an input instance that changes the original model. Broadly, these methods are categorized based on access to the predictive model (white- vs. black-box), enforce sparsity (more interpretable explanations), and whether the counterfactuals belong to the original data manifold (Verma et al., 2020a). To this end, Wachter et al. (2018) is one of the most widely used methods that use a gradient-based method to obtain counterfactual explanations for models using a distance-based penalty. In this work, we extend the counterfactual explanations to RL policies.

Explainability in RL. Explainable RL (XRL) (Puiutta & Veith, 2020) methods are a sub-field of explainable artificial intelligence (XAI) that aims to interpret and explain the complex behaviour of RL agents. Recent works in XRL are broadly categorized into i) gradient-based methods (Greydanus et al., 2018), which analyzed Atari RL agents that use raw visual input to make their decision and identify salient information in the image that the policy uses to select an action, ii) attribution-based methods (Deshmukh et al., ; Puri et al., 2019; Iyer et al., 2018), which explain an agent’s behaviour by attributing its actions to global or local observation data, and iii) surrogate-based methods, which distill complex RL policy into simpler models such as decision trees (Coppens et al., 2019) or to human understandable decision language (Verma et al., 2018). While all existing XRL methods generate explanations using state features or trajectories, none explore counterfactual explanations. To the best of our knowledge, for the first time, we explore counterfactual explanation policies for contrastive analysis of policies in terms of what minimal change to the policy would get us to desired output returns.

Proximal Gradient Methods in Optimization. General idea behind proximal gradient methods is to solve an optimization problem $\min_{x}f(x)$ in the neighborhood of certain $x_{k-1}$ at $k^{th}$ optimization iteration by modifying the objective as $\min_{x}f(x)+\lambda\cdot\left\lVert x-x_{k-1}\right\rVert_{p}$ (Parikh et al., 2014; Nitanda, 2014). This objective modification is termed as proximal operator. In machine learning, these operators find great significance in regularization, constraint-based optimization, convex structuring of objectives, etc. They have several applications in a multitude of fields ranging from risk-aware forecasting systems, and mathematical reasoning to reinforcement learning (Maggipinto et al., 2020; Asadi et al., 2023; Ding et al., 2022a; Hirano et al., 2022; Khan et al., 2020; Lu et al., 2023). One major advantage of proximal gradient methods is in their ability to handle noisy and incomplete data, which is common in real-world applications. Subsequently, in RL, they have been referred by state-of-the-art algorithms like PPO (Schulman et al., 2017) to develop ways to achieve monotonic gains. In this light and in contrast, we incorporate a proximal operator to ensure counterfactual RL policies are closer to the original input policy.

3 Preliminaries

Supervised Learning. Consider a standard classification setup with a given set of training examples,

\mathbb{X}=\{(\mathbf{x}^{(i)},y^{(i)})~{}|~{}\mathbf{x}^{(i)}\in\mathbb{R}^{d},y^{(i)}\in C\},

where $i\in\{1,2,\dots,N\}$ , $C=\{1,2,\dots,K\}\}$ , $\mathbf{x}^{(i)}$ is a $d$ -dimension vector, $N$ is the number of training examples, and $C$ denotes set of $K$ classes. Let a classifier $f$ be trained on $\mathbb{X}$ to predict the correct label for any unseen input x.

Counterfactual Explanations. XAI literature (Stepin et al., 2021; Verma et al., 2020b) defines counterfactual explanations as a technique to analyze “what if” scenarios for model predictions, e.g., a counterfactual explanation for a loan application model rejecting an individual’s loan application could be “if you increased your income by $500, your application would have been accepted”. Note that in the present work we work in a non-causal setup, where following the previous works (Wachter et al., 2018), we define a counterfactual explanation for a given model prediction of input $\mathbf{x}$ as the minimal variation in the features of $\mathbf{x}$ that changes the prediction of the underlying model from $y$ to $y_{\text{target}}$ .

Formally, in for the aforementioned supervised learning setup, counterfactual explanation of model prediction $\hat{y}_{0}=f(\mathbf{x}_{0})$ conditioned on a target class $y_{\text{target}}\in C$ is given by,

\mathbf{x}_{\text{cf}}=\operatorname*{arg\,min}_{\mathbf{x}}[\text{CE}(f(\mathbf{x}),~{}y_{\text{target}})+k\cdot\left\lVert\mathbf{x}_{\text{0}}-\mathbf{x}\right\rVert_{2}^{2}],

(1)

where cross-entropy (CE) loss guides explanation for achieving $y_{\text{target}}$ by modifiying $\mathbf{x}_{0}$ to $\mathbf{x}_{\text{cf}}$ and the mean-squared distance between $\mathbf{x}_{\text{cf}}$ and $\mathbf{x}_{0}$ ensures the proximity between the two, regulated via coefficient $k$ .

Reinforcement Learning. Consider a finite horizon Markov Decision Process (MDP) (Bellman, 1957) defined as $\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,d_{0},\gamma)$ , where $\mathcal{S}$ denotes the state-space, $\mathcal{A}$ denotes action-space, $P:(\mathcal{S}\times\mathcal{A}\times\mathcal{S})\xrightarrow{}[0,1]$ denotes state transition function, $R:(\mathcal{S}\times\mathcal{A}\times\mathcal{S})\xrightarrow{}\mathbb{R}$ denotes the reward function, $d_{0}:\mathcal{S}\xrightarrow[]{}[0,1]$ represents distribution over starting states, and $\gamma\in(0,1]$ denotes the discount factor. Let $\pi:\mathcal{S}\times\mathcal{A}\xrightarrow[]{}[0,1]$ denote the learnt agent policy. Then, we measure the performance $J_{\pi}$ of the policy in terms of expected return as follows:

J_{\pi}=\mathbb{E}_{(s_{0},a_{0},s_{1},a_{1},\dots,s_{T})}\Big{[}\sum_{t=0}^{T-1}{\gamma^{t}r(s_{t},a_{t},s_{t+1})}\Big{]},

(2)

where $s_{0}\sim d_{0}$ , $a_{t}\sim\pi(a_{t}|s_{t})$ , and $T$ denotes the episode terminating time-step.

4 Counterfactual Explanation Policies

Reinforcement learning is a powerful decision-making tool that provides a systematic abstraction for defining environment dynamics with reward preferences and also provides the algorithms to come up with policies for maximizing returns in that environment. In the context of RL agents, we pose the counterfactual question as follows – given a policy $\pi$ performing at level $J_{\pi}$ in an MDP $\mathcal{M}$ , what infinitesimal change in $\pi$ would lead us to a target return of $R_{\text{target}}$ ? Here, we refer to such variation of $\pi$ as Counterfactual Explanation Policy conditioned on $R_{\text{target}}$ . In posing the counterfactual question in terms of target returns, we aim to get contrastive insights into what minimal changes to the current policy can result in improving/worsening its performance to a desired level. To estimate counterfactual explanation policy $\pi_{\text{cf}}$ , we first define the return penalty for achieving the target return as:

\displaystyle L_{\text{ret}}={}

\displaystyle\left\lVert J_{\pi}-R_{\text{target}}\right\rVert_{p},\text{where $\left\lVert\cdot\right\rVert_{p}$ is $\ell_{p}$-norm.}

(3)

In order to ensure that the counterfactual policy is similar to the original policy, we limit the changes while achieving the target return. We use the KL-divergence loss to measure the distance between the original policy $\pi_{0}$ and a given policy $\pi$ , i.e., ,

L_{\text{KL}}=D_{\text{KL}}(\pi_{0}||\pi)

(4)

Next, we can find the counterfactual policy $\pi_{\text{cf}}$ by minimizing the following objective:

	$\displaystyle\pi_{\text{cf}}={}$	$\displaystyle\operatorname*{arg\,min}_{\pi}~{}~{}L_{\text{ret}}+k\cdot L_{\text{KL}}$		(5)
	$\displaystyle={}$	$\displaystyle\operatorname*{arg\,min}_{\pi}~{}~{}\left\lVert J_{\pi}-R_{\text{target}}\right\rVert_{p}+k\cdot D_{\text{KL}}(\pi_{0}\|\|\pi),$		(5)

where $k$ is a regularization coefficient that ensures that the output counterfactual policy is close to $\pi_{0}$ .

4.1 Counterfactual Explanation Policy Optimization

In practice, RL policies are represented using function approximators (Sutton et al., 1999; Melo et al., 2008), e.g., a policy is represented using a neural network in deep RL. Let $\pi_{\theta}$ denote the policy approximated using a neural network function with parameters $\theta$ . We can rewrite Eqn. 5 as:

	$\displaystyle\pi_{\theta_{\text{cf}}}={}$	$\displaystyle\operatorname*{arg\,min}_{\theta}~{}~{}L_{\text{ret},\theta}+k\cdot L_{\text{KL},\theta}$		(6)
	$\displaystyle={}$	$\displaystyle\operatorname*{arg\,min}_{\theta}~{}~{}\left\lVert J_{\pi_{\theta}}-R_{\text{target}}\right\rVert_{p}+k\cdot D_{\text{KL}}(\pi_{\theta_{0}}\|\|\pi_{\theta})$		(6)

The above optimization requires the gradients of the counterfactual objectives w.r.t. $\theta$ , i.e., $\nabla_{\theta}(L_{\text{ret},\theta}+k\cdot L_{\text{KL},\theta})$ , which can be written as $\nabla_{\theta}L_{\text{ret},\theta}+k\cdot\nabla_{\theta}D_{\text{KL}}(\pi_{\theta_{0}}||\pi_{\theta})$ .

Computing the gradients of Eqn. 6 is not directly feasible using automated differentiation provided in standard deep learning libraries (Paszke et al., 2017; Abadi et al., 2016). Therefore, we detail the derivation of how to estimate the gradients of the two loss terms in Eqn. 6.

Estimating $\mathbf{\nabla_{\theta}L_{\text{ret},\theta}}$ . Without loss of generality, we calculate the gradients of the return penalty using $\ell_{1}$ -norm for the return loss objective $L_{\text{ret},\theta}$ , i.e.,

\displaystyle\nabla_{\theta}L_{\text{ret},\theta}=\nabla_{\theta}|J_{\pi_{\theta}}-R_{\text{target}}|={}

\displaystyle\text{sgn}(J_{\pi_{\theta}}-R_{\text{target}})\cdot\nabla_{\theta}J_{\pi_{\theta}},

(7)

where $\nabla_{\theta}J_{\pi_{\theta}}$ becomes the standard policy gradient (Kakade, 2001).

We approximate $J_{\pi_{\theta}}$ with the average return gathered through the on-policy rollout of $\pi_{\theta}$ in the given environment. Formally, for $N$ episodes sampled using $\pi_{\theta}$ , i.e., $\{\tau_{i}\}_{i=1}^{N}\sim\pi_{\theta}$ we have:

J_{\pi_{\theta}}\approx\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T_{i}-1}{\gamma^{t}r(s_{t},a_{t},s_{t+1})}

(8)

Further, using the policy gradient theorem (Kakade, 2001; Sutton & Barto, 2018), we can write $\nabla_{\theta}J_{\pi_{\theta}}=\mathbb{E}_{\pi_{\theta}}[Q_{\pi_{\theta}}(s,a)\nabla_{\theta}\text{log}(\pi_{\theta}(s,a))]$ . We approximate $Q_{\pi_{\theta}}(s,a)$ term by the return obtained starting from state $s$ and action $a~{}{\sim}~{}\pi_{\theta}(\cdot|s)$ for a sampled episode, i.e., $Q_{\pi_{\theta}}(s,a)~{}{\approx}~{}\sum_{j=t}^{T-1}~{}{\gamma^{j-t}r(s_{j},a_{j},s_{j+1}|s_{t}=s,a_{t}=a)}$ . Therefore, {fleqn}

		$\displaystyle\nabla_{\theta}J_{\pi_{\theta}}\approx\frac{1}{\sum_{i=1}^{N}{T_{i}}}\cdot$		(9)
		$\displaystyle\sum_{i=1}^{N}{\sum_{t=0}^{T_{i}-1}[\sum_{j=t}^{T_{i}-1}{\gamma^{j-t}r(s_{j},a_{j},s_{j+1}\|s_{t},a_{t})]\cdot\nabla_{\theta}\text{log}(\pi_{\theta}(s_{t},a_{t}))}}$		(9)

Note that the $J_{\pi_{\theta}}$ computation can be made more sample efficient by off-policy estimation (Munos et al., 2016; Jiang & Li, 2016) of $J_{\pi_{\theta}}$ on rollouts performed using $\pi_{\theta_{0}}$ (Degris et al., 2012; Schulman et al., 2015, 2017). However, for the sake of simplicity of exposition and implementation, we use On-policy Monte Carlo Policy Gradients, which can later be modified to their sample efficient versions depending on the environmental requirements.

Estimating $\mathbf{\nabla_{\theta}D_{\text{KL}}(\pi_{\theta_{0}}||\pi_{\theta})}$ . As discussed in Eqn. 4, we calculate KL-divergence between two policy distributions as the distance penalty for counterfactual policies. As estimating the exact KL-divergence between two policies (i.e., $\mathbb{E}_{s}[D_{\text{KL}}(\pi_{0}(\cdot|s)||\pi(\cdot|s))]$ ) is computationally expensive for large state-action spaces, we approximate it by averaging the KL-divergence calculated over states visited during sampling of $N$ episodes. Using states in $\{\tau_{i}\}_{i=1}^{N}\sim\pi_{\theta}$ , we write:

D_{\text{KL}}(\pi_{\theta_{0}}||\pi_{\theta})\approx\frac{\sum_{i=1}^{N}\sum_{t=0}^{T_{i}-1}D_{\text{KL}}(\pi_{\theta_{0}}(\cdot|s_{t})||\pi_{\theta}(\cdot|s_{t}))}{\sum_{i=1}^{N}{T_{i}}}

(10)

and finally approximate $\nabla_{\theta}D_{\text{KL}}(\pi_{\theta_{0}}||\pi_{\theta})$ as:

\nabla_{\theta}D_{\text{KL}}(\pi_{\theta_{0}}||\pi_{\theta})\approx\frac{\sum_{i=1}^{N}\sum_{t=0}^{T_{i}-1}\nabla_{\theta}D_{\text{KL}}(\pi_{\theta_{0}}(\cdot|s_{t})||\pi_{\theta}(\cdot|s_{t}))}{\sum_{i=1}^{N}{T_{i}}}

(11)

Additional considerations. Generating counterfactual explanation policies using the above gradients poses a significant practical challenge – the KL term conservatively restricts the policy to the neighborhood of the original policy $\pi_{\theta_{0}}$ hindering the actual aim of reaching $R_{\text{target}}$ .

Intuitively, we can tune the regularization hyperparameter $k$ on the KL-weight w.r.t. the $R_{\text{target}}$ at the start of the optimization. Nevertheless, we update the policy pivot $\pi_{\theta_{0}}$ iteratively to $[\pi_{\theta_{1}},\pi_{\theta_{2}},\dots]$ after every fixed $m$ number of gradient updates of $\pi_{\theta}$ to ensure it achieves the desired $R_{\text{target}}$ return. Finally, at the $i^{\text{th}}$ iteration, after $m$ steps of gradient updates of the objective:

\pi_{\theta_{\text{cf}}}=\operatorname*{arg\,min}_{\theta}(|J_{\pi_{\theta}}-R_{\text{target}}|+k\cdot D_{\text{KL}}(\pi_{\theta_{i-1}}||\pi_{\theta})),

(12)

we change the pivot policy from $\pi_{\theta_{i-1}}$ to $\pi_{\theta_{i}}$ using $\pi_{\theta}$ at that step. The above process is continued till $|J_{\pi_{\theta}}{-}R_{\text{target}}|$ is less than a certain threshold $\delta$ . In Algorithm 1, we describe the complete step-by-step algorithm of Counterpol, our proposed framework.

Given:

\pi_{\theta_{0}},R_{\text{target}},\delta,k,m,N,\eta

\pi_{\theta^{\prime}}\xleftarrow[]{}\pi_{\theta_{0}}.\text{copy}()

// Create copy of original policy, which will be updated

for i = $0,1,\dots$ do

for j = $0,1,\dots,m-1$ do

\{\tau\}_{i=1}^{N}\sim\pi_{\theta^{\prime}}

// Sample

N

episodes using policy under updation

J_{\pi_{\theta^{\prime}}}\xleftarrow[]{}\text{estimatePerformance}(\{\tau\}_{i=1}^{N})

// using Equation 8

/* Check if stopping criterion is met */

if $|J_{\pi_{\theta^{\prime}}}-R_{\text{target}}|<\delta$ then

\pi_{\theta_{\text{cf}}}\xleftarrow[]{}\pi_{\theta^{\prime}}

Exit both the loops.

end if

\nabla_{\theta}L_{\text{cf},\theta}\xleftarrow[]{}\nabla_{\theta}L_{\text{ret},\theta}\rvert_{\theta=\theta^{\prime}}+k\cdot\nabla_{\theta}L_{\text{KL}}(\pi_{\theta_{i}}||\pi_{\theta})\rvert_{\theta=\theta^{\prime}}

// Compute the objective gradient using Equations 9 and 11

\theta^{\prime}\xleftarrow[]{}\theta^{\prime}-\eta\cdot\nabla_{\theta}{L_{\text{cf},\theta}}\rvert_{\theta=\theta^{\prime}}

// Update parameters of

\pi_{\theta^{\prime}}

end for

\pi_{\theta_{i+1}}\xleftarrow{}\pi_{\theta^{\prime}}

// Update the KL-pivot policy to

\pi_{\theta^{\prime}}

end for

return counterfactual policy

\pi_{\theta_{\text{cf}}}

Algorithm 1 Counterpol: Counterfactual Explanation Policy Optimization

4.2 Connection between Counterfactual Explanation Policies and Trust Region Methods

Trust region-based RL policy optimization methods like TRPO (Schulman et al., 2015), ACKTR (Wu et al., 2017) and PPO (Schulman et al., 2017) have been foundational in the unprecedented success of deep RL (Berner et al., 2019; OpenAI, 2023). The primary aspect behind their optimization involves iteratively updating the policy within a trust region, leading to a monotonic improvement in the policy’s performance. Formally, the objective is defined as:

\pi_{\theta_{i+1}}=\operatorname*{arg\,max}_{\theta}~{}~{}J_{\pi_{\theta}}\text{~{}~{}such that~{}~{}}D_{\text{KL}}(\pi_{\theta_{i}}||\pi_{\theta})\leq\delta

(13)

The above objective can be written in its Lagrangian form using penalty instead of the constraint as:

	$\displaystyle\pi_{\theta_{i+1}}={}$	$\displaystyle\operatorname*{arg\,max}_{\theta}~{}~{}J_{\pi_{\theta}}-\lambda\cdot D_{\text{KL}}(\pi_{\theta_{i}}\|\|\pi_{\theta}),$		(14)
		where $\lambda$ is treated as a hyper-parameter.		(14)

Next, we show the equivalence between counterfactual explanation policy and the policy obtained using a trust region-based policy gradient method.

Theorem 4.1.

(Equivalence) Let $R_{\text{target}}$ be the maximum possible return in the MDP under consideration. Then, for any $L_{\text{ret}}$ estimated using $\ell_{1}$ -norm, the generated counterfactual explanation policy through iterative KL-pivoting is equivalent to optimizing the policy for best return using a trust region-based policy gradient method.

Proof.

For a given $R_{\text{target}}$ , let us assume that the desired return from a policy is equal to the maximum possible return, i.e., $R_{\text{target}}=R_{\text{max}}$ , and the return penalty calculated using $\ell_{1}$ -norm, we rewrite the counterfactual generation objective using KL-pivoting Eqn. 12 as:

\pi_{\theta_{\text{cf}}}=\operatorname*{arg\,min}_{\theta}(|J_{\pi_{\theta}}-R_{\text{max}}|+k\cdot D_{\text{kl}}(\pi_{\theta_{i}}||\pi_{\theta}))

(15)

We have $J_{\pi_{\theta}}=\mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{T-1}\gamma^{t}r(s_{t},a_{t},s_{t+1})|s_{0}{\sim}d_{0},a_{t}\sim\pi_{\theta}(\cdot|s_{t}))]$ as defined in Eq. 2. Now, let $R{=}\sum_{t=0}^{T-1}\gamma^{t}r(s_{t},\\ a_{t},s_{t+1})$ denote the discounted return for the episode $(s_{0},a_{0},s_{1},a_{1},\dots,s_{T})$ , then $J_{\pi_{\theta}}=\mathbb{E}_{\pi_{\theta}}[R]$ . Further,

R\leq R_{\text{max}}\implies J_{\pi_{\theta}}=\mathbb{E}_{\pi_{\theta}}[R]\leq\mathbb{E}_{\pi_{\theta}}[R_{\text{max}}]=R_{\text{max}}

Hence, $J_{\pi_{\theta}}-R_{\text{max}}\leq 0$ , or $R_{\text{max}}-J_{\pi_{\theta}}\geq 0$ , which allows us to write $|J_{\pi_{\theta}}-R_{\text{max}}|$ simply as $(R_{\text{max}}-J_{\pi_{\theta}})$ . Rewriting Eqn. 15, we get:

$\displaystyle\pi_{\theta_{\text{cf}}}={}$	$\displaystyle\operatorname*{arg\,min}_{\theta}~{}~{}R_{\text{max}}-J_{\pi_{\theta}}+k\cdot D_{\text{KL}}(\pi_{\theta_{i}}\|\|\pi_{\theta})$	(16)
$\displaystyle={}$	$\displaystyle\operatorname*{arg\,min}_{\theta}~{}~{}-J_{\pi_{\theta}}+k\cdot D_{\text{KL}}(\pi_{\theta_{i}}\|\|\pi_{\theta})$
$\displaystyle={}$	$\displaystyle\operatorname*{arg\,max}_{\theta}~{}~{}J_{\pi_{\theta}}-k\cdot D_{\text{KL}}(\pi_{\theta_{i}}\|\|\pi_{\theta})$

Comparing the above equations with Eqn. 14 and treating the hyper-parameters $k$ and $\lambda$ interchangeably, we find that both these objectives of policy updation are the same, which completes our proof. ∎

Consequently, the trust region-based policy gradient methods can be interpreted as finding a counterfactual explanation policy of the original policy, which achieves the maximum possible return (i.e., the best performance). This also opens up possibility for using sophisticated ways to choose distance regulation parameter $k$ similar to $\lambda$ following (Kakade & Langford, 2002).

5 Experiments and Results

Here, we study the counterfactual explanations of policies trained using policy gradient methods, particularly the actor-critic methods (Sutton & Barto, 2018), as they converge more faithfully compared to vanilla policy gradients and also maintain the simplicity required to analyze them in a contrastive fashion rigorously.

Setup. We employ the widely used actor-critic algorithm of Advantage Actor-Critic (A2C) (Degris et al., 2012), and only in the case of complex environments, we shift to Proximal Policy Optimization (PPO) (Schulman et al., 2017) for training the original policies in our empirical analysis. We use the standard implementations provided in stable-baselines library ¹¹1Documentation for the library: https://stable-baselines3.readthedocs.io/en/master/ (Raffin et al., 2021) for RL training using the above-mentioned algorithms. We conduct our analysis in five OpenAI gym environments (Brockman et al., 2016), viz. i) CartPole-v1, ii) Acrobot-v1, iii) Pendulum-v1, iv) LunarLander-v2 and v) BipedalWalker-v3 ²²2Additional environment details can be found at https://gymnasium.farama.org/environments/.

Implementation details. We set the threshold return values of $\delta$ for the five environments to $\{10,2.5,37.5,5,10\}$ according to the order of magnitude of returns in their respective environment, the number of KL-pivoting policy iterations $m$ as 10 for all environments (except BipedalWalker, where $m=5$ ), and the number of trajectory rollouts $N$ to 10 for all four environments except BipedalWalker, where $N=2$ . Further, we chose the distance regularizer parameter $k$ values as $\{10,1,10^{5},10,1\}$ for the five environments in the same order as discussed above. For computation purposes, we use a single NVIDIA A100 (40GB) GPU. We share the code for generating counterfactual explanation policies using Counterpol in the supplementary material.

Table 1: Counterpol Optimization on Control Environments. Counterfactual explanations for a given policy

\pi_{0}

with performance

J_{\pi_{0}}

and three target returns. Shown are the number of outer policy updates (

n_{\pi}

) for generating counterfactuals and the resulting performance of the counterfactual policy

\pi_{\theta_{cf}}

. Counterpol faithfully estimates the counterfactuals across all the

\pi_{0}

R_{\text{target}}

pairs and diverse environments.

CartPole-v1
$J_{\pi_{\theta_{o}}}$	235.6			368.2			500.0
$R_{\text{target}}$	50	250	450	50	250	450	50	250	450
$n_{\pi}$	278.0 $\pm$ 22.9	10.0 $\pm$ 6.5	303.3 $\pm$ 67.1	108.0 $\pm$ 16.6	1340.0 $\pm$ 278.8	68.7 $\pm$ 5.0	833.3 $\pm$ 56.3	721.3 $\pm$ 14.7	557.3 $\pm$ 10.0
$J_{\pi_{\theta_{\text{cf}}}}$	48.6 $\pm$ 4.9	245.0 $\pm$ 4.4	450.6 $\pm$ 7.5	48.2 $\pm$ 3.4	245.5 $\pm$ 1.9	453.5 $\pm$ 0.5	56.4 $\pm$ 2.4	246.0 $\pm$ 4.7	452.1 $\pm$ 1.0
Acrobot-v1
$J_{\pi_{\theta_{o}}}$	-146.7			-89.0			-84.3
$R_{\text{target}}$	-120	-100	-80	-120	-100	-80	-120	-100	-80
$n_{\pi}$	44.0 $\pm$ 18.8	31.3 $\pm$ 1.9	18.0 $\pm$ 12.8	62.0 $\pm$ 34.8	38.0 $\pm$ 37.3	6.0 $\pm$ 2.8	117.3 $\pm$ 95.3	26.0 $\pm$ 36.8	8.7 $\pm$ 5.7
$J_{\pi_{\theta_{\text{cf}}}}$	-119.5 $\pm$ 1.3	-100.5 $\pm$ 2.1	-80.6 $\pm$ 1.6	-119.1 $\pm$ 0.5	-99.9 $\pm$ 0.2	-80.6 $\pm$ 0.5	-120.1 $\pm$ 1.5	-99.8 $\pm$ 1.8	-81.6 $\pm$ 0.3
Pendulum-v1
$J_{\pi_{\theta_{o}}}$	-853.5			-792.6			-568.0
$R_{\text{target}}$	-1000	-750	-500	-1000	-750	-500	-1000	-750	-500
$n_{\pi}$	1.3 $\pm$ 0.9	15.3 $\pm$ 14.8	174.7 $\pm$ 38.0	6.0 $\pm$ 3.3	10.0 $\pm$ 11.4	198.7 $\pm$ 157.4	113.3 $\pm$ 139.1	22.0 $\pm$ 25.7	6.0 $\pm$ 3.1
$J_{\pi_{\theta_{\text{cf}}}}$	-996.9 $\pm$ 19.6	-773.5 $\pm$ 6.4	-507.9 $\pm$ 19.7	-992.1 $\pm$ 25.9	-777.1 $\pm$ 6.2	-507.3 $\pm$ 7.1	-997.6 $\pm$ 15.1	-764.9 $\pm$ 19.8	-501.0 $\pm$ 8.8

Refer to caption — Figure 1: Counterpol on LunarLander-v2. We showcase rollouts of different policies trained to land the rover by sampling frames at every 5 ${}^{\text{th}}$ step and tracking the rover’s center in green. Slight modifications to the original policy can improve (columns 2-3) or worsen (columns 4-5) its overall returns.

5.1 Results on Counterpol Optimization

We conduct experiments to understand and verify the optimization of our proposed Counterpol framework. In doing so, we present the results of Counterpol optimization on A2C-trained agents of CartPole-v1, Acrobot-v1, and Pendulum-v1 environments. We sample three distinct policy checkpoints from the A2C training of each RL environment and generate counterfactual policies using Counterpol for these checkpoints and three different target returns $R_{\text{target}}$ (chosen with respect to the RL environment). We train each tuple of the RL environment, A2C checkpoint, and target return for three different seeds to reduce variance arising from stochasticity. Our results in Table 1 show that Counterpol faithfully achieves target returns for diverse starting checkpoints across all environments, i.e., Counterpol converges to the target return value generating $\pi_{\text{cf}}$ that obtains return very close to the given $R_{\text{target}}$ . Further, we find an intuitive trend in the number of outer policy (KL-pivoting) updates, where we observe a lesser number of outer policy updates when the $R_{\text{target}}$ value is closer to the performance of the original policy $J_{\pi_{\theta_{0}}}$ . Our results on these standard control environments form the basis for further investigation of more complex environments having a discrete or a continuous action space, which we explore in the next section.

5.2 Contrastive Insights into RL policies

Next, we present our analysis on generating counterfactual explanations using more complex environments.

1) Lunar Lander. We train an RL agent on LunarLander-v2 using A2C, and intervene the training to retrieve the policy $\pi_{0}$ for our contrastive analysis. The original policy, on average, achieves a return of 50 (refer Figure 2(a)). We then generate its counterfactuals for target returns of $\{100,150\}$ to understand how the policy can be improved further and also for target returns of $\{0,-50\}$ to understand how the policy can become worse. We present landing scenarios of the policy $\pi_{0}$ on three different surfaces in Figure 1 and their respective contrastive explanations for improvement and deterioration. Our results for $\pi_{0}$ rollouts show interesting agent characteristics like “slow start”, “a quick fall after covering half the distance,” and “landing near the right flag”. We find an improved version of the original policy in the generated counterfactuals with $R_{\text{target}}=100$ (Fig. 1; column 2), where we observe that a uniform descent and shifting the landing slightly to the left between the two flags can improve the original policy $\pi_{0}$ . Further, the counterfactual with $R_{\text{target}}=150$ (Fig. 1; column 3) shows uniform descent with decelerated landing can lead to even higher gains. In contrast, $R_{\text{target}}=0$ (Fig. 1; column 4) shows how by starting very fast and making free fall before landing outside the space between flags, $\pi_{0}$ can go worse. Similarly, $R_{\text{target}}=-50$ (Fig. 1; column 5) show how the policy could worsen/collapse by making the agent land further right. Notably, the generated counterfactuals for targeted deterioration of performance using Counterpol can be interpreted as a robust way to unlearn (Sekhari et al., 2021; Nguyen et al., 2022) RL skills as it might be required to forget certain aspects of learning(e.g., in Fig. 1, the slow start of $\pi_{0}$ ).

2) Bipedal Walker. For the BipedalWalker-v3 environment, we train a PPO agent that demonstrates early success in achieving walking behavior. We conduct experiments to analyze the trained Bipedal agent contrastively by estimating counterfactuals at target returns of 50 and 150 (refer Fig. 2(b)). We demonstrate the qualitative results in Figure 3, where we find that the given policy $\pi_{0}$ has a peculiar walk, kneeling on the right leg and taking stride with the left one. In contrast, our generated counterfactual policy using $R_{\text{target}}=150$ shows improved (upright and faster) walk of the Bipedal agent. Further, when we reduce the target return to $R_{\text{target}}=50$ , we observe that the kneeling gets intensified and the agent starts to drag itself to the finishing line, making the agent to walk slow and also fall (as shown in Scenario 1).

Across both environments, we observe that Counterpol generates counterfactual policies that look similar to the original policy, keeping the essential characteristics of the

policy the same. This enables us to easily contrast between different versions of the same policies.

6 Conclusion and Discussion

In this work, we present a systematic framework for contrastive analysis of reinforcement learning policies. In doing so, we propose Counterpol for generating counterfactual explanations for a given RL policy. We demonstrate the flexibility of Counterpol across multiple RL benchmarks and find insights into RL policy learning. We carry out a detailed theoretical analysis to connect the counterfactual estimation with eminent trust region optimization methods in RL. Further, results across five OpenAI gym environments show that Counterpol generates faithful counterfactuals for (un)learning skills while keeping close to the original policy. To the best of our knowledge, our work presents the first technique to generate counterfactual explanation policies. Overall, we believe our work paves the way toward a new direction in the methodical contrastive analysis of RL policies and brings more transparency to RL decision-making. In our present work, we mainly used on-policy policy gradient-based RL optimization techniques, which are not very sample efficient, and we assume the stochastic nature of the policies which might not be the case always. It would be interesting to explore the utility of Counterpol in unlearning options (skills) in hierarchical RL (Barto & Mahadevan, 2003; Pateria et al., 2021). Also, Counterpol gives a glimpse into return contours in RL which could be further explored for enhanced optimization.

References

Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv, 2016.
Asadi et al. (2023) Asadi, K., Fakoor, R., Gottesman, O., Kim, T., Littman, M. L., and Smola, A. J. Faster deep reinforcement learning with slower online network, 2023.
Barto & Mahadevan (2003) Barto, A. G. and Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 2003.
Bellman (1957) Bellman, R. A markovian decision process. Journal of mathematics and mechanics, 1957.
Berner et al. (2019) Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv, 2019.
Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv, 2016.
Coppens et al. (2019) Coppens, Y., Efthymiadis, K., Lenaerts, T., Nowé, A., Miller, T., Weber, R., and Magazzeni, D. Distilling deep reinforcement learning policies in soft decision trees. In IJCAI workshop on Explainable Artificial Intelligence, 2019.
Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. Off-policy actor-critic. arXiv, 2012.
(9) Deshmukh, S. V., Dasgupta, A., Krishnamurthy, B., Jiang, N., Agarwal, C., Theocharous, G., and Subramanian, J. Explaining rl decisions with trajectories. In ICLR.
Ding et al. (2022a) Ding, D., Wei, C.-Y., Zhang, K., and Jovanovic, M. Independent policy gradient for large-scale Markov potential games: Sharper rates, function approximation, and game-agnostic convergence. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), ICML, volume 162 of PMLR. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/ding22b.html.
Ding et al. (2022b) Ding, W., Nakai, K., and Gong, H. Protein design via deep learning. Briefings in bioinformatics, 2022b.
Fawzi et al. (2022) Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022.
Gershman (2017) Gershman, S. J. Reinforcement learning and causal models. The Oxford handbook of causal reasoning, 2017.
Greydanus et al. (2018) Greydanus, S., Koul, A., Dodge, J., and Fern, A. Visualizing and understanding atari agents. In ICML, 2018.
Hirano et al. (2022) Hirano, M., Sakaji, H., and Izumi, K. Policy gradient stock gan for realistic discrete order data generation in financial markets, 2022.
Iyer et al. (2018) Iyer, R., Li, Y., Li, H., Lewis, M., Sundar, R., and Sycara, K. Transparency and explanation in deep reinforcement learning neural networks. In AIES, 2018.
Jiang & Li (2016) Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In ICML. PMLR, 2016.
Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, 2002.
Kakade (2001) Kakade, S. M. A natural policy gradient. NeurIPS, 2001.
Karimi et al. (2020) Karimi, A.-H., Barthe, G., Balle, B., and Valera, I. Model-agnostic counterfactual explanations for consequential decisions. In AISTATS, 2020.
Khan et al. (2020) Khan, A., Tolstaya, E., Ribeiro, A., and Kumar, V. Graph policy gradients for large scale robot control. In Kaelbling, L. P., Kragic, D., and Sugiura, K. (eds.), Proceedings of the Conference on Robot Learning, PMLR. PMLR, 30 Oct–01 Nov 2020. URL https://proceedings.mlr.press/v100/khan20a.html.
Koedinger et al. (2013) Koedinger, K. R., Brunskill, E., Baker, R. S., McLaughlin, E. A., and Stamper, J. New potentials for data-driven intelligent tutoring system development and optimization. AI Magazine, 2013.
Lu et al. (2023) Lu, P., Qiu, L., Chang, K.-W., Wu, Y. N., Zhu, S.-C., Rajpurohit, T., Clark, P., and Kalyan, A. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning, 2023.
Maggipinto et al. (2020) Maggipinto, M., Susto, G. A., and Chaudhari, P. Proximal deterministic policy gradient. In IROS, 2020. doi: 10.1109/IROS45743.2020.9341559.
Mahajan et al. (2019) Mahajan, D., Tan, C., and Sharma, A. Preserving causal constraints in counterfactual explanations for machine learning classifiers. arXiv, 2019.
Melo et al. (2008) Melo, F. S., Meyn, S. P., and Ribeiro, M. I. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, 2008.
Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv, 2013.
Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. NeurIPS, 2016.
Nguyen et al. (2022) Nguyen, T. T., Huynh, T. T., Nguyen, P. L., Liew, A. W.-C., Yin, H., and Nguyen, Q. V. H. A survey of machine unlearning. arXiv, 2022.
Nitanda (2014) Nitanda, A. Stochastic proximal gradient descent with acceleration techniques. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), NeurIPS. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/d554f7bb7be44a7267068a7df88ddd20-Paper.pdf.
OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
Parikh et al. (2014) Parikh, N., Boyd, S., et al. Proximal algorithms. Foundations and trends® in Optimization, 2014.
Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
Pateria et al. (2021) Pateria, S., Subagdja, B., Tan, A.-h., and Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 2021.
Puiutta & Veith (2020) Puiutta, E. and Veith, E. M. Explainable reinforcement learning: A survey. In Machine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2020, Dublin, Ireland, August 25–28, 2020, Proceedings 4. Springer, 2020.
Puri et al. (2019) Puri, N., Verma, S., Gupta, P., Kayastha, D., Deshmukh, S., Krishnamurthy, B., and Singh, S. Explain your move: Understanding agent actions using specific and relevant feature attribution. arXiv, 2019.
Raffin et al. (2021) Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. JMLR, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML. PMLR, 2015.
Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv, 2017.
Sekhari et al. (2021) Sekhari, A., Acharya, J., Kamath, G., and Suresh, A. T. Remember what you want to forget: Algorithms for machine unlearning. NeurIPS, 2021.
Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 2016.
Stepin et al. (2021) Stepin, I., Alonso, J. M., Catalá, A., and Pereira-Fariña, M. A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence. IEEE Access, 2021.
Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. NeurIPS, 1999.
Theocharous et al. (2015) Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. Ad recommendation systems for life-time value optimization. In WWW, 2015.
Ustun et al. (2019) Ustun, B., Spangher, A., and Liu, Y. Actionable recourse in linear classification. In FAacT, 2019.
Van Looveren & Klaise (2019) Van Looveren, A. and Klaise, J. Interpretable counterfactual explanations guided by prototypes. arXiv, 2019.
Verma et al. (2018) Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. In ICML. PMLR, 2018.
Verma et al. (2020a) Verma, S., Dickerson, J., and Hines, K. Counterfactual explanations for machine learning: A review. arXiv, 2020a.
Verma et al. (2020b) Verma, S., Dickerson, J. P., and Hines, K. E. Counterfactual explanations for machine learning: A review. ArXiv, abs/2010.10596, 2020b.
Wachter et al. (2018) Wachter, S., Mittelstadt, B., and Russell, C. Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law & Technology, 31(2), 2018.
Wu et al. (2017) Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. NeurIPS, 2017.