Risk-Sensitive Reinforcement Learning: a Martingale Approach to Reward Uncertainty

Nelson Vadori, Sumitra Ganesh, Prashant Reddy, Manuela Veloso
JP Morgan AI Research
{nelson.n.vadori, sumitra.ganesh, prashant.reddy, manuela.veloso}@jpmorgan.com

Abstract

We introduce a novel framework to account for sensitivity to rewards uncertainty in sequential decision-making problems. While risk-sensitive formulations for Markov decision processes studied so far focus on the distribution of the cumulative reward as a whole, we aim at learning policies sensitive to the uncertain/stochastic nature of the rewards, which has the advantage of being conceptually more meaningful in some cases. To this end, we present a new decomposition of the randomness contained in the cumulative reward based on the Doob decomposition of a stochastic process, and introduce a new conceptual tool - the chaotic variation - which can rigorously be interpreted as the risk measure of the martingale component associated to the cumulative reward process. We innovate on the reinforcement learning side by incorporating this new risk-sensitive approach into model-free algorithms, both policy gradient and value function based, and illustrate its relevance on grid world and portfolio optimization problems.

1 Introduction

Classical reinforcement learning (RL) aims at deriving policies that maximize the expected value of the sum (or average) of future rewards, while so-called risk-sensitive RL aims at taking into account some measure of variability of the cumulative reward into the learned policy, usually through a risk criterion such as variance or a risk measure (e.g. entropic, CVaR). The common goal to the existing risk-sensitive RL literature (cf. section 1.3) is to take into account the distribution of the cumulative rewards in order to learn a variety of policies, usually parametrized by a risk parameter such as the mean-variance trade-off, the CVaR percentile or an upper bound on variance. For example, in the mean-variance case, learned policies typically lead to the distribution of cumulative rewards having lower mean but also lower variance (hence we gain confidence on the outcome at the expense of its mean).

In the existing literature, the chosen risk criterion is applied to the cumulative reward as a whole, and hence does not distinguish between the different sources of randomness contained in it. The motivating example of section 1.2 and the Doob decomposition of section 2 show that the randomness contained in the cumulative reward actually consists of two components of different nature and having different practical interpretations. If we denote $R_{\pi}(s_{t},s_{t+1}):=R(s_{t},\pi(s_{t}),s_{t+1})$ the reward obtained at time $t+1$ associated to policy $\pi$ , the first chaotic component exactly captures the non-deterministic nature of the reward, i.e. that it is uncertain (or stochastic) when action $a_{t}=\pi(s_{t})$ is chosen in state $s_{t}$ (as it depends on $s_{t+1}$ ). This component is zero if the reward is deterministic. The second predictable component replaces the uncertain rewards by their predictable/deterministic projections $\mathbb{E}[R_{\pi}(s_{t},s_{t+1})|s_{t},\pi(s_{t})]$ , and hence does not depend on their uncertainty/stochasticity but acts as if the reward we get at time $t+1$ after choosing action $\pi(s_{t})$ is always what we predicted it to be based on information available at time $t$ . This component possesses some variability due to the switching between states, but no due to the uncertain nature of the rewards. Interestingly, in the average reward case, these two components also appear in the limiting process of the central limit theorem for functionals of Markov chains, applied to the average reward $\frac{1}{n}\sum_{t=0}^{n}R_{\pi}(s_{t},s_{t+1})$ . Indeed, the limit of this term, minus its long-range (ergodic) mean and rescaled by $\sqrt{n}$ , converges in distribution as $n\to+\infty$ to the sum of two independent normal random variables with respective variances $\sigma_{deter.}^{2}$ and $\sigma_{chaotic}^{2}$ , where the latter depends on the uncertainty of the rewards, whereas the former only depends on their deterministic projections $\mathbb{E}[R_{\pi}(s_{t},s_{t+1})|s_{t},\pi(s_{t})]$ as discussed above. We refer to the supplementary for more details on this observation.

In this work, we develop a new approach complementary to the existing literature that learns policies sensitive to the variability contained in the chaotic component reflecting the uncertainty/stochasticity of the rewards, discussed above. We will work in a framework where the reward $R(s_{t},a_{t},s_{t+1},h_{t+1})$ received at $t+1$ after having taken action $a_{t}$ is uncertain in the sense that it depends on the next state $s_{t+1}$ and/or possibly on extra "hidden" information $h_{t+1}$ (cf. the setting of Shen et al. , (2014) for example).

We believe that this approach is of interest for the following reasons. First, as mentioned in Chow et al. , (2018), deriving risk criteria that are conceptually meaningful is a topic of current research. In our case, sensitivity to the chaotic reward component can nicely be interpreted as sensitivity to the uncertainty of the future reward when an action $a_{t}$ is chosen, in other words to the difference between the actual reward received at $t+1$ and our best guess of it given information available at time $t$ , which is a natural human behavior that is adopted in many real-world situations. For example in section 5, we show on a portfolio optimization problem how classical mean-variance based risk-sensitive RL counterintuitively leads to reduce investment in both the risky and the risk-free asset as the risk aversion parameter increases (and hence we stop taking advantage of the risk-free asset), whereas our new chaotic mean-variance algorithm leads to reduce investment in the risky asset only, which is intuitive as the evolution of the latter is unknown when the investment decision is made and hence can potentially yield significant losses.

Second, depending on the application in mind, mixing the above mentioned predictable and chaotic sources of randomness together and treating it as a single "noise" may lead to undesirable and counterintuitive learned policies, and a more subtle understanding of the randomness is needed in order to gain more interpretability in the learned risk-sensitive policies (cf. section 1.2). It was mentioned in Mannor & Tsitsiklis, (2011) that variance as a risk criterion may sometimes lead to counterintuitive policies, in that an agent who has received unexpected large rewards may seek to incur losses in order to keep the variance small, since variance penalizes deviations of the cumulative reward from its expected value. Hence, Prashanth & Ghavamzadeh, (2016) have considered per-period variance instead. An interesting observation is that when applied to the chaotic component only (which will prove to be a martingale), these two notions of overall and per-period variance bridge under the concept of quadratic variation $\left<M\right>$ of a martingale $M$ , due to the equality $\mathbb{E}[M^{2}]=\mathbb{E}[\left<M\right>]$ (proposition 1): penalizing reward uncertainty/stochasticity will not force the cumulative reward towards some baseline level and hence will not lead to counterintuitive policies mentioned above.

1.1 Main contributions

We provide in section 2 a novel, conceptually meaningful decomposition of the cumulative reward process that distinguishes between the different sources of randomness contained within it (chaotic and predictable, cf. discussion above). The key idea of the paper is to apply the Doob decomposition of a stochastic process to the cumulative reward process. We believe that this decomposition in itself is of interest in gaining a better understanding of the variability contained within the reward process.

We then introduce in section 3 a new definition of risk that exactly captures reward uncertainty risk, i.e. the risk related to its non-deterministic nature: the chaotic variation associated to the reward process (and to a risk measure).

We incorporate for the first time reward uncertainty risk into model-free value-function based and policy gradient algorithms: the related modified versions of some canonical RL algorithms, so-called Chaotic Mean-Variance algorithms (CMV), are presented in section 4. These algorithms are applied to grid world and portfolio optimization problems in section 5. Although the focus in the main text is on the chaotic mean-variance case, our analysis in section 3 allows for general risk measures and we discuss some of them in the supplementary, such as the chaotic CVaR and Sharpe ratio.

1.2 Motivating example

As mentioned above, we are interested in developing a general RL framework to account for reward uncertainty in learned policies. The example presented in this section aims at illustrating what could sometimes go wrong when taking variance as a measure of risk, as it is commonly the case in the risk-sensitive RL literature (cf. section 1.3). We consider the episodic toy example of $T$ timesteps where there are 2 states, 2 actions, and at each state transition the probability to reach either states is the same, i.e. $P(s_{0}=2)=P(s_{0}=1)=P(s_{t+1}=1|s_{t},a_{t})=P(s_{t+1}=2|s_{t},a_{t})=\frac{1}{2}$ . If $s_{t}=1$ , the reward obtained at $t+1$ is 2 if $a_{t}=1$ and $4+\sigma h_{t+1}$ if $a_{t}=2$ ; if $s_{t}=2$ , the rewards obtained at $t+1$ are 10 if $a_{t}=1$ and $8+\sigma h_{t+1}$ if $a_{t}=2$ , where $h_{t}$ are i.i.d. zero mean unit variance, and the noise $\sigma\geq 0$ . Think for example as action $a_{t}=2$ as investing in a risky stock at time $t$ and getting the corresponding price change at time $t+1$ . If $\pi_{i}$ is the policy that always selects action $i$ , then $\pi_{1},\pi_{2}$ have the same expected cumulative reward of $6T$ . We have plotted in figure 1 the rewards obtained for the 2 policies in each state.

It can be proved that if the noise $\sigma^{2}$ is small enough (cf. supplementary for a precise quantitative statement), policy $\pi_{1}$ will be established as less risky than $\pi_{2}$ according to the variance criterion. Nevertheless, we know that the cumulative reward earned using $\pi_{1}$ over $T$ timesteps is bounded from below by $2T$ with probability 1. On the other hand, since $\pi_{2}$ contains the noise term $h_{t}$ , there is always a positive probability that the reward hits any arbitrarily low value. That is, $\pi_{2}$ leaves the agent with a component that could constitute a true risk for him, whereas $\pi_{1}$ doesn’t, but is still established as the most risky according to the variance criterion. Hence, the risky noise component is not detected: section 2 will establish that this risky component is in fact the martingale associated to the Doob decomposition of the cumulative reward process. On the other hand, the conceptual tool introduced in this paper, the chaotic variation, when applied to the mean-variance case, will lead to a risk penalty term of $0$ for $\pi_{1}$ and proportional to $\sigma^{2}$ for $\pi_{2}$ (cf. supplementary), i.e. exactly the desired outcome that the penalty term is proportional to reward uncertainty only.

Another observation is the following: if $\sigma=0$ , rewards are deterministic and the obvious optimal policy is to simply take the best reward in each state, i.e. $a_{t}=2$ if $s_{t}=1$ and $a_{t}=1$ if $s_{t}=2$ . This optimal strategy generates a variance over $T$ timesteps of $9T$ , due to the switching between states. If one chooses the risk aversion coefficient in the mean-variance trade-off to be high enough, one will then choose not to execute that obvious optimal policy and rather execute policy $\pi_{2}$ as it generates a lower variance of $4T$ over $T$ timesteps. On the other hand, the chaotic variation, when applied to the mean-variance case, will lead to a risk penalty term of $0$ for all policies (since rewards are deterministic) and hence one will always choose the optimal policy, whatever the risk aversion coefficient is.

Refer to caption — Figure 1: Rewards for 2 policies in state 1 (timesteps 1 to 200) and state 2 (timesteps 200 to 400). State transitions are random but for readability we have grouped together on the plot the timesteps corresponding to each state. The noise $\sigma=0.16$

1.3 Related work

There exists an interesting and growing body of literature on the topic of risk-sensitive RL. Our work contrasts with the latter in that we focus on learning policies sensitive to the uncertain nature of the rewards whereas previous work apply risk criterion to the cumulative reward as a whole, and hence do not distinguish between the different sources of randomness contained in it. Previously studied algorithms can be split into those that (i) are value-function based (cf. value iteration and policy iteration), and (ii) employ policy gradient methods that update the policy parameters in the direction of the gradient of the risk-flavored performance metric. In category (i), we cite the work of Mihatsch & Neuneier, (2002) in which they transform TD increments with a function that is linear by parts, and Borkar, (2002), Borkar, (2005), Borkar, (2010) which focuses on the exponential utility function in an ergodic average-reward type setting. Work of category (ii) are more numerous, and can be split into those that use (Actor-Critic) and do not use (Monte Carlo) a specific representation of the value function, the former usually requiring TD-type updates of the value function. For large action/state spaces, one typically employs function approximation of the value function in the actor-critic setting, which introduces bias but improves variance. On the RL front, Prashanth & Fu, (2018) provides a thorough literature review. Some references are interested in the full distribution of the cumulative reward, cf. Morimura et al. , (2010a), Morimura et al. , (2010b), Bellemare et al. , (2017), Defourny et al. , (2008) while others focus on specific risk measures, e.g. CVaR Tamar, A., Glassner, Y., Manor, S., (2015), Chow et al. , (2018). Some authors define their own notion of risk with specific applications in mind, such as Geibel & Wysotzki, (2005) for which risks refers to the likelihood of reaching so-called "error states". Tamar et al. , (2015) provides a general methodology to handle all coherent risk measures.

The most-studied framework remains the variance-related one. Variance can be understood as the variance of the cumulative reward, or as the sum of variances of step-by-step rewards, cf. the work of Prashanth & Ghavamzadeh, (2016), Tamar et al. , (2016), Tamar et al. , (2012) (in a stochastic shortest path context), Prashanth & Ghavamzadeh, (2013). Sobel, MJ, (1982) was among the firsts to study the mean-variance case, and he explains why using variance as a measure of risk may cause problems, more specifically the inability to use policy iteration type algorithms due to the lack of monotonicity of the variance operator, while Mannor & Tsitsiklis, (2011) shows that these problems can be hard to solve. Tamar et al. , (2016) provides a methodology to estimate the variance of the cumulative reward via a TD-type approach.

2 Markov Decision Processes: Doob decomposition and Martingale formulation

We study a Markov Decision Process (MDP) $(\mathbb{S},\mathbb{A},P,R,\gamma,H,\mathbb{H})$ , where we allow the reward function $R$ to be general of the form $R_{t+1}:=R(s_{t},a_{t},s_{t+1},h_{t+1})$ , and the hidden process $h_{t}\in\mathbb{H}$ as in Shen et al. , (2014). The kernels $P$ and $H$ are defined in assumption 1. We use the notation that the reward $R_{t+1}$ is received at time $t+1$ after having taken action $a_{t}\in\mathbb{A}$ at state $s_{t}\in\mathbb{S}$ at time $t$ . We denote the cumulative reward to go and the conditional average reward:

\mathcal{R}_{t}:=\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R_{t^{\prime}+1},\hskip 8.53581pt\overline{R}(s_{t},a_{t}):=\mathbb{E}[R_{t+1}|s_{t},a_{t}]

To keep notations consistent in the episodic and discounted reward setting we impose as usual $\gamma<1$ , and $\gamma=1$ only possible in the episodic setting (in the latter case the process stays in the same absorbing state forever once it is reached, with zero associated rewards). The analysis of this section is amenable to the average-reward formulation, which is detailed in the supplementary. We make the below assumption throughout the paper, needed in particular in the proof of theorem 1.

Assumption 1.

$R(\cdot,\cdot,\cdot,\cdot)$ is uniformly bounded and the probability kernels $P$ and $H$ satisfy $\mathbb{P}[s_{t+1}\in B|s_{t},a_{t},h_{t}]=P(s_{t},a_{t},B)$ , $\mathbb{P}[h_{t+1}\in B|s_{t},a_{t},h_{t},s_{t+1}]=H(s_{t},a_{t},s_{t+1},B)$ .

We need the following notations:

\mathcal{R}_{n,t}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}R_{t^{\prime}+1},\,\mathcal{F}_{n}:=\sigma(s_{t},a_{t},h_{t},t\leq n)

$\mathcal{F}_{n}$ is the sigma-algebra generated by the MDP before or equal to time $n$ , i.e. the information available up to time $n$ .

Definition 1.

We remind that a discrete-time process $X$ is adapted to the filtration $\mathbb{F}:=(\mathcal{F}_{n})_{n\geq 0}$ if $X_{n}$ is $\mathcal{F}_{n}$ -measurable for all $n$ ; it is predictable if $X_{n}$ is $\mathcal{F}_{n-1}$ -measurable for all $n$ (i.e. it is known one timestep before); it is a $\mathbb{F}$ -martingale if for all $n$ , $X_{n}$ is adapted to $\mathcal{F}_{n}$ and $\mathbb{E}[X_{n+1}-X_{n}|\mathcal{F}_{n}]=0$ .

The below observation is the cornerstone of this paper, as it will formalize precisely the discussion carried on in section 1 about the decomposition of the reward variability into two conceptually meaningful components. The martingale component captures the uncertain/stochastic part of the reward in section 1.2. The key idea here is to apply the Doob decomposition of a stochastic process (cf. supplementary) to the reward process in order to gain better understanding of its variability.

Theorem 1.

(Doob decomposition of the reward) Let $\pi$ be a policy. $\mathcal{R}_{n,t}$ can be decomposed in a unique way $\mathbb{P}_{\pi}-a.s.$ into the sum of i) a $\mathcal{F}_{t}-$ measurable random variable ii) a zero-mean predictable process $\mathcal{R}_{n,t}^{\pi,pred}$ and iii) a zero-mean martingale $\mathcal{R}_{n,t}^{chaos}$ , with respect to the filtration $\mathcal{F}_{n}$ . The decomposition is given by:

\mathcal{R}_{n,t}=\mathbb{E}_{\pi}[\mathcal{R}_{n,t}|s_{t}]+\mathcal{R}_{n,t}^{\pi,pred}+\mathcal{R}_{n,t}^{chaos}

\mathcal{R}_{n,t}^{\pi,pred}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}(\overline{R}(s_{t^{\prime}},a_{t^{\prime}})-\mathbb{E}_{\pi}[R_{t^{\prime}+1}|s_{t}])

\mathcal{R}_{n,t}^{chaos}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}(R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}))

Denoting $\mathcal{R}_{t}^{\pi,pred}:=\mathcal{R}_{\infty,t}^{\pi,pred}$ and $\mathcal{R}_{t}^{chaos}:=\mathcal{R}_{\infty,t}^{chaos}$ we get, taking the limit as $n\to+\infty$ :

\mathcal{R}_{t}=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t}]+\mathcal{R}_{t}^{\pi,pred}+\mathcal{R}_{t}^{chaos}

Definition 2.

We call $\mathcal{R}_{t}^{chaos}$ (resp. $\mathcal{R}_{t}^{\pi,pred}$ ) the chaotic (resp. predictable) reward process associated to $\mathcal{R}_{t}$ .

The Doob decomposition of theorem 1 consists of decomposing rewards into:

•

the predictable component $\overline{R}(s_{t^{\prime}},a_{t^{\prime}})-\mathbb{E}_{\pi}[R_{t^{\prime}+1}|s_{t}]$ that accounts for deviations of the predictable projection of the uncertain rewards (i.e. the best guess of the unknown reward $R_{t^{\prime}+1}$ given information at time $t^{\prime}$ ) from the overall average reward. This process possesses some variance due to the switching between states, but not to the reward uncertainty.
•

the chaotic component $R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}})$ exactly captures the uncertainty of the reward by computing the difference between its actual realization and what we expected it to be based on the information available at time $t^{\prime}$ . In other words, it captures the "surprise" part of the reward. $\mathcal{R}_{t}^{chaos}=0$ if and only if $R$ is a deterministic reward, in which case we have $R_{t+1}=\overline{R}(s_{t},a_{t})$ with probability 1.

3 Chaotic Variation of the reward process

We define below a new conceptual tool, the chaotic variation associated to a reward process (and to a conditional risk measure $\rho$ ), that answers the issues raised in section 1.2 and captures the reward uncertainty risk. The concept of risk that could be intuited from section 1.2 emerges rigorously as a martingale. We refer to Detlefsen & Scandolo, (2005) (cf. supplementary) for the definition of a conditional risk-measure associated to a sigma-algebra $\mathcal{G}$ , and we use the notation $\rho^{\pi}_{s_{t}}$ to denote the conditional risk measure associated to the sigma-algebra $\sigma(s_{t})$ .

Definition 3.

(Chaotic variation) Let $\pi$ a policy and $\bm{\rho^{\pi}}:=(\rho^{\pi}_{s_{t}})_{t\geq 0}$ a family of conditional risk measures associated to the process $(s_{t})_{t\geq 0}$ . The chaotic variation associated to ( $\mathcal{R}_{t}$ , $\bm{\rho^{\pi}}$ ) is defined as $\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{t}](s_{t}):=\rho^{\pi}_{s_{t}}(\mathcal{R}_{t}^{chaos})$ . That is, the chaotic variation quantifies the risk related to the chaotic reward process.

In the case of the entropic risk measure, the predictable quadratic variation of the chaotic reward process emerges naturally as a measure of risk as a consequence of martingale theory, as proposition 1 shows it.

Proposition 1.

(chaotic variation in the entropic case). Let $\pi$ a policy. Using definition 3, for $\beta>0$ , let $\rho^{\beta,\pi}_{s_{t}}(X):=\beta^{-1}\ln\mathbb{E}_{\pi}[e^{-\beta X}|s_{t}]$ be the so-called (conditional) entropic risk measure, obtained as the certainty equivalent $CE_{U,\pi,s_{t}}(X):=U^{-1}(\mathbb{E}_{\pi}[U(X)|s_{t}])$ of the exponential utility function $U(x):=-e^{-\beta x}$ . Then, the chaotic variation satisfies $\mathcal{C}_{\bm{\rho^{\beta,\pi}}}[\mathcal{R}_{t}](s_{t})\leq\beta^{-1}\ln\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}$ , where $\left<\mathcal{R}_{t}^{chaos}\right>$ is the predictable quadratic variation of the martingale $\mathcal{R}_{t}^{chaos}$ , given by:

\left<\mathcal{R}_{t}^{chaos}\right>=\sum_{t^{\prime}=t}^{\infty}\gamma^{2(t^{\prime}-t)}\mathbb{E}[(R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}))^{2}|s_{t^{\prime}},a_{t^{\prime}}]

As $\beta\to 0$ we get:

\mathcal{C}_{\rho^{\beta,\pi}}[\mathcal{R}_{t}](s_{t})=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]+o(\beta)

Definition 4.

(Chaotic variance) The chaotic variance associated to ( $\mathcal{R}_{t}$ , $\pi$ ) is defined as $V^{\mathbb{V}(\beta)}_{\pi}(s_{t}):=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]$ . By martingale property of $\mathcal{R}_{t}^{chaos}$ , the chaotic variance is equal to the variance of $\mathcal{R}_{t}^{chaos}$ (scaled by $\frac{\beta}{2}$ , conditional on $s_{t}$ ).

4 Reinforcement Learning: Risk-Sensitive Chaotic algorithms

In this section we assume for simplicity that the state and action spaces are finite. Given the conceptual tools introduced in section 3, and given a fixed initial state $s_{0}$ (generalization to the case of distributions over initial states is straightforward), we are interested in solving so-called chaotic problems of type:

\max_{\pi}\mathbb{E}_{\pi}[\mathcal{R}_{0}|s_{0}]-\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{0}](s_{0})

(1)

The extension to solving problems of type " $\max_{\pi}\mathbb{E}_{\pi}[\mathcal{R}_{0}|s_{0}]$ subject to: $\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{0}](s_{0})\leq\lambda$ " is not discussed here but can be done using similar techniques developed in Prashanth & Fu, (2018), Tamar et al. , (2012) or Prashanth & Ghavamzadeh, (2016).

4.1 Bellman equations: Chaotic Mean-Variance case

We present below the Bellman equation in the chaotic mean-variance case $\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{0}](s_{0})=V^{\mathbb{V}(\beta)}_{\pi}(s_{0})$ , cf. definition 4. It will be used to formulate a chaotic mean-variance version of Q-Learning in the episodic case (section 4.2), as well as to study actor-critic algorithms and the average reward case (cf. supplementary). The proof of theorem 2 follows directly from the definition of $\left<\mathcal{R}_{t}^{chaos}\right>$ in proposition 1.

Theorem 2.

(Bellman equation) Let $Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})$ the Q-function associated to $V^{\mathbb{V}(\beta)}_{\pi}(s_{t})$ of definition 4. Then we have the following Bellman equation:

Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+\gamma^{2}V^{\mathbb{V}(\beta)}_{\pi}(s_{t+1})|s_{t},a_{t}]

Consequently, if $R^{\beta}_{t+1}:=R_{t+1}-\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}$ and $Q_{\pi}(s_{t},a_{t}):=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t},a_{t}]$ with associated value function $V_{\pi}$ , then $Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})$ and its associated value function $V^{\beta}_{\pi}$ satisfy the following Bellman equation in the episodic case with $\gamma=1$ :

Q^{\beta}_{\pi}(s_{t},a_{t})=\mathbb{E}[R^{\beta}_{t+1}+V^{\beta}_{\pi}(s_{t+1})|s_{t},a_{t}]

In particular, in the episodic case with $\gamma=1$ , theorem 2 yields that the class of optimal policies associated to (1) coincides with that associated to the modified rewards $R^{\beta}$ . In contrast to traditional TD-based methods such as $Q$ -Learning, $R^{\beta}$ is not directly observable as it involves the term $\overline{R}(s_{t},a_{t})$ which will need to be estimated over the course of the algorithm.

4.2 Chaotic Mean-Variance Q-Learning

In the episodic mean-variance case (with $\gamma=1$ ), theorem 2 allows us to derive a chaotic version of Q-learning, cf. algorithm 1 which is based on theorem 3. In the discounted reward setting, it is not possible to combine $Q_{\pi}$ and $Q^{\mathbb{V}(\beta)}_{\pi}$ (where $Q_{\pi}(s_{t},a_{t}):=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t},a_{t}]$ ), hence we cannot get a Q-Learning type algorithm. The convergence proof is discussed in the supplementary and consists of a minor modification of the proof of Dayan & Watkins, (1992) based on the fact that $\overline{R}_{t}(s,a)\to\overline{R}(s,a)$ as $t\to+\infty$ with probability 1 for every state-action pair $(s,a)$ , where $\overline{R}_{t}(s,a)$ is defined in theorem 3. The average reward version of the algorithm is presented in the supplementary.

Theorem 3.

(Chaotic Mean-Variance Q-Learning in the episodic case). Using theorem 2, denote $Q_{\pi}(s_{t},a_{t}):=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t},a_{t}]$ and $Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})$ . Let $(s_{t})$ , $(a_{t})$ and $(R_{t+1})$ the successive states, actions and rewards observed by the agent. Let $(\alpha_{t})$ a sequence of learning rates satisfying the usual conditions for every state-action pair $(s,a)$ :

\sum_{k=1}^{\infty}\alpha_{n_{k}(s,a)}=+\infty,\hskip 11.38109pt\sum_{k=1}^{\infty}\alpha^{2}_{n_{k}(s,a)}<+\infty

where $n_{k}(s,a)$ is the index corresponding to the $k^{th}$ visit to $(s,a)$ . Further define the following iterates if $s_{t}=s$ and $a_{t}=a$ :

N_{t}(s,a)=N_{t-1}(s,a)+1

\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a)+\frac{1}{N_{t}(s,a)}(R_{t+1}-\overline{R}_{t-1}(s,a))

Q^{\beta}_{t}(s,a)=(1-\alpha_{t})Q^{\beta}_{t-1}(s,a)+\alpha_{t}(R_{t+1}-\frac{1}{2}\beta(R_{t+1}-\overline{R}_{t}(s,a))^{2}+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a^{\prime}))

and $Q^{\beta}_{t}(s,a)=Q^{\beta}_{t-1}(s,a)$ , $N_{t}(s,a)=N_{t-1}(s,a)$ , $\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a)$ otherwise. Then $Q^{\beta}_{t}(s,a)\to Q^{\beta}_{*}(s,a)$ as $t\to+\infty$ with probability 1 for every state-action pair $(s,a)$ , where $Q^{\beta}_{*}(s,a):=\sup_{\pi}Q^{\beta}_{\pi}(s,a)$ .

Remark 1.

Note that in theorem 3, one could use a two-timescale stochastic approximation algorithm by using a specific timescale for the $\overline{R}_{t}$ process (instead of $\frac{1}{N_{t}(s,a)}$ ). Since the update rule of $\overline{R}_{t}$ doesn’t depend on $Q^{\beta}_{t}(s,a)$ (uncoupled case), this is not required for convergence but could be used to improve the rate of convergence, cf. a remark in Konda & Tsitsiklis, (2004), p. 4.

Algorithm 1 CMV-Q-Learning (episodic case)

Input: $Q^{\beta}$ -table initialized arbitrarily, learning rate $(\alpha_{t})_{t\geq 0}$ , $\overline{R}(s,a)$ and $N(s,a)$ initialized to 0.
Output: optimal policy $\pi_{*}(s)=\mbox{argmax}_{a}Q^{\beta}_{*}(s,a)$

1:for each episode do

2: initialize

s_{t}=s_{0}

3: while

s_{t}

is not terminal do

4: Choose

a_{t}

from

s_{t}

using a policy derived from

Q^{\beta}

(e.g.

\epsilon

-greedy).

5: Take action

a_{t}

, observe

s_{t+1}

R_{t+1}

N(s_{t},a_{t})\leftarrow N(s_{t},a_{t})+1

;

\overline{R}(s_{t},a_{t})\leftarrow\overline{R}(s_{t},a_{t})+\frac{1}{N(s_{t},a_{t})}(R_{t+1}-\overline{R}(s_{t},a_{t}))

Q^{\beta}(s_{t},a_{t})\leftarrow(1-\alpha_{t})Q^{\beta}(s_{t},a_{t})+\alpha_{t}(R_{t+1}-\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+\max_{a}Q^{\beta}(s_{t+1},a))

s_{t}\leftarrow s_{t+1}

4.3 Monte Carlo Policy gradient algorithms - Episodic case

In this section we consider episodic Monte Carlo based algorithms that start with a parametric form $\pi_{\theta}$ for the policy and optimize equation (1) in the direction of the gradient with respect to $\theta$ , hence aiming for local optima only. We make the following classical assumption in this section (Bhatnagar et al. , (2009)):

Assumption 2.

For every policy $\pi_{\theta}$ , the Markov chain induced by $P$ and $\pi_{\theta}$ is ergodic, i.e. irreducible, aperiodic and positive recurrent. Further, for every state-action pair $(s,a)$ , $\pi_{\theta}(s,a)$ is continuously differentiable in $\theta$ .

From equation (1), we need to compute unbiased estimates of $\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]$ and $\nabla_{\theta}\mathcal{C}_{\bm{\rho^{{\pi_{\theta}}}}}[\mathcal{R}_{0}](s_{0})$ . The former is the classical expected reward gradient. By definition 3, $\mathcal{C}_{\bm{\rho^{{\pi_{\theta}}}}}[\mathcal{R}_{0}](s_{0})$ consists in applying a risk measure to a mean zero martingale, which is usually simpler as we will see below in the mean-variance case $\mathcal{C}_{\bm{\rho^{{\pi_{\theta}}}}}[\mathcal{R}_{0}](s_{0})=V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})$ (cf. definition 4), and more importantly the work done in the literature for specific risk measures can be applied straightforwardly, e.g. Chow et al. , (2018) or Tamar, A., Glassner, Y., Manor, S., (2015) in the case of CVaR (however the risk measure in our case is applied to the chaotic reward process $\mathcal{R}_{t}^{chaos}$ only). From now on we focus on the mean-variance case of definition 4, but we present in the supplementary extensions to chaotic CVaR and Sharpe ratio.

Provided we know $\overline{R}(s,a)$ , the gradient $\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})$ presents no specific difficulty and can be computed using the classical likelihood ratio technique by generating $B\geq 1$ episodes $s_{0}^{(b)}$ , $a_{0}^{(b)}$ , $R_{1}^{(b)}$ , …, $s_{T_{b}-1}^{(b)}$ , $a_{T_{b}-1}^{(b)}$ , $R_{T_{b}}^{(b)}$ , $b=1..B$ , following $\pi_{\theta}$ and computing the unbiased Monte Carlo estimate:

\nabla_{\theta}V_{\pi_{\theta}}^{\mathbb{V}(\beta)}(s_{0})=\frac{\beta}{2}\frac{1}{B}\sum_{b=1}^{B}\sum_{t^{\prime}=0}^{T_{b}-1}\nabla\ln\pi_{\theta}(a^{(b)}_{t^{\prime}}|s^{(b)}_{t^{\prime}})V_{b,t^{\prime}}

V_{b,t^{\prime}}:=\sum_{t=t^{\prime}}^{T_{b}-1}\gamma^{2(t-t^{\prime})}(R^{(b)}_{t+1}-\overline{R}(s^{(b)}_{t},a^{(b)}_{t}))^{2}

The subtlety here is that we need to learn $\overline{R}(s,a)$ in the course of the algorithm. In the tabular case, using assumption 2, we are guaranteed that every state-action pair $(s,a)$ will be visited infinitely often and we can proceed as in the Q-Learning case (theorem 3) by updating the (conditional) average estimate $\widehat{R}(s,a)$ as iterations on $\theta$ are performed. By the strong law of large numbers, convergence of $\widehat{R}(s,a)$ to $\overline{R}(s,a)$ occurs with probability 1. Remark 1 still holds here: since the iteration on the sample average $\widehat{R}(s,a)$ is uncoupled from the one on $\theta$ , we do not need a two-timescale algorithm to guarantee convergence. For each $b=1..B$ and $t=0..T_{b}$ , if $s=s_{t}^{(b)}$ and $a=a_{t}^{(b)}$ we perform the updates:

		$\displaystyle N(s,a)=N(s,a)+1$		(2)
		$\displaystyle\widehat{R}(s,a)=\widehat{R}(s,a)+\frac{1}{N(s,a)}(R^{(b)}_{t+1}-\widehat{R}(s,a))$		(2)

Alternatively, if the state or action spaces are large, one may want to use a parametric approximation $\widehat{R}_{\psi}(s,a)$ of $\overline{R}(s,a)$ that will be updated in the course of the algorithm: this approach is presented in algorithm 2, where we use an experience replay table and fit $\widehat{R}_{\psi}$ using SGD. Proposition 2 quantifies the related gradient bias due to the use of this approximation (proof in supplementary). We present in algorithm 3 the Chaotic Mean-Variance version of REINFORCE, which uses either equation (LABEL:eqe1) or algorithm 2.

Proposition 2.

Let $\epsilon_{\psi}(s,a):=\widehat{R}_{\psi}(s,a)-\overline{R}(s,a)$ and $\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}$ the gradient obtained using the approximation $\widehat{R}_{\psi}$ . The gradient bias $B_{\psi}^{\theta}(s_{0})$ $:=\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(s_{0})-\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})$ satisfies:

B_{\psi}^{\theta}(s_{0})=\frac{\beta}{2}\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)b^{\psi}_{\theta}(S,A)]

b^{\psi}_{\theta}(s,a):=\mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{2t}\epsilon_{\psi}^{2}(s_{t},a_{t})|s_{0}=s,a_{0}=a]

where $d^{\theta}_{\gamma^{2}}(s_{0},s,a):=\pi_{\theta}(a|s)\sum_{t=0}^{\infty}\gamma^{2t}\mathbb{P}_{\pi_{\theta}}[s_{t}=s|s_{0}]$ is the action-state $\gamma^{2}$ -discounted visiting distribution.

Algorithm 2 Distributional Update

Input: initial distributional parameter $\psi$ , experience replay table $\mathcal{E}$ , number of distributional gradient steps $N_{\psi}$ , number of SGD samples $M_{\psi}$ .
Output: Approximation of the optimal distributional parameter $\psi^{*}$

1:Draw

M_{\psi}

samples

(s_{j},a_{j},R_{j})

randomly from

\mathcal{E}

2:Set

\widetilde{M}_{\psi}

to be the number of unique pairs

(\widetilde{s}_{j},\widetilde{a}_{j})

and for each such pair, set

\widetilde{R}_{j}

to be the average of the corresponding

R_{j}

3:Perform

N_{\psi}

steps of SGD on the loss (

\widehat{R}_{\psi}(\widetilde{s}_{j},\widetilde{a}_{j})-\widetilde{R}_{j})^{2}

using the

\widetilde{M}_{\psi}

samples.

Algorithm 3 CMV-REINFORCE (episodic case)

Input: Initial policy parameter $\theta_{0}$ , learning rate $(\alpha_{n})$ , number of episodes per batch $B$ , $\widehat{R}_{\psi}$ initialized to 0, Optional: experience replay table $\mathcal{E}$ .
Output: Approximation of the optimal policy parameter $\theta^{*}$

1:while

\theta_{n}

not converged do

2: Generate

B

episodes

s_{0}^{(b)}

a_{0}^{(b)}

R_{1}^{(b)}

, …,

s_{T_{b}-1}^{(b)}

a_{T_{b}-1}^{(b)}

R_{T_{b}}^{(b)}

b=1..B

, following

\pi_{\theta_{n}}(\cdot|\cdot)

3: In tabular case: use episodes to update

\widehat{R}_{\psi}

with eq. (LABEL:eqe1); else increment

\mathcal{E}

with the tuples (

s_{t}^{(b)}

a_{t}^{(b)}

R_{t}^{(b)}

) and update

\widehat{R}_{\psi}

using algorithm 2.

4: for

b=1

B

v_{t,b}:=\sum_{t^{\prime}=t}^{T_{b}-1}\gamma^{t^{\prime}-t}R^{(b)}_{t^{\prime}+1}

-\frac{\beta}{2}\gamma^{2(t^{\prime}-t)}(R^{(b)}_{t^{\prime}+1}-\widehat{R}_{\psi}(s^{(b)}_{t^{\prime}},a^{(b)}_{t^{\prime}}))^{2}

V_{b}\leftarrow\sum_{t=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t}|s^{(b)}_{t})v_{t,b}

\theta_{n+1}\leftarrow\theta_{n}+\alpha_{n}B^{-1}\sum_{b=1}^{B}V_{b}

4.4 Policy Gradient Actor-Critic algorithms

In order to derive Actor-Critic algorithms, the Bellman equations in theorem 2 can be used - as in the CMV-Q-Learning algorithm 1 - in order to adapt the work of e.g. Prashanth & Ghavamzadeh, (2016) (variance case) or Chow et al. , (2018) (CVaR case) in order to derive chaotic versions of their actor-critic algorithms. This extension is discussed in the supplementary.

5 Experiments: Chaotic Mean-Variance

5.1 Grid World

We consider the episodic problem of a robot on a grid aiming at a goal. The state space consists of the 16 grid squares, and the action space consists in choosing to go East, West, North or South. Reaching the goal (resp. taking a step) gives a +1 (resp. -1) reward and negative rewards are positioned on the grid as seen on figure 2. When an action $a_{t}$ is chosen, there is a probability $p_{error}=50\%$ that the robot goes in a random direction. If the robot hits the extremity of the grid, it stays where it is. We train policies using the CMV-Q-Learning algorithm 1. In figure 2 we plot the path heatmap over $10^{5}$ rollout steps performed with the learned policy for various risk aversion coefficients $\beta$ (cf. details and additional experiments in supplementary). The higher $\beta$ , the further away from the -20 reward the robot goes, as expected, as it prefers taking the -6 loss rather than walking next to the -20 reward and risking to encur the corresponding loss.

The second element that our algorithm gives is a risk heatmap, highlighting the most uncertain states, i.e. the states which yield the highest reward stochasticity/uncertainty (figure 2). We perform rollouts with the learned $\overline{R}(\cdot,\cdot)$ and policy $\pi^{*}_{\beta}$ and compute $\mathbb{E}_{\pi^{*}_{\beta}}[(R_{t+1}-\overline{R}(s,\pi^{*}_{\beta}(s)))^{2}|s_{t}=s]$ for every state $s$ . In figure 2 we see that as expected, the heatmap highlights the states next to the -20 reward, and to a lesser intensity the states around the -6 rewards. To the best of our knowledge, only two other work study risk-sensitive Q-Learning algorithms: Mihatsch & Neuneier, (2002) and Shen et al. , (2014). Both transform the TD-increments by a non-linear function in order to obtain risk-sensitive behaviors. The former show that their algorithm converge to the worst-case optimality criterion as the risk parameter changes, hence in our specific example, we could obtain similar paths as in figure 2, but our algorithm, in addition to being the only one to focus on reward stochasticity only, additionally provides the risk heatmap discussed above, which quantifies the extent to which a given state yields uncertain rewards.

5.2 Portfolio optimization

We consider the problem of investing in 2 financial assets. One is risk-free in the sense that investing a quantity $q^{RF}_{t}$ in the asset yields a deterministic reward $R^{RF}_{t+1}:=q^{RF}_{t}\mu(s_{t})$ , where $\mu$ is the risk-free rate. The latter can be seen as the overnight rate from day $t$ to day $t+1$ , which is known at the end of day $t$ when $q^{RF}_{t}$ is chosen. The other asset is risky (e.g. a stock) and yields an uncertain reward for a quantity $q^{R}_{t}$ of $R^{R}_{t+1}:=q^{R}_{t}(\mu(s_{t})+\sigma(s_{t})h_{t+1})$ , where $(h_{t})$ are i.i.d. standard normal and $\sigma$ is referred to as volatility. We restrict $q^{RF}_{t},q^{R}_{t}$ to be nonnegative integers which sum is less than a total investment constraint $q_{max}$ . The action is defined as $a_{t}:=(q^{RF}_{t},q^{R}_{t})$ and the reward $R_{t+1}:=R^{RF}_{t+1}+R^{R}_{t+1}$ . In our experiments we take $q_{max}=5$ , yielding $21$ possible actions. The state space consists of 3 states LowVol, HighVol and MediumVol defined as low volatility $\sigma$ (and low risk-free rate $\mu$ ), high volatility (and high risk-free rate $\mu$ ), as well as an intermediate state. The state transition matrix is designed such that the more we trade in the risky asset (i.e. the higher $q^{R}_{t}$ ), the more likely we are to reach a higher volatility state. We train the policy using CMV-REINFORCE algorithm 3 and its baseline (classical mean-variance), and use it to compute performance metrics over rollout episodes of $T$ timesteps. Details on training/numerical values used are presented in the supplementary.

We display in figure 3 the fraction of time the process has spent in each state, as well as the investment fraction in each asset (for rollout episodes). With $\beta=0$ and by design of the state transition matrix, both assets give the same expected reward $\mu$ but the policy is incentivized to trade in the risky asset rather than in the risk-free asset in order to reach the HighVol state, which gives the highest $\mu$ . As $\beta$ increases, in the CMV case, the policy shifts towards trajectories which contain less reward uncertainty, which means investing in the risk-free asset which associated rewards are deterministic. The classical mean-variance case (baseline) penalizes not only the variability related to reward uncertainty but also the variability related to the switching of states appearing in both assets via the deterministic term $\mu(s_{t})$ , hence as $\beta$ increases, the policy is incentivized to stop investing (cf. green line in figure 3 representing the uninvested budget $q_{max}-q^{RF}_{t}-q^{R}_{t}$ ), i.e. it stops taking advantage of the risk-free asset. The latter can also be seen in figure 4 where we plot for rollout episodes the cumulative reward mean and standard deviation as a function of $\beta$ . In the mean-variance case the reward mean will eventually vanish as the policy will stop trading in both assets: by trying to make the standard deviation lower, it generates a counterintuitive behavior in that it stops taking advantage of the risk-free asset.

6 Conclusion

We presented a novel, conceptually meaningful decomposition of the cumulative reward process based on the Doob decomposition that distinguishes between the different sources of randomness contained within it, introduced a new conceptual tool - the chaotic variation - that exactly captures reward uncertainty risk, and incorporated it into model-free value-function based and policy gradient algorithms. Potential real-world applications include all settings where one is subject to uncertain/stochastic rewards and is interested in deriving interpretable risk-sensitive policies, for example recently studied RL financial market-making problems Guéant & Manziuk, (2019), Ganesh et al. , (2019) where reward uncertainty plays a major role in that market-makers stream prices but do not know whether clients will decide to trade at that price, and further they are typically averse to uncertain fluctuations in the underlying financial asset price. Future work could include extending the framework to the case of delayed rewards.

Disclaimer

This paper was prepared for information purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful. ^© 2020 JPMorgan Chase & Co. All rights reserved.

References

Bellemare et al. , (2017) Bellemare, Marc G, Dabney, Will, & Munos, Rémi. 2017. A Distributional Perspective on Reinforcement Learning. In: ICML.
Bhatnagar et al. , (2009) Bhatnagar, Shalabh, Sutton, Richard S, Ghavamzadeh, Mohammad, & Lee, Mark. 2009. Natural actor–critic algorithms. Automatica, 45(11), 2471–2482.
Borkar, (2002) Borkar, Vivek S. 2002. Q-learning for Risk-Sensitive control. Mathematics of Operations Research, 27(1), 294–31.
Borkar, (2005) Borkar, Vivek S. 2005. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3), 207–213.
Borkar, (2010) Borkar, Vivek S. 2010. Learning Algorithms for Risk-Sensitive Control. Proceedings of the Nineteenth International Symposium on Mathematical Theory of Networks and Systems, 1327–1332.
Chow et al. , (2018) Chow, Yinlam, Ghavamzadeh, Mohammad, Janson, Lucas, & Pavone, Marco. 2018. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria. J. Mach. Learn. Res., 18(167), 1–51.
Dayan & Watkins, (1992) Dayan, Peter, & Watkins, Christopher. 1992. Technical Note Q-Learning. Machine Learning, 8, 279–292.
Defourny et al. , (2008) Defourny, Boris, Ernst, Damien, & Wehenkel, Louis. 2008. Risk-Aware Decision Making and Dynamic Programming. In: NIPS workshop.
Detlefsen & Scandolo, (2005) Detlefsen, Kai, & Scandolo, Giacomo. 2005. Conditional and dynamic convex risk measures. Finance Stoch., 9(4), 539–561.
Ganesh et al. , (2019) Ganesh, S., Vadori, N., Xu, M., Zheng, H., Reddy, P., & Veloso, M. 2019. Reinforcement Learning for Market Making in a Multi-agent Dealer Market. In: NeurIPS Workshop on Robust AI in Financial Services.
Geibel & Wysotzki, (2005) Geibel, Peter, & Wysotzki, Fritz. 2005. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res., 24, 81–108.
Guéant & Manziuk, (2019) Guéant, Olivier, & Manziuk, Iuliia. 2019. Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality. arXiv.
Konda & Tsitsiklis, (2004) Konda, Vijay R., & Tsitsiklis, John. 2004. Convergence rate of linear two-timescale stochastic approximation. The Annals of Applied Probability, 14(2), 796–819.
Mahadevan, (1996) Mahadevan, Sridhar. 1996. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results. Mach. Learn., 22(1-3), 159–195.
Mannor & Tsitsiklis, (2011) Mannor, Shie, & Tsitsiklis, John. 2011. Mean-Variance Optimization in Markov Decision Processes. In: ICML.
Mihatsch & Neuneier, (2002) Mihatsch, Oliver, & Neuneier, Ralph. 2002. Risk-Sensitive Reinforcement Learning. Mach. Learn., 49(2), 267–290.
Morimura et al. , (2010a) Morimura, Tetsuro, Sugiyama, Masashi, Kashima, Hisashi, Hachiya, Hirotaka, & Tanaka, Toshiyuki. 2010a. Nonparametric Return Distribution Approximation for Reinforcement Learning. In: ICML.
Morimura et al. , (2010b) Morimura, Tetsuro, Sugiyama, Masashi, & Kashima, Hisashi. 2010b. Parametric Return Density Estimation for Reinforcement Learning. In: UAI.
Necchi, (2016) Necchi, Pierpaolo. 2016. Policy Gradient Algorithms for the Asset Allocation Problem. M.Phil. thesis, Politechnico di Milano.
Prashanth & Fu, (2018) Prashanth, L A, & Fu, Michael. 2018. Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint. arXiv, Oct.
Prashanth & Ghavamzadeh, (2016) Prashanth, L A, & Ghavamzadeh, Mohammad. 2016. Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs. Mach. Learn., 105(3), 367–417.
Prashanth & Ghavamzadeh, (2013) Prashanth, L.a., & Ghavamzadeh, Mohammad. 2013. Actor-Critic Algorithms for Risk-Sensitive MDPs. Advances in Neural Information Processing Systems 26, 252–260.
Shen et al. , (2014) Shen, Yun, Tobia, Michael J, Sommer, Tobias, & Obermayer, Klaus. 2014. Risk-sensitive Reinforcement Learning. Neural Computation, 26(7).
Sobel, MJ, (1982) Sobel, MJ. 1982. The Variance of discounted Markov Decision Processes. J. Appl. Probab.
Tamar et al. , (2012) Tamar, Aviv, Di Castro, Dotan, & Mannor, Shie. 2012. Policy Gradients with Variance Related Risk Criteria. In: ICML.
Tamar et al. , (2015) Tamar, Aviv, Chow, Yinla, Ghavamzadeh, Mohammad, & Mannor, Shie. 2015. Policy Gradient for Coherent Risk Measures. In: NIPS.
Tamar et al. , (2016) Tamar, Aviv, Castro, Dotan Di, & Mannor, Shie. 2016. Learning the Variance of the Reward-To-Go. Journal of Machine Learning Research, 17(13), 1–36.
Tamar, A., Glassner, Y., Manor, S., (2015) Tamar, A., Glassner, Y., Manor, S. 2015. Optimizing the CVaR via Sampling. In: AAAI.
Thomas, (2014) Thomas, Philip S. 2014. Bias in Natural Actor-Critic Algorithms. In: ICML.
Vadori & Swishchuk, (2015) Vadori, N., & Swishchuk, A. 2015. Strong Law of Large Numbers and Central Limit Theorems for Functionals of Inhomogeneous Semi-Markov Processes. Stochastic Analysis and Applications, 33(2), 213–243.

Appendix A Background

A.1 Central limit theorem for Markov chain functionals (section 1 - Introduction)

Assume for simplicity that the state space $\mathbb{S}$ is finite, and let $R_{\pi}(s_{t},s_{t+1}):=R(s_{t},\pi(s_{t}),s_{t+1})$ the reward obtained at time $t+1$ associated to the deterministic policy $\pi$ , so that $R_{\pi}$ is a function of $s_{t}$ and $s_{t+1}$ only, and $(s_{t})$ is a Markov chain satisfying $\mathbb{P}[s_{t+1}=s^{\prime}|s_{t}=s]=P(s,\pi(s),s^{\prime})=:P_{\pi}(s,s^{\prime})$ . The central limit theorem for Markov chain functionals (Vadori & Swishchuk, (2015), theorem 3.17) states that the limit in distribution as $n\to+\infty$ of:

\sqrt{n}\left(\frac{1}{n}\sum_{t=0}^{n}R_{\pi}(s_{t},s_{t+1})-\sum_{x,y\in\mathbb{S}}d_{\pi}(x)P_{\pi}(x,y)R_{\pi}(x,y)\right)

is normal with mean zero and variance $\sigma_{deter.}^{2}+\sigma_{chaotic}^{2}$ , where $d_{\pi}$ is the stationary distribution of the Markov chain of transition kernel $P_{\pi}$ . Equivalently, the latter limit is equal in distribution to the sum of two independent normal random variables of respective variances $\sigma_{deter.}^{2}$ and $\sigma_{chaotic}^{2}$ , which are given by the following expressions:

\sigma_{chaotic}^{2}=\sum_{x\in\mathbb{S}}d_{\pi}(x)P_{\pi}^{var}R_{\pi}(x,\cdot)(x),\hskip 14.22636pt\sigma_{deter.}^{2}=\sum_{x\in\mathbb{S}}d_{\pi}(x)P_{\pi}^{var}f_{Poisson}(x)

Denoting the conditional expected reward $\mu_{\pi}(x):=\mathbb{E}[R_{\pi}(s_{t},s_{t+1})|s_{t}=x]=\sum_{y\in\mathbb{S}}R_{\pi}(x,y)P_{\pi}(x,y)$ , we see below that $\sigma_{deter.}^{2}$ only depends on the rewards via the deterministic term $\mu_{\pi}(x)$ , whereas $\sigma_{chaotic}^{2}$ additionally depends on the term $\mathbb{E}[R_{\pi}(s_{t},s_{t+1})^{2}|s_{t}=x]$ which quantifies reward stochasticity/uncertainty. Indeed, the variance operators $P_{\pi}^{var}$ act as follows:

P_{\pi}^{var}R_{\pi}(x,\cdot)(x)=\sum_{y\in\mathbb{S}}R^{2}_{\pi}(x,y)P_{\pi}(x,y)-\mu_{\pi}(x)^{2}

and:

P_{\pi}^{var}f_{Poisson}(x)=\sum_{y\in\mathbb{S}}f^{2}_{Poisson}(y)P_{\pi}(x,y)-(\sum_{y\in\mathbb{S}}f_{Poisson}(y)P_{\pi}(x,y))^{2}

where the function $f_{Poisson}$ is the solution of the Poisson equation:

\sum_{y\in\mathbb{S}}f_{Poisson}(y)P_{\pi}(x,y)-f_{Poisson}(x)=\sum_{x^{\prime}\in\mathbb{S}}d_{\pi}(x^{\prime})\mu_{\pi}(x^{\prime})-\mu_{\pi}(x)

A.2 Doob decomposition of a stochastic process

Let $X$ a discrete-time process on a probability space $(\Omega,\mathcal{F},\mathbb{P})$ such that (i) $\mathbb{E}[|X_{n}|]<\infty$ for all $n$ , and (ii) $X$ is adapted to the filtration $\mathbb{F}:=(\mathcal{F}_{n})_{n\geq 0}$ , i.e. $X_{n}$ is $\mathcal{F}_{n}$ -measurable for all $n$ . Then, there exists an integrable martingale $M$ , and an integrable predictable process $A$ , such that $A_{0}=M_{0}=0$ and $X_{n}=X_{0}+A_{n}+M_{n}$ for all $n$ . This decomposition is almost surely unique. Here, we remind that $A$ being predictable means that $A_{n}$ is $\mathcal{F}_{n-1}$ -measurable for all $n$ (i.e. it is known one timestep before), and $M$ martingale means that $M_{n}$ is $\mathcal{F}_{n}$ -measurable and $\mathbb{E}[M_{n+1}-M_{n}|\mathcal{F}_{n}]=0$ . Further, $A$ and $M$ are given by:

A_{n}=\sum_{k=1}^{n}\mathbb{E}[X_{k}|\mathcal{F}_{k-1}]-X_{k-1},\hskip 14.22636ptM_{n}=\sum_{k=1}^{n}X_{k}-\mathbb{E}[X_{k}|\mathcal{F}_{k-1}]

A.3 Conditional risk measures

We remind here the definition of a conditional risk measure associated to a sigma-algebra $\mathcal{G}$ , according to Detlefsen & Scandolo, (2005). Let $L^{\infty}$ , $L^{\infty}_{\mathcal{G}}$ the set of resp. bounded random variables and bounded, $\mathcal{G}$ -measurable random variables. A conditional risk measure $\rho$ associated to a sigma-algebra $\mathcal{G}$ is a map $L^{\infty}\to L^{\infty}_{\mathcal{G}}$ such that:

•

(Normalization) $\rho(0)=0$ .
•

(Conditional Translation Invariance) For any $X\in L^{\infty}$ , $Z\in L^{\infty}_{\mathcal{G}}$ we have $\rho(X+Z)=\rho(X)-Z$ .
•

(Monotonicity) For any $X,Y\in L^{\infty}$ , if $X\leq Y$ with probability 1, then $\rho(X)\geq\rho(Y)$ with probability 1.

Appendix B Proofs

B.1 Proof of theorem 1

Define the process $Y_{n}=\mathcal{R}_{n+t,t}-\mathbb{E}_{\pi}[\mathcal{R}_{n+t,t}|s_{t}]$ for $n\geq 1$ and $Y_{0}:=0$ , where we remind that:

\mathcal{R}_{n,t}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}R_{t^{\prime}+1}

The process $Y_{n}$ is adapted to the filtration generated by the sigma-algebras $\mathcal{G}_{n}:=\mathcal{F}_{n+t}$ , since by definition $\mathcal{F}_{n}:=\sigma(s_{k},a_{k},h_{k},k\leq n)$ . Hence, we can apply the Doob decomposition (cf. section A.2), and we get the following decomposition of the process $Y_{n}$ up to unicity $\mathbb{P}_{\pi}-$ a.s., for $n\geq 0$ :

Y_{n}=Y_{0}+Y_{n}^{pred}+Y_{n}^{chaos}

where $Y_{n}^{pred}$ , $Y_{n}^{chaos}$ are respectively a predictable process and a martingale with respect to the filtration $(\mathcal{G}_{n})_{n\geq 0}$ , satisfying $Y_{0}^{pred}=Y_{0}^{chaos}=0$ , and are given by, for $n\geq 0$ (with the usual convention that $\sum_{k=1}^{0}(\cdot):=0$ ):

Y_{n}^{pred}=\sum_{k=1}^{n}(\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}]-Y_{k-1}),\hskip 14.22636ptY_{n}^{chaos}=\sum_{k=1}^{n}(Y_{k}-\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}])

We have $Y_{0}=0$ and if $k\geq 2$ :

\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}]-Y_{k-1}=\mathbb{E}_{\pi}[\mathcal{R}_{k+t,t}-\mathcal{R}_{k+t-1,t}|\mathcal{F}_{k+t-1}]-\mathbb{E}_{\pi}[\mathcal{R}_{k+t,t}-\mathcal{R}_{k+t-1,t}|s_{t}]

=\gamma^{k-1}(\mathbb{E}_{\pi}[R_{k+t}|\mathcal{F}_{k+t-1}]-\mathbb{E}_{\pi}[R_{k+t}|s_{t}])

If $k=1$ we get:

\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}]-Y_{k-1}=\mathbb{E}_{\pi}[Y_{1}|\mathcal{G}_{0}]=\mathbb{E}_{\pi}[R_{t+1}|\mathcal{F}_{t}]-\mathbb{E}_{\pi}[R_{t+1}|s_{t}]

Overall, this gives for $n\geq 1$ :

Y_{n}^{pred}=\sum_{k=1}^{n}\gamma^{k-1}(\mathbb{E}_{\pi}[R_{k+t}|\mathcal{F}_{k+t-1}]-\mathbb{E}_{\pi}[R_{k+t}|s_{t}])=\sum_{k=t}^{n-1+t}\gamma^{k-t}(\mathbb{E}_{\pi}[R_{k+1}|\mathcal{F}_{k}]-\mathbb{E}_{\pi}[R_{k+1}|s_{t}])

Now, we claim that $\mathbb{E}_{\pi}[R_{k+1}|\mathcal{F}_{k}]=\overline{R}(s_{k},a_{k})$ . Indeed, by assumption 1 and since by definition $\mathcal{F}_{n}:=\sigma(s_{k},a_{k},h_{k},k\leq n)$ , we get:

\mathbb{E}_{\pi}[R_{k+1}|\mathcal{F}_{k}]=\mathbb{E}_{\pi}[R_{k+1}|s_{m},a_{m},h_{m}:m\leq k]

=\int_{\mathbb{S}}\int_{\mathbb{H}}R(s_{k},a_{k},s^{\prime},h^{\prime})H(s_{k},a_{k},s^{\prime},dh^{\prime})P(s_{k},a_{k},ds^{\prime})=\mathbb{E}[R_{k+1}|s_{k},a_{k}]=\overline{R}(s_{k},a_{k})

This yields for $n\geq 1$ :

Y_{n}^{pred}=\sum_{k=t}^{n-1+t}\gamma^{k-t}(\overline{R}(s_{k},a_{k})-\mathbb{E}_{\pi}[R_{k+1}|s_{t}])=\mathcal{R}_{n+t,t}^{\pi,pred}

Similarly for $n\geq 1$ :

Y_{n}^{chaos}=\sum_{k=t}^{n-1+t}\gamma^{k-t}(R_{k+1}-\overline{R}(s_{k},a_{k}))=\mathcal{R}_{n+t,t}^{chaos}

Since $Y_{n}^{pred}$ , $Y_{n}^{chaos}$ are respectively a predictable process and a martingale with respect to the filtration $(\mathcal{G}_{n})_{n\geq 0}$ by the Doob decomposition, and by definition $\mathcal{G}_{n}:=\mathcal{F}_{n+t}$ , we get that $\mathcal{R}_{n,t}^{\pi,pred}$ , $\mathcal{R}_{n,t}^{chaos}$ are respectively a predictable process and a martingale with respect to the filtration $\mathcal{F}_{n}$ , for $n\geq t+1$ , and satisfy:

\mathcal{R}_{n,t}-\mathbb{E}_{\pi}[\mathcal{R}_{n,t}|s_{t}]=Y_{n-t}=Y_{n-t}^{pred}+Y_{n-t}^{chaos}=\mathcal{R}_{n,t}^{\pi,pred}+\mathcal{R}_{n,t}^{chaos}

B.2 Proof of proposition 1

We have the Taylor expansion:

e^{-\beta\mathcal{R}_{t}^{chaos}}=1-\beta\mathcal{R}_{t}^{chaos}+\frac{\beta^{2}}{2}(\mathcal{R}_{t}^{chaos})^{2}+o(\beta^{2})

We remind that the quadratic variation $\left<M_{n}\right>$ of a discrete-time, square-integrable martingale $M$ adapted to a filtration $(\mathcal{F}_{n})_{n\geq 0}$ satisfies $\mathbb{E}[\left<M_{n}\right>]=\mathbb{E}[M_{n}^{2}]$ and is given by:

\left<M_{n}\right>=\sum_{k=1}^{n}\mathbb{E}[(M_{k}-M_{k-1})^{2}|\mathcal{F}_{k-1}]

By theorem 1, the process $\mathcal{R}_{n,t}^{chaos}$ is a mean zero martingale, hence $\mathbb{E}_{\pi}[\mathcal{R}_{t}^{chaos}|s_{t}]=0$ by the dominated convergence theorem and the process $Z_{n}:=(\mathcal{R}_{n,t}^{chaos})^{2}-\left<\mathcal{R}_{n,t}^{chaos}\right>$ is a mean zero martingale, where $\left<\mathcal{R}_{n,t}^{chaos}\right>$ is the predictable quadratic variation of the martingale $\mathcal{R}_{n,t}^{chaos}$ , given by:

\left<\mathcal{R}_{n,t}^{chaos}\right>=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{2(t^{\prime}-t)}\mathbb{E}[(R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}))^{2}|s_{t^{\prime}},a_{t^{\prime}}]

This yields $\mathbb{E}_{\pi}[Z_{\infty}|s_{t}]=\mathbb{E}_{\pi}[Z_{0}|s_{t}]=0$ , that is:

\mathbb{E}_{\pi}[(\mathcal{R}_{t}^{chaos})^{2}|s_{t}]=\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]

All terms put together we get:

\rho^{\beta,\pi}_{s_{t}}(\mathcal{R}_{t}^{chaos})=\beta^{-1}\ln\left(1+\frac{\beta^{2}}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]+o(\beta^{2})\right)=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]+o(\beta)

We now proceed to proving that:

\mathcal{C}_{\bm{\rho^{\beta,\pi}}}[\mathcal{R}_{t}](s_{t})\leq\beta^{-1}\ln\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}

Since the logarithm function is increasing, it is sufficient to prove that $\mathbb{E}_{\pi}[e^{-\beta\mathcal{R}_{t}^{chaos}}|s_{t}]\leq\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}$ . Since $-\beta\mathcal{R}_{n,t}^{chaos}$ is a (bounded) martingale, we get that $Y_{n}:=\exp(-2\beta\mathcal{R}_{n,t}^{chaos}-2\beta^{2}\left<\mathcal{R}_{n,t}^{chaos}\right>)$ is a martingale, namely the exponential martingale associated to $-2\beta\mathcal{R}_{n,t}^{chaos}$ . We then have:

e^{-\beta\mathcal{R}_{n,t}^{chaos}}=e^{-\beta\mathcal{R}_{n,t}^{chaos}-\beta^{2}\left<\mathcal{R}_{n,t}^{chaos}\right>}e^{\beta^{2}\left<\mathcal{R}_{n,t}^{chaos}\right>}

By Hölder’s inequality and taking the limit as $n\to\infty$ (using the dominated convergence theorem), we get:

\mathbb{E}_{\pi}[e^{-\beta\mathcal{R}_{t}^{chaos}}|s_{t}]\leq\sqrt{\mathbb{E}_{\pi}[\underbrace{e^{-2\beta\mathcal{R}_{t}^{chaos}-2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}}_{Y_{\infty}}|s_{t}]}\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}

Since $Y_{n}$ is a martingale, $\mathbb{E}_{\pi}[Y_{\infty}|s_{t}]=\mathbb{E}_{\pi}[Y_{0}|s_{t}]=1$ , which gives the result.

B.3 Proof of convergence of Chaotic Mean-Variance Q-Learning algorithm of theorem 3.

By theorem 2 we have:

Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+V^{\mathbb{V}(\beta)}_{\pi}(s_{t+1})|s_{t},a_{t}]

We also have the classical Bellman equation:

Q_{\pi}(s_{t},a_{t})=\mathbb{E}[R_{t+1}+V_{\pi}(s_{t+1})|s_{t},a_{t}]

Hence $Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})$ (and its associated value function $V^{\beta}_{\pi}$ ) satisfies the classical Bellman equation with modified rewards $R^{\beta}$ :

Q^{\beta}_{\pi}(s_{t},a_{t})=\mathbb{E}[R^{\beta}_{t+1}+V^{\beta}_{\pi}(s_{t+1})|s_{t},a_{t}]

where:

R^{\beta}_{t+1}:=R_{t+1}-\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}

By assumption of the theorem, $\sum_{k=1}^{\infty}\alpha_{n_{k}(s,a)}=+\infty$ , hence the $k^{th}$ visit index to $(s,a)$ $n_{k}(s,a)\to+\infty$ as $k\to+\infty$ and so every state-action pair is visited infinitely often. As a consequence we get that $N_{t}(s,a)\to+\infty$ as $t\to+\infty$ with probability 1 and by the strong law of large numbers, $\overline{R}_{t}(s,a)\to\overline{R}(s,a)$ as $t\to+\infty$ with probability 1. This guaranties in the proof of Dayan & Watkins, (1992) that in step B.3 of Lemma B (Rewards and transition probabilities converge with probability 1), using their notations, the expected rewards $\mathcal{R}_{s}^{(t)}(a)$ of the so-called action-replay process (ARP) tend as $t\to+\infty$ to the expected rewards $\mathbb{E}[R^{\beta}_{t+1}|s_{t}=s,a_{t}=a]$ of the real process. The rest of the convergence proof of Dayan & Watkins, (1992) goes through the same way. Note that in the latter reference, the proof is given in the discounted reward case with $\gamma<1$ , but their section 4 "Discussions and Conclusions" discusses the proof extension to the episodic case with $\gamma=1$ .

B.4 Proof of proposition 2

We have by definition:

Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s,a)=\frac{\beta}{2}\mathbb{E}_{\pi_{\theta}}[\left<\mathcal{R}_{0}^{chaos}\right>|s_{0}=s,a_{0}=a]

and $V^{\mathbb{V}(\beta)}_{\pi_{\theta}}$ the associated value function, as in definition 4. We have the below equality, which proof is very similar to that Prashanth & Ghavamzadeh, (2016) (lemma 1). We postpone it at the end of the present proof for reader’s convenience:

\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(S,A)]

where $d^{\theta}_{\gamma^{2}}(s_{0},s,a):=\pi_{\theta}(a|s)\sum_{t=0}^{\infty}\gamma^{2t}\mathbb{P}_{\pi_{\theta}}[s_{t}=s|s_{0}]$ is the action-state $\gamma^{2}$ -discounted visiting distribution. Similarly we have, with $\overline{R}$ replaced by $\widehat{R}$ :

\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(s_{0})=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(S,A)]

where:

Q^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(s,a):=\frac{\beta}{2}\mathbb{E}_{\pi_{\theta}}[\left<\mathcal{R}_{0,\psi}^{chaos}\right>|s_{0}=s,a_{0}=a]

\left<\mathcal{R}_{0,\psi}^{chaos}\right>=\sum_{t=0}^{\infty}\gamma^{2t}\mathbb{E}[(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}|s_{t},a_{t}]

By definition we have:

\left<\mathcal{R}_{0,\psi}^{chaos}\right>-\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=0}^{\infty}\gamma^{2t}(\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t}))\mathbb{E}[\widehat{R}_{\psi}(s_{t},a_{t})+\overline{R}(s_{t},a_{t})-2R_{t+1}|s_{t},a_{t}]

But:

\mathbb{E}[\widehat{R}_{\psi}(s_{t},a_{t})+\overline{R}(s_{t},a_{t})-2R_{t+1}|s_{t},a_{t}]=\widehat{R}_{\psi}(s_{t},a_{t})+\overline{R}(s_{t},a_{t})-2\overline{R}(s_{t},a_{t})=\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t})

Hence:

\left<\mathcal{R}_{0,\psi}^{chaos}\right>-\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=0}^{\infty}\gamma^{2t}(\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t}))^{2}

which shows that:

B_{\psi}^{\theta}(s_{0})=\frac{\beta}{2}\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)b^{\psi}_{\pi_{\theta}}(S,A)]

with:

b^{\psi}_{\pi_{\theta}}(s,a):=\mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{2t}(\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t}))^{2}|s_{0}=s,a_{0}=a]

Finally, we prove as claimed earlier that:

\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(S,A)]

The proof is very similar to that of Prashanth & Ghavamzadeh, (2016) (lemma 1). We first use Bellman equation for $Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}$ (theorem 2):

Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+\gamma^{2}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t+1})|s_{t},a_{t}]

hence taking the gradient:

\nabla_{\theta}Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a_{t})=\gamma^{2}\mathbb{E}[\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t+1})|s_{t},a_{t}]=\gamma^{2}\int_{\mathbb{S}}\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s^{\prime})P(s_{t},a_{t},s^{\prime})ds^{\prime}

On the other hand by definition of the value function:

\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})=\nabla_{\theta}\int_{\mathbb{A}}\pi_{\theta}(a|s_{t})Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)da=\int_{\mathbb{A}}\nabla_{\theta}\pi_{\theta}(a|s_{t})Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)da+\int_{\mathbb{A}}\pi_{\theta}(a|s_{t})\nabla_{\theta}Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)da

Plugging in the expression obtained for $\nabla_{\theta}Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a_{t})$ we get:

\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})=\int_{\mathbb{A}}\left[\nabla_{\theta}\pi_{\theta}(a|s_{t})Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)\right.\left.+\gamma^{2}\pi_{\theta}(a|s_{t})\int_{\mathbb{S}}\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s^{\prime})P(s_{t},a,s^{\prime})ds^{\prime}\right]da

After unrolling $\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s^{\prime})$ infinitely many times we get:

\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})=\int_{\mathbb{A}}\int_{\mathbb{S}}\sum_{t^{\prime}=t}^{\infty}\gamma^{2(t^{\prime}-t)}\mathbb{P}_{\theta}[s_{t^{\prime}}=s|s_{t}]\nabla_{\theta}\pi_{\theta}(a|s)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s,a)dsda

Since $\nabla_{\theta}\pi_{\theta}(a|s)=\pi_{\theta}(a|s)\nabla_{\theta}\ln\pi_{\theta}(a|s)$ , and using the definition of $d^{\theta}_{\gamma^{2}}(s_{t},s,a)$ , we get that $\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})$ is equal to:

\int_{\mathbb{A}}\int_{\mathbb{S}}\nabla_{\theta}\ln\pi_{\theta}(a|s)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s,a)d^{\theta}_{\gamma^{2}}(s_{t},s,a)dsda

which completes the proof.

Appendix C Toy example of section 1.2: generalization and quantitative results

We slightly generalize the example in section 1.2 by considering the case where there are $N$ states, 2 actions, and at each state transition the probability to reach another state is given by $P(s_{t+1}=n|s_{t},a_{t})=P(s_{0}=n)=p_{n}\in[0,1]$ such that $\sum_{n=1}^{N}p_{n}=1$ and the reward are as follows:

•

if action 1 is chosen, the reward received at $t+1$ if $s_{t}=n$ is the constant $\mu_{n}\in\mathbb{R}$ .
•

if action 2 is chosen, the reward received at $t+1$ if $s_{t}=n$ is $\mu_{n}+\kappa_{n}+\sigma_{n}h_{t+1}$ , where $(h_{t})$ are zero mean and unit variance i.i.d. and $\kappa_{n},\sigma_{n}\in\mathbb{R}$ .

Hence the reward has the compact formulation:

R(s_{t}=n,a_{t},s_{t+1},h_{t+1})=\mu_{n}+(\kappa_{n}+\sigma_{n}h_{t+1})1_{\{a_{t}=2\}}

where we denote the indicator function by $1_{\{\cdot\}}$ . The below example 1 formulates quantitatively the informal discussion in section 1.2 by showing that, using cumulative reward variance as a measure of risk and provided the noise is small enough, a policy that leaves the agent with a truly risky component $h_{t}$ (i.e. that can hit any arbitrarily low value with positive probability) may be established as less risky than a policy that doesn’t, hence that "true risk" fails to be captured using this standard criterion.

Example 1.

(Risk fails to be captured using cumulative reward variance as a measure of risk). Let us consider the regime-switching example described above, denote $\mathbb{V}$ the variance operator and introduce the classical variance penalty $v_{\pi}(\mathcal{R}_{0})$ (variance of the sum is equal to the sum of variances in this specific example as the rewards $R_{t}$ are independent):

v_{\pi}(\mathcal{R}_{0}):=\mathbb{V}_{\pi}\left[\sum_{t=1}^{T}R_{t}\right]=\sum_{t=1}^{T}\mathbb{V}_{\pi}[R_{t}]

If $\sum_{n=1}^{N}p_{n}\kappa_{n}=0$ and $\pi_{i}$ is the policy that always selects action $i$ , we get:

v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})=T\sum_{n=1}^{N}p_{n}\left(\kappa_{n}^{2}+2\mu_{n}\kappa_{n}+\sigma_{n}^{2}\right)

In particular, $v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})$ can be negative. For example if $N=2$ , $p_{n}=0.5$ , $\mu_{2}=\mu_{1}+\delta$ , $\kappa_{2}=-\delta\epsilon=-\kappa_{1}$ , then: $v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})=T\delta^{2}\left(\epsilon^{2}-\epsilon+\frac{1}{2\delta^{2}}(\sigma_{1}^{2}+\sigma_{2}^{2})\right)$ i.e. the latter is negative provided the average noise is small enough $\frac{1}{2}(\sigma_{1}^{2}+\sigma_{2}^{2})<\frac{1}{4}\delta^{2}$ .

Proof.

We have, with $\pi(n):=\mathbb{P}[a_{t}=2|s_{t}=n]$ :

\sum_{t=1}^{T}\mathbb{V}_{\pi}[R_{t}]=\sum_{t=1}^{T}\sum_{n=1}^{N}p_{n}\mathbb{E}_{\pi}[R_{t}^{2}|s_{t-1}=n]-\sum_{t=1}^{T}(\sum_{n=1}^{N}p_{n}\mathbb{E}_{\pi}[R_{t}|s_{t-1}=n])^{2}

=T\sum_{n=1}^{N}p_{n}(\mu_{n}^{2}+(\kappa_{n}^{2}+\sigma_{n}^{2}+2\mu_{n}\kappa_{n})\pi(n))-T(\sum_{n=1}^{N}p_{n}(\mu_{n}+\pi(n)\kappa_{n}))^{2}

By definition of $\pi_{1}$ and $\pi_{2}$ , we have $\pi_{1}(n)=0$ and $\pi_{2}(n)=1$ for all $n$ and hence:

v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})=T\sum_{n=1}^{N}p_{n}(\kappa_{n}^{2}+\sigma_{n}^{2}+2\mu_{n}\kappa_{n})

-(\sum_{n=1}^{N}p_{n}(\mu_{n}+\kappa_{n}))^{2}+(\sum_{n=1}^{N}p_{n}\mu_{n})^{2}

Since by assumption $\sum_{n=1}^{N}p_{n}\kappa_{n}=0$ , we get the desired result. If $N=2$ , $p_{n}=0.5$ , $\mu_{2}=\mu_{1}+\delta$ , $\kappa_{2}=-\delta\epsilon=-\kappa_{1}$ then:

v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})

=\frac{T}{2}(2\delta^{2}\epsilon^{2}+\sigma_{1}^{2}+\sigma_{2}^{2}+2\mu_{1}\delta\epsilon-2\delta\epsilon(\mu_{1}+\delta))=T\delta^{2}(\epsilon^{2}+\frac{\sigma_{1}^{2}+\sigma_{2}^{2}}{2\delta^{2}}-\epsilon)

The latter is a 2nd order polynomial in $\epsilon$ with positive $\epsilon^{2}$ coefficient, hence it can take negative values if and only if it has two distinct real roots, i.e. if $\frac{1}{2}(\sigma_{1}^{2}+\sigma_{2}^{2})<\frac{1}{4}\delta^{2}$ . ∎

Proposition 3 below shows that the chaotic variance $V^{\mathbb{V}(\beta)}_{\pi}$ is proportional to the hidden noise, and in particular is zero in the absence of such noise. That is, chaotic variance captures the risky component $h_{t}$ contained in the rewards.

Proposition 3.

The chaotic variance (cf. definition 4) associated to the regime-switching example 1 is given by:

V^{\mathbb{V}(\beta)}_{\pi}=\frac{\beta}{2}T\sum_{n=1}^{N}\sigma^{2}_{n}p_{n}\pi(n)

where $\pi(n):=\mathbb{P}[a_{t}=2|s_{t}=n]$ and $V^{\mathbb{V}(\beta)}_{\pi}:=\sum_{n=1}^{N}V^{\mathbb{V}(\beta)}_{\pi}(n)p_{n}$ .

Proof.

By definition:

\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=1}^{T}\mathbb{E}_{\pi}[(R_{t}-\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}])^{2}|s_{t-1},a_{t-1}]

By definition of $R_{t}$ we get:

\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}]=\mu_{s_{t-1}}+(\kappa_{s_{t-1}}+\sigma_{s_{t-1}}h_{t})1_{\{a_{t-1}=2\}})

and hence:

R_{t}-\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}]=1_{\{a_{t-1}=2\}}\sigma_{s_{t-1}}h_{t}

so that:

\mathbb{E}_{\pi}[(R_{t}-\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}])^{2}|s_{t-1},a_{t-1}]=1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}

and therefore:

\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=1}^{T}1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}

By definition the chaotic variance is given by $V^{\mathbb{V}(\beta)}_{\pi}(n)=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{0}^{chaos}\right>|s_{0}=n]$ , and hence:

V^{\mathbb{V}(\beta)}_{\pi}(n)=\frac{\beta}{2}\mathbb{E}_{\pi}[1_{\{a_{0}=2\}}\sigma^{2}_{s_{0}}|s_{0}=n]+\frac{\beta}{2}\sum_{t=2}^{T}\mathbb{E}_{\pi}[1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}|s_{0}=n]

But $\mathbb{E}_{\pi}[1_{\{a_{0}=2\}}\sigma^{2}_{s_{0}}|s_{0}=n]=\pi(n)\sigma^{2}_{n}$ and for $t\geq 2$ :

\mathbb{E}_{\pi}[1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}|s_{0}=n]=\sum_{k=1}^{N}\pi(k)\sigma^{2}_{k}p_{k}

Plugging in the latter expression we get all in all:

V^{\mathbb{V}(\beta)}_{\pi}(n)=\frac{\beta}{2}\pi(n)\sigma^{2}_{n}+\frac{\beta}{2}(T-1)\sum_{k=1}^{N}\pi(k)\sigma^{2}_{k}p_{k}

Since by definition $V^{\mathbb{V}(\beta)}_{\pi}=\sum_{n=1}^{N}V^{\mathbb{V}(\beta)}_{\pi}(n)p_{n}$ we get eventually:

V^{\mathbb{V}(\beta)}_{\pi}=\frac{\beta}{2}\sum_{n=1}^{N}\pi(n)\sigma^{2}_{n}p_{n}+\frac{\beta}{2}(T-1)\sum_{n=1}^{N}p_{n}\sum_{k=1}^{N}\pi(k)\sigma^{2}_{k}p_{k}=T\frac{\beta}{2}\sum_{n=1}^{N}p_{n}\pi(n)\sigma^{2}_{n}

∎

Appendix D Average reward version of the Chaotic Mean-Variance Q-Learning algorithm (theorem 3)

In the average reward framework, some modifications are required since $\sum_{t=0}^{\infty}R_{t+1}$ doesn’t necessarily converge anymore. We follow the spirit of Prashanth & Ghavamzadeh, (2016), and start by defining

\rho_{\pi}:=\lim_{n\to\infty}\frac{1}{n}\sum_{t=0}^{n}R_{t+1},\hskip 8.53581pt\widetilde{\mathcal{R}}_{t}:=\sum_{t^{\prime}=t}^{\infty}(R_{t^{\prime}+1}-\rho_{\pi})

Q_{\pi}(s_{t},a_{t})=\mathbb{E}_{\pi}[\widetilde{\mathcal{R}}_{t}|s_{t},a_{t}]

The chaotic variance process needs to be modified similarly according to a recentering by $\sigma_{\pi}$ :

\sigma_{\pi}:=\lim_{n\to\infty}\frac{1}{n}\sum_{t=0}^{n}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}

\mathcal{R}^{\mathbb{V}(\beta)}_{t}:=\frac{\beta}{2}\sum_{t^{\prime}=t}^{\infty}((R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}-\sigma_{\pi})

Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}_{\pi}[\mathcal{R}^{\mathbb{V}(\beta)}_{t}|s_{t},a_{t}]

In that case we get the following Bellman equations, similar to theorem 2, which will be used in theorem 4:

Q_{\pi}(s_{t},a_{t})=\mathbb{E}[R_{t+1}-\rho_{\pi}+\gamma V_{\pi}(s_{t+1})|s_{t},a_{t}]

Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}-\frac{\beta}{2}\sigma_{\pi}+\gamma^{2}V^{\mathbb{V}(\beta)}_{\pi}(s_{t+1})|s_{t},a_{t}]

We present below the chaotic mean-variance version of the R-learning algorithm in Mahadevan, (1996). The algorithm is similar to the one presented in the episodic case (theorem 3), except that the rewards and the chaotic variance have to be adjusted by their mean value as precised above (ergodic limit). The mean values $\rho_{\pi}$ and $\sigma_{\pi}$ have to be estimated during the course of the algorithm.

Theorem 4.

(Chaotic Mean-Variance R-Learning in the average reward case). Let $Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})$ and $(s_{t})$ , $(a_{t})$ and $(R_{t+1})$ the successive states, actions and rewards observed by the agent. Let $(\alpha^{(1)}_{t})$ , $(\alpha^{(2)}_{t})$ be two sequences of learning rates. The Chaotic Mean-Variance R-Learning algorithm is given by:

		$\displaystyle N_{t}(s,a)=N_{t-1}(s,a)+1$
		$\displaystyle\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a)+\frac{1}{N_{t}(s,a)}(R_{t+1}-\overline{R}_{t-1}(s,a))$
		$\displaystyle Q^{\beta}_{t}(s,a)=(1-\alpha^{(1)}_{t})Q^{\beta}_{t-1}(s,a)+\alpha^{(1)}_{t}(R_{t+1}-\rho_{t-1}-\frac{1}{2}\beta((R_{t+1}-\overline{R}_{t}(s,a))^{2}-\sigma_{t-1})+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a^{\prime}))$
		$\displaystyle\rho_{t}=(1-\alpha^{(2)}_{t})\rho_{t-1}+\alpha^{(2)}_{t}(R_{t+1}+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a)-\max_{a^{\prime}}Q^{\beta}_{t-1}(s,a))$
		$\displaystyle\sigma_{t}=(1-\alpha^{(2)}_{t})\sigma_{t-1}+\alpha^{(2)}_{t}((R_{t+1}-\overline{R}_{t}(s,a))^{2}+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a)-\max_{a^{\prime}}Q^{\beta}_{t-1}(s,a))$

if $s_{t}=s$ and $a_{t}=a$ and $Q^{\beta}_{t}(s,a)=Q^{\beta}_{t-1}(s,a)$ , $N_{t}(s,a)=N_{t-1}(s,a)$ , $\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a)$ otherwise.

Appendix E Episodic Monte Carlo Policy gradient algorithms for additional risk measures: the chaotic CVaR and Sharpe Ratio cases

Here we discuss the extension of CMV-REINFORCE algorithm 3 to the chaotic Sharpe ratio and CVaR cases.

E.1 The chaotic Sharpe Ratio case

The chaotic Sharpe ratio - which we seek to maximize - is defined as:

\mathcal{C}_{\pi}^{Sh}[\mathcal{R}_{t}](s_{0}):=\frac{V_{\pi}(s_{0})}{\sqrt{V^{\mathbb{V}}_{\pi}(s_{0})}}

where we remind that $V_{\pi}(s_{t}):=E_{\pi}[\mathcal{R}_{t}|s_{t}]$ , $V^{\mathbb{V}}_{\pi}(s_{t}):=V^{\mathbb{V}(2)}_{\pi}(s_{t})=\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]$ , and we assume that $V^{\mathbb{V}}_{\pi}(s_{0})>0$ $\mathbb{P}_{\pi}$ -a.s. Taking the gradient, we get:

\nabla_{\theta}\mathcal{C}_{\pi}^{Sh}[\mathcal{R}_{t}](s_{0})=\frac{\nabla_{\theta}V_{\pi}(s_{0})}{\sqrt{V^{\mathbb{V}}_{\pi}(s_{0})}}-\frac{1}{2}\frac{V_{\pi}(s_{0})\nabla_{\theta}V^{\mathbb{V}}_{\pi}(s_{0})}{V^{\mathbb{V}}_{\pi}(s_{0})^{3/2}}

Provided $V_{\pi}(s_{0})$ and $V^{\mathbb{V}}_{\pi}(s_{0})$ are known, we can compute unbiased estimates of the gradients $\nabla_{\theta}V_{\pi}(s_{0})$ , $\nabla_{\theta}V^{\mathbb{V}}_{\pi}(s_{0})$ as done in algorithm 3, in particular updating $\widehat{R}_{\psi}$ the same way. In order to estimate $V_{\pi}(s_{0})$ and $V^{\mathbb{V}}_{\pi}(s_{0})$ , we use the techniques developed in Tamar et al. , (2012) (theorem 4.3) which uses two timescales: $V_{\pi}(s_{0})$ and $V^{\mathbb{V}}_{\pi}(s_{0})$ are estimated on the faster timescale $\alpha^{(2)}_{n}$ so that they can be considered as converged when the $\theta$ update is performed on the slower timescale $\alpha^{(1)}_{n}$ . We impose $\lim_{n\to\infty}\frac{\alpha^{(1)}_{n}}{\alpha^{(2)}_{n}}=0$ and we perform the fast timescales updates:

V_{n+1}=V_{n}+\frac{\alpha^{(2)}_{n}}{B}\sum_{b=1}^{B}\sum_{t=0}^{T_{b}-1}\gamma^{t}R^{(b)}_{t+1}

V^{\mathbb{V}}_{n+1}=V^{\mathbb{V}}_{n}+\frac{\alpha^{(2)}_{n}}{B}\sum_{b=1}^{B}\sum_{t=0}^{T_{b}-1}\gamma^{2t}(R^{(b)}_{t+1}-\widehat{R}_{\psi}(s^{(b)}_{t},a^{(b)}_{t}))^{2}

The unbiased estimates of the gradients are given as in algorithm 3 by:

\nabla V_{n}=\frac{1}{B}\sum_{b=1}^{B}\sum_{t^{\prime}=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t^{\prime}}|s^{(b)}_{t^{\prime}})\sum_{t=t^{\prime}}^{T_{b}-1}\gamma^{t}R^{(b)}_{t+1}

\nabla V^{\mathbb{V}}_{n}=\frac{1}{B}\sum_{b=1}^{B}\sum_{t^{\prime}=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t^{\prime}}|s^{(b)}_{t^{\prime}})\sum_{t=t^{\prime}}^{T_{b}-1}\gamma^{2t}(R^{(b)}_{t+1}-\widehat{R}_{\psi}(s^{(b)}_{t},a^{(b)}_{t}))^{2}

Finally, $\theta$ is updated on the slow timescale by:

\theta_{n+1}=\theta_{n}+\alpha^{(1)}_{n}\left(\frac{\nabla V_{n}}{\sqrt{V^{\mathbb{V}}_{n}}}-\frac{1}{2}\frac{V_{n}\nabla V^{\mathbb{V}}_{n}}{V^{\mathbb{V},3/2}_{n}}\right)

For completeness we present the assumptions under which the above algorithm is guaranteed to converge, cf. theorem 4.3 in Tamar et al. , (2012) which proof uses two timescales and an ODE based approach:

•

$\lim_{n\to\infty}\frac{\alpha^{(1)}_{n}}{\alpha^{(2)}_{n}}=0$ , and for $j=1,2$ : $\sum_{n=0}^{\infty}\alpha^{(j)}_{n}=+\infty$ , $\sum_{n=0}^{\infty}\alpha^{(j)2}_{n}<+\infty$ .
•

Assumptions 1, 2 hold true (which guarantee in particular that $V_{\pi}$ and $V^{\mathbb{V}}_{\pi}$ are uniformly bounded).
•

For all $\theta$ , the objective function $f_{\theta}$ has bounded second derivatives. Furthermore, the set of local optima of $f_{\theta}$ is countable. Here the objective function $f_{\theta}$ is defined as:

$f_{\theta}(s)=\mathcal{C}_{\pi_{\theta}}^{Sh}[\mathcal{R}_{t}](s)$

E.2 The chaotic CVaR case

We recall that the $CVaR_{\beta}$ of a random variable $X$ is defined as $CVaR_{\beta}(X):=\mathbb{E}[X|X\leq Var_{\beta}(X)]$ , where the value-at-risk $Var_{\beta}(X)=F_{X}^{-1}(X)$ , where $F_{X}$ is the cumulative distribution function of $X$ . We adapt the work Chow et al. , (2018) (algorithm 1) to the chaotic case, i.e. the case where $X=\mathcal{R}_{t}^{chaos}$ is the chaotic reward process. In order to obtain a unbiased estimate of the gradient of

\mathcal{C}_{\pi_{\theta}}^{CVaR(\beta)}[\mathcal{R}_{t}](s_{0}):=\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{t}^{chaos}(s_{0})|s_{t},\mathcal{R}_{t}^{chaos}(s_{0})\leq Var_{\beta}(\mathcal{R}_{t}^{chaos}(s_{0}))]

we first estimate as in Chow et al. , (2018) (algorithm 1) $Var_{\beta}(\mathcal{R}_{t}^{chaos}(s_{t}))$ on a fast timescale $\alpha^{(2)}_{n}$ so that it can be considered as converged when performing the $\theta$ update:

Var_{n+1}=Var_{n}-\alpha^{(2)}_{n}(1-(1-\beta)^{-1}\frac{1}{B}\sum_{b=1}^{B}1\{Z_{b}\geq Var_{n}\})

Z_{b}:=\sum_{t=0}^{T_{b}-1}\gamma^{2t}(R^{(b)}_{t+1}-\widehat{R}_{\psi}(s^{(b)}_{t},a^{(b)}_{t}))

We can then compute an unbiased estimate of the gradient $\widehat{\nabla CVaR}$ of $\nabla\mathcal{C}_{\pi_{\theta}}^{CVaR(\beta)}[\mathcal{R}_{t}](s_{0})$ as:

\widehat{\nabla CVaR_{n}}=\frac{1}{B}\sum_{b=1}^{B}\nabla\ln\pi_{\theta_{n},b}(Z_{b}-Var_{n})1\{Z_{b}\geq Var_{n}\}

\nabla\ln\pi_{\theta_{n},b}:=\sum_{t=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t}|s^{(b)}_{t})

Eventually, $\theta$ is updated on the slow timescale $\alpha^{(1)}_{n}$ as:

\theta_{n+1}=\theta_{n}+\alpha^{(1)}_{n}(\frac{1}{B}\sum_{b=1}^{B}J_{b}-\widehat{\nabla CVaR_{n}})

where $J_{b}$ is as in algorithm 3.

Appendix F Policy Gradient Actor-Critic algorithms

In the episodic, discounted or average reward settings, we can use the Bellman equations in theorem 2 to adapt the work of e.g. Prashanth & Ghavamzadeh, (2016) (variance case) or Chow et al. , (2018) (CVaR case) in order to derive chaotic versions of their actor-critic algorithms.

In the chaotic mean-variance discounted reward case, and under assumption 2, the policy gradient theorem gives us for the expected reward and risk-sensitive gradients:

		$\displaystyle\nabla_{\theta}\mathbb{E}_{\pi}[\mathcal{R}_{0}\|s_{0}]=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma}(s_{0})}[\nabla\ln\pi_{\theta}(A\|S)Q_{\pi}(S,A)]$		(3)
		$\displaystyle\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})=\frac{\beta}{2}\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A\|S)Q^{\mathbb{V}(\beta)}_{\pi}(S,A)]$		(3)

where $d^{\theta}_{\gamma^{n}}(s_{0},s,a):=\sum_{t=0}^{\infty}\gamma^{nt}\pi_{\theta}(a|s)\mathbb{P}_{\pi_{\theta}}[s_{t}=s|s_{0}]$ is the action-state $\gamma^{n}$ -discounted visiting distribution. The proof of the above is very similar to Prashanth & Ghavamzadeh, (2016) (lemma 1), and we have derived it in the proof of section B.4 for the convenience of the reader. In the average reward case, under assumption 2 and using the functions $Q_{\pi}$ and chaotic variance $V^{\mathbb{V}(\beta)}_{\pi}$ defined in section D, we get the same equalities except that $(S,A)\sim(d^{\theta},\pi_{\theta})$ , where $d^{\theta}$ is the stationary distribution of the underlying Markov chain generated by having actions follow $\pi_{\theta}$ . The proof of this result is very similar to that of equations (LABEL:eqac) and is given in e.g. Prashanth & Ghavamzadeh, (2016), lemma 3.

Note that in equations (LABEL:eqac), the $Q$ functions can, as usual, be replaced by the corresponding advantage functions by subtracting the value functions.

In order to derive online Actor-Critic algorithms, we follow the approach in Prashanth & Ghavamzadeh, (2016) by defining $V^{\phi}_{\pi}$ , $V^{\phi,\mathbb{V}(\beta)}_{\pi}$ as parametric approximations of the value function and chaotic variance, where $\phi$ is a feature vector. We then apply theorem 2 to compute Temporal Differences that are used to estimate the advantage functions in equations (LABEL:eqac), and to update the critic parameters.

Remark 2.

In the average reward setting, the policy gradient equations (LABEL:eqac) can be used to derive an online Actor-Critic algorithm as in Prashanth & Ghavamzadeh, (2016), because as tuples ( $s_{t}$ , $a_{t}$ , $R_{t+1}$ , $s_{t+1}$ ) are observed following $\pi_{\theta}$ , we are guaranteed to end up in the limit in the stationary distribution $d^{\theta^{*}}$ of a local optima $\theta^{*}$ . In the episodic setting with $\gamma=1$ , the same remark holds. Nevertheless, it much less clear to the author in the infinite horizon discounted reward setting (with $\gamma<1$ ) how the policy gradient equations (LABEL:eqac) could be used rigorously, as they involve the $\gamma$ and $\gamma^{2}$ -discounted visiting distribution. This subtlety is the focus of Thomas, (2014) and was pointed out as well in Prashanth & Ghavamzadeh, (2016) where they mention difficulties linked to the two policy gradient equations involving two distinct distributions: the $\gamma$ and $\gamma^{2}$ discounted visiting distributions, leading them to introduce new SF-based and SPSA-based algorithms.

In the average reward case we use the notations of section D to get algorithm 5, which is a modification of algorithm 2 in Prashanth & Ghavamzadeh, (2016). The three differences are i) the use of the chaotic TD error $\delta_{2,t}$ based on the Bellman equation in theorem 2, ii) the update of the distributional function $\widehat{R}_{\psi}$ as it is done in algorithm 3 and iii) the use of the policy gradient equation (LABEL:eqac) for the chaotic variance. In that case the value functions $V_{\pi}(s)$ and $V^{\mathbb{V}(\beta)}_{\pi}(s)$ are approximated by linear functions on lower dimensional spaces $\lambda_{1}^{T}\phi_{1}(s)$ and $\lambda_{2}^{T}\phi_{2}(s)$ , respectively. The algorithm uses three timescales, so that parameters updated on the faster timescales can be considered as converged to their limiting value when considering updates on the slower timescales.

In the episodic setting with $\gamma=1$ , we obtain algorithm 4, which is the episodic counterpart of algorithm 5.

It is to be noted that we can easily extend algorithm 4, 5 to the chaotic Sharpe ratio case, introduced in section E.1. To do so, we use the theta update taken from section E.1:

\theta_{t+1}=\theta_{t}+\alpha^{(1)}_{n}\left(\frac{\varphi_{t}\delta_{1,t}}{\sqrt{\delta_{2,t}}}-\frac{1}{2}\frac{\delta_{1,t}\varphi_{t}\delta_{2,t}}{\delta_{2,t}^{3/2}}\right)

with $\varphi_{t}:=\nabla\ln\pi_{\theta}(a_{t}|s_{t})$ .

We do not discuss here the Actor-Critic extensions of the chaotic risk framework to the infinite horizon discounted reward framework in the CVaR case (Chow et al. , (2018)) and the mean-variance case (Prashanth & Ghavamzadeh, (2016)), but they can be performed using similar techniques as used in the present section and in sections E.1, E.2. In particular, the SF-based and SPSA-based techniques used in the above mentioned literature in the discounted reward framework go through with minor modifications as in algorithm 5, i) computing the chaotic TD error $\delta_{2,t}$ based on the Bellman equation in theorem 2 and ii) computing the update of the distributional function $\widehat{R}_{\psi}$ as it is done in algorithm 3.

Algorithm 4 Online CMV-Actor-Critic (episodic case)

Input: Initial policy parameter $\theta_{0}$ , initial critic parameters $\lambda_{j,0}$ ( $j=1..2$ ), learning rates $\alpha^{(j)}_{t}$ ( $j=1..3$ ), value function feature vectors $\phi_{j}$ ( $j=1..2$ ), $\widehat{R}_{\psi}$ initialized to 0, Optional: experience replay table $\mathcal{E}$
Output: Approximation of the optimal policy parameter $\theta^{*}$

1:while

\theta

not converged do

2: observe tuple (

s_{t}

a_{t}

R_{t+1}

s_{t+1})

following

\pi_{\theta}(\cdot|\cdot)

3: In tabular case: use tuple to update

\widehat{R}_{\psi}

with eq. (2); else: increment

\mathcal{E}

with the tuple and update

\widehat{R}_{\psi}

using algorithm 2.

4: TD errors:

\delta_{1,t}=R_{t+1}+\lambda_{1,t}^{T}\phi_{1}(s_{t+1})-\lambda_{1,t}^{T}\phi_{1}(s_{t})

\delta_{2,t}=(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}+\lambda_{2,t}^{T}\phi_{2}(s_{t+1})-\lambda_{2,t}^{T}\phi_{2}(s_{t})

7: Critic Updates:

\lambda_{1,t+1}=\lambda_{1,t}+\alpha^{(2)}_{t}\delta_{1,t}\phi_{1}(s_{t})

\lambda_{2,t+1}=\lambda_{2,t}+\alpha^{(2)}_{t}\delta_{2,t}\phi_{2}(s_{t})

10: Policy Update:

11:

\theta_{t+1}=\theta_{t}+\alpha^{(1)}_{t}\nabla\ln\pi_{\theta}(a_{t}|s_{t})(\delta_{1,t}-\frac{\beta}{2}\delta_{2,t})

Algorithm 5 Online CMV-Actor-Critic (average reward case)

Input: Initial policy parameter $\theta_{0}$ , initial critic parameters $\lambda_{j,0}$ ( $j=1..2$ ), learning rates $\alpha^{(j)}_{t}$ ( $j=1..3$ ), value function feature vectors $\phi_{j}$ ( $j=1..2$ ), $\widehat{R}_{\psi}$ initialized to 0, Optional: experience replay table $\mathcal{E}$ .
Output: Approximation of the optimal policy parameter $\theta^{*}$

1:while

\theta

not converged do

2: observe tuple (

s_{t}

a_{t}

R_{t+1}

s_{t+1})

following

\pi_{\theta}(\cdot|\cdot)

3: In tabular case: use tuple to update

\widehat{R}_{\psi}

with eq. (2); else: increment

\mathcal{E}

with the tuple and update

\widehat{R}_{\psi}

using algorithm 2.

4: Average rewards:

\rho_{t+1}=(1-\alpha^{(3)}_{t})\rho_{t}+\alpha^{(3)}_{t}R_{t+1}

\sigma_{t+1}=(1-\alpha^{(3)}_{t})\sigma_{t}+\alpha^{(3)}_{t}(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}

7: TD errors:

\delta_{1,t}=R_{t+1}-\rho_{t+1}+\lambda_{1,t}^{T}\phi_{1}(s_{t+1})-\lambda_{1,t}^{T}\phi_{1}(s_{t})

\delta_{2,t}=(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}-\sigma_{t+1}+\lambda_{2,t}^{T}\phi_{2}(s_{t+1})-\lambda_{2,t}^{T}\phi_{2}(s_{t})

10: Critic Updates:

11:

\lambda_{1,t+1}=\lambda_{1,t}+\alpha^{(2)}_{t}\delta_{1,t}\phi_{1}(s_{t})

12:

\lambda_{2,t+1}=\lambda_{2,t}+\alpha^{(2)}_{t}\delta_{2,t}\phi_{2}(s_{t})

13: Policy Update:

14:

\theta_{t+1}=\theta_{t}+\alpha^{(1)}_{t}\nabla\ln\pi_{\theta}(a_{t}|s_{t})(\delta_{1,t}-\frac{\beta}{2}\delta_{2,t})

Appendix G Numerical values used in experiments of section 5 and additional experiments

G.1 Grid World - section 5.1

We train the policy using $5\cdot 10^{5}$ timesteps for each value of $\beta$ using an $\epsilon-$ greedy exploration policy with associated probability to take a random action of $\epsilon=0.1$ , a $Q$ table initialized to zero, a learning rate $\alpha_{t}=N(s_{t},a_{t})^{-0.5}$ (where $N(s_{t},a_{t})$ counts the number of visits to the state-action pair $(s_{t},a_{t})$ ), and set the probability of the robot going in a random direction to be $p_{error}=0.5$ . We then run the trained policy for $10^{5}$ rollout steps in order to get the path heatmap and risk penalty heatmap displayed in figure 2. In the latter, the results are averaged over 25 NumPy RNG seeds (1001, 1003, 1006, 1008, 1009, 1010, 1015, 1018, 1021, 1022, 1025, 1028, 1029, 1031, 1033, 1035, 1037, 1038, 1039, 1040, 1041, 1043, 1047, 1048, 1049): for each such seed, we perform the training and obtain the rollouts, which are then averaged over seeds.

In figure 5 we display the path heatmaps similar as in figure 2 but for various values of $p_{error}$ , representing the probability that the robot goes in a random direction. Figure 6 is similar to figure 5 but with a more severe negative reward of -50 instead of -20. As expected, in figure 5, the higher $p_{error}$ , the more reward uncertainty/stochasticity there is and the further away from the -20 reward the robot goes. The same observation goes for a lower reward -50 instead of -20 (figure 6).

G.2 Portfolio optimization - section 5.2

As described in section 5.2, the action is defined as the quantities invested in the risk-free and risky assets $a_{t}:=(q^{RF}_{t},q^{R}_{t})$ with a total budget constraint $q^{RF}_{t}+q^{R}_{t}\leq q_{max}=5$ and $q^{RF}_{t},q^{R}_{t}\geq 0$ . This yields a total of 21 possible actions $(0,0)$ , $(1,0)$ , $(0,1)$ ,…, $(4,1)$ , $(1,4)$ , $(0,5)$ , $(5,0)$ .

In order to train the risk-sensitive policy in section 5.2 according to the CMV-REINFORCE algorithm 3, we use a learning rate $\alpha=0.1$ , batch size $B=10^{4}$ and perform $M=5\cdot 10^{3}$ iterations on the policy parameter $\theta$ . We use the Boltzmann exploration policy:

\pi_{\theta}(a|s)=\frac{e^{\theta^{T}\phi(s,a)}}{\sum_{a^{\prime}\in\mathbb{A}}e^{\theta^{T}\phi(s,a^{\prime})}}

where $\theta$ is of size $|\mathbb{S}|\cdot|\mathbb{A}|=3\cdot 21=63$ and $\phi(s,a)$ is the vector with entries all zero except the entry corresponding to $(s,a)$ which is set to 1. For each value of $\beta$ , we train the policy as described above and plot in figure 3, 4 the corresponding metrics for $5\cdot 10^{4}$ rollout episodes of $T=20$ timesteps.

The values of the risk-free rate $\mu$ and volatility $\sigma$ in each state are reported in table 1. The state transition matrices $P(s,a,s^{\prime})$ depending on the action $a$ (via the quantity invested in the risky asset $q^{R}_{t}$ ) are displayed in table 2. They have been designed such that the more we invest in the risky asset, the more likely we are to reach a higher volatility state.

Table 1: Section 5.2 Portfolio Optimization: risk-free rate

\mu

and volatility

\sigma

	LowVol	MediumVol	HighVol
$\mu$	$0.2$	$0.6$	$1.$
$\sigma$	$0.5$	$1.$	$1.5$

Table 2: Section 5.2 Portfolio Optimization: state transition matrix as a function of the chosen action (quantity

q^{R}_{t}

invested in the risky asset)

$q^{R}_{t}=5$	LowVol	MediumVol	HighVol
LowVol	$0.05$	$0.25$	$0.7$
MediumVol	$0.05$	$0.25$	$0.7$
HighVol	$0.05$	$0.25$	$0.7$

$2<q^{R}_{t}<5$	LowVol	MediumVol	HighVol
LowVol	$0.1$	$0.45$	$0.45$
MediumVol	$0.1$	$0.45$	$0.45$
HighVol	$0.1$	$0.45$	$0.45$

$0<q^{R}_{t}\leq 2$	LowVol	MediumVol	HighVol
LowVol	$1/3$	$1/3$	$1/3$
MediumVol	$1/3$	$1/3$	$1/3$
HighVol	$1/3$	$1/3$	$1/3$

$q^{R}_{t}=0$	LowVol	MediumVol	HighVol
LowVol	$0.5$	$0.45$	$0.05$
MediumVol	$0.5$	$0.45$	$0.05$
HighVol	$0.5$	$0.45$	$0.05$

In figure 3 and 4, we compare CMV-REINFORCE (algorithm 3) with the mean-variance baseline. Denoting $\mathbb{V}_{\pi_{\theta}}$ the variance operator, the gradient of the variance $\nabla_{\theta}\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]$ needed in the mean-variance method is computed using the classical likelihood ratio technique. Indeed, $\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]=\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}^{2}|s_{0}]-\left(\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]\right)^{2}$ , and hence:

\nabla_{\theta}\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]=\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}^{2}|s_{0}]-2\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]

The likelihood ratio technique then yields (Tamar et al. , (2012), lemma 4.2):

\displaystyle\nabla_{\theta}\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]=\frac{1}{B}\sum_{b=1}^{B}(V_{b}-2\mu_{J}J_{b})\sum_{t=0}^{T_{b}-1}\nabla\ln\pi_{\theta}(a^{(b)}_{t}|s^{(b)}_{t})

(4)

with:

J_{b}:=\sum_{t=0}^{T_{b}-1}\gamma^{t}R^{(b)}_{t+1},\hskip 5.69054ptV_{b}:=J_{b}^{2},\hskip 5.69054pt\mu_{J}:=\frac{1}{B}\sum_{b=1}^{B}J_{b}

Note that we additionally apply an optimal baseline $\ell^{*}$ (Necchi, (2016), section 4.4.1.1.) by replacing $V_{b}-2\mu_{J}J_{b}$ by $V_{b}-2\mu_{J}J_{b}-\ell^{*}$ in equation (4) with $\ell^{*}$ the vector which $k^{th}$ component is given by:

\ell^{*}_{k}=\frac{\mathbb{E}_{\pi_{\theta}}[(\mathcal{R}_{0}^{2}-2\mu_{J}\mathcal{R}_{0})(\partial_{\theta_{k}}\ln\pi_{\theta})^{2}]}{\mathbb{E}_{\pi_{\theta}}[(\partial_{\theta_{k}}\ln\pi_{\theta})^{2}]}