This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Risk-Sensitive Reinforcement Learning: a Martingale Approach to Reward Uncertainty

Nelson Vadori, Sumitra Ganesh, Prashant Reddy, Manuela Veloso
JP Morgan AI Research
{nelson.n.vadori, sumitra.ganesh, prashant.reddy, manuela.veloso}@jpmorgan.com
Abstract

We introduce a novel framework to account for sensitivity to rewards uncertainty in sequential decision-making problems. While risk-sensitive formulations for Markov decision processes studied so far focus on the distribution of the cumulative reward as a whole, we aim at learning policies sensitive to the uncertain/stochastic nature of the rewards, which has the advantage of being conceptually more meaningful in some cases. To this end, we present a new decomposition of the randomness contained in the cumulative reward based on the Doob decomposition of a stochastic process, and introduce a new conceptual tool - the chaotic variation - which can rigorously be interpreted as the risk measure of the martingale component associated to the cumulative reward process. We innovate on the reinforcement learning side by incorporating this new risk-sensitive approach into model-free algorithms, both policy gradient and value function based, and illustrate its relevance on grid world and portfolio optimization problems.

1 Introduction

Classical reinforcement learning (RL) aims at deriving policies that maximize the expected value of the sum (or average) of future rewards, while so-called risk-sensitive RL aims at taking into account some measure of variability of the cumulative reward into the learned policy, usually through a risk criterion such as variance or a risk measure (e.g. entropic, CVaR). The common goal to the existing risk-sensitive RL literature (cf. section 1.3) is to take into account the distribution of the cumulative rewards in order to learn a variety of policies, usually parametrized by a risk parameter such as the mean-variance trade-off, the CVaR percentile or an upper bound on variance. For example, in the mean-variance case, learned policies typically lead to the distribution of cumulative rewards having lower mean but also lower variance (hence we gain confidence on the outcome at the expense of its mean).

In the existing literature, the chosen risk criterion is applied to the cumulative reward as a whole, and hence does not distinguish between the different sources of randomness contained in it. The motivating example of section 1.2 and the Doob decomposition of section 2 show that the randomness contained in the cumulative reward actually consists of two components of different nature and having different practical interpretations. If we denote Rπ(st,st+1):=R(st,π(st),st+1)R_{\pi}(s_{t},s_{t+1}):=R(s_{t},\pi(s_{t}),s_{t+1}) the reward obtained at time t+1t+1 associated to policy π\pi, the first chaotic component exactly captures the non-deterministic nature of the reward, i.e. that it is uncertain (or stochastic) when action at=π(st)a_{t}=\pi(s_{t}) is chosen in state sts_{t} (as it depends on st+1s_{t+1}). This component is zero if the reward is deterministic. The second predictable component replaces the uncertain rewards by their predictable/deterministic projections 𝔼[Rπ(st,st+1)|st,π(st)]\mathbb{E}[R_{\pi}(s_{t},s_{t+1})|s_{t},\pi(s_{t})], and hence does not depend on their uncertainty/stochasticity but acts as if the reward we get at time t+1t+1 after choosing action π(st)\pi(s_{t}) is always what we predicted it to be based on information available at time tt. This component possesses some variability due to the switching between states, but no due to the uncertain nature of the rewards. Interestingly, in the average reward case, these two components also appear in the limiting process of the central limit theorem for functionals of Markov chains, applied to the average reward 1nt=0nRπ(st,st+1)\frac{1}{n}\sum_{t=0}^{n}R_{\pi}(s_{t},s_{t+1}). Indeed, the limit of this term, minus its long-range (ergodic) mean and rescaled by n\sqrt{n}, converges in distribution as n+n\to+\infty to the sum of two independent normal random variables with respective variances σdeter.2\sigma_{deter.}^{2} and σchaotic2\sigma_{chaotic}^{2}, where the latter depends on the uncertainty of the rewards, whereas the former only depends on their deterministic projections 𝔼[Rπ(st,st+1)|st,π(st)]\mathbb{E}[R_{\pi}(s_{t},s_{t+1})|s_{t},\pi(s_{t})] as discussed above. We refer to the supplementary for more details on this observation.

In this work, we develop a new approach complementary to the existing literature that learns policies sensitive to the variability contained in the chaotic component reflecting the uncertainty/stochasticity of the rewards, discussed above. We will work in a framework where the reward R(st,at,st+1,ht+1)R(s_{t},a_{t},s_{t+1},h_{t+1}) received at t+1t+1 after having taken action ata_{t} is uncertain in the sense that it depends on the next state st+1s_{t+1} and/or possibly on extra "hidden" information ht+1h_{t+1} (cf. the setting of Shen et al. , (2014) for example).

We believe that this approach is of interest for the following reasons. First, as mentioned in Chow et al. , (2018), deriving risk criteria that are conceptually meaningful is a topic of current research. In our case, sensitivity to the chaotic reward component can nicely be interpreted as sensitivity to the uncertainty of the future reward when an action ata_{t} is chosen, in other words to the difference between the actual reward received at t+1t+1 and our best guess of it given information available at time tt, which is a natural human behavior that is adopted in many real-world situations. For example in section 5, we show on a portfolio optimization problem how classical mean-variance based risk-sensitive RL counterintuitively leads to reduce investment in both the risky and the risk-free asset as the risk aversion parameter increases (and hence we stop taking advantage of the risk-free asset), whereas our new chaotic mean-variance algorithm leads to reduce investment in the risky asset only, which is intuitive as the evolution of the latter is unknown when the investment decision is made and hence can potentially yield significant losses.

Second, depending on the application in mind, mixing the above mentioned predictable and chaotic sources of randomness together and treating it as a single "noise" may lead to undesirable and counterintuitive learned policies, and a more subtle understanding of the randomness is needed in order to gain more interpretability in the learned risk-sensitive policies (cf. section 1.2). It was mentioned in Mannor & Tsitsiklis, (2011) that variance as a risk criterion may sometimes lead to counterintuitive policies, in that an agent who has received unexpected large rewards may seek to incur losses in order to keep the variance small, since variance penalizes deviations of the cumulative reward from its expected value. Hence, Prashanth & Ghavamzadeh, (2016) have considered per-period variance instead. An interesting observation is that when applied to the chaotic component only (which will prove to be a martingale), these two notions of overall and per-period variance bridge under the concept of quadratic variation M\left<M\right> of a martingale MM, due to the equality 𝔼[M2]=𝔼[M]\mathbb{E}[M^{2}]=\mathbb{E}[\left<M\right>] (proposition 1): penalizing reward uncertainty/stochasticity will not force the cumulative reward towards some baseline level and hence will not lead to counterintuitive policies mentioned above.

1.1 Main contributions

We provide in section 2 a novel, conceptually meaningful decomposition of the cumulative reward process that distinguishes between the different sources of randomness contained within it (chaotic and predictable, cf. discussion above). The key idea of the paper is to apply the Doob decomposition of a stochastic process to the cumulative reward process. We believe that this decomposition in itself is of interest in gaining a better understanding of the variability contained within the reward process.

We then introduce in section 3 a new definition of risk that exactly captures reward uncertainty risk, i.e. the risk related to its non-deterministic nature: the chaotic variation associated to the reward process (and to a risk measure).

We incorporate for the first time reward uncertainty risk into model-free value-function based and policy gradient algorithms: the related modified versions of some canonical RL algorithms, so-called Chaotic Mean-Variance algorithms (CMV), are presented in section 4. These algorithms are applied to grid world and portfolio optimization problems in section 5. Although the focus in the main text is on the chaotic mean-variance case, our analysis in section 3 allows for general risk measures and we discuss some of them in the supplementary, such as the chaotic CVaR and Sharpe ratio.

1.2 Motivating example

As mentioned above, we are interested in developing a general RL framework to account for reward uncertainty in learned policies. The example presented in this section aims at illustrating what could sometimes go wrong when taking variance as a measure of risk, as it is commonly the case in the risk-sensitive RL literature (cf. section 1.3). We consider the episodic toy example of TT timesteps where there are 2 states, 2 actions, and at each state transition the probability to reach either states is the same, i.e. P(s0=2)=P(s0=1)=P(st+1=1|st,at)=P(st+1=2|st,at)=12P(s_{0}=2)=P(s_{0}=1)=P(s_{t+1}=1|s_{t},a_{t})=P(s_{t+1}=2|s_{t},a_{t})=\frac{1}{2}. If st=1s_{t}=1, the reward obtained at t+1t+1 is 2 if at=1a_{t}=1 and 4+σht+14+\sigma h_{t+1} if at=2a_{t}=2; if st=2s_{t}=2, the rewards obtained at t+1t+1 are 10 if at=1a_{t}=1 and 8+σht+18+\sigma h_{t+1} if at=2a_{t}=2, where hth_{t} are i.i.d. zero mean unit variance, and the noise σ0\sigma\geq 0. Think for example as action at=2a_{t}=2 as investing in a risky stock at time tt and getting the corresponding price change at time t+1t+1. If πi\pi_{i} is the policy that always selects action ii, then π1,π2\pi_{1},\pi_{2} have the same expected cumulative reward of 6T6T. We have plotted in figure 1 the rewards obtained for the 2 policies in each state.

It can be proved that if the noise σ2\sigma^{2} is small enough (cf. supplementary for a precise quantitative statement), policy π1\pi_{1} will be established as less risky than π2\pi_{2} according to the variance criterion. Nevertheless, we know that the cumulative reward earned using π1\pi_{1} over TT timesteps is bounded from below by 2T2T with probability 1. On the other hand, since π2\pi_{2} contains the noise term hth_{t}, there is always a positive probability that the reward hits any arbitrarily low value. That is, π2\pi_{2} leaves the agent with a component that could constitute a true risk for him, whereas π1\pi_{1} doesn’t, but is still established as the most risky according to the variance criterion. Hence, the risky noise component is not detected: section 2 will establish that this risky component is in fact the martingale associated to the Doob decomposition of the cumulative reward process. On the other hand, the conceptual tool introduced in this paper, the chaotic variation, when applied to the mean-variance case, will lead to a risk penalty term of 0 for π1\pi_{1} and proportional to σ2\sigma^{2} for π2\pi_{2} (cf. supplementary), i.e. exactly the desired outcome that the penalty term is proportional to reward uncertainty only.

Another observation is the following: if σ=0\sigma=0, rewards are deterministic and the obvious optimal policy is to simply take the best reward in each state, i.e. at=2a_{t}=2 if st=1s_{t}=1 and at=1a_{t}=1 if st=2s_{t}=2. This optimal strategy generates a variance over TT timesteps of 9T9T, due to the switching between states. If one chooses the risk aversion coefficient in the mean-variance trade-off to be high enough, one will then choose not to execute that obvious optimal policy and rather execute policy π2\pi_{2} as it generates a lower variance of 4T4T over TT timesteps. On the other hand, the chaotic variation, when applied to the mean-variance case, will lead to a risk penalty term of 0 for all policies (since rewards are deterministic) and hence one will always choose the optimal policy, whatever the risk aversion coefficient is.

Refer to caption

Figure 1: Rewards for 2 policies in state 1 (timesteps 1 to 200) and state 2 (timesteps 200 to 400). State transitions are random but for readability we have grouped together on the plot the timesteps corresponding to each state. The noise σ=0.16\sigma=0.16

1.3 Related work

There exists an interesting and growing body of literature on the topic of risk-sensitive RL. Our work contrasts with the latter in that we focus on learning policies sensitive to the uncertain nature of the rewards whereas previous work apply risk criterion to the cumulative reward as a whole, and hence do not distinguish between the different sources of randomness contained in it. Previously studied algorithms can be split into those that (i) are value-function based (cf. value iteration and policy iteration), and (ii) employ policy gradient methods that update the policy parameters in the direction of the gradient of the risk-flavored performance metric. In category (i), we cite the work of Mihatsch & Neuneier, (2002) in which they transform TD increments with a function that is linear by parts, and Borkar, (2002), Borkar, (2005), Borkar, (2010) which focuses on the exponential utility function in an ergodic average-reward type setting. Work of category (ii) are more numerous, and can be split into those that use (Actor-Critic) and do not use (Monte Carlo) a specific representation of the value function, the former usually requiring TD-type updates of the value function. For large action/state spaces, one typically employs function approximation of the value function in the actor-critic setting, which introduces bias but improves variance. On the RL front, Prashanth & Fu, (2018) provides a thorough literature review. Some references are interested in the full distribution of the cumulative reward, cf. Morimura et al. , (2010a), Morimura et al. , (2010b), Bellemare et al. , (2017), Defourny et al. , (2008) while others focus on specific risk measures, e.g. CVaR Tamar, A., Glassner, Y., Manor, S., (2015), Chow et al. , (2018). Some authors define their own notion of risk with specific applications in mind, such as Geibel & Wysotzki, (2005) for which risks refers to the likelihood of reaching so-called "error states". Tamar et al. , (2015) provides a general methodology to handle all coherent risk measures.

The most-studied framework remains the variance-related one. Variance can be understood as the variance of the cumulative reward, or as the sum of variances of step-by-step rewards, cf. the work of Prashanth & Ghavamzadeh, (2016), Tamar et al. , (2016), Tamar et al. , (2012) (in a stochastic shortest path context), Prashanth & Ghavamzadeh, (2013). Sobel, MJ, (1982) was among the firsts to study the mean-variance case, and he explains why using variance as a measure of risk may cause problems, more specifically the inability to use policy iteration type algorithms due to the lack of monotonicity of the variance operator, while Mannor & Tsitsiklis, (2011) shows that these problems can be hard to solve. Tamar et al. , (2016) provides a methodology to estimate the variance of the cumulative reward via a TD-type approach.

2 Markov Decision Processes: Doob decomposition and Martingale formulation

We study a Markov Decision Process (MDP) (𝕊,𝔸,P,R,γ,H,)(\mathbb{S},\mathbb{A},P,R,\gamma,H,\mathbb{H}), where we allow the reward function RR to be general of the form Rt+1:=R(st,at,st+1,ht+1)R_{t+1}:=R(s_{t},a_{t},s_{t+1},h_{t+1}), and the hidden process hth_{t}\in\mathbb{H} as in Shen et al. , (2014). The kernels PP and HH are defined in assumption 1. We use the notation that the reward Rt+1R_{t+1} is received at time t+1t+1 after having taken action at𝔸a_{t}\in\mathbb{A} at state st𝕊s_{t}\in\mathbb{S} at time tt. We denote the cumulative reward to go and the conditional average reward:

t:=t=tγttRt+1,R¯(st,at):=𝔼[Rt+1|st,at]\mathcal{R}_{t}:=\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R_{t^{\prime}+1},\hskip 8.53581pt\overline{R}(s_{t},a_{t}):=\mathbb{E}[R_{t+1}|s_{t},a_{t}]

To keep notations consistent in the episodic and discounted reward setting we impose as usual γ<1\gamma<1, and γ=1\gamma=1 only possible in the episodic setting (in the latter case the process stays in the same absorbing state forever once it is reached, with zero associated rewards). The analysis of this section is amenable to the average-reward formulation, which is detailed in the supplementary. We make the below assumption throughout the paper, needed in particular in the proof of theorem 1.

Assumption 1.

R(,,,)R(\cdot,\cdot,\cdot,\cdot) is uniformly bounded and the probability kernels PP and HH satisfy [st+1B|st,at,ht]=P(st,at,B)\mathbb{P}[s_{t+1}\in B|s_{t},a_{t},h_{t}]=P(s_{t},a_{t},B), [ht+1B|st,at,ht,st+1]=H(st,at,st+1,B)\mathbb{P}[h_{t+1}\in B|s_{t},a_{t},h_{t},s_{t+1}]=H(s_{t},a_{t},s_{t+1},B).

We need the following notations:

n,t:=t=t(n1)tγttRt+1,n:=σ(st,at,ht,tn)\mathcal{R}_{n,t}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}R_{t^{\prime}+1},\,\mathcal{F}_{n}:=\sigma(s_{t},a_{t},h_{t},t\leq n)

n\mathcal{F}_{n} is the sigma-algebra generated by the MDP before or equal to time nn, i.e. the information available up to time nn.

Definition 1.

We remind that a discrete-time process XX is adapted to the filtration 𝔽:=(n)n0\mathbb{F}:=(\mathcal{F}_{n})_{n\geq 0} if XnX_{n} is n\mathcal{F}_{n}-measurable for all nn; it is predictable if XnX_{n} is n1\mathcal{F}_{n-1}-measurable for all nn (i.e. it is known one timestep before); it is a 𝔽\mathbb{F}-martingale if for all nn, XnX_{n} is adapted to n\mathcal{F}_{n} and 𝔼[Xn+1Xn|n]=0\mathbb{E}[X_{n+1}-X_{n}|\mathcal{F}_{n}]=0.

The below observation is the cornerstone of this paper, as it will formalize precisely the discussion carried on in section 1 about the decomposition of the reward variability into two conceptually meaningful components. The martingale component captures the uncertain/stochastic part of the reward in section 1.2. The key idea here is to apply the Doob decomposition of a stochastic process (cf. supplementary) to the reward process in order to gain better understanding of its variability.

Theorem 1.

(Doob decomposition of the reward) Let π\pi be a policy. n,t\mathcal{R}_{n,t} can be decomposed in a unique way πa.s.\mathbb{P}_{\pi}-a.s. into the sum of i) a t\mathcal{F}_{t}-measurable random variable ii) a zero-mean predictable process n,tπ,pred\mathcal{R}_{n,t}^{\pi,pred} and iii) a zero-mean martingale n,tchaos\mathcal{R}_{n,t}^{chaos}, with respect to the filtration n\mathcal{F}_{n}. The decomposition is given by:

n,t=𝔼π[n,t|st]+n,tπ,pred+n,tchaos\mathcal{R}_{n,t}=\mathbb{E}_{\pi}[\mathcal{R}_{n,t}|s_{t}]+\mathcal{R}_{n,t}^{\pi,pred}+\mathcal{R}_{n,t}^{chaos}
n,tπ,pred:=t=t(n1)tγtt(R¯(st,at)𝔼π[Rt+1|st])\mathcal{R}_{n,t}^{\pi,pred}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}(\overline{R}(s_{t^{\prime}},a_{t^{\prime}})-\mathbb{E}_{\pi}[R_{t^{\prime}+1}|s_{t}])
n,tchaos:=t=t(n1)tγtt(Rt+1R¯(st,at))\mathcal{R}_{n,t}^{chaos}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}(R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}))

Denoting tπ,pred:=,tπ,pred\mathcal{R}_{t}^{\pi,pred}:=\mathcal{R}_{\infty,t}^{\pi,pred} and tchaos:=,tchaos\mathcal{R}_{t}^{chaos}:=\mathcal{R}_{\infty,t}^{chaos} we get, taking the limit as n+n\to+\infty:

t=𝔼π[t|st]+tπ,pred+tchaos\mathcal{R}_{t}=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t}]+\mathcal{R}_{t}^{\pi,pred}+\mathcal{R}_{t}^{chaos}
Definition 2.

We call tchaos\mathcal{R}_{t}^{chaos} (resp. tπ,pred\mathcal{R}_{t}^{\pi,pred}) the chaotic (resp. predictable) reward process associated to t\mathcal{R}_{t}.

The Doob decomposition of theorem 1 consists of decomposing rewards into:

  • the predictable component R¯(st,at)𝔼π[Rt+1|st]\overline{R}(s_{t^{\prime}},a_{t^{\prime}})-\mathbb{E}_{\pi}[R_{t^{\prime}+1}|s_{t}] that accounts for deviations of the predictable projection of the uncertain rewards (i.e. the best guess of the unknown reward Rt+1R_{t^{\prime}+1} given information at time tt^{\prime}) from the overall average reward. This process possesses some variance due to the switching between states, but not to the reward uncertainty.

  • the chaotic component Rt+1R¯(st,at)R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}) exactly captures the uncertainty of the reward by computing the difference between its actual realization and what we expected it to be based on the information available at time tt^{\prime}. In other words, it captures the "surprise" part of the reward. tchaos=0\mathcal{R}_{t}^{chaos}=0 if and only if RR is a deterministic reward, in which case we have Rt+1=R¯(st,at)R_{t+1}=\overline{R}(s_{t},a_{t}) with probability 1.

3 Chaotic Variation of the reward process

We define below a new conceptual tool, the chaotic variation associated to a reward process (and to a conditional risk measure ρ\rho), that answers the issues raised in section 1.2 and captures the reward uncertainty risk. The concept of risk that could be intuited from section 1.2 emerges rigorously as a martingale. We refer to Detlefsen & Scandolo, (2005) (cf. supplementary) for the definition of a conditional risk-measure associated to a sigma-algebra 𝒢\mathcal{G}, and we use the notation ρstπ\rho^{\pi}_{s_{t}} to denote the conditional risk measure associated to the sigma-algebra σ(st)\sigma(s_{t}).

Definition 3.

(Chaotic variation) Let π\pi a policy and 𝛒𝛑:=(ρstπ)t0\bm{\rho^{\pi}}:=(\rho^{\pi}_{s_{t}})_{t\geq 0} a family of conditional risk measures associated to the process (st)t0(s_{t})_{t\geq 0}. The chaotic variation associated to (t\mathcal{R}_{t}, 𝛒𝛑\bm{\rho^{\pi}}) is defined as 𝒞𝛒𝛑[t](st):=ρstπ(tchaos)\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{t}](s_{t}):=\rho^{\pi}_{s_{t}}(\mathcal{R}_{t}^{chaos}). That is, the chaotic variation quantifies the risk related to the chaotic reward process.

In the case of the entropic risk measure, the predictable quadratic variation of the chaotic reward process emerges naturally as a measure of risk as a consequence of martingale theory, as proposition 1 shows it.

Proposition 1.

(chaotic variation in the entropic case). Let π\pi a policy. Using definition 3, for β>0\beta>0, let ρstβ,π(X):=β1ln𝔼π[eβX|st]\rho^{\beta,\pi}_{s_{t}}(X):=\beta^{-1}\ln\mathbb{E}_{\pi}[e^{-\beta X}|s_{t}] be the so-called (conditional) entropic risk measure, obtained as the certainty equivalent CEU,π,st(X):=U1(𝔼π[U(X)|st])CE_{U,\pi,s_{t}}(X):=U^{-1}(\mathbb{E}_{\pi}[U(X)|s_{t}]) of the exponential utility function U(x):=eβxU(x):=-e^{-\beta x}. Then, the chaotic variation satisfies 𝒞𝛒𝛃,𝛑[t](st)β1ln𝔼π[e2β2tchaos|st]\mathcal{C}_{\bm{\rho^{\beta,\pi}}}[\mathcal{R}_{t}](s_{t})\leq\beta^{-1}\ln\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}, where tchaos\left<\mathcal{R}_{t}^{chaos}\right> is the predictable quadratic variation of the martingale tchaos\mathcal{R}_{t}^{chaos}, given by:

tchaos=t=tγ2(tt)𝔼[(Rt+1R¯(st,at))2|st,at]\left<\mathcal{R}_{t}^{chaos}\right>=\sum_{t^{\prime}=t}^{\infty}\gamma^{2(t^{\prime}-t)}\mathbb{E}[(R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}))^{2}|s_{t^{\prime}},a_{t^{\prime}}]

As β0\beta\to 0 we get:

𝒞ρβ,π[t](st)=β2𝔼π[tchaos|st]+o(β)\mathcal{C}_{\rho^{\beta,\pi}}[\mathcal{R}_{t}](s_{t})=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]+o(\beta)
Definition 4.

(Chaotic variance) The chaotic variance associated to (t\mathcal{R}_{t}, π\pi) is defined as Vπ𝕍(β)(st):=β2𝔼π[tchaos|st]V^{\mathbb{V}(\beta)}_{\pi}(s_{t}):=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]. By martingale property of tchaos\mathcal{R}_{t}^{chaos}, the chaotic variance is equal to the variance of tchaos\mathcal{R}_{t}^{chaos} (scaled by β2\frac{\beta}{2}, conditional on sts_{t}).

4 Reinforcement Learning: Risk-Sensitive Chaotic algorithms

In this section we assume for simplicity that the state and action spaces are finite. Given the conceptual tools introduced in section 3, and given a fixed initial state s0s_{0} (generalization to the case of distributions over initial states is straightforward), we are interested in solving so-called chaotic problems of type:

maxπ𝔼π[0|s0]𝒞𝝆𝝅[0](s0)\max_{\pi}\mathbb{E}_{\pi}[\mathcal{R}_{0}|s_{0}]-\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{0}](s_{0}) (1)

The extension to solving problems of type "maxπ𝔼π[0|s0]\max_{\pi}\mathbb{E}_{\pi}[\mathcal{R}_{0}|s_{0}] subject to: 𝒞𝝆𝝅[0](s0)λ\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{0}](s_{0})\leq\lambda" is not discussed here but can be done using similar techniques developed in Prashanth & Fu, (2018), Tamar et al. , (2012) or Prashanth & Ghavamzadeh, (2016).

4.1 Bellman equations: Chaotic Mean-Variance case

We present below the Bellman equation in the chaotic mean-variance case 𝒞𝝆𝝅[0](s0)=Vπ𝕍(β)(s0)\mathcal{C}_{\bm{\rho^{\pi}}}[\mathcal{R}_{0}](s_{0})=V^{\mathbb{V}(\beta)}_{\pi}(s_{0}), cf. definition 4. It will be used to formulate a chaotic mean-variance version of Q-Learning in the episodic case (section 4.2), as well as to study actor-critic algorithms and the average reward case (cf. supplementary). The proof of theorem 2 follows directly from the definition of tchaos\left<\mathcal{R}_{t}^{chaos}\right> in proposition 1.

Theorem 2.

(Bellman equation) Let Qπ𝕍(β)(st,at)Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t}) the Q-function associated to Vπ𝕍(β)(st)V^{\mathbb{V}(\beta)}_{\pi}(s_{t}) of definition 4. Then we have the following Bellman equation:

Qπ𝕍(β)(st,at)=𝔼[β2(Rt+1R¯(st,at))2+γ2Vπ𝕍(β)(st+1)|st,at]Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+\gamma^{2}V^{\mathbb{V}(\beta)}_{\pi}(s_{t+1})|s_{t},a_{t}]

Consequently, if Rt+1β:=Rt+1β2(Rt+1R¯(st,at))2R^{\beta}_{t+1}:=R_{t+1}-\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2} and Qπ(st,at):=𝔼π[t|st,at]Q_{\pi}(s_{t},a_{t}):=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t},a_{t}] with associated value function VπV_{\pi}, then Qπβ(st,at):=Qπ(st,at)Qπ𝕍(β)(st,at)Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t}) and its associated value function VπβV^{\beta}_{\pi} satisfy the following Bellman equation in the episodic case with γ=1\gamma=1:

Qπβ(st,at)=𝔼[Rt+1β+Vπβ(st+1)|st,at]Q^{\beta}_{\pi}(s_{t},a_{t})=\mathbb{E}[R^{\beta}_{t+1}+V^{\beta}_{\pi}(s_{t+1})|s_{t},a_{t}]

In particular, in the episodic case with γ=1\gamma=1, theorem 2 yields that the class of optimal policies associated to (1) coincides with that associated to the modified rewards RβR^{\beta}. In contrast to traditional TD-based methods such as QQ-Learning, RβR^{\beta} is not directly observable as it involves the term R¯(st,at)\overline{R}(s_{t},a_{t}) which will need to be estimated over the course of the algorithm.

4.2 Chaotic Mean-Variance Q-Learning

In the episodic mean-variance case (with γ=1\gamma=1), theorem 2 allows us to derive a chaotic version of Q-learning, cf. algorithm 1 which is based on theorem 3. In the discounted reward setting, it is not possible to combine QπQ_{\pi} and Qπ𝕍(β)Q^{\mathbb{V}(\beta)}_{\pi} (where Qπ(st,at):=𝔼π[t|st,at]Q_{\pi}(s_{t},a_{t}):=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t},a_{t}]), hence we cannot get a Q-Learning type algorithm. The convergence proof is discussed in the supplementary and consists of a minor modification of the proof of Dayan & Watkins, (1992) based on the fact that R¯t(s,a)R¯(s,a)\overline{R}_{t}(s,a)\to\overline{R}(s,a) as t+t\to+\infty with probability 1 for every state-action pair (s,a)(s,a), where R¯t(s,a)\overline{R}_{t}(s,a) is defined in theorem 3. The average reward version of the algorithm is presented in the supplementary.

Theorem 3.

(Chaotic Mean-Variance Q-Learning in the episodic case). Using theorem 2, denote Qπ(st,at):=𝔼π[t|st,at]Q_{\pi}(s_{t},a_{t}):=\mathbb{E}_{\pi}[\mathcal{R}_{t}|s_{t},a_{t}] and Qπβ(st,at):=Qπ(st,at)Qπ𝕍(β)(st,at)Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t}). Let (st)(s_{t}), (at)(a_{t}) and (Rt+1)(R_{t+1}) the successive states, actions and rewards observed by the agent. Let (αt)(\alpha_{t}) a sequence of learning rates satisfying the usual conditions for every state-action pair (s,a)(s,a):

k=1αnk(s,a)=+,k=1αnk(s,a)2<+\sum_{k=1}^{\infty}\alpha_{n_{k}(s,a)}=+\infty,\hskip 11.38109pt\sum_{k=1}^{\infty}\alpha^{2}_{n_{k}(s,a)}<+\infty

where nk(s,a)n_{k}(s,a) is the index corresponding to the kthk^{th} visit to (s,a)(s,a). Further define the following iterates if st=ss_{t}=s and at=aa_{t}=a :

Nt(s,a)=Nt1(s,a)+1N_{t}(s,a)=N_{t-1}(s,a)+1
R¯t(s,a)=R¯t1(s,a)+1Nt(s,a)(Rt+1R¯t1(s,a))\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a)+\frac{1}{N_{t}(s,a)}(R_{t+1}-\overline{R}_{t-1}(s,a))
Qtβ(s,a)=(1αt)Qt1β(s,a)+αt(Rt+112β(Rt+1R¯t(s,a))2+maxaQt1β(st+1,a))Q^{\beta}_{t}(s,a)=(1-\alpha_{t})Q^{\beta}_{t-1}(s,a)+\alpha_{t}(R_{t+1}-\frac{1}{2}\beta(R_{t+1}-\overline{R}_{t}(s,a))^{2}+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a^{\prime}))

and Qtβ(s,a)=Qt1β(s,a)Q^{\beta}_{t}(s,a)=Q^{\beta}_{t-1}(s,a), Nt(s,a)=Nt1(s,a)N_{t}(s,a)=N_{t-1}(s,a), R¯t(s,a)=R¯t1(s,a)\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a) otherwise. Then Qtβ(s,a)Qβ(s,a)Q^{\beta}_{t}(s,a)\to Q^{\beta}_{*}(s,a) as t+t\to+\infty with probability 1 for every state-action pair (s,a)(s,a), where Qβ(s,a):=supπQπβ(s,a)Q^{\beta}_{*}(s,a):=\sup_{\pi}Q^{\beta}_{\pi}(s,a).

Remark 1.

Note that in theorem 3, one could use a two-timescale stochastic approximation algorithm by using a specific timescale for the R¯t\overline{R}_{t} process (instead of 1Nt(s,a)\frac{1}{N_{t}(s,a)}). Since the update rule of R¯t\overline{R}_{t} doesn’t depend on Qtβ(s,a)Q^{\beta}_{t}(s,a) (uncoupled case), this is not required for convergence but could be used to improve the rate of convergence, cf. a remark in Konda & Tsitsiklis, (2004), p. 4.

Algorithm 1 CMV-Q-Learning (episodic case)

Input: QβQ^{\beta}-table initialized arbitrarily, learning rate (αt)t0(\alpha_{t})_{t\geq 0}, R¯(s,a)\overline{R}(s,a) and N(s,a)N(s,a) initialized to 0.
Output: optimal policy π(s)=argmaxaQβ(s,a)\pi_{*}(s)=\mbox{argmax}_{a}Q^{\beta}_{*}(s,a)

1:for each episode do
2:     initialize st=s0s_{t}=s_{0}
3:     while sts_{t} is not terminal do
4:         Choose ata_{t} from sts_{t} using a policy derived from QβQ^{\beta} (e.g. ϵ\epsilon-greedy).
5:         Take action ata_{t}, observe st+1s_{t+1}, Rt+1R_{t+1}.
6:         N(st,at)N(st,at)+1N(s_{t},a_{t})\leftarrow N(s_{t},a_{t})+1; R¯(st,at)R¯(st,at)+1N(st,at)(Rt+1R¯(st,at))\overline{R}(s_{t},a_{t})\leftarrow\overline{R}(s_{t},a_{t})+\frac{1}{N(s_{t},a_{t})}(R_{t+1}-\overline{R}(s_{t},a_{t}))
7:         Qβ(st,at)(1αt)Qβ(st,at)+αt(Rt+1β2(Rt+1R¯(st,at))2+maxaQβ(st+1,a))Q^{\beta}(s_{t},a_{t})\leftarrow(1-\alpha_{t})Q^{\beta}(s_{t},a_{t})+\alpha_{t}(R_{t+1}-\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+\max_{a}Q^{\beta}(s_{t+1},a))
8:         stst+1s_{t}\leftarrow s_{t+1}      

4.3 Monte Carlo Policy gradient algorithms - Episodic case

In this section we consider episodic Monte Carlo based algorithms that start with a parametric form πθ\pi_{\theta} for the policy and optimize equation (1) in the direction of the gradient with respect to θ\theta, hence aiming for local optima only. We make the following classical assumption in this section (Bhatnagar et al. , (2009)):

Assumption 2.

For every policy πθ\pi_{\theta}, the Markov chain induced by PP and πθ\pi_{\theta} is ergodic, i.e. irreducible, aperiodic and positive recurrent. Further, for every state-action pair (s,a)(s,a), πθ(s,a)\pi_{\theta}(s,a) is continuously differentiable in θ\theta.

From equation (1), we need to compute unbiased estimates of θ𝔼πθ[0|s0]\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}] and θ𝒞𝝆𝝅𝜽[0](s0)\nabla_{\theta}\mathcal{C}_{\bm{\rho^{{\pi_{\theta}}}}}[\mathcal{R}_{0}](s_{0}). The former is the classical expected reward gradient. By definition 3, 𝒞𝝆𝝅𝜽[0](s0)\mathcal{C}_{\bm{\rho^{{\pi_{\theta}}}}}[\mathcal{R}_{0}](s_{0}) consists in applying a risk measure to a mean zero martingale, which is usually simpler as we will see below in the mean-variance case 𝒞𝝆𝝅𝜽[0](s0)=Vπθ𝕍(β)(s0)\mathcal{C}_{\bm{\rho^{{\pi_{\theta}}}}}[\mathcal{R}_{0}](s_{0})=V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0}) (cf. definition 4), and more importantly the work done in the literature for specific risk measures can be applied straightforwardly, e.g. Chow et al. , (2018) or Tamar, A., Glassner, Y., Manor, S., (2015) in the case of CVaR (however the risk measure in our case is applied to the chaotic reward process tchaos\mathcal{R}_{t}^{chaos} only). From now on we focus on the mean-variance case of definition 4, but we present in the supplementary extensions to chaotic CVaR and Sharpe ratio.

Provided we know R¯(s,a)\overline{R}(s,a), the gradient θVπθ𝕍(β)(s0)\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0}) presents no specific difficulty and can be computed using the classical likelihood ratio technique by generating B1B\geq 1 episodes s0(b)s_{0}^{(b)}, a0(b)a_{0}^{(b)}, R1(b)R_{1}^{(b)}, …, sTb1(b)s_{T_{b}-1}^{(b)}, aTb1(b)a_{T_{b}-1}^{(b)}, RTb(b)R_{T_{b}}^{(b)}, b=1..Bb=1..B, following πθ\pi_{\theta} and computing the unbiased Monte Carlo estimate:

θVπθ𝕍(β)(s0)=β21Bb=1Bt=0Tb1lnπθ(at(b)|st(b))Vb,t\nabla_{\theta}V_{\pi_{\theta}}^{\mathbb{V}(\beta)}(s_{0})=\frac{\beta}{2}\frac{1}{B}\sum_{b=1}^{B}\sum_{t^{\prime}=0}^{T_{b}-1}\nabla\ln\pi_{\theta}(a^{(b)}_{t^{\prime}}|s^{(b)}_{t^{\prime}})V_{b,t^{\prime}}
Vb,t:=t=tTb1γ2(tt)(Rt+1(b)R¯(st(b),at(b)))2V_{b,t^{\prime}}:=\sum_{t=t^{\prime}}^{T_{b}-1}\gamma^{2(t-t^{\prime})}(R^{(b)}_{t+1}-\overline{R}(s^{(b)}_{t},a^{(b)}_{t}))^{2}

The subtlety here is that we need to learn R¯(s,a)\overline{R}(s,a) in the course of the algorithm. In the tabular case, using assumption 2, we are guaranteed that every state-action pair (s,a)(s,a) will be visited infinitely often and we can proceed as in the Q-Learning case (theorem 3) by updating the (conditional) average estimate R^(s,a)\widehat{R}(s,a) as iterations on θ\theta are performed. By the strong law of large numbers, convergence of R^(s,a)\widehat{R}(s,a) to R¯(s,a)\overline{R}(s,a) occurs with probability 1. Remark 1 still holds here: since the iteration on the sample average R^(s,a)\widehat{R}(s,a) is uncoupled from the one on θ\theta, we do not need a two-timescale algorithm to guarantee convergence. For each b=1..Bb=1..B and t=0..Tbt=0..T_{b}, if s=st(b)s=s_{t}^{(b)} and a=at(b)a=a_{t}^{(b)} we perform the updates:

N(s,a)=N(s,a)+1\displaystyle N(s,a)=N(s,a)+1 (2)
R^(s,a)=R^(s,a)+1N(s,a)(Rt+1(b)R^(s,a))\displaystyle\widehat{R}(s,a)=\widehat{R}(s,a)+\frac{1}{N(s,a)}(R^{(b)}_{t+1}-\widehat{R}(s,a))

Alternatively, if the state or action spaces are large, one may want to use a parametric approximation R^ψ(s,a)\widehat{R}_{\psi}(s,a) of R¯(s,a)\overline{R}(s,a) that will be updated in the course of the algorithm: this approach is presented in algorithm 2, where we use an experience replay table and fit R^ψ\widehat{R}_{\psi} using SGD. Proposition 2 quantifies the related gradient bias due to the use of this approximation (proof in supplementary). We present in algorithm 3 the Chaotic Mean-Variance version of REINFORCE, which uses either equation (LABEL:eqe1) or algorithm 2.

Proposition 2.

Let ϵψ(s,a):=R^ψ(s,a)R¯(s,a)\epsilon_{\psi}(s,a):=\widehat{R}_{\psi}(s,a)-\overline{R}(s,a) and θVπθ,ψ𝕍(β)\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi} the gradient obtained using the approximation R^ψ\widehat{R}_{\psi}. The gradient bias Bψθ(s0)B_{\psi}^{\theta}(s_{0}):=θVπθ,ψ𝕍(β)(s0)θVπθ𝕍(β)(s0):=\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(s_{0})-\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0}) satisfies:

Bψθ(s0)=β2𝔼(S,A)dγ2θ(s0)[lnπθ(A|S)bθψ(S,A)]B_{\psi}^{\theta}(s_{0})=\frac{\beta}{2}\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)b^{\psi}_{\theta}(S,A)]
bθψ(s,a):=𝔼πθ[t=0γ2tϵψ2(st,at)|s0=s,a0=a]b^{\psi}_{\theta}(s,a):=\mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{2t}\epsilon_{\psi}^{2}(s_{t},a_{t})|s_{0}=s,a_{0}=a]

where dγ2θ(s0,s,a):=πθ(a|s)t=0γ2tπθ[st=s|s0]d^{\theta}_{\gamma^{2}}(s_{0},s,a):=\pi_{\theta}(a|s)\sum_{t=0}^{\infty}\gamma^{2t}\mathbb{P}_{\pi_{\theta}}[s_{t}=s|s_{0}] is the action-state γ2\gamma^{2}-discounted visiting distribution.

Algorithm 2 Distributional Update

Input: initial distributional parameter ψ\psi, experience replay table \mathcal{E}, number of distributional gradient steps NψN_{\psi}, number of SGD samples MψM_{\psi}.
Output: Approximation of the optimal distributional parameter ψ\psi^{*}

1:Draw MψM_{\psi} samples (sj,aj,Rj)(s_{j},a_{j},R_{j}) randomly from \mathcal{E}.
2:Set M~ψ\widetilde{M}_{\psi} to be the number of unique pairs (s~j,a~j)(\widetilde{s}_{j},\widetilde{a}_{j}) and for each such pair, set R~j\widetilde{R}_{j} to be the average of the corresponding RjR_{j}.
3:Perform NψN_{\psi} steps of SGD on the loss (R^ψ(s~j,a~j)R~j)2\widehat{R}_{\psi}(\widetilde{s}_{j},\widetilde{a}_{j})-\widetilde{R}_{j})^{2} using the M~ψ\widetilde{M}_{\psi} samples.
Algorithm 3 CMV-REINFORCE (episodic case)

Input: Initial policy parameter θ0\theta_{0}, learning rate (αn)(\alpha_{n}), number of episodes per batch BB, R^ψ\widehat{R}_{\psi} initialized to 0, Optional: experience replay table \mathcal{E}.
Output: Approximation of the optimal policy parameter θ\theta^{*}

1:while θn\theta_{n} not converged do
2:     Generate BB episodes s0(b)s_{0}^{(b)}, a0(b)a_{0}^{(b)}, R1(b)R_{1}^{(b)}, …, sTb1(b)s_{T_{b}-1}^{(b)}, aTb1(b)a_{T_{b}-1}^{(b)}, RTb(b)R_{T_{b}}^{(b)}, b=1..Bb=1..B, following πθn(|)\pi_{\theta_{n}}(\cdot|\cdot)
3:     In tabular case: use episodes to update R^ψ\widehat{R}_{\psi} with eq. (LABEL:eqe1); else increment \mathcal{E} with the tuples (st(b)s_{t}^{(b)}, at(b)a_{t}^{(b)}, Rt(b)R_{t}^{(b)}) and update R^ψ\widehat{R}_{\psi} using algorithm 2.
4:     for b=1b=1 to BB do
5:         vt,b:=t=tTb1γttRt+1(b)v_{t,b}:=\sum_{t^{\prime}=t}^{T_{b}-1}\gamma^{t^{\prime}-t}R^{(b)}_{t^{\prime}+1}β2γ2(tt)(Rt+1(b)R^ψ(st(b),at(b)))2-\frac{\beta}{2}\gamma^{2(t^{\prime}-t)}(R^{(b)}_{t^{\prime}+1}-\widehat{R}_{\psi}(s^{(b)}_{t^{\prime}},a^{(b)}_{t^{\prime}}))^{2}
6:         Vbt=0Tb1lnπθn(at(b)|st(b))vt,bV_{b}\leftarrow\sum_{t=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t}|s^{(b)}_{t})v_{t,b}      
7:     θn+1θn+αnB1b=1BVb\theta_{n+1}\leftarrow\theta_{n}+\alpha_{n}B^{-1}\sum_{b=1}^{B}V_{b}

4.4 Policy Gradient Actor-Critic algorithms

In order to derive Actor-Critic algorithms, the Bellman equations in theorem 2 can be used - as in the CMV-Q-Learning algorithm 1 - in order to adapt the work of e.g. Prashanth & Ghavamzadeh, (2016) (variance case) or Chow et al. , (2018) (CVaR case) in order to derive chaotic versions of their actor-critic algorithms. This extension is discussed in the supplementary.

5 Experiments: Chaotic Mean-Variance

5.1 Grid World

We consider the episodic problem of a robot on a grid aiming at a goal. The state space consists of the 16 grid squares, and the action space consists in choosing to go East, West, North or South. Reaching the goal (resp. taking a step) gives a +1 (resp. -1) reward and negative rewards are positioned on the grid as seen on figure 2. When an action ata_{t} is chosen, there is a probability perror=50%p_{error}=50\% that the robot goes in a random direction. If the robot hits the extremity of the grid, it stays where it is. We train policies using the CMV-Q-Learning algorithm 1. In figure 2 we plot the path heatmap over 10510^{5} rollout steps performed with the learned policy for various risk aversion coefficients β\beta (cf. details and additional experiments in supplementary). The higher β\beta, the further away from the -20 reward the robot goes, as expected, as it prefers taking the -6 loss rather than walking next to the -20 reward and risking to encur the corresponding loss.

The second element that our algorithm gives is a risk heatmap, highlighting the most uncertain states, i.e. the states which yield the highest reward stochasticity/uncertainty (figure 2). We perform rollouts with the learned R¯(,)\overline{R}(\cdot,\cdot) and policy πβ\pi^{*}_{\beta} and compute 𝔼πβ[(Rt+1R¯(s,πβ(s)))2|st=s]\mathbb{E}_{\pi^{*}_{\beta}}[(R_{t+1}-\overline{R}(s,\pi^{*}_{\beta}(s)))^{2}|s_{t}=s] for every state ss. In figure 2 we see that as expected, the heatmap highlights the states next to the -20 reward, and to a lesser intensity the states around the -6 rewards. To the best of our knowledge, only two other work study risk-sensitive Q-Learning algorithms: Mihatsch & Neuneier, (2002) and Shen et al. , (2014). Both transform the TD-increments by a non-linear function in order to obtain risk-sensitive behaviors. The former show that their algorithm converge to the worst-case optimality criterion as the risk parameter changes, hence in our specific example, we could obtain similar paths as in figure 2, but our algorithm, in addition to being the only one to focus on reward stochasticity only, additionally provides the risk heatmap discussed above, which quantifies the extent to which a given state yields uncertain rewards.

Refer to caption

Figure 2: Path and risk heatmaps over 10510^{5} rollout steps with the learned policy, perror=50%p_{error}=50\% (cf. description in text).

5.2 Portfolio optimization

We consider the problem of investing in 2 financial assets. One is risk-free in the sense that investing a quantity qtRFq^{RF}_{t} in the asset yields a deterministic reward Rt+1RF:=qtRFμ(st)R^{RF}_{t+1}:=q^{RF}_{t}\mu(s_{t}), where μ\mu is the risk-free rate. The latter can be seen as the overnight rate from day tt to day t+1t+1, which is known at the end of day tt when qtRFq^{RF}_{t} is chosen. The other asset is risky (e.g. a stock) and yields an uncertain reward for a quantity qtRq^{R}_{t} of Rt+1R:=qtR(μ(st)+σ(st)ht+1)R^{R}_{t+1}:=q^{R}_{t}(\mu(s_{t})+\sigma(s_{t})h_{t+1}), where (ht)(h_{t}) are i.i.d. standard normal and σ\sigma is referred to as volatility. We restrict qtRF,qtRq^{RF}_{t},q^{R}_{t} to be nonnegative integers which sum is less than a total investment constraint qmaxq_{max}. The action is defined as at:=(qtRF,qtR)a_{t}:=(q^{RF}_{t},q^{R}_{t}) and the reward Rt+1:=Rt+1RF+Rt+1RR_{t+1}:=R^{RF}_{t+1}+R^{R}_{t+1}. In our experiments we take qmax=5q_{max}=5, yielding 2121 possible actions. The state space consists of 3 states LowVol, HighVol and MediumVol defined as low volatility σ\sigma (and low risk-free rate μ\mu), high volatility (and high risk-free rate μ\mu), as well as an intermediate state. The state transition matrix is designed such that the more we trade in the risky asset (i.e. the higher qtRq^{R}_{t}), the more likely we are to reach a higher volatility state. We train the policy using CMV-REINFORCE algorithm 3 and its baseline (classical mean-variance), and use it to compute performance metrics over rollout episodes of TT timesteps. Details on training/numerical values used are presented in the supplementary.

Refer to caption

Figure 3: (Top) Fraction of time spent per state (Mid/Bottom) CMV and baseline (mean-variance) asset investment fraction per state as a function of β\beta - T=20T=20 timesteps, s0=LowVols_{0}=\mbox{LowVol}.

We display in figure 3 the fraction of time the process has spent in each state, as well as the investment fraction in each asset (for rollout episodes). With β=0\beta=0 and by design of the state transition matrix, both assets give the same expected reward μ\mu but the policy is incentivized to trade in the risky asset rather than in the risk-free asset in order to reach the HighVol state, which gives the highest μ\mu. As β\beta increases, in the CMV case, the policy shifts towards trajectories which contain less reward uncertainty, which means investing in the risk-free asset which associated rewards are deterministic. The classical mean-variance case (baseline) penalizes not only the variability related to reward uncertainty but also the variability related to the switching of states appearing in both assets via the deterministic term μ(st)\mu(s_{t}), hence as β\beta increases, the policy is incentivized to stop investing (cf. green line in figure 3 representing the uninvested budget qmaxqtRFqtRq_{max}-q^{RF}_{t}-q^{R}_{t}), i.e. it stops taking advantage of the risk-free asset. The latter can also be seen in figure 4 where we plot for rollout episodes the cumulative reward mean and standard deviation as a function of β\beta. In the mean-variance case the reward mean will eventually vanish as the policy will stop trading in both assets: by trying to make the standard deviation lower, it generates a counterintuitive behavior in that it stops taking advantage of the risk-free asset.

6 Conclusion

We presented a novel, conceptually meaningful decomposition of the cumulative reward process based on the Doob decomposition that distinguishes between the different sources of randomness contained within it, introduced a new conceptual tool - the chaotic variation - that exactly captures reward uncertainty risk, and incorporated it into model-free value-function based and policy gradient algorithms. Potential real-world applications include all settings where one is subject to uncertain/stochastic rewards and is interested in deriving interpretable risk-sensitive policies, for example recently studied RL financial market-making problems Guéant & Manziuk, (2019), Ganesh et al. , (2019) where reward uncertainty plays a major role in that market-makers stream prices but do not know whether clients will decide to trade at that price, and further they are typically averse to uncertain fluctuations in the underlying financial asset price. Future work could include extending the framework to the case of delayed rewards.

Refer to caption

Figure 4: Cumulative reward Mean and Std.Dev. for CMV and baseline (mean-variance), as a function of β\beta - (Top) T=1T=1, s0=Randoms_{0}=\mbox{Random} and (Bottom) T=20T=20 timesteps, s0=LowVols_{0}=\mbox{LowVol}.

Disclaimer

This paper was prepared for information purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful. © 2020 JPMorgan Chase & Co. All rights reserved.

References

  • Bellemare et al. , (2017) Bellemare, Marc G, Dabney, Will, & Munos, Rémi. 2017. A Distributional Perspective on Reinforcement Learning. In: ICML.
  • Bhatnagar et al. , (2009) Bhatnagar, Shalabh, Sutton, Richard S, Ghavamzadeh, Mohammad, & Lee, Mark. 2009. Natural actor–critic algorithms. Automatica, 45(11), 2471–2482.
  • Borkar, (2002) Borkar, Vivek S. 2002. Q-learning for Risk-Sensitive control. Mathematics of Operations Research, 27(1), 294–31.
  • Borkar, (2005) Borkar, Vivek S. 2005. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3), 207–213.
  • Borkar, (2010) Borkar, Vivek S. 2010. Learning Algorithms for Risk-Sensitive Control. Proceedings of the Nineteenth International Symposium on Mathematical Theory of Networks and Systems, 1327–1332.
  • Chow et al. , (2018) Chow, Yinlam, Ghavamzadeh, Mohammad, Janson, Lucas, & Pavone, Marco. 2018. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria. J. Mach. Learn. Res., 18(167), 1–51.
  • Dayan & Watkins, (1992) Dayan, Peter, & Watkins, Christopher. 1992. Technical Note Q-Learning. Machine Learning, 8, 279–292.
  • Defourny et al. , (2008) Defourny, Boris, Ernst, Damien, & Wehenkel, Louis. 2008. Risk-Aware Decision Making and Dynamic Programming. In: NIPS workshop.
  • Detlefsen & Scandolo, (2005) Detlefsen, Kai, & Scandolo, Giacomo. 2005. Conditional and dynamic convex risk measures. Finance Stoch., 9(4), 539–561.
  • Ganesh et al. , (2019) Ganesh, S., Vadori, N., Xu, M., Zheng, H., Reddy, P., & Veloso, M. 2019. Reinforcement Learning for Market Making in a Multi-agent Dealer Market. In: NeurIPS Workshop on Robust AI in Financial Services.
  • Geibel & Wysotzki, (2005) Geibel, Peter, & Wysotzki, Fritz. 2005. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res., 24, 81–108.
  • Guéant & Manziuk, (2019) Guéant, Olivier, & Manziuk, Iuliia. 2019. Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality. arXiv.
  • Konda & Tsitsiklis, (2004) Konda, Vijay R., & Tsitsiklis, John. 2004. Convergence rate of linear two-timescale stochastic approximation. The Annals of Applied Probability, 14(2), 796–819.
  • Mahadevan, (1996) Mahadevan, Sridhar. 1996. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results. Mach. Learn., 22(1-3), 159–195.
  • Mannor & Tsitsiklis, (2011) Mannor, Shie, & Tsitsiklis, John. 2011. Mean-Variance Optimization in Markov Decision Processes. In: ICML.
  • Mihatsch & Neuneier, (2002) Mihatsch, Oliver, & Neuneier, Ralph. 2002. Risk-Sensitive Reinforcement Learning. Mach. Learn., 49(2), 267–290.
  • Morimura et al. , (2010a) Morimura, Tetsuro, Sugiyama, Masashi, Kashima, Hisashi, Hachiya, Hirotaka, & Tanaka, Toshiyuki. 2010a. Nonparametric Return Distribution Approximation for Reinforcement Learning. In: ICML.
  • Morimura et al. , (2010b) Morimura, Tetsuro, Sugiyama, Masashi, & Kashima, Hisashi. 2010b. Parametric Return Density Estimation for Reinforcement Learning. In: UAI.
  • Necchi, (2016) Necchi, Pierpaolo. 2016. Policy Gradient Algorithms for the Asset Allocation Problem. M.Phil. thesis, Politechnico di Milano.
  • Prashanth & Fu, (2018) Prashanth, L A, & Fu, Michael. 2018. Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint. arXiv, Oct.
  • Prashanth & Ghavamzadeh, (2016) Prashanth, L A, & Ghavamzadeh, Mohammad. 2016. Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs. Mach. Learn., 105(3), 367–417.
  • Prashanth & Ghavamzadeh, (2013) Prashanth, L.a., & Ghavamzadeh, Mohammad. 2013. Actor-Critic Algorithms for Risk-Sensitive MDPs. Advances in Neural Information Processing Systems 26, 252–260.
  • Shen et al. , (2014) Shen, Yun, Tobia, Michael J, Sommer, Tobias, & Obermayer, Klaus. 2014. Risk-sensitive Reinforcement Learning. Neural Computation, 26(7).
  • Sobel, MJ, (1982) Sobel, MJ. 1982. The Variance of discounted Markov Decision Processes. J. Appl. Probab.
  • Tamar et al. , (2012) Tamar, Aviv, Di Castro, Dotan, & Mannor, Shie. 2012. Policy Gradients with Variance Related Risk Criteria. In: ICML.
  • Tamar et al. , (2015) Tamar, Aviv, Chow, Yinla, Ghavamzadeh, Mohammad, & Mannor, Shie. 2015. Policy Gradient for Coherent Risk Measures. In: NIPS.
  • Tamar et al. , (2016) Tamar, Aviv, Castro, Dotan Di, & Mannor, Shie. 2016. Learning the Variance of the Reward-To-Go. Journal of Machine Learning Research, 17(13), 1–36.
  • Tamar, A., Glassner, Y., Manor, S., (2015) Tamar, A., Glassner, Y., Manor, S. 2015. Optimizing the CVaR via Sampling. In: AAAI.
  • Thomas, (2014) Thomas, Philip S. 2014. Bias in Natural Actor-Critic Algorithms. In: ICML.
  • Vadori & Swishchuk, (2015) Vadori, N., & Swishchuk, A. 2015. Strong Law of Large Numbers and Central Limit Theorems for Functionals of Inhomogeneous Semi-Markov Processes. Stochastic Analysis and Applications, 33(2), 213–243.

Appendix A Background

A.1 Central limit theorem for Markov chain functionals (section 1 - Introduction)

Assume for simplicity that the state space 𝕊\mathbb{S} is finite, and let Rπ(st,st+1):=R(st,π(st),st+1)R_{\pi}(s_{t},s_{t+1}):=R(s_{t},\pi(s_{t}),s_{t+1}) the reward obtained at time t+1t+1 associated to the deterministic policy π\pi, so that RπR_{\pi} is a function of sts_{t} and st+1s_{t+1} only, and (st)(s_{t}) is a Markov chain satisfying [st+1=s|st=s]=P(s,π(s),s)=:Pπ(s,s)\mathbb{P}[s_{t+1}=s^{\prime}|s_{t}=s]=P(s,\pi(s),s^{\prime})=:P_{\pi}(s,s^{\prime}). The central limit theorem for Markov chain functionals (Vadori & Swishchuk, (2015), theorem 3.17) states that the limit in distribution as n+n\to+\infty of:

n(1nt=0nRπ(st,st+1)x,y𝕊dπ(x)Pπ(x,y)Rπ(x,y))\sqrt{n}\left(\frac{1}{n}\sum_{t=0}^{n}R_{\pi}(s_{t},s_{t+1})-\sum_{x,y\in\mathbb{S}}d_{\pi}(x)P_{\pi}(x,y)R_{\pi}(x,y)\right)

is normal with mean zero and variance σdeter.2+σchaotic2\sigma_{deter.}^{2}+\sigma_{chaotic}^{2}, where dπd_{\pi} is the stationary distribution of the Markov chain of transition kernel PπP_{\pi}. Equivalently, the latter limit is equal in distribution to the sum of two independent normal random variables of respective variances σdeter.2\sigma_{deter.}^{2} and σchaotic2\sigma_{chaotic}^{2}, which are given by the following expressions:

σchaotic2=x𝕊dπ(x)PπvarRπ(x,)(x),σdeter.2=x𝕊dπ(x)PπvarfPoisson(x)\sigma_{chaotic}^{2}=\sum_{x\in\mathbb{S}}d_{\pi}(x)P_{\pi}^{var}R_{\pi}(x,\cdot)(x),\hskip 14.22636pt\sigma_{deter.}^{2}=\sum_{x\in\mathbb{S}}d_{\pi}(x)P_{\pi}^{var}f_{Poisson}(x)

Denoting the conditional expected reward μπ(x):=𝔼[Rπ(st,st+1)|st=x]=y𝕊Rπ(x,y)Pπ(x,y)\mu_{\pi}(x):=\mathbb{E}[R_{\pi}(s_{t},s_{t+1})|s_{t}=x]=\sum_{y\in\mathbb{S}}R_{\pi}(x,y)P_{\pi}(x,y), we see below that σdeter.2\sigma_{deter.}^{2} only depends on the rewards via the deterministic term μπ(x)\mu_{\pi}(x), whereas σchaotic2\sigma_{chaotic}^{2} additionally depends on the term 𝔼[Rπ(st,st+1)2|st=x]\mathbb{E}[R_{\pi}(s_{t},s_{t+1})^{2}|s_{t}=x] which quantifies reward stochasticity/uncertainty. Indeed, the variance operators PπvarP_{\pi}^{var} act as follows:

PπvarRπ(x,)(x)=y𝕊Rπ2(x,y)Pπ(x,y)μπ(x)2P_{\pi}^{var}R_{\pi}(x,\cdot)(x)=\sum_{y\in\mathbb{S}}R^{2}_{\pi}(x,y)P_{\pi}(x,y)-\mu_{\pi}(x)^{2}

and:

PπvarfPoisson(x)=y𝕊fPoisson2(y)Pπ(x,y)(y𝕊fPoisson(y)Pπ(x,y))2P_{\pi}^{var}f_{Poisson}(x)=\sum_{y\in\mathbb{S}}f^{2}_{Poisson}(y)P_{\pi}(x,y)-(\sum_{y\in\mathbb{S}}f_{Poisson}(y)P_{\pi}(x,y))^{2}

where the function fPoissonf_{Poisson} is the solution of the Poisson equation:

y𝕊fPoisson(y)Pπ(x,y)fPoisson(x)=x𝕊dπ(x)μπ(x)μπ(x)\sum_{y\in\mathbb{S}}f_{Poisson}(y)P_{\pi}(x,y)-f_{Poisson}(x)=\sum_{x^{\prime}\in\mathbb{S}}d_{\pi}(x^{\prime})\mu_{\pi}(x^{\prime})-\mu_{\pi}(x)

A.2 Doob decomposition of a stochastic process

Let XX a discrete-time process on a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) such that (i) 𝔼[|Xn|]<\mathbb{E}[|X_{n}|]<\infty for all nn, and (ii) XX is adapted to the filtration 𝔽:=(n)n0\mathbb{F}:=(\mathcal{F}_{n})_{n\geq 0}, i.e. XnX_{n} is n\mathcal{F}_{n}-measurable for all nn. Then, there exists an integrable martingale MM, and an integrable predictable process AA, such that A0=M0=0A_{0}=M_{0}=0 and Xn=X0+An+MnX_{n}=X_{0}+A_{n}+M_{n} for all nn. This decomposition is almost surely unique. Here, we remind that AA being predictable means that AnA_{n} is n1\mathcal{F}_{n-1}-measurable for all nn (i.e. it is known one timestep before), and MM martingale means that MnM_{n} is n\mathcal{F}_{n}-measurable and 𝔼[Mn+1Mn|n]=0\mathbb{E}[M_{n+1}-M_{n}|\mathcal{F}_{n}]=0. Further, AA and MM are given by:

An=k=1n𝔼[Xk|k1]Xk1,Mn=k=1nXk𝔼[Xk|k1]A_{n}=\sum_{k=1}^{n}\mathbb{E}[X_{k}|\mathcal{F}_{k-1}]-X_{k-1},\hskip 14.22636ptM_{n}=\sum_{k=1}^{n}X_{k}-\mathbb{E}[X_{k}|\mathcal{F}_{k-1}]

A.3 Conditional risk measures

We remind here the definition of a conditional risk measure associated to a sigma-algebra 𝒢\mathcal{G}, according to Detlefsen & Scandolo, (2005). Let LL^{\infty}, L𝒢L^{\infty}_{\mathcal{G}} the set of resp. bounded random variables and bounded, 𝒢\mathcal{G}-measurable random variables. A conditional risk measure ρ\rho associated to a sigma-algebra 𝒢\mathcal{G} is a map LL𝒢L^{\infty}\to L^{\infty}_{\mathcal{G}} such that:

  • (Normalization) ρ(0)=0\rho(0)=0.

  • (Conditional Translation Invariance) For any XLX\in L^{\infty}, ZL𝒢Z\in L^{\infty}_{\mathcal{G}} we have ρ(X+Z)=ρ(X)Z\rho(X+Z)=\rho(X)-Z.

  • (Monotonicity) For any X,YLX,Y\in L^{\infty}, if XYX\leq Y with probability 1, then ρ(X)ρ(Y)\rho(X)\geq\rho(Y) with probability 1.

Appendix B Proofs

B.1 Proof of theorem 1

Define the process Yn=n+t,t𝔼π[n+t,t|st]Y_{n}=\mathcal{R}_{n+t,t}-\mathbb{E}_{\pi}[\mathcal{R}_{n+t,t}|s_{t}] for n1n\geq 1 and Y0:=0Y_{0}:=0, where we remind that:

n,t:=t=t(n1)tγttRt+1\mathcal{R}_{n,t}:=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{t^{\prime}-t}R_{t^{\prime}+1}

The process YnY_{n} is adapted to the filtration generated by the sigma-algebras 𝒢n:=n+t\mathcal{G}_{n}:=\mathcal{F}_{n+t}, since by definition n:=σ(sk,ak,hk,kn)\mathcal{F}_{n}:=\sigma(s_{k},a_{k},h_{k},k\leq n). Hence, we can apply the Doob decomposition (cf. section A.2), and we get the following decomposition of the process YnY_{n} up to unicity π\mathbb{P}_{\pi}-a.s., for n0n\geq 0:

Yn=Y0+Ynpred+YnchaosY_{n}=Y_{0}+Y_{n}^{pred}+Y_{n}^{chaos}

where YnpredY_{n}^{pred}, YnchaosY_{n}^{chaos} are respectively a predictable process and a martingale with respect to the filtration (𝒢n)n0(\mathcal{G}_{n})_{n\geq 0}, satisfying Y0pred=Y0chaos=0Y_{0}^{pred}=Y_{0}^{chaos}=0, and are given by, for n0n\geq 0 (with the usual convention that k=10():=0\sum_{k=1}^{0}(\cdot):=0):

Ynpred=k=1n(𝔼π[Yk|𝒢k1]Yk1),Ynchaos=k=1n(Yk𝔼π[Yk|𝒢k1])Y_{n}^{pred}=\sum_{k=1}^{n}(\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}]-Y_{k-1}),\hskip 14.22636ptY_{n}^{chaos}=\sum_{k=1}^{n}(Y_{k}-\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}])

We have Y0=0Y_{0}=0 and if k2k\geq 2:

𝔼π[Yk|𝒢k1]Yk1=𝔼π[k+t,tk+t1,t|k+t1]𝔼π[k+t,tk+t1,t|st]\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}]-Y_{k-1}=\mathbb{E}_{\pi}[\mathcal{R}_{k+t,t}-\mathcal{R}_{k+t-1,t}|\mathcal{F}_{k+t-1}]-\mathbb{E}_{\pi}[\mathcal{R}_{k+t,t}-\mathcal{R}_{k+t-1,t}|s_{t}]
=γk1(𝔼π[Rk+t|k+t1]𝔼π[Rk+t|st])=\gamma^{k-1}(\mathbb{E}_{\pi}[R_{k+t}|\mathcal{F}_{k+t-1}]-\mathbb{E}_{\pi}[R_{k+t}|s_{t}])

If k=1k=1 we get:

𝔼π[Yk|𝒢k1]Yk1=𝔼π[Y1|𝒢0]=𝔼π[Rt+1|t]𝔼π[Rt+1|st]\mathbb{E}_{\pi}[Y_{k}|\mathcal{G}_{k-1}]-Y_{k-1}=\mathbb{E}_{\pi}[Y_{1}|\mathcal{G}_{0}]=\mathbb{E}_{\pi}[R_{t+1}|\mathcal{F}_{t}]-\mathbb{E}_{\pi}[R_{t+1}|s_{t}]

Overall, this gives for n1n\geq 1:

Ynpred=k=1nγk1(𝔼π[Rk+t|k+t1]𝔼π[Rk+t|st])=k=tn1+tγkt(𝔼π[Rk+1|k]𝔼π[Rk+1|st])Y_{n}^{pred}=\sum_{k=1}^{n}\gamma^{k-1}(\mathbb{E}_{\pi}[R_{k+t}|\mathcal{F}_{k+t-1}]-\mathbb{E}_{\pi}[R_{k+t}|s_{t}])=\sum_{k=t}^{n-1+t}\gamma^{k-t}(\mathbb{E}_{\pi}[R_{k+1}|\mathcal{F}_{k}]-\mathbb{E}_{\pi}[R_{k+1}|s_{t}])

Now, we claim that 𝔼π[Rk+1|k]=R¯(sk,ak)\mathbb{E}_{\pi}[R_{k+1}|\mathcal{F}_{k}]=\overline{R}(s_{k},a_{k}). Indeed, by assumption 1 and since by definition n:=σ(sk,ak,hk,kn)\mathcal{F}_{n}:=\sigma(s_{k},a_{k},h_{k},k\leq n), we get:

𝔼π[Rk+1|k]=𝔼π[Rk+1|sm,am,hm:mk]\mathbb{E}_{\pi}[R_{k+1}|\mathcal{F}_{k}]=\mathbb{E}_{\pi}[R_{k+1}|s_{m},a_{m},h_{m}:m\leq k]
=𝕊R(sk,ak,s,h)H(sk,ak,s,dh)P(sk,ak,ds)=𝔼[Rk+1|sk,ak]=R¯(sk,ak)=\int_{\mathbb{S}}\int_{\mathbb{H}}R(s_{k},a_{k},s^{\prime},h^{\prime})H(s_{k},a_{k},s^{\prime},dh^{\prime})P(s_{k},a_{k},ds^{\prime})=\mathbb{E}[R_{k+1}|s_{k},a_{k}]=\overline{R}(s_{k},a_{k})

This yields for n1n\geq 1:

Ynpred=k=tn1+tγkt(R¯(sk,ak)𝔼π[Rk+1|st])=n+t,tπ,predY_{n}^{pred}=\sum_{k=t}^{n-1+t}\gamma^{k-t}(\overline{R}(s_{k},a_{k})-\mathbb{E}_{\pi}[R_{k+1}|s_{t}])=\mathcal{R}_{n+t,t}^{\pi,pred}

Similarly for n1n\geq 1:

Ynchaos=k=tn1+tγkt(Rk+1R¯(sk,ak))=n+t,tchaosY_{n}^{chaos}=\sum_{k=t}^{n-1+t}\gamma^{k-t}(R_{k+1}-\overline{R}(s_{k},a_{k}))=\mathcal{R}_{n+t,t}^{chaos}

Since YnpredY_{n}^{pred}, YnchaosY_{n}^{chaos} are respectively a predictable process and a martingale with respect to the filtration (𝒢n)n0(\mathcal{G}_{n})_{n\geq 0} by the Doob decomposition, and by definition 𝒢n:=n+t\mathcal{G}_{n}:=\mathcal{F}_{n+t}, we get that n,tπ,pred\mathcal{R}_{n,t}^{\pi,pred}, n,tchaos\mathcal{R}_{n,t}^{chaos} are respectively a predictable process and a martingale with respect to the filtration n\mathcal{F}_{n}, for nt+1n\geq t+1, and satisfy:

n,t𝔼π[n,t|st]=Ynt=Yntpred+Yntchaos=n,tπ,pred+n,tchaos\mathcal{R}_{n,t}-\mathbb{E}_{\pi}[\mathcal{R}_{n,t}|s_{t}]=Y_{n-t}=Y_{n-t}^{pred}+Y_{n-t}^{chaos}=\mathcal{R}_{n,t}^{\pi,pred}+\mathcal{R}_{n,t}^{chaos}

B.2 Proof of proposition 1

We have the Taylor expansion:

eβtchaos=1βtchaos+β22(tchaos)2+o(β2)e^{-\beta\mathcal{R}_{t}^{chaos}}=1-\beta\mathcal{R}_{t}^{chaos}+\frac{\beta^{2}}{2}(\mathcal{R}_{t}^{chaos})^{2}+o(\beta^{2})

We remind that the quadratic variation Mn\left<M_{n}\right> of a discrete-time, square-integrable martingale MM adapted to a filtration (n)n0(\mathcal{F}_{n})_{n\geq 0} satisfies 𝔼[Mn]=𝔼[Mn2]\mathbb{E}[\left<M_{n}\right>]=\mathbb{E}[M_{n}^{2}] and is given by:

Mn=k=1n𝔼[(MkMk1)2|k1]\left<M_{n}\right>=\sum_{k=1}^{n}\mathbb{E}[(M_{k}-M_{k-1})^{2}|\mathcal{F}_{k-1}]

By theorem 1, the process n,tchaos\mathcal{R}_{n,t}^{chaos} is a mean zero martingale, hence 𝔼π[tchaos|st]=0\mathbb{E}_{\pi}[\mathcal{R}_{t}^{chaos}|s_{t}]=0 by the dominated convergence theorem and the process Zn:=(n,tchaos)2n,tchaosZ_{n}:=(\mathcal{R}_{n,t}^{chaos})^{2}-\left<\mathcal{R}_{n,t}^{chaos}\right> is a mean zero martingale, where n,tchaos\left<\mathcal{R}_{n,t}^{chaos}\right> is the predictable quadratic variation of the martingale n,tchaos\mathcal{R}_{n,t}^{chaos}, given by:

n,tchaos=t=t(n1)tγ2(tt)𝔼[(Rt+1R¯(st,at))2|st,at]\left<\mathcal{R}_{n,t}^{chaos}\right>=\sum_{t^{\prime}=t}^{(n-1)\vee t}\gamma^{2(t^{\prime}-t)}\mathbb{E}[(R_{t^{\prime}+1}-\overline{R}(s_{t^{\prime}},a_{t^{\prime}}))^{2}|s_{t^{\prime}},a_{t^{\prime}}]

This yields 𝔼π[Z|st]=𝔼π[Z0|st]=0\mathbb{E}_{\pi}[Z_{\infty}|s_{t}]=\mathbb{E}_{\pi}[Z_{0}|s_{t}]=0, that is:

𝔼π[(tchaos)2|st]=𝔼π[tchaos|st]\mathbb{E}_{\pi}[(\mathcal{R}_{t}^{chaos})^{2}|s_{t}]=\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]

All terms put together we get:

ρstβ,π(tchaos)=β1ln(1+β22𝔼π[tchaos|st]+o(β2))=β2𝔼π[tchaos|st]+o(β)\rho^{\beta,\pi}_{s_{t}}(\mathcal{R}_{t}^{chaos})=\beta^{-1}\ln\left(1+\frac{\beta^{2}}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]+o(\beta^{2})\right)=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}]+o(\beta)

We now proceed to proving that:

𝒞𝝆𝜷,𝝅[t](st)β1ln𝔼π[e2β2tchaos|st]\mathcal{C}_{\bm{\rho^{\beta,\pi}}}[\mathcal{R}_{t}](s_{t})\leq\beta^{-1}\ln\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}

Since the logarithm function is increasing, it is sufficient to prove that 𝔼π[eβtchaos|st]𝔼π[e2β2tchaos|st]\mathbb{E}_{\pi}[e^{-\beta\mathcal{R}_{t}^{chaos}}|s_{t}]\leq\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}. Since βn,tchaos-\beta\mathcal{R}_{n,t}^{chaos} is a (bounded) martingale, we get that Yn:=exp(2βn,tchaos2β2n,tchaos)Y_{n}:=\exp(-2\beta\mathcal{R}_{n,t}^{chaos}-2\beta^{2}\left<\mathcal{R}_{n,t}^{chaos}\right>) is a martingale, namely the exponential martingale associated to 2βn,tchaos-2\beta\mathcal{R}_{n,t}^{chaos}. We then have:

eβn,tchaos=eβn,tchaosβ2n,tchaoseβ2n,tchaose^{-\beta\mathcal{R}_{n,t}^{chaos}}=e^{-\beta\mathcal{R}_{n,t}^{chaos}-\beta^{2}\left<\mathcal{R}_{n,t}^{chaos}\right>}e^{\beta^{2}\left<\mathcal{R}_{n,t}^{chaos}\right>}

By Hölder’s inequality and taking the limit as nn\to\infty (using the dominated convergence theorem), we get:

𝔼π[eβtchaos|st]𝔼π[e2βtchaos2β2tchaosY|st]𝔼π[e2β2tchaos|st]\mathbb{E}_{\pi}[e^{-\beta\mathcal{R}_{t}^{chaos}}|s_{t}]\leq\sqrt{\mathbb{E}_{\pi}[\underbrace{e^{-2\beta\mathcal{R}_{t}^{chaos}-2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}}_{Y_{\infty}}|s_{t}]}\sqrt{\mathbb{E}_{\pi}[e^{2\beta^{2}\left<\mathcal{R}_{t}^{chaos}\right>}|s_{t}]}

Since YnY_{n} is a martingale, 𝔼π[Y|st]=𝔼π[Y0|st]=1\mathbb{E}_{\pi}[Y_{\infty}|s_{t}]=\mathbb{E}_{\pi}[Y_{0}|s_{t}]=1, which gives the result.

B.3 Proof of convergence of Chaotic Mean-Variance Q-Learning algorithm of theorem 3.

By theorem 2 we have:

Qπ𝕍(β)(st,at)=𝔼[β2(Rt+1R¯(st,at))2+Vπ𝕍(β)(st+1)|st,at]Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+V^{\mathbb{V}(\beta)}_{\pi}(s_{t+1})|s_{t},a_{t}]

We also have the classical Bellman equation:

Qπ(st,at)=𝔼[Rt+1+Vπ(st+1)|st,at]Q_{\pi}(s_{t},a_{t})=\mathbb{E}[R_{t+1}+V_{\pi}(s_{t+1})|s_{t},a_{t}]

Hence Qπβ(st,at):=Qπ(st,at)Qπ𝕍(β)(st,at)Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t}) (and its associated value function VπβV^{\beta}_{\pi}) satisfies the classical Bellman equation with modified rewards RβR^{\beta}:

Qπβ(st,at)=𝔼[Rt+1β+Vπβ(st+1)|st,at]Q^{\beta}_{\pi}(s_{t},a_{t})=\mathbb{E}[R^{\beta}_{t+1}+V^{\beta}_{\pi}(s_{t+1})|s_{t},a_{t}]

where:

Rt+1β:=Rt+1β2(Rt+1R¯(st,at))2R^{\beta}_{t+1}:=R_{t+1}-\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}

By assumption of the theorem, k=1αnk(s,a)=+\sum_{k=1}^{\infty}\alpha_{n_{k}(s,a)}=+\infty, hence the kthk^{th} visit index to (s,a)(s,a) nk(s,a)+n_{k}(s,a)\to+\infty as k+k\to+\infty and so every state-action pair is visited infinitely often. As a consequence we get that Nt(s,a)+N_{t}(s,a)\to+\infty as t+t\to+\infty with probability 1 and by the strong law of large numbers, R¯t(s,a)R¯(s,a)\overline{R}_{t}(s,a)\to\overline{R}(s,a) as t+t\to+\infty with probability 1. This guaranties in the proof of Dayan & Watkins, (1992) that in step B.3 of Lemma B (Rewards and transition probabilities converge with probability 1), using their notations, the expected rewards s(t)(a)\mathcal{R}_{s}^{(t)}(a) of the so-called action-replay process (ARP) tend as t+t\to+\infty to the expected rewards 𝔼[Rt+1β|st=s,at=a]\mathbb{E}[R^{\beta}_{t+1}|s_{t}=s,a_{t}=a] of the real process. The rest of the convergence proof of Dayan & Watkins, (1992) goes through the same way. Note that in the latter reference, the proof is given in the discounted reward case with γ<1\gamma<1, but their section 4 "Discussions and Conclusions" discusses the proof extension to the episodic case with γ=1\gamma=1.

B.4 Proof of proposition 2

We have by definition:

Qπθ𝕍(β)(s,a)=β2𝔼πθ[0chaos|s0=s,a0=a]Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s,a)=\frac{\beta}{2}\mathbb{E}_{\pi_{\theta}}[\left<\mathcal{R}_{0}^{chaos}\right>|s_{0}=s,a_{0}=a]

and Vπθ𝕍(β)V^{\mathbb{V}(\beta)}_{\pi_{\theta}} the associated value function, as in definition 4. We have the below equality, which proof is very similar to that Prashanth & Ghavamzadeh, (2016) (lemma 1). We postpone it at the end of the present proof for reader’s convenience:

θVπθ𝕍(β)(s0)=𝔼(S,A)dγ2θ(s0)[lnπθ(A|S)Qπθ𝕍(β)(S,A)]\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(S,A)]

where dγ2θ(s0,s,a):=πθ(a|s)t=0γ2tπθ[st=s|s0]d^{\theta}_{\gamma^{2}}(s_{0},s,a):=\pi_{\theta}(a|s)\sum_{t=0}^{\infty}\gamma^{2t}\mathbb{P}_{\pi_{\theta}}[s_{t}=s|s_{0}] is the action-state γ2\gamma^{2}-discounted visiting distribution. Similarly we have, with R¯\overline{R} replaced by R^\widehat{R}:

θVπθ,ψ𝕍(β)(s0)=𝔼(S,A)dγ2θ(s0)[lnπθ(A|S)Qπθ,ψ𝕍(β)(S,A)]\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(s_{0})=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(S,A)]

where:

Qπθ,ψ𝕍(β)(s,a):=β2𝔼πθ[0,ψchaos|s0=s,a0=a]Q^{\mathbb{V}(\beta)}_{\pi_{\theta},\psi}(s,a):=\frac{\beta}{2}\mathbb{E}_{\pi_{\theta}}[\left<\mathcal{R}_{0,\psi}^{chaos}\right>|s_{0}=s,a_{0}=a]
0,ψchaos=t=0γ2t𝔼[(Rt+1R^ψ(st,at))2|st,at]\left<\mathcal{R}_{0,\psi}^{chaos}\right>=\sum_{t=0}^{\infty}\gamma^{2t}\mathbb{E}[(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}|s_{t},a_{t}]

By definition we have:

0,ψchaos0chaos=t=0γ2t(R^ψ(st,at)R¯(st,at))𝔼[R^ψ(st,at)+R¯(st,at)2Rt+1|st,at]\left<\mathcal{R}_{0,\psi}^{chaos}\right>-\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=0}^{\infty}\gamma^{2t}(\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t}))\mathbb{E}[\widehat{R}_{\psi}(s_{t},a_{t})+\overline{R}(s_{t},a_{t})-2R_{t+1}|s_{t},a_{t}]

But:

𝔼[R^ψ(st,at)+R¯(st,at)2Rt+1|st,at]=R^ψ(st,at)+R¯(st,at)2R¯(st,at)=R^ψ(st,at)R¯(st,at)\mathbb{E}[\widehat{R}_{\psi}(s_{t},a_{t})+\overline{R}(s_{t},a_{t})-2R_{t+1}|s_{t},a_{t}]=\widehat{R}_{\psi}(s_{t},a_{t})+\overline{R}(s_{t},a_{t})-2\overline{R}(s_{t},a_{t})=\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t})

Hence:

0,ψchaos0chaos=t=0γ2t(R^ψ(st,at)R¯(st,at))2\left<\mathcal{R}_{0,\psi}^{chaos}\right>-\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=0}^{\infty}\gamma^{2t}(\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t}))^{2}

which shows that:

Bψθ(s0)=β2𝔼(S,A)dγ2θ(s0)[lnπθ(A|S)bπθψ(S,A)]B_{\psi}^{\theta}(s_{0})=\frac{\beta}{2}\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)b^{\psi}_{\pi_{\theta}}(S,A)]

with:

bπθψ(s,a):=𝔼πθ[t=0γ2t(R^ψ(st,at)R¯(st,at))2|s0=s,a0=a]b^{\psi}_{\pi_{\theta}}(s,a):=\mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{2t}(\widehat{R}_{\psi}(s_{t},a_{t})-\overline{R}(s_{t},a_{t}))^{2}|s_{0}=s,a_{0}=a]

Finally, we prove as claimed earlier that:

θVπθ𝕍(β)(s0)=𝔼(S,A)dγ2θ(s0)[lnπθ(A|S)Qπθ𝕍(β)(S,A)]\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(S,A)]

The proof is very similar to that of Prashanth & Ghavamzadeh, (2016) (lemma 1). We first use Bellman equation for Qπθ𝕍(β)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}} (theorem 2):

Qπθ𝕍(β)(st,at)=𝔼[β2(Rt+1R¯(st,at))2+γ2Vπθ𝕍(β)(st+1)|st,at]Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}+\gamma^{2}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t+1})|s_{t},a_{t}]

hence taking the gradient:

θQπθ𝕍(β)(st,at)=γ2𝔼[θVπθ𝕍(β)(st+1)|st,at]=γ2𝕊θVπθ𝕍(β)(s)P(st,at,s)𝑑s\nabla_{\theta}Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a_{t})=\gamma^{2}\mathbb{E}[\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t+1})|s_{t},a_{t}]=\gamma^{2}\int_{\mathbb{S}}\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s^{\prime})P(s_{t},a_{t},s^{\prime})ds^{\prime}

On the other hand by definition of the value function:

θVπθ𝕍(β)(st)=θ𝔸πθ(a|st)Qπθ𝕍(β)(st,a)𝑑a=𝔸θπθ(a|st)Qπθ𝕍(β)(st,a)𝑑a+𝔸πθ(a|st)θQπθ𝕍(β)(st,a)𝑑a\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})=\nabla_{\theta}\int_{\mathbb{A}}\pi_{\theta}(a|s_{t})Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)da=\int_{\mathbb{A}}\nabla_{\theta}\pi_{\theta}(a|s_{t})Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)da+\int_{\mathbb{A}}\pi_{\theta}(a|s_{t})\nabla_{\theta}Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)da

Plugging in the expression obtained for θQπθ𝕍(β)(st,at)\nabla_{\theta}Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a_{t}) we get:

θVπθ𝕍(β)(st)=𝔸[θπθ(a|st)Qπθ𝕍(β)(st,a)+γ2πθ(a|st)𝕊θVπθ𝕍(β)(s)P(st,a,s)𝑑s]𝑑a\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})=\int_{\mathbb{A}}\left[\nabla_{\theta}\pi_{\theta}(a|s_{t})Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t},a)\right.\left.+\gamma^{2}\pi_{\theta}(a|s_{t})\int_{\mathbb{S}}\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s^{\prime})P(s_{t},a,s^{\prime})ds^{\prime}\right]da

After unrolling θVπθ𝕍(β)(s)\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s^{\prime}) infinitely many times we get:

θVπθ𝕍(β)(st)=𝔸𝕊t=tγ2(tt)θ[st=s|st]θπθ(a|s)Qπθ𝕍(β)(s,a)dsda\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t})=\int_{\mathbb{A}}\int_{\mathbb{S}}\sum_{t^{\prime}=t}^{\infty}\gamma^{2(t^{\prime}-t)}\mathbb{P}_{\theta}[s_{t^{\prime}}=s|s_{t}]\nabla_{\theta}\pi_{\theta}(a|s)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s,a)dsda

Since θπθ(a|s)=πθ(a|s)θlnπθ(a|s)\nabla_{\theta}\pi_{\theta}(a|s)=\pi_{\theta}(a|s)\nabla_{\theta}\ln\pi_{\theta}(a|s), and using the definition of dγ2θ(st,s,a)d^{\theta}_{\gamma^{2}}(s_{t},s,a), we get that θVπθ𝕍(β)(st)\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{t}) is equal to:

𝔸𝕊θlnπθ(a|s)Qπθ𝕍(β)(s,a)dγ2θ(st,s,a)𝑑s𝑑a\int_{\mathbb{A}}\int_{\mathbb{S}}\nabla_{\theta}\ln\pi_{\theta}(a|s)Q^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s,a)d^{\theta}_{\gamma^{2}}(s_{t},s,a)dsda

which completes the proof.

Appendix C Toy example of section 1.2: generalization and quantitative results

We slightly generalize the example in section 1.2 by considering the case where there are NN states, 2 actions, and at each state transition the probability to reach another state is given by P(st+1=n|st,at)=P(s0=n)=pn[0,1]P(s_{t+1}=n|s_{t},a_{t})=P(s_{0}=n)=p_{n}\in[0,1] such that n=1Npn=1\sum_{n=1}^{N}p_{n}=1 and the reward are as follows:

  • if action 1 is chosen, the reward received at t+1t+1 if st=ns_{t}=n is the constant μn\mu_{n}\in\mathbb{R}.

  • if action 2 is chosen, the reward received at t+1t+1 if st=ns_{t}=n is μn+κn+σnht+1\mu_{n}+\kappa_{n}+\sigma_{n}h_{t+1}, where (ht)(h_{t}) are zero mean and unit variance i.i.d. and κn,σn\kappa_{n},\sigma_{n}\in\mathbb{R}.

Hence the reward has the compact formulation:

R(st=n,at,st+1,ht+1)=μn+(κn+σnht+1)1{at=2}R(s_{t}=n,a_{t},s_{t+1},h_{t+1})=\mu_{n}+(\kappa_{n}+\sigma_{n}h_{t+1})1_{\{a_{t}=2\}}

where we denote the indicator function by 1{}1_{\{\cdot\}}. The below example 1 formulates quantitatively the informal discussion in section 1.2 by showing that, using cumulative reward variance as a measure of risk and provided the noise is small enough, a policy that leaves the agent with a truly risky component hth_{t} (i.e. that can hit any arbitrarily low value with positive probability) may be established as less risky than a policy that doesn’t, hence that "true risk" fails to be captured using this standard criterion.

Example 1.

(Risk fails to be captured using cumulative reward variance as a measure of risk). Let us consider the regime-switching example described above, denote 𝕍\mathbb{V} the variance operator and introduce the classical variance penalty vπ(0)v_{\pi}(\mathcal{R}_{0}) (variance of the sum is equal to the sum of variances in this specific example as the rewards RtR_{t} are independent):

vπ(0):=𝕍π[t=1TRt]=t=1T𝕍π[Rt]v_{\pi}(\mathcal{R}_{0}):=\mathbb{V}_{\pi}\left[\sum_{t=1}^{T}R_{t}\right]=\sum_{t=1}^{T}\mathbb{V}_{\pi}[R_{t}]

If n=1Npnκn=0\sum_{n=1}^{N}p_{n}\kappa_{n}=0 and πi\pi_{i} is the policy that always selects action ii, we get:

vπ2(0)vπ1(0)=Tn=1Npn(κn2+2μnκn+σn2)v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})=T\sum_{n=1}^{N}p_{n}\left(\kappa_{n}^{2}+2\mu_{n}\kappa_{n}+\sigma_{n}^{2}\right)

In particular, vπ2(0)vπ1(0)v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0}) can be negative. For example if N=2N=2, pn=0.5p_{n}=0.5, μ2=μ1+δ\mu_{2}=\mu_{1}+\delta, κ2=δϵ=κ1\kappa_{2}=-\delta\epsilon=-\kappa_{1}, then: vπ2(0)vπ1(0)=Tδ2(ϵ2ϵ+12δ2(σ12+σ22))v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})=T\delta^{2}\left(\epsilon^{2}-\epsilon+\frac{1}{2\delta^{2}}(\sigma_{1}^{2}+\sigma_{2}^{2})\right) i.e. the latter is negative provided the average noise is small enough 12(σ12+σ22)<14δ2\frac{1}{2}(\sigma_{1}^{2}+\sigma_{2}^{2})<\frac{1}{4}\delta^{2}.

Proof.

We have, with π(n):=[at=2|st=n]\pi(n):=\mathbb{P}[a_{t}=2|s_{t}=n]:

t=1T𝕍π[Rt]=t=1Tn=1Npn𝔼π[Rt2|st1=n]t=1T(n=1Npn𝔼π[Rt|st1=n])2\sum_{t=1}^{T}\mathbb{V}_{\pi}[R_{t}]=\sum_{t=1}^{T}\sum_{n=1}^{N}p_{n}\mathbb{E}_{\pi}[R_{t}^{2}|s_{t-1}=n]-\sum_{t=1}^{T}(\sum_{n=1}^{N}p_{n}\mathbb{E}_{\pi}[R_{t}|s_{t-1}=n])^{2}
=Tn=1Npn(μn2+(κn2+σn2+2μnκn)π(n))T(n=1Npn(μn+π(n)κn))2=T\sum_{n=1}^{N}p_{n}(\mu_{n}^{2}+(\kappa_{n}^{2}+\sigma_{n}^{2}+2\mu_{n}\kappa_{n})\pi(n))-T(\sum_{n=1}^{N}p_{n}(\mu_{n}+\pi(n)\kappa_{n}))^{2}

By definition of π1\pi_{1} and π2\pi_{2}, we have π1(n)=0\pi_{1}(n)=0 and π2(n)=1\pi_{2}(n)=1 for all nn and hence:

vπ2(0)vπ1(0)=Tn=1Npn(κn2+σn2+2μnκn)v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})=T\sum_{n=1}^{N}p_{n}(\kappa_{n}^{2}+\sigma_{n}^{2}+2\mu_{n}\kappa_{n})
(n=1Npn(μn+κn))2+(n=1Npnμn)2-(\sum_{n=1}^{N}p_{n}(\mu_{n}+\kappa_{n}))^{2}+(\sum_{n=1}^{N}p_{n}\mu_{n})^{2}

Since by assumption n=1Npnκn=0\sum_{n=1}^{N}p_{n}\kappa_{n}=0, we get the desired result. If N=2N=2, pn=0.5p_{n}=0.5, μ2=μ1+δ\mu_{2}=\mu_{1}+\delta, κ2=δϵ=κ1\kappa_{2}=-\delta\epsilon=-\kappa_{1} then:

vπ2(0)vπ1(0)v_{\pi_{2}}(\mathcal{R}_{0})-v_{\pi_{1}}(\mathcal{R}_{0})
=T2(2δ2ϵ2+σ12+σ22+2μ1δϵ2δϵ(μ1+δ))=Tδ2(ϵ2+σ12+σ222δ2ϵ)=\frac{T}{2}(2\delta^{2}\epsilon^{2}+\sigma_{1}^{2}+\sigma_{2}^{2}+2\mu_{1}\delta\epsilon-2\delta\epsilon(\mu_{1}+\delta))=T\delta^{2}(\epsilon^{2}+\frac{\sigma_{1}^{2}+\sigma_{2}^{2}}{2\delta^{2}}-\epsilon)

The latter is a 2nd order polynomial in ϵ\epsilon with positive ϵ2\epsilon^{2} coefficient, hence it can take negative values if and only if it has two distinct real roots, i.e. if 12(σ12+σ22)<14δ2\frac{1}{2}(\sigma_{1}^{2}+\sigma_{2}^{2})<\frac{1}{4}\delta^{2}. ∎

Proposition 3 below shows that the chaotic variance Vπ𝕍(β)V^{\mathbb{V}(\beta)}_{\pi} is proportional to the hidden noise, and in particular is zero in the absence of such noise. That is, chaotic variance captures the risky component hth_{t} contained in the rewards.

Proposition 3.

The chaotic variance (cf. definition 4) associated to the regime-switching example 1 is given by:

Vπ𝕍(β)=β2Tn=1Nσn2pnπ(n)V^{\mathbb{V}(\beta)}_{\pi}=\frac{\beta}{2}T\sum_{n=1}^{N}\sigma^{2}_{n}p_{n}\pi(n)

where π(n):=[at=2|st=n]\pi(n):=\mathbb{P}[a_{t}=2|s_{t}=n] and Vπ𝕍(β):=n=1NVπ𝕍(β)(n)pnV^{\mathbb{V}(\beta)}_{\pi}:=\sum_{n=1}^{N}V^{\mathbb{V}(\beta)}_{\pi}(n)p_{n}.

Proof.

By definition:

0chaos=t=1T𝔼π[(Rt𝔼π[Rt|st1,at1])2|st1,at1]\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=1}^{T}\mathbb{E}_{\pi}[(R_{t}-\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}])^{2}|s_{t-1},a_{t-1}]

By definition of RtR_{t} we get:

𝔼π[Rt|st1,at1]=μst1+(κst1+σst1ht)1{at1=2})\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}]=\mu_{s_{t-1}}+(\kappa_{s_{t-1}}+\sigma_{s_{t-1}}h_{t})1_{\{a_{t-1}=2\}})

and hence:

Rt𝔼π[Rt|st1,at1]=1{at1=2}σst1htR_{t}-\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}]=1_{\{a_{t-1}=2\}}\sigma_{s_{t-1}}h_{t}

so that:

𝔼π[(Rt𝔼π[Rt|st1,at1])2|st1,at1]=1{at1=2}σst12\mathbb{E}_{\pi}[(R_{t}-\mathbb{E}_{\pi}[R_{t}|s_{t-1},a_{t-1}])^{2}|s_{t-1},a_{t-1}]=1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}

and therefore:

0chaos=t=1T1{at1=2}σst12\left<\mathcal{R}_{0}^{chaos}\right>=\sum_{t=1}^{T}1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}

By definition the chaotic variance is given by Vπ𝕍(β)(n)=β2𝔼π[0chaos|s0=n]V^{\mathbb{V}(\beta)}_{\pi}(n)=\frac{\beta}{2}\mathbb{E}_{\pi}[\left<\mathcal{R}_{0}^{chaos}\right>|s_{0}=n], and hence:

Vπ𝕍(β)(n)=β2𝔼π[1{a0=2}σs02|s0=n]+β2t=2T𝔼π[1{at1=2}σst12|s0=n]V^{\mathbb{V}(\beta)}_{\pi}(n)=\frac{\beta}{2}\mathbb{E}_{\pi}[1_{\{a_{0}=2\}}\sigma^{2}_{s_{0}}|s_{0}=n]+\frac{\beta}{2}\sum_{t=2}^{T}\mathbb{E}_{\pi}[1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}|s_{0}=n]

But 𝔼π[1{a0=2}σs02|s0=n]=π(n)σn2\mathbb{E}_{\pi}[1_{\{a_{0}=2\}}\sigma^{2}_{s_{0}}|s_{0}=n]=\pi(n)\sigma^{2}_{n} and for t2t\geq 2:

𝔼π[1{at1=2}σst12|s0=n]=k=1Nπ(k)σk2pk\mathbb{E}_{\pi}[1_{\{a_{t-1}=2\}}\sigma^{2}_{s_{t-1}}|s_{0}=n]=\sum_{k=1}^{N}\pi(k)\sigma^{2}_{k}p_{k}

Plugging in the latter expression we get all in all:

Vπ𝕍(β)(n)=β2π(n)σn2+β2(T1)k=1Nπ(k)σk2pkV^{\mathbb{V}(\beta)}_{\pi}(n)=\frac{\beta}{2}\pi(n)\sigma^{2}_{n}+\frac{\beta}{2}(T-1)\sum_{k=1}^{N}\pi(k)\sigma^{2}_{k}p_{k}

Since by definition Vπ𝕍(β)=n=1NVπ𝕍(β)(n)pnV^{\mathbb{V}(\beta)}_{\pi}=\sum_{n=1}^{N}V^{\mathbb{V}(\beta)}_{\pi}(n)p_{n} we get eventually:

Vπ𝕍(β)=β2n=1Nπ(n)σn2pn+β2(T1)n=1Npnk=1Nπ(k)σk2pk=Tβ2n=1Npnπ(n)σn2V^{\mathbb{V}(\beta)}_{\pi}=\frac{\beta}{2}\sum_{n=1}^{N}\pi(n)\sigma^{2}_{n}p_{n}+\frac{\beta}{2}(T-1)\sum_{n=1}^{N}p_{n}\sum_{k=1}^{N}\pi(k)\sigma^{2}_{k}p_{k}=T\frac{\beta}{2}\sum_{n=1}^{N}p_{n}\pi(n)\sigma^{2}_{n}

Appendix D Average reward version of the Chaotic Mean-Variance Q-Learning algorithm (theorem 3)

In the average reward framework, some modifications are required since t=0Rt+1\sum_{t=0}^{\infty}R_{t+1} doesn’t necessarily converge anymore. We follow the spirit of Prashanth & Ghavamzadeh, (2016), and start by defining

ρπ:=limn1nt=0nRt+1,~t:=t=t(Rt+1ρπ)\rho_{\pi}:=\lim_{n\to\infty}\frac{1}{n}\sum_{t=0}^{n}R_{t+1},\hskip 8.53581pt\widetilde{\mathcal{R}}_{t}:=\sum_{t^{\prime}=t}^{\infty}(R_{t^{\prime}+1}-\rho_{\pi})
Qπ(st,at)=𝔼π[~t|st,at]Q_{\pi}(s_{t},a_{t})=\mathbb{E}_{\pi}[\widetilde{\mathcal{R}}_{t}|s_{t},a_{t}]

The chaotic variance process needs to be modified similarly according to a recentering by σπ\sigma_{\pi}:

σπ:=limn1nt=0n(Rt+1R¯(st,at))2\sigma_{\pi}:=\lim_{n\to\infty}\frac{1}{n}\sum_{t=0}^{n}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}
t𝕍(β):=β2t=t((Rt+1R¯(st,at))2σπ)\mathcal{R}^{\mathbb{V}(\beta)}_{t}:=\frac{\beta}{2}\sum_{t^{\prime}=t}^{\infty}((R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}-\sigma_{\pi})
Qπ𝕍(β)(st,at)=𝔼π[t𝕍(β)|st,at]Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}_{\pi}[\mathcal{R}^{\mathbb{V}(\beta)}_{t}|s_{t},a_{t}]

In that case we get the following Bellman equations, similar to theorem 2, which will be used in theorem 4:

Qπ(st,at)=𝔼[Rt+1ρπ+γVπ(st+1)|st,at]Q_{\pi}(s_{t},a_{t})=\mathbb{E}[R_{t+1}-\rho_{\pi}+\gamma V_{\pi}(s_{t+1})|s_{t},a_{t}]
Qπ𝕍(β)(st,at)=𝔼[β2(Rt+1R¯(st,at))2β2σπ+γ2Vπ𝕍(β)(st+1)|st,at]Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t})=\mathbb{E}[\frac{\beta}{2}(R_{t+1}-\overline{R}(s_{t},a_{t}))^{2}-\frac{\beta}{2}\sigma_{\pi}+\gamma^{2}V^{\mathbb{V}(\beta)}_{\pi}(s_{t+1})|s_{t},a_{t}]

We present below the chaotic mean-variance version of the R-learning algorithm in Mahadevan, (1996). The algorithm is similar to the one presented in the episodic case (theorem 3), except that the rewards and the chaotic variance have to be adjusted by their mean value as precised above (ergodic limit). The mean values ρπ\rho_{\pi} and σπ\sigma_{\pi} have to be estimated during the course of the algorithm.

Theorem 4.

(Chaotic Mean-Variance R-Learning in the average reward case). Let Qπβ(st,at):=Qπ(st,at)Qπ𝕍(β)(st,at)Q^{\beta}_{\pi}(s_{t},a_{t}):=Q_{\pi}(s_{t},a_{t})-Q^{\mathbb{V}(\beta)}_{\pi}(s_{t},a_{t}) and (st)(s_{t}), (at)(a_{t}) and (Rt+1)(R_{t+1}) the successive states, actions and rewards observed by the agent. Let (αt(1))(\alpha^{(1)}_{t}), (αt(2))(\alpha^{(2)}_{t}) be two sequences of learning rates. The Chaotic Mean-Variance R-Learning algorithm is given by:

Nt(s,a)=Nt1(s,a)+1\displaystyle N_{t}(s,a)=N_{t-1}(s,a)+1
R¯t(s,a)=R¯t1(s,a)+1Nt(s,a)(Rt+1R¯t1(s,a))\displaystyle\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a)+\frac{1}{N_{t}(s,a)}(R_{t+1}-\overline{R}_{t-1}(s,a))
Qtβ(s,a)=(1αt(1))Qt1β(s,a)+αt(1)(Rt+1ρt112β((Rt+1R¯t(s,a))2σt1)+maxaQt1β(st+1,a))\displaystyle Q^{\beta}_{t}(s,a)=(1-\alpha^{(1)}_{t})Q^{\beta}_{t-1}(s,a)+\alpha^{(1)}_{t}(R_{t+1}-\rho_{t-1}-\frac{1}{2}\beta((R_{t+1}-\overline{R}_{t}(s,a))^{2}-\sigma_{t-1})+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a^{\prime}))
ρt=(1αt(2))ρt1+αt(2)(Rt+1+maxaQt1β(st+1,a)maxaQt1β(s,a))\displaystyle\rho_{t}=(1-\alpha^{(2)}_{t})\rho_{t-1}+\alpha^{(2)}_{t}(R_{t+1}+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a)-\max_{a^{\prime}}Q^{\beta}_{t-1}(s,a))
σt=(1αt(2))σt1+αt(2)((Rt+1R¯t(s,a))2+maxaQt1β(st+1,a)maxaQt1β(s,a))\displaystyle\sigma_{t}=(1-\alpha^{(2)}_{t})\sigma_{t-1}+\alpha^{(2)}_{t}((R_{t+1}-\overline{R}_{t}(s,a))^{2}+\max_{a^{\prime}}Q^{\beta}_{t-1}(s_{t+1},a)-\max_{a^{\prime}}Q^{\beta}_{t-1}(s,a))

if st=ss_{t}=s and at=aa_{t}=a and Qtβ(s,a)=Qt1β(s,a)Q^{\beta}_{t}(s,a)=Q^{\beta}_{t-1}(s,a), Nt(s,a)=Nt1(s,a)N_{t}(s,a)=N_{t-1}(s,a), R¯t(s,a)=R¯t1(s,a)\overline{R}_{t}(s,a)=\overline{R}_{t-1}(s,a) otherwise.

Appendix E Episodic Monte Carlo Policy gradient algorithms for additional risk measures: the chaotic CVaR and Sharpe Ratio cases

Here we discuss the extension of CMV-REINFORCE algorithm 3 to the chaotic Sharpe ratio and CVaR cases.

E.1 The chaotic Sharpe Ratio case

The chaotic Sharpe ratio - which we seek to maximize - is defined as:

𝒞πSh[t](s0):=Vπ(s0)Vπ𝕍(s0)\mathcal{C}_{\pi}^{Sh}[\mathcal{R}_{t}](s_{0}):=\frac{V_{\pi}(s_{0})}{\sqrt{V^{\mathbb{V}}_{\pi}(s_{0})}}

where we remind that Vπ(st):=Eπ[t|st]V_{\pi}(s_{t}):=E_{\pi}[\mathcal{R}_{t}|s_{t}], Vπ𝕍(st):=Vπ𝕍(2)(st)=𝔼π[tchaos|st]V^{\mathbb{V}}_{\pi}(s_{t}):=V^{\mathbb{V}(2)}_{\pi}(s_{t})=\mathbb{E}_{\pi}[\left<\mathcal{R}_{t}^{chaos}\right>|s_{t}], and we assume that Vπ𝕍(s0)>0V^{\mathbb{V}}_{\pi}(s_{0})>0 π\mathbb{P}_{\pi}-a.s. Taking the gradient, we get:

θ𝒞πSh[t](s0)=θVπ(s0)Vπ𝕍(s0)12Vπ(s0)θVπ𝕍(s0)Vπ𝕍(s0)3/2\nabla_{\theta}\mathcal{C}_{\pi}^{Sh}[\mathcal{R}_{t}](s_{0})=\frac{\nabla_{\theta}V_{\pi}(s_{0})}{\sqrt{V^{\mathbb{V}}_{\pi}(s_{0})}}-\frac{1}{2}\frac{V_{\pi}(s_{0})\nabla_{\theta}V^{\mathbb{V}}_{\pi}(s_{0})}{V^{\mathbb{V}}_{\pi}(s_{0})^{3/2}}

Provided Vπ(s0)V_{\pi}(s_{0}) and Vπ𝕍(s0)V^{\mathbb{V}}_{\pi}(s_{0}) are known, we can compute unbiased estimates of the gradients θVπ(s0)\nabla_{\theta}V_{\pi}(s_{0}), θVπ𝕍(s0)\nabla_{\theta}V^{\mathbb{V}}_{\pi}(s_{0}) as done in algorithm 3, in particular updating R^ψ\widehat{R}_{\psi} the same way. In order to estimate Vπ(s0)V_{\pi}(s_{0}) and Vπ𝕍(s0)V^{\mathbb{V}}_{\pi}(s_{0}), we use the techniques developed in Tamar et al. , (2012) (theorem 4.3) which uses two timescales: Vπ(s0)V_{\pi}(s_{0}) and Vπ𝕍(s0)V^{\mathbb{V}}_{\pi}(s_{0}) are estimated on the faster timescale αn(2)\alpha^{(2)}_{n} so that they can be considered as converged when the θ\theta update is performed on the slower timescale αn(1)\alpha^{(1)}_{n}. We impose limnαn(1)αn(2)=0\lim_{n\to\infty}\frac{\alpha^{(1)}_{n}}{\alpha^{(2)}_{n}}=0 and we perform the fast timescales updates:

Vn+1=Vn+αn(2)Bb=1Bt=0Tb1γtRt+1(b)V_{n+1}=V_{n}+\frac{\alpha^{(2)}_{n}}{B}\sum_{b=1}^{B}\sum_{t=0}^{T_{b}-1}\gamma^{t}R^{(b)}_{t+1}
Vn+1𝕍=Vn𝕍+αn(2)Bb=1Bt=0Tb1γ2t(Rt+1(b)R^ψ(st(b),at(b)))2V^{\mathbb{V}}_{n+1}=V^{\mathbb{V}}_{n}+\frac{\alpha^{(2)}_{n}}{B}\sum_{b=1}^{B}\sum_{t=0}^{T_{b}-1}\gamma^{2t}(R^{(b)}_{t+1}-\widehat{R}_{\psi}(s^{(b)}_{t},a^{(b)}_{t}))^{2}

The unbiased estimates of the gradients are given as in algorithm 3 by:

Vn=1Bb=1Bt=0Tb1lnπθn(at(b)|st(b))t=tTb1γtRt+1(b)\nabla V_{n}=\frac{1}{B}\sum_{b=1}^{B}\sum_{t^{\prime}=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t^{\prime}}|s^{(b)}_{t^{\prime}})\sum_{t=t^{\prime}}^{T_{b}-1}\gamma^{t}R^{(b)}_{t+1}
Vn𝕍=1Bb=1Bt=0Tb1lnπθn(at(b)|st(b))t=tTb1γ2t(Rt+1(b)R^ψ(st(b),at(b)))2\nabla V^{\mathbb{V}}_{n}=\frac{1}{B}\sum_{b=1}^{B}\sum_{t^{\prime}=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t^{\prime}}|s^{(b)}_{t^{\prime}})\sum_{t=t^{\prime}}^{T_{b}-1}\gamma^{2t}(R^{(b)}_{t+1}-\widehat{R}_{\psi}(s^{(b)}_{t},a^{(b)}_{t}))^{2}

Finally, θ\theta is updated on the slow timescale by:

θn+1=θn+αn(1)(VnVn𝕍12VnVn𝕍Vn𝕍,3/2)\theta_{n+1}=\theta_{n}+\alpha^{(1)}_{n}\left(\frac{\nabla V_{n}}{\sqrt{V^{\mathbb{V}}_{n}}}-\frac{1}{2}\frac{V_{n}\nabla V^{\mathbb{V}}_{n}}{V^{\mathbb{V},3/2}_{n}}\right)

For completeness we present the assumptions under which the above algorithm is guaranteed to converge, cf. theorem 4.3 in Tamar et al. , (2012) which proof uses two timescales and an ODE based approach:

  • limnαn(1)αn(2)=0\lim_{n\to\infty}\frac{\alpha^{(1)}_{n}}{\alpha^{(2)}_{n}}=0, and for j=1,2j=1,2: n=0αn(j)=+\sum_{n=0}^{\infty}\alpha^{(j)}_{n}=+\infty, n=0αn(j)2<+\sum_{n=0}^{\infty}\alpha^{(j)2}_{n}<+\infty.

  • Assumptions 1, 2 hold true (which guarantee in particular that VπV_{\pi} and Vπ𝕍V^{\mathbb{V}}_{\pi} are uniformly bounded).

  • For all θ\theta, the objective function fθf_{\theta} has bounded second derivatives. Furthermore, the set of local optima of fθf_{\theta} is countable. Here the objective function fθf_{\theta} is defined as:

    fθ(s)=𝒞πθSh[t](s)f_{\theta}(s)=\mathcal{C}_{\pi_{\theta}}^{Sh}[\mathcal{R}_{t}](s)

E.2 The chaotic CVaR case

We recall that the CVaRβCVaR_{\beta} of a random variable XX is defined as CVaRβ(X):=𝔼[X|XVarβ(X)]CVaR_{\beta}(X):=\mathbb{E}[X|X\leq Var_{\beta}(X)], where the value-at-risk Varβ(X)=FX1(X)Var_{\beta}(X)=F_{X}^{-1}(X), where FXF_{X} is the cumulative distribution function of XX. We adapt the work Chow et al. , (2018) (algorithm 1) to the chaotic case, i.e. the case where X=tchaosX=\mathcal{R}_{t}^{chaos} is the chaotic reward process. In order to obtain a unbiased estimate of the gradient of

𝒞πθCVaR(β)[t](s0):=𝔼πθ[tchaos(s0)|st,tchaos(s0)Varβ(tchaos(s0))]\mathcal{C}_{\pi_{\theta}}^{CVaR(\beta)}[\mathcal{R}_{t}](s_{0}):=\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{t}^{chaos}(s_{0})|s_{t},\mathcal{R}_{t}^{chaos}(s_{0})\leq Var_{\beta}(\mathcal{R}_{t}^{chaos}(s_{0}))]

we first estimate as in Chow et al. , (2018) (algorithm 1) Varβ(tchaos(st))Var_{\beta}(\mathcal{R}_{t}^{chaos}(s_{t})) on a fast timescale αn(2)\alpha^{(2)}_{n} so that it can be considered as converged when performing the θ\theta update:

Varn+1=Varnαn(2)(1(1β)11Bb=1B1{ZbVarn})Var_{n+1}=Var_{n}-\alpha^{(2)}_{n}(1-(1-\beta)^{-1}\frac{1}{B}\sum_{b=1}^{B}1\{Z_{b}\geq Var_{n}\})
Zb:=t=0Tb1γ2t(Rt+1(b)R^ψ(st(b),at(b)))Z_{b}:=\sum_{t=0}^{T_{b}-1}\gamma^{2t}(R^{(b)}_{t+1}-\widehat{R}_{\psi}(s^{(b)}_{t},a^{(b)}_{t}))

We can then compute an unbiased estimate of the gradient CVaR^\widehat{\nabla CVaR} of 𝒞πθCVaR(β)[t](s0)\nabla\mathcal{C}_{\pi_{\theta}}^{CVaR(\beta)}[\mathcal{R}_{t}](s_{0}) as:

CVaRn^=1Bb=1Blnπθn,b(ZbVarn)1{ZbVarn}\widehat{\nabla CVaR_{n}}=\frac{1}{B}\sum_{b=1}^{B}\nabla\ln\pi_{\theta_{n},b}(Z_{b}-Var_{n})1\{Z_{b}\geq Var_{n}\}
lnπθn,b:=t=0Tb1lnπθn(at(b)|st(b))\nabla\ln\pi_{\theta_{n},b}:=\sum_{t=0}^{T_{b}-1}\nabla\ln\pi_{\theta_{n}}(a^{(b)}_{t}|s^{(b)}_{t})

Eventually, θ\theta is updated on the slow timescale αn(1)\alpha^{(1)}_{n} as:

θn+1=θn+αn(1)(1Bb=1BJbCVaRn^)\theta_{n+1}=\theta_{n}+\alpha^{(1)}_{n}(\frac{1}{B}\sum_{b=1}^{B}J_{b}-\widehat{\nabla CVaR_{n}})

where JbJ_{b} is as in algorithm 3.

Appendix F Policy Gradient Actor-Critic algorithms

In the episodic, discounted or average reward settings, we can use the Bellman equations in theorem 2 to adapt the work of e.g. Prashanth & Ghavamzadeh, (2016) (variance case) or Chow et al. , (2018) (CVaR case) in order to derive chaotic versions of their actor-critic algorithms.

In the chaotic mean-variance discounted reward case, and under assumption 2, the policy gradient theorem gives us for the expected reward and risk-sensitive gradients:

θ𝔼π[0|s0]=𝔼(S,A)dγθ(s0)[lnπθ(A|S)Qπ(S,A)]\displaystyle\nabla_{\theta}\mathbb{E}_{\pi}[\mathcal{R}_{0}|s_{0}]=\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q_{\pi}(S,A)] (3)
θVπθ𝕍(β)(s0)=β2𝔼(S,A)dγ2θ(s0)[lnπθ(A|S)Qπ𝕍(β)(S,A)]\displaystyle\nabla_{\theta}V^{\mathbb{V}(\beta)}_{\pi_{\theta}}(s_{0})=\frac{\beta}{2}\mathbb{E}_{(S,A)\sim d^{\theta}_{\gamma^{2}}(s_{0})}[\nabla\ln\pi_{\theta}(A|S)Q^{\mathbb{V}(\beta)}_{\pi}(S,A)]

where dγnθ(s0,s,a):=t=0γntπθ(a|s)πθ[st=s|s0]d^{\theta}_{\gamma^{n}}(s_{0},s,a):=\sum_{t=0}^{\infty}\gamma^{nt}\pi_{\theta}(a|s)\mathbb{P}_{\pi_{\theta}}[s_{t}=s|s_{0}] is the action-state γn\gamma^{n}-discounted visiting distribution. The proof of the above is very similar to Prashanth & Ghavamzadeh, (2016) (lemma 1), and we have derived it in the proof of section B.4 for the convenience of the reader. In the average reward case, under assumption 2 and using the functions QπQ_{\pi} and chaotic variance Vπ𝕍(β)V^{\mathbb{V}(\beta)}_{\pi} defined in section D, we get the same equalities except that (S,A)(dθ,πθ)(S,A)\sim(d^{\theta},\pi_{\theta}), where dθd^{\theta} is the stationary distribution of the underlying Markov chain generated by having actions follow πθ\pi_{\theta}. The proof of this result is very similar to that of equations (LABEL:eqac) and is given in e.g. Prashanth & Ghavamzadeh, (2016), lemma 3.

Note that in equations (LABEL:eqac), the QQ functions can, as usual, be replaced by the corresponding advantage functions by subtracting the value functions.

In order to derive online Actor-Critic algorithms, we follow the approach in Prashanth & Ghavamzadeh, (2016) by defining VπϕV^{\phi}_{\pi}, Vπϕ,𝕍(β)V^{\phi,\mathbb{V}(\beta)}_{\pi} as parametric approximations of the value function and chaotic variance, where ϕ\phi is a feature vector. We then apply theorem 2 to compute Temporal Differences that are used to estimate the advantage functions in equations (LABEL:eqac), and to update the critic parameters.

Remark 2.

In the average reward setting, the policy gradient equations (LABEL:eqac) can be used to derive an online Actor-Critic algorithm as in Prashanth & Ghavamzadeh, (2016), because as tuples (sts_{t}, ata_{t}, Rt+1R_{t+1}, st+1s_{t+1}) are observed following πθ\pi_{\theta}, we are guaranteed to end up in the limit in the stationary distribution dθd^{\theta^{*}} of a local optima θ\theta^{*}. In the episodic setting with γ=1\gamma=1, the same remark holds. Nevertheless, it much less clear to the author in the infinite horizon discounted reward setting (with γ<1\gamma<1) how the policy gradient equations (LABEL:eqac) could be used rigorously, as they involve the γ\gamma and γ2\gamma^{2}-discounted visiting distribution. This subtlety is the focus of Thomas, (2014) and was pointed out as well in Prashanth & Ghavamzadeh, (2016) where they mention difficulties linked to the two policy gradient equations involving two distinct distributions: the γ\gamma and γ2\gamma^{2} discounted visiting distributions, leading them to introduce new SF-based and SPSA-based algorithms.

In the average reward case we use the notations of section D to get algorithm 5, which is a modification of algorithm 2 in Prashanth & Ghavamzadeh, (2016). The three differences are i) the use of the chaotic TD error δ2,t\delta_{2,t} based on the Bellman equation in theorem 2, ii) the update of the distributional function R^ψ\widehat{R}_{\psi} as it is done in algorithm 3 and iii) the use of the policy gradient equation (LABEL:eqac) for the chaotic variance. In that case the value functions Vπ(s)V_{\pi}(s) and Vπ𝕍(β)(s)V^{\mathbb{V}(\beta)}_{\pi}(s) are approximated by linear functions on lower dimensional spaces λ1Tϕ1(s)\lambda_{1}^{T}\phi_{1}(s) and λ2Tϕ2(s)\lambda_{2}^{T}\phi_{2}(s), respectively. The algorithm uses three timescales, so that parameters updated on the faster timescales can be considered as converged to their limiting value when considering updates on the slower timescales.

In the episodic setting with γ=1\gamma=1, we obtain algorithm 4, which is the episodic counterpart of algorithm 5.

It is to be noted that we can easily extend algorithm 4, 5 to the chaotic Sharpe ratio case, introduced in section E.1. To do so, we use the theta update taken from section E.1:

θt+1=θt+αn(1)(φtδ1,tδ2,t12δ1,tφtδ2,tδ2,t3/2)\theta_{t+1}=\theta_{t}+\alpha^{(1)}_{n}\left(\frac{\varphi_{t}\delta_{1,t}}{\sqrt{\delta_{2,t}}}-\frac{1}{2}\frac{\delta_{1,t}\varphi_{t}\delta_{2,t}}{\delta_{2,t}^{3/2}}\right)

with φt:=lnπθ(at|st)\varphi_{t}:=\nabla\ln\pi_{\theta}(a_{t}|s_{t}).

We do not discuss here the Actor-Critic extensions of the chaotic risk framework to the infinite horizon discounted reward framework in the CVaR case (Chow et al. , (2018)) and the mean-variance case (Prashanth & Ghavamzadeh, (2016)), but they can be performed using similar techniques as used in the present section and in sections E.1, E.2. In particular, the SF-based and SPSA-based techniques used in the above mentioned literature in the discounted reward framework go through with minor modifications as in algorithm 5, i) computing the chaotic TD error δ2,t\delta_{2,t} based on the Bellman equation in theorem 2 and ii) computing the update of the distributional function R^ψ\widehat{R}_{\psi} as it is done in algorithm 3.

Algorithm 4 Online CMV-Actor-Critic (episodic case)

Input: Initial policy parameter θ0\theta_{0}, initial critic parameters λj,0\lambda_{j,0} (j=1..2j=1..2), learning rates αt(j)\alpha^{(j)}_{t} (j=1..3j=1..3), value function feature vectors ϕj\phi_{j} (j=1..2j=1..2), R^ψ\widehat{R}_{\psi} initialized to 0, Optional: experience replay table \mathcal{E}
Output: Approximation of the optimal policy parameter θ\theta^{*}

1:while θ\theta not converged do
2:     observe tuple (sts_{t}, ata_{t}, Rt+1R_{t+1}, st+1)s_{t+1}) following πθ(|)\pi_{\theta}(\cdot|\cdot)
3:     In tabular case: use tuple to update R^ψ\widehat{R}_{\psi} with eq. (2); else: increment \mathcal{E} with the tuple and update R^ψ\widehat{R}_{\psi} using algorithm 2.
4:     TD errors:
5:     δ1,t=Rt+1+λ1,tTϕ1(st+1)λ1,tTϕ1(st)\delta_{1,t}=R_{t+1}+\lambda_{1,t}^{T}\phi_{1}(s_{t+1})-\lambda_{1,t}^{T}\phi_{1}(s_{t})
6:     δ2,t=(Rt+1R^ψ(st,at))2+λ2,tTϕ2(st+1)λ2,tTϕ2(st)\delta_{2,t}=(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}+\lambda_{2,t}^{T}\phi_{2}(s_{t+1})-\lambda_{2,t}^{T}\phi_{2}(s_{t})
7:     Critic Updates:
8:     λ1,t+1=λ1,t+αt(2)δ1,tϕ1(st)\lambda_{1,t+1}=\lambda_{1,t}+\alpha^{(2)}_{t}\delta_{1,t}\phi_{1}(s_{t})
9:     λ2,t+1=λ2,t+αt(2)δ2,tϕ2(st)\lambda_{2,t+1}=\lambda_{2,t}+\alpha^{(2)}_{t}\delta_{2,t}\phi_{2}(s_{t})
10:     Policy Update:
11:     θt+1=θt+αt(1)lnπθ(at|st)(δ1,tβ2δ2,t)\theta_{t+1}=\theta_{t}+\alpha^{(1)}_{t}\nabla\ln\pi_{\theta}(a_{t}|s_{t})(\delta_{1,t}-\frac{\beta}{2}\delta_{2,t})
Algorithm 5 Online CMV-Actor-Critic (average reward case)

Input: Initial policy parameter θ0\theta_{0}, initial critic parameters λj,0\lambda_{j,0} (j=1..2j=1..2), learning rates αt(j)\alpha^{(j)}_{t} (j=1..3j=1..3), value function feature vectors ϕj\phi_{j} (j=1..2j=1..2), R^ψ\widehat{R}_{\psi} initialized to 0, Optional: experience replay table \mathcal{E}.
Output: Approximation of the optimal policy parameter θ\theta^{*}

1:while θ\theta not converged do
2:     observe tuple (sts_{t}, ata_{t}, Rt+1R_{t+1}, st+1)s_{t+1}) following πθ(|)\pi_{\theta}(\cdot|\cdot)
3:     In tabular case: use tuple to update R^ψ\widehat{R}_{\psi} with eq. (2); else: increment \mathcal{E} with the tuple and update R^ψ\widehat{R}_{\psi} using algorithm 2.
4:     Average rewards:
5:     ρt+1=(1αt(3))ρt+αt(3)Rt+1\rho_{t+1}=(1-\alpha^{(3)}_{t})\rho_{t}+\alpha^{(3)}_{t}R_{t+1}
6:     σt+1=(1αt(3))σt+αt(3)(Rt+1R^ψ(st,at))2\sigma_{t+1}=(1-\alpha^{(3)}_{t})\sigma_{t}+\alpha^{(3)}_{t}(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}
7:     TD errors:
8:     δ1,t=Rt+1ρt+1+λ1,tTϕ1(st+1)λ1,tTϕ1(st)\delta_{1,t}=R_{t+1}-\rho_{t+1}+\lambda_{1,t}^{T}\phi_{1}(s_{t+1})-\lambda_{1,t}^{T}\phi_{1}(s_{t})
9:     δ2,t=(Rt+1R^ψ(st,at))2σt+1+λ2,tTϕ2(st+1)λ2,tTϕ2(st)\delta_{2,t}=(R_{t+1}-\widehat{R}_{\psi}(s_{t},a_{t}))^{2}-\sigma_{t+1}+\lambda_{2,t}^{T}\phi_{2}(s_{t+1})-\lambda_{2,t}^{T}\phi_{2}(s_{t})
10:     Critic Updates:
11:     λ1,t+1=λ1,t+αt(2)δ1,tϕ1(st)\lambda_{1,t+1}=\lambda_{1,t}+\alpha^{(2)}_{t}\delta_{1,t}\phi_{1}(s_{t})
12:     λ2,t+1=λ2,t+αt(2)δ2,tϕ2(st)\lambda_{2,t+1}=\lambda_{2,t}+\alpha^{(2)}_{t}\delta_{2,t}\phi_{2}(s_{t})
13:     Policy Update:
14:     θt+1=θt+αt(1)lnπθ(at|st)(δ1,tβ2δ2,t)\theta_{t+1}=\theta_{t}+\alpha^{(1)}_{t}\nabla\ln\pi_{\theta}(a_{t}|s_{t})(\delta_{1,t}-\frac{\beta}{2}\delta_{2,t})

Appendix G Numerical values used in experiments of section 5 and additional experiments

G.1 Grid World - section 5.1

We train the policy using 51055\cdot 10^{5} timesteps for each value of β\beta using an ϵ\epsilon-greedy exploration policy with associated probability to take a random action of ϵ=0.1\epsilon=0.1, a QQ table initialized to zero, a learning rate αt=N(st,at)0.5\alpha_{t}=N(s_{t},a_{t})^{-0.5} (where N(st,at)N(s_{t},a_{t}) counts the number of visits to the state-action pair (st,at)(s_{t},a_{t})), and set the probability of the robot going in a random direction to be perror=0.5p_{error}=0.5. We then run the trained policy for 10510^{5} rollout steps in order to get the path heatmap and risk penalty heatmap displayed in figure 2. In the latter, the results are averaged over 25 NumPy RNG seeds (1001, 1003, 1006, 1008, 1009, 1010, 1015, 1018, 1021, 1022, 1025, 1028, 1029, 1031, 1033, 1035, 1037, 1038, 1039, 1040, 1041, 1043, 1047, 1048, 1049): for each such seed, we perform the training and obtain the rollouts, which are then averaged over seeds.

In figure 5 we display the path heatmaps similar as in figure 2 but for various values of perrorp_{error}, representing the probability that the robot goes in a random direction. Figure 6 is similar to figure 5 but with a more severe negative reward of -50 instead of -20. As expected, in figure 5, the higher perrorp_{error}, the more reward uncertainty/stochasticity there is and the further away from the -20 reward the robot goes. The same observation goes for a lower reward -50 instead of -20 (figure 6).

Refer to caption

Figure 5: Path heatmaps over 10510^{5} rollout steps with the learned policy for various β\beta and perrorp_{error}

Refer to caption

Figure 6: Path heatmaps over 10510^{5} rollout steps with the learned policy for various β\beta and perrorp_{error}, and more severe negative reward of -50 instead of -20.

G.2 Portfolio optimization - section 5.2

As described in section 5.2, the action is defined as the quantities invested in the risk-free and risky assets at:=(qtRF,qtR)a_{t}:=(q^{RF}_{t},q^{R}_{t}) with a total budget constraint qtRF+qtRqmax=5q^{RF}_{t}+q^{R}_{t}\leq q_{max}=5 and qtRF,qtR0q^{RF}_{t},q^{R}_{t}\geq 0. This yields a total of 21 possible actions (0,0)(0,0), (1,0)(1,0), (0,1)(0,1),…, (4,1)(4,1), (1,4)(1,4), (0,5)(0,5), (5,0)(5,0).

In order to train the risk-sensitive policy in section 5.2 according to the CMV-REINFORCE algorithm 3, we use a learning rate α=0.1\alpha=0.1, batch size B=104B=10^{4} and perform M=5103M=5\cdot 10^{3} iterations on the policy parameter θ\theta. We use the Boltzmann exploration policy:

πθ(a|s)=eθTϕ(s,a)a𝔸eθTϕ(s,a)\pi_{\theta}(a|s)=\frac{e^{\theta^{T}\phi(s,a)}}{\sum_{a^{\prime}\in\mathbb{A}}e^{\theta^{T}\phi(s,a^{\prime})}}

where θ\theta is of size |𝕊||𝔸|=321=63|\mathbb{S}|\cdot|\mathbb{A}|=3\cdot 21=63 and ϕ(s,a)\phi(s,a) is the vector with entries all zero except the entry corresponding to (s,a)(s,a) which is set to 1. For each value of β\beta, we train the policy as described above and plot in figure 3, 4 the corresponding metrics for 51045\cdot 10^{4} rollout episodes of T=20T=20 timesteps.

The values of the risk-free rate μ\mu and volatility σ\sigma in each state are reported in table 1. The state transition matrices P(s,a,s)P(s,a,s^{\prime}) depending on the action aa (via the quantity invested in the risky asset qtRq^{R}_{t}) are displayed in table 2. They have been designed such that the more we invest in the risky asset, the more likely we are to reach a higher volatility state.

Table 1: Section 5.2 Portfolio Optimization: risk-free rate μ\mu and volatility σ\sigma
Parameter State LowVol MediumVol HighVol
μ\mu 0.20.2 0.60.6 1.1.
σ\sigma 0.50.5 1.1. 1.51.5
Table 2: Section 5.2 Portfolio Optimization: state transition matrix as a function of the chosen action (quantity qtRq^{R}_{t} invested in the risky asset)
qtR=5q^{R}_{t}=5 LowVol MediumVol HighVol
LowVol 0.050.05 0.250.25 0.70.7
MediumVol 0.050.05 0.250.25 0.70.7
HighVol 0.050.05 0.250.25 0.70.7
2<qtR<52<q^{R}_{t}<5 LowVol MediumVol HighVol
LowVol 0.10.1 0.450.45 0.450.45
MediumVol 0.10.1 0.450.45 0.450.45
HighVol 0.10.1 0.450.45 0.450.45
0<qtR20<q^{R}_{t}\leq 2 LowVol MediumVol HighVol
LowVol 1/31/3 1/31/3 1/31/3
MediumVol 1/31/3 1/31/3 1/31/3
HighVol 1/31/3 1/31/3 1/31/3
qtR=0q^{R}_{t}=0 LowVol MediumVol HighVol
LowVol 0.50.5 0.450.45 0.050.05
MediumVol 0.50.5 0.450.45 0.050.05
HighVol 0.50.5 0.450.45 0.050.05

In figure 3 and 4, we compare CMV-REINFORCE (algorithm 3) with the mean-variance baseline. Denoting 𝕍πθ\mathbb{V}_{\pi_{\theta}} the variance operator, the gradient of the variance θ𝕍πθ[0|s0]\nabla_{\theta}\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}] needed in the mean-variance method is computed using the classical likelihood ratio technique. Indeed, 𝕍πθ[0|s0]=𝔼πθ[02|s0](𝔼πθ[0|s0])2\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]=\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}^{2}|s_{0}]-\left(\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]\right)^{2}, and hence:

θ𝕍πθ[0|s0]=θ𝔼πθ[02|s0]2𝔼πθ[0|s0]θ𝔼πθ[0|s0]\nabla_{\theta}\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]=\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}^{2}|s_{0}]-2\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]

The likelihood ratio technique then yields (Tamar et al. , (2012), lemma 4.2):

θ𝕍πθ[0|s0]=1Bb=1B(Vb2μJJb)t=0Tb1lnπθ(at(b)|st(b))\displaystyle\nabla_{\theta}\mathbb{V}_{\pi_{\theta}}[\mathcal{R}_{0}|s_{0}]=\frac{1}{B}\sum_{b=1}^{B}(V_{b}-2\mu_{J}J_{b})\sum_{t=0}^{T_{b}-1}\nabla\ln\pi_{\theta}(a^{(b)}_{t}|s^{(b)}_{t}) (4)

with:

Jb:=t=0Tb1γtRt+1(b),Vb:=Jb2,μJ:=1Bb=1BJbJ_{b}:=\sum_{t=0}^{T_{b}-1}\gamma^{t}R^{(b)}_{t+1},\hskip 5.69054ptV_{b}:=J_{b}^{2},\hskip 5.69054pt\mu_{J}:=\frac{1}{B}\sum_{b=1}^{B}J_{b}

Note that we additionally apply an optimal baseline \ell^{*} (Necchi, (2016), section 4.4.1.1.) by replacing Vb2μJJbV_{b}-2\mu_{J}J_{b} by Vb2μJJbV_{b}-2\mu_{J}J_{b}-\ell^{*} in equation (4) with \ell^{*} the vector which kthk^{th} component is given by:

k=𝔼πθ[(022μJ0)(θklnπθ)2]𝔼πθ[(θklnπθ)2]\ell^{*}_{k}=\frac{\mathbb{E}_{\pi_{\theta}}[(\mathcal{R}_{0}^{2}-2\mu_{J}\mathcal{R}_{0})(\partial_{\theta_{k}}\ln\pi_{\theta})^{2}]}{\mathbb{E}_{\pi_{\theta}}[(\partial_{\theta_{k}}\ln\pi_{\theta})^{2}]}