Reinforcement Learning for Jump-Diffusions, with Financial Applications

Xuefeng Gao Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China. E-mail: [email protected] Lingfei Li Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China. E-mail: [email protected] Xun Yu Zhou Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia University, New York, NY 10027, USA. Email: [email protected]

Abstract

We study continuous-time reinforcement learning (RL) for stochastic control in which system dynamics are governed by jump-diffusion processes. We formulate an entropy-regularized exploratory control problem with stochastic policies to capture the exploration–exploitation balance essential for RL. Unlike the pure diffusion case initially studied by Wang et al. (2020), the derivation of the exploratory dynamics under jump-diffusions calls for a careful formulation of the jump part. Through a theoretical analysis, we find that one can simply use the same policy evaluation and $q$ -learning algorithms in Jia and Zhou (2022a), Jia and Zhou (2023), originally developed for controlled diffusions, without needing to check a priori whether the underlying data come from a pure diffusion or a jump-diffusion. However, we show that the presence of jumps ought to affect parameterizations of actors and critics in general. We investigate as an application the mean–variance portfolio selection problem with stock price modelled as a jump-diffusion, and show that both RL algorithms and parameterizations are invariant with respect to jumps. Finally, we present a detailed study on applying the general theory to option hedging.

Keywords. Reinforcement learning, continuous time, jump-diffusions, exploratory formulation, well-posedness, Hamiltonian, martingale, $q$ -learning.

1 Introduction

Recently there is an upsurge of interest in continuous-time reinforcement learning (RL) with continuous state spaces and possibly continuous action spaces. Continuous RL problems are important because: 1) many if not most practical problems are naturally continuous in time (and in space), such as autonomous driving, robot navigation, video game play and high frequency trading; 2) while one can discretize time upfront and turn a continuous-time problem into a discrete-time MDP, it has been known, indeed shown experimentally e.g., Munos (2006), Tallec et al. (2019) and Park et al. (2021), that this approach is very sensitive to time discretization and performs poorly with small time steps; 3) there are more analytical tools available for the continuous setting that enable a rigorous and thorough analysis leading to interpretable (instead of black-box) and general (instead of ad hoc) RL algorithms.

Compared with the vast literature of RL for MDPs, continuous-time RL research is still in its infancy with the latest study focusing on establishing a rigorous mathematical theory and devising resulting RL algorithms. This strand of research starts with Wang et al. (2020) that introduces a mathematical formulation to capture the essence of RL – the exploration–exploitation tradeoff – in the continuous setting, followed by a “trilogy” (Jia and Zhou 2022a, b, Jia and Zhou 2023) that develops intertwining theories on policy evaluation, policy gradient and $q$ -learning respectively. The common underpinning of the entire theory is the martingale property of certain stochastic processes, the enforcement of which naturally leads to various temporal difference algorithms to train and learn $q$ -functions, value functions and optimal (stochastic) policies. The research is characterized by carrying out all the analysis in the continuous setting, and discretizing time only at the final, implementation stage for approximating the integrated rewards and the temporal difference. The theory has been adapted and extended in different directions; see e.g. Reisinger and Zhang (2021), Guo et al. (2022), Dai et al. (2023), as well as employed for applications; see e.g. Wang and Zhou (2020), Huang et al. (2022), Gao et al. (2022), Wang et al. (2023), and Wu and Li (2024).

The study so far has been predominantly on pure diffusion processes, namely the state processes are governed by controlled stochastic differential equations (SDEs) with a drift part and a diffusion one. While it is reasonable to model the underlying data generating processes as diffusions within a short period of time, sudden and drastic changes can and do happen over time. An example is a stock price process: while it is approximately a diffusion over a sufficiently short period, it may respond dramatically to a surprisingly good or bad earning report. Other examples include neuron dynamics (Giraudo and Sacerdote 1997), stochastic resonance (Gammaitoni et al. 1998) and climate data (Goswami et al. 2018). It is therefore natural and necessary to extend the continuous RL theory and algorithms to the case when jumps are present. This is particularly important for decision makings in financial markets, where it has been well recognized that using jumps to capture large sudden movements provides a more realistic way to model market dynamics; see the discussions in Chapter 1 of Cont and Tankov (2004). The financial modeling literature with jumps dates back to the seminal work of Merton (1976), who extends the classical Black–Scholes model by introducing a compound Poisson process with normally distributed jumps in the log returns. Since then alternative jump size distributions have been proposed in e.g. Kou (2002) and Cai and Kou (2011). Empirical success of jump-diffusion models have been documented for many asset classes; see Bates (1991), Andersen et al. (2002), and Aït-Sahalia et al. (2012) for stocks and stock indices, Bates (1996) for exchange rates, Das (2002) for interest rates, and Li and Linetsky (2014) for commodities, among many others.

This paper makes two major contributions. The first is mathematical in terms of setting up the suitable exploratory formulation and proving the well-posedness of the resulting exploratory SDE, which form the foundation of the RL theory for jump-diffusions. Wang et al. (2020) apply the classical stochastic relaxed control to model the exploration or randomization prevalent in RL, and derive an exploratory state equation that dictates the dynamics of the “average” of infinitely many state processes generated by repeatedly sampling from the same exploratory, stochastic policy. The drift and variance coefficients of the exploratory SDE are the means of those coefficients against the given stochastic policy (which is a probability distribution) respectively. The derivation therein is based on a law of large number argument to the first two moments of the diffusion process. That argument fails for jump-diffusions which are not uniquely determined by the first two moments. We overcome this difficulty by analyzing instead the infinitesimal generator of the sample state process, from which we identify the dynamics of the exploratory state process. Inspired by Kushner (2000) who studies relaxed control for jump-diffusions, we formulate the exploratory SDE by extending the original Poisson random measures for jumps to capture the effect of random exploration. It should be noted that, like almost all the earlier works on relaxed control, Kushner (2000) is motivated by answering the theoretical question of whether an optimal control exists, as randomization convexifies the universe of control strategies. In comparison, our formulation is guided by the practical motivation of exploration for learning. There is also another subtle but important difference. We consider stochastic feedback policies while Kushner (2000) does not. This in turn creates technical issues in studying the well-posedness of the exploratory SDE in our framework.

The second main contribution is several implications regarding the impact of jumps on RL algorithm design. Thanks to the established exploratory formulation, we can define the Hamiltonian that, compared with the pure diffusion counterpart, has to include an additional term corresponding to the jumps. The resulting HJB equation – called the exploratory HJB – is now a partial integro-differential equation (PIDE) instead of a PDE due to that additional term. However, when expressed in terms of the Hamiltonian, the exploratory HJB equation has exactly the same form as that in the diffusion case. This leads to several completely identical statements of important results, including the optimality of the Gibbs exploration, definition of a $q$ -function, and martingale characterizations of value functions and $q$ -functions. Here by “identical” we mean in terms of the Hamiltonian; in other words, these statements differ between diffusions and jump-diffusions entirely because the Hamiltonian is defined differently (which also causes some differences in the proofs of the results concerned). Most important of all, in the resulting RL algorithms, the Hamiltonian (or equivalently the $q$ -function) can be computed using temporal difference of the value function by virtue of the Itô lemma; as a result the algorithms are completely identical no matter whether or not there are jumps. This has a significant practical implication: we can just use the same RL algorithms without the need of checking in advance whether the underlying data come from a pure diffusion or a jump-diffusion. It is significant for the following reason. In practice, data are always observed or sampled at discrete times, no matter how frequent they arrive. Thus we encounter successive discontinuities along the sample trajectory even when the data actually come from a diffusion process. There are some criteria that can be used to check whether the underlying process is a diffusion or a jump-diffusion, e.g. Aït-Sahalia et al. (2012), Wang and Zheng (2022). But these methods typically require data with very high frequency to be effective, which may not always be available. In addition, noises must be properly handled for them to work.

Even though we can apply the same RL algorithms irrespective of the presence of jumps, the parametrization of the policy and value function may still depend on it, if we try to exploit certain special structure of the problem instead of using general neural networks for parameterization. Indeed, we give an example in which the optimal exploratory policy is Gaussian when there are no jumps, whereas an optimal policy either does not exist or becomes non-Gaussian when there are jumps. However, in the mean–variance portfolio selection we present as a concrete application, the optimal Gibbs exploration measure again reduces to Gaussian and the value function is quadratic as in Wang and Zhou (2020), both owing to the inherent linear–quadratic (LQ) structure of the problem. Hence in this particular case jumps do not even affect the parametrization of the policy and value function/ $q$ -function for learning.

We also consider mean–variance hedging of options as another application. This is a non-LQ problem and hence more difficult to solve than mean–variance portfolio selection. The mean–variance hedging problem has been studied in various early works, such as Schweizer (1996, 2001) and Lim (2005). Here, we introduce the entropy-regularized mean–variance hedging objective for an asset following a jump-diffusion and derive analytical representations for the optimal stochastic policy, which is again Gaussian, as well as the optimal value function. We use these representations to devise an actor–critic algorithm to learn the optimal hedging policy from data.

We compare our work with four recent related papers. (1) Bender and Thuan (2023) consider the continuous-time mean–variance portfolio selection problem with exploration under a jump-diffusion setting. Our paper differs from theirs in several aspects. First, they consider a specific application problem while we study RL for general controlled jump-diffusions. Second, they obtain the SDE for the exploratory state by taking limit of the discrete-time exploratory dynamics, whereas our approach first derives the form of the infinitesimal generator of the sample state process and then infers the exploratory SDE from it. It is unclear how their approach works when dealing with general control problems. Finally, they do not consider how to develop RL algorithms based on their solution of the exploratory mean–variance portfolio selection, which we do in this paper. (2) Guo et al. (2023) consider continuous-time RL for linear–convex models with jumps. The scope and motivation are different from ours: They focus on the Lipschitz stability of feedback controls for this special class of control problems where the diffusion and jump terms are not controlled, and propose a least-square model-based algorithm and obtain sublinear regret guarantees in the episodic setting. By contrast, we consider RL for general jump-diffusions and develop model-free algorithms without considering regret bounds. (3) Denkert et al. (2024) aim to unify certain types of stochastic control problems by considering the so-called randomized control formulation which leads to the same optimal value functions as those of the original problems. They develop a policy gradient representation and actor-critic algorithms for RL. The randomized control formulation is fundamentally different from the framework we are considering: therein the control is applied at a set of random time points generated by a random point process instead of at every time point as in our framework. (4) Bo et al. (2024) develop $q$ -learning for jump-diffusions by using Tsallis’ entropy for regularization instead of Shannon’s entropy considered in our paper and in Jia and Zhou (2022b), Jia and Zhou (2023).¹¹1The paper Bo et al. (2024) came to our attention after a previous version of our paper was completed and posted. While this entropy presents an interesting alternative for developing RL algorithms, it may make the exploratory control problem less tractable to solve and lead to policy distributions that are inefficient to sample for exploration in certain applications.

The remainder of the paper is organized as follows. In Section 2, we discuss the problem formulation. In Section 3, we present the theory of $q$ -learning for jump-diffusions, followed by the discussion of $q$ -learning algorithms in Section 4. In Section 5, we apply the general theory and algorithms to a mean–variance portfolio selection problem, and discuss the impact of jumps. Section 6 presents the application to a mean–variance option hedging problem. Finally, Section 7 concludes. All the proofs are given in an appendix.

2 Problem Formulation and Preliminaries

For readers’ convenience, we first recall some basic concepts for one-dimensional (1D) Lévy processes, which can be found in standard references such as Sato (1999) and Applebaum (2009). A 1D process $L=\{L_{t}:t\geq 0\}$ is a Lévy process if it is continuous in probability, has stationary and independent increments, and $L_{0}=0$ almost surely. Denote the jump of $L$ at time $t$ by $\Delta L_{t}=L_{t}-L_{t-}$ , and let $\boldsymbol{\text{B}}_{0}$ be the collection of Borel sets of $\mathbb{R}$ whose closure does not contain $0$ . The Poisson random measure (or jump measure) of $L$ is defined as

\displaystyle N(t,B)=\sum_{s:0<s\leq t}1_{B}(\Delta L_{s}),\ B\in\boldsymbol{\text{B}}_{0},

(1)

which gives the number of jumps up to time $t$ with jump size in a Borel set $B$ away from $0$ . The Lévy measure $\nu$ of $L$ is defined by $\nu(B)=\mathbb{E}[N(1,B)]$ for $B\in\boldsymbol{\text{B}}_{0}$ , which shows the expected number of jumps in $B$ in unit time, and $\nu(B)$ is finite. For any $B\in\boldsymbol{\text{B}}_{0}$ , $\{N(t,B):t\geq 0\}$ is a Poisson process with intensity given by $\nu(B)$ . The differential forms of these two measures are written as $N(dt,dz)$ and $\nu(dz)$ , respectively. If $\nu$ is absolutely continuous, we write $\nu(dz)=\nu(z)dz$ by using the same letter for the measure and its density function. The Lévy measure $\nu$ must satisfy the integrability condition

\int_{\mathbb{R}}\min\{z^{2},1\}\nu(dz)<\infty.

(2)

However, it is not necessarily a finite measure on $\mathbb{R}/\{0\}$ but always a $\sigma$ -finite measure. The Lévy process $L$ is said to have jumps of finite (infinite activity) if $\int_{\mathbb{R}}\nu(dz)<\infty$ ( $=\infty$ ). The number of jumps on any finite time interval is finite in the former case but infinite in the latter. For any Borel set $B$ with its closure including $0$ , $\nu(B)$ is finite in the finite activity case but infinite otherwise. Finally, the compensated Poisson random measure is defined as $\widetilde{N}(dt,dz)=N(dt,dz)-\nu(dz)dt$ . For any $B\in\boldsymbol{\text{B}}_{0}$ , the process $\{\widetilde{N}(t,B):t\geq 0\}$ is a martingale.

Throughout the paper, we use the following notations. By convention, all vectors are column vectors unless otherwise specified. We use $\mathbb{R}^{k}$ and $\mathbb{R}^{k\times\ell}$ to denote the space of $k$ -dimensional vectors and $k\times\ell$ matrices, respectively. For matrix $A$ , we use $A^{\top}$ for its transpose, $|A|$ for its Euclidean/Frobenius norm, and write $A^{2}\coloneqq AA^{\top}$ . Given two matrices $A$ and $B$ of the same size, we denote by $A\circ B$ the inner product between $A$ and $B$ , which is given by $\text{tr}(AB^{\top})$ . For a positive semidefinite matrix $A$ , we write $\sqrt{A}=UD^{1/2}V^{\top}$ , where $A=UDV^{\top}$ is its singular value decomposition with $U,V$ as two orthogonal matrices and $D$ as a diagonal matrix, and $D^{1/2}$ is the diagonal matrix whose entries are the square root of those of $D$ . We use $f=f(\cdot)$ to denote the function $f$ , and $f(x)$ to denote the function value of $f$ at $x$ . We use both $f_{x},f_{xx}$ and $\frac{\partial f}{\partial x},\frac{\partial^{2}f}{\partial x^{2}}$ for the firs and second (partial) derivatives of a function $f$ with respect to $x$ . We write the minimum of two values $a$ and $b$ as $a\wedge b$ . The notation $\mathcal{U}(B)$ denotes the uniform distribution over set $B$ while $\mathcal{N}(\mu,\Sigma)$ refers to the Gaussian distribution with mean vector $\mu$ and covariance matrix $\Sigma$ .

2.1 Classical stochastic control of jump-diffusions

Consider a filtered probability space $(\Omega,\mathcal{F},\{\mathcal{F}_{t}\},\mathbb{P})$ satisfying the usual hypothesis. Assume that this space is rich enough to support $W=\{W_{t}:t\geq 0\}$ , a standard Brownian motion in $\mathbb{R}^{m}$ , and $\ell$ independent one-dimensional (1D) Lévy processes $L_{1},\cdots,L_{\ell}$ , which are also independent of $W$ . Let $N(dt,dz)=(N_{1}(dt,dz_{1}),\cdots,N_{\ell}(dt,dz_{\ell}))^{\top}$ be the vector of their Poisson random measures, and similarly define $\nu(dz)$ and $\widetilde{N}(dt,dz)$ . The controlled system dynamics are governed by the following Lévy SDE (Øksendal and Sulem 2007, Chapter 3):

\displaystyle dX_{s}^{a}=b(s,X_{s-}^{a},a_{s})ds+\sigma(s,X_{s-}^{a},a_{s})dW_{s}+\int_{\mathbb{R}^{\ell}}\gamma(s,X_{s-}^{a},a_{s},z)\widetilde{N}(ds,dz),\ s\in[0,T],

(3)

where

b:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}^{d},\ \sigma:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}^{d\times m}\ \text{ and }\ \gamma:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\times\mathbb{R}^{\ell}\rightarrow\mathbb{R}^{d\times\ell},

(4)

$a_{s}$ is the control or action at time $s$ , $\mathcal{A}\subseteq\mathbb{R}^{n}$ is the control space, and $a=\{a_{s}:s\in[0,T]\}$ is the control process assumed to be predictable with respect to $\{\mathcal{F}_{s}:s\in[0,T]\}$ . We denote the $k$ -th column of the matrix $\gamma$ by $\gamma_{k}$ . The goal of stochastic control is, for each initial time-state pair $(t,x)$ of (3), to find the optimal control process $a$ that maximizes the expected total reward:

\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}r(s,X_{s}^{a},a_{s})ds+e^{-\beta(s-t)}h(X_{T}^{a})\Big{|}X_{t}^{a}=x\right],

(5)

where $\beta\geq 0$ is a discount factor that measures the time value of the payoff.

The stochastic control problem (3)–(5) is very general; in particular, control processes affect the drift, diffusion and jump coefficients. We now make the following assumption to ensure well-posedness of the problem. Define $\mathbb{R}^{d}_{K}\coloneqq\{x\in\mathbb{R}^{d}:|x|\leq K\}$ .

Assumption 1.

Suppose the following conditions are satisfied by the state dynamics and reward functions:

(i)

$b,\sigma,\gamma,r,h$ are all continuous functions in their respective arguments;

(ii)

(local Lipschitz continuity) for any $K>0$ and any $p\geq 2$ , there exist positive constants $C_{K}$ and $C_{K,p}$ such that $\forall(t,a)\in[0,T]\times\mathcal{A},(x,x^{\prime})\in\mathbb{R}^{d}_{K}$ ,

	$\displaystyle\|b(t,x,a)-b(t,x^{\prime},a)\|^{2}+\|\sigma(t,x,a)-\sigma(t,x^{\prime},a)\|^{2}\leq C_{K}\|x-x^{\prime}\|^{2},$		(6)
	$\displaystyle\sum_{k=1}^{\ell}\int_{\mathbb{R}}\|\gamma_{k}(t,x,a,z_{k})-\gamma_{k}(t,x^{\prime},a,z_{k})\|^{p}\nu_{k}(dz)\leq C_{K,p}\|x-x^{\prime}\|^{p};$		(7)

(iii)

(linear growth in $x$ ) for any $p\geq 1$ , there exist positive constants $C$ and $C_{p}$ such that $\forall(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}$ ,

	$\displaystyle\|b(t,x,a)\|^{2}+\|\sigma(t,x,a)\|^{2}\leq C(1+\|x\|^{2}),$		(8)
	$\displaystyle\sum_{k=1}^{\ell}\int_{\mathbb{R}}\|\gamma_{k}(t,x,a,z)\|^{p}\nu_{k}(dz)\leq C_{p}(1+\|x\|^{p});$		(9)

(iv)

there exists a constant $C>0$ such that

|r(t,x,a)|\leq C\left(1+|x|^{p}+|a|^{q}\right),\ |h(x)|\leq C\left(1+|x|^{p}\right),\ \forall(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}

(10)

for some $p\geq 2$ and some $q\geq 1$ ; moreover, $\mathbb{E}[\int_{0}^{T}|a_{s}|^{q}ds]$ is finite.

Conditions (i)-(iii) guarantee the existence of a unique strong solution to the Lévy SDE (3) with initial condition $X_{t}^{a}=x\in\mathbb{R}^{d}$ . Furthermore, for any $p\geq 2$ , there exists a constant $C_{p}>0$ such that

\mathbb{E}_{t,x}\left[\sup_{t\leq s\leq T}|X_{s}^{a}|^{p}\right]\leq C_{p}(1+|x|^{p});

(11)

see (Kunita 2004, Theorem 3.2) and (Situ 2006, Theorem 119). With the moment estimate (11), it follows that condition (iv) implies that the expected value in (5) is finite.

Let $\mathcal{L}^{a}$ be the infinitesimal generator associated with the Lévy SDE (3). Under condition (iii), we have $\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|\nu_{k}(dz)<\infty$ for $k=1,\cdots,\ell$ . Thus, we can write $\mathcal{L}^{a}$ in the following form:

	$\displaystyle\mathcal{L}^{a}f(t,x)$	$\displaystyle=\frac{\partial f}{\partial t}(t,x)+b(t,x,a)\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\sigma^{2}(t,x,a)\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x)$
		$\displaystyle+\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz),$		(12)

where $\frac{\partial f}{\partial x}\in\mathbb{R}^{d}$ is the gradient and $\frac{\partial^{2}f}{\partial x^{2}}\in\mathbb{R}^{d\times d}$ is the Hessian matrix.

We recall Itô’s formula, which will be frequently used in our analysis; see e.g. (Øksendal and Sulem 2007, Theorem 1.16). Let $X^{a}$ be the unique strong solution to (3). For any $f\in C^{1,2}(\mathbb{R}^{+}\times\mathbb{R}^{d})$ , we have

	$\displaystyle df(t,X^{a}_{t})=\frac{\partial f}{\partial t}(t,X^{a}_{t-})dt+b(t,X^{a}_{t-},a_{t})\circ\frac{\partial f}{\partial x}(t,X^{a}_{t-})dt+\frac{1}{2}\sigma^{2}(t,X^{a}_{t-},a_{t})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,X^{a}_{t-})dt$	(13)
$\displaystyle+$	$\displaystyle\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,X^{a}_{t-}+\gamma_{k}(t,X^{a}_{t-},a_{t},z))-f(t,X^{a}_{t-})-\gamma_{k}(t,X^{a}_{t-},a_{t},z)\circ\frac{\partial f}{\partial x}(t,X^{a}_{t-})\right)\nu_{k}(dz)dt$	(14)
$\displaystyle+$	$\displaystyle\frac{\partial f}{\partial x}(t,X^{a}_{t-})\circ\sigma(t,X^{a}_{t-},a_{t})dW_{t}+\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,X^{a}_{t-}+\gamma_{k}(t,X^{a}_{t-},a_{t},z))-f(t,X^{a}_{t-})\right)\widetilde{N}_{k}(dt,dz).$	(15)

It is known that the Hamilton–Jacobi–Bellman (HJB) equation for the control problem (3)–(5) is given by

	$\displaystyle\sup_{a\in\mathcal{A}}\{r(t,x,a)+\mathcal{L}^{a}V(t,x)\}-\beta V(t,x)$	$\displaystyle=0,\quad(t,x)\in[0,T)\times\mathbb{R}^{d},$		(16)
	$\displaystyle V(T,x)$	$\displaystyle=h(x),$

where $\mathcal{L}^{a}$ is given in (2.1). Under proper conditions, the solution to the above equation is the optimal value function $V^{*}$ for control problem (5). Moreover, the following function, which maps a time-state pair to an action:

\displaystyle a^{*}(t,x)=\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\{r(t,x,a)+\mathcal{L}^{a}V^{*}(t,x)\right\}

(17)

is the optimal feedback control policy of the problem.

Given a smooth function $V(t,x)\in C^{1,2}([0,T]\times\mathbb{R}^{d}),$ we define the Hamiltonian $H$ by

	$\displaystyle H(t,x,a,$	$\displaystyle V_{x},V_{xx},V)=r(t,x,a)+b(t,x,a)\circ V_{x}(t,x)+\frac{1}{2}\sigma^{2}(t,x,a)\circ V_{xx}(t,x)$		(18)
		$\displaystyle+\quad\sum_{k=1}^{\ell}\int_{\mathbb{R}^{d}}\left(V(t,x+\gamma_{k}(t,x,a,z))-V(t,x)-\gamma_{k}(t,x,a,z)\circ V_{x}(t,x)\right)\nu_{k}(dz).$		(19)

Then, the HJB equation (16) can be recast as

	$\displaystyle\frac{\partial V(t,x)}{\partial t}+\sup_{a\in\mathcal{A}}\{H(t,x,a,V_{x},V_{xx},V)\}-\beta V(t,x)$	$\displaystyle=0,\quad(t,x)\in[0,T)\times\mathbb{R}^{d},$		(20)
	$\displaystyle V(T,x)$	$\displaystyle=h(x).$

2.2 Randomized control and exploratory formulation

A key idea of RL is to explore the unknown environment by randomizing the actions. Let $\boldsymbol{\pi}:(t,x)\in[0,T]\times\mathbb{R}^{d}\rightarrow\boldsymbol{\pi}(\cdot|t,x)\in\mathcal{P}(\mathcal{A})$ be a given stochastic feedback policy, where $\mathcal{P}(\mathcal{A})$ is the set of probability density functions defined on $\mathcal{A}$ . Let ${\bf a}:(t,x)\in[0,T]\times\mathbb{R}^{d}\rightarrow{\bf a}(t,x)\in\mathcal{A}$ be sampled from $\boldsymbol{\pi}$ (i.e. ${\bf a}$ is a copy of $\boldsymbol{\pi}$ ), which is a deterministic feedback policy. Applying this policy to (3), we get for $s\in[0,T]$ ,

\displaystyle dX_{s}^{\bf a}=b(s,X_{s-}^{\bf a},{\bf a}(s,X_{s-}^{\bf a}))ds+\sigma(s,X_{s-}^{\bf a},{\bf a}(s,X_{s-}^{\bf a}))dW_{s}+\int_{\mathbb{R}^{\ell}}\gamma(s,X_{s-}^{\bf a},{\bf a}(s,X_{s-}^{\bf a}),z)\widetilde{N}(ds,dz).

(21)

Assuming the solution to the above SDE uniquely exists, we say the action process $a^{\boldsymbol{\pi}}=\{a^{\boldsymbol{\pi}}_{s}={\bf a}(s,X_{s-}^{\bf a}):t\leq s\leq T\}$ to be generated from $\boldsymbol{\pi}$ . Note that $a^{\boldsymbol{\pi}}$ depends on the specific sample ${\bf a}\sim\boldsymbol{\pi}$ , which we omit to write out for notational simplicity. In the following, we will also write $\boldsymbol{\pi}(\cdot|t,x)$ as $\boldsymbol{\pi}_{t,x}(\cdot)$ .

We need to enlarge the original filtered probability space to include the additional randomness from sampling actions. Following Jia and Zhou (2022b), Jia and Zhou (2023), we assume that the probability space is rich enough to support independent copies of an $n$ -dimensional random vector uniformly distributed over $[0,1]^{n}$ , where $n$ is the dimension of the control space. These copies are also independent of $W$ and $L_{1},\cdots,L_{\ell}$ . Let $\mathcal{G}_{s}$ be the new sigma-algebra generated by $\mathcal{F}_{s}$ and the copies of the uniform random vector up to time $s$ . The new filtered probability space is $(\Omega,\mathcal{F},\{\mathcal{G}_{t}\},\bar{\mathbb{P}})$ , where $\bar{\mathbb{P}}$ is the product extension from $\mathbb{P}$ and they coincide when restricted to $\mathcal{F}_{T}$ .

Fix a stochastic feedback policy $\boldsymbol{\pi}$ and an initial time-state pair $(t,x)$ . An action process $a^{\boldsymbol{\pi}}=\{a^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\}$ generated from $\boldsymbol{\pi}$ is an $\mathcal{G}_{s}$ -progressively measurable process that is also predictable. Consider the sample state process $X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\}$ that follows the SDE

\displaystyle dX^{\boldsymbol{\pi}}_{s}=b(s,X^{\boldsymbol{\pi}}_{s-},a^{\boldsymbol{\pi}}_{s})ds+\sigma(s,X^{\boldsymbol{\pi}}_{s-},a^{\boldsymbol{\pi}}_{s})dW_{s}+\int_{\mathbb{R}^{\ell}}\gamma(s,X^{\boldsymbol{\pi}}_{s-},a^{\boldsymbol{\pi}}_{s},z)\widetilde{N}(ds,dz),\quad s\in[t,T].

(22)

Once again, bear in mind that the above equation depends on a specific sample ${\bf a}\sim\boldsymbol{\pi}$ ; so there are in fact infinitely many similar equations, each corresponding to a sample of $\boldsymbol{\pi}$ .

To encourage exploration, we add an entropy regularizer to the running reward, leading to

\displaystyle J(t,x;\boldsymbol{\pi})=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-\beta(s-t)}\left(r(s,X_{s}^{\boldsymbol{\pi}},a_{s}^{\boldsymbol{\pi}})-\theta\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})\right)ds+e^{-\beta(T-t)}h(X_{T}^{\boldsymbol{\pi}})\right],

(23)

where $\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}$ is the expectation conditioned on $X_{t}^{\boldsymbol{\pi}}=x$ , taken with respect to the randomness in the Brownian motion, the Poisson random measures, and the action randomization. Here $\theta>0$ is the temperature parameter that controls the level of exploration. The function $J(\cdot,\cdot;\boldsymbol{\pi})$ is called the value function of the policy $\boldsymbol{\pi}$ . The goal of RL is to find the policy that maximizes the value function among admissible policies that are to be specified in Definition 1 below.

For theoretical analysis, we consider the exploratory dynamics of $X^{\boldsymbol{\pi}}$ , which represent the key averaged characteristics of the sample state process over infinitely many randomized actions. In the case of diffusions, Wang et al. (2020) derive such exploratory dynamics by applying a law of large number argument to the first two moments of the diffusion process. Their approach, however, cannot be applied to jump-diffusions. Here, we get around by studying the infinitesimal generator of the sample state process, from which we will identify the dynamics of the exploratory state process.

To this end, let $f\in C_{0}^{1,2}([0,T)\times\mathbb{R}^{d})$ , which is continuously differentiable in $t$ and twice continuously differentiable in $x$ with compact support, and we need to analyze $\lim_{s\to 0}\frac{\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}[f(t+s,X_{t+s}^{\boldsymbol{\pi}})]-f(t,x)}{s}$ . Fixing $(t,x)$ , consider the SDE (22) starting from $X^{\boldsymbol{\pi}}_{t}=x$ with $N$ independent copies ${\bf a}_{1},\cdots,{\bf a}_{N}$ of $\boldsymbol{\pi}$ . Let $s>0$ be very small and assume the corresponding actions $a_{1},\cdots,a_{N}$ are fixed from $t$ to $t+s$ . Denote by $X_{t+s}^{a_{i}}$ the value of the state process corresponding to ${\bf a}_{i}$ at $t+s$ . Then

	$\displaystyle\phantom{=}\lim_{s\to 0}\frac{\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}[f(t+s,X_{t+s}^{\boldsymbol{\pi}})]-f(t,x)}{s}$		(24)
	$\displaystyle=\lim_{s\to 0}\frac{\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{t,x}^{\mathbb{P}}[f(t+s,X_{t+s}^{a_{i}})]-f(t,x)}{s}$		(25)
	$\displaystyle=\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\lim_{s\to 0}\frac{\mathbb{E}_{t,x}^{\mathbb{P}}[f(t+s,X_{t+s}^{a_{i}})]-f(t,x)}{s}$		(26)
	$\displaystyle=\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\left(\frac{\partial f}{\partial t}(t,x)+b(t,x,a_{i})\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\sigma^{2}(t,x,a_{i})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x)\right)$		(27)
	$\displaystyle+\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a_{i},z))-f(t,x)-\gamma_{k}(t,x,a_{i},z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz).$		(28)

Using the law of large number, we obtain

\eqref{eq:part1}=\frac{\partial f}{\partial t}(t,x)+\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\tilde{\sigma}^{2}(t,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x),

(29)

where

\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})\coloneqq\int_{\mathcal{A}}b(t,x,a)\boldsymbol{\pi}(a|t,x)da,\quad\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})\coloneqq\left(\int_{\mathcal{A}}\sigma^{2}(t,x,a)\boldsymbol{\pi}(a|t,x)da\right)^{1/2}.

(30)

These “exploratory” drift and diffusion coefficients are consistent with those in Wang et al. (2020). It is tempting to think the exploratory jump coefficient $\tilde{\gamma}$ is similarly the average of $\gamma$ with respect to $\boldsymbol{\pi}$ ; but unfortunately it is generally not true. This in turn is one of the main distinctive features in studying RL for jump-diffusions.

We approach the problem by analyzing the integrals in (28). Using the second-order Taylor expansion, the boundedness of $\frac{\partial^{2}f}{\partial x^{2}}(t,x)$ for $x\in\mathbb{R}^{d}$ and condition (iii) of Assumption 1, we obtain that for fixed $(t,x)$ and each $k$ ,

	$\displaystyle\left\|\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)\right\|$		(31)
	$\displaystyle\leq C\int_{\mathbb{R}}\|\gamma_{k}(t,x,a,z)\|^{2}\nu_{k}(dz)\leq C(1+\|x\|^{2})$		(32)

for some constant $C>0$ , which is independent of $a$ . It follows that

\eqref{eq:part2}=\sum_{k=1}^{\ell}\int_{\mathcal{A}}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)\boldsymbol{\pi}(a|t,x)da.

(33)

Combining (29) and (33), the infinitesimal generator $\mathcal{L}^{\boldsymbol{\pi}}$ of the sample state process is given by the probability weighted average of the generator $\mathcal{L}^{a}$ of the classical controlled jump-diffusion, i.e.,

\mathcal{L}^{\boldsymbol{\pi}}f(t,x)=\int_{\mathcal{A}}\mathcal{L}^{a}f(t,x)\boldsymbol{\pi}(a|t,x)da.

(34)

Next, we reformulate the integrals in (33) to convert them to the same form as (2.1), from which we can infer the SDE for the exploratory state process.

Recall that the Poisson random measure $N_{k}(dt,dz)$ with intensity measure $dt\nu_{k}(dz)$ ( $k=1,\cdots,\ell$ ) is defined over the product space $[0,T]\times\mathbb{R}$ . We can also interpret $N_{k}$ as a counting measure associated with a random configuration of points $(T_{i},Y_{i})\in[0,T]\times\mathbb{R}$ (Cont and Tankov 2004, Section 2.6.3), i.e.,

N_{k}=\sum_{i\geq 1}\delta_{(T_{i},Y_{i})},

(35)

where $\delta_{x}$ is the Dirac measure with mass one at point $x$ , $T_{i}$ is the arrival time of the $i$ th jump of the Lévy process $L_{k}$ , and $Y_{i}=L_{k}(T_{i})-L_{k}(T_{i}-)$ is the size of this jump. We can interpret $Y_{i}$ as the mark of the $i$ th event.

At $T_{i}$ , the size of the jump in the controlled state $X$ under policy $\boldsymbol{\pi}$ is given by $\gamma(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}},a_{T_{i}}^{\boldsymbol{\pi}},Y_{i})$ , where $X_{T_{i}-}^{\boldsymbol{\pi}}$ is the state right before the jump occurs and $a_{T_{i}}^{\boldsymbol{\pi}}$ is the action generated from the feedback policy $\boldsymbol{\pi}$ . When the policy $\boldsymbol{\pi}$ is deterministic, the generated action is determined by $(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}})$ and thus the size of the jump in $X^{\boldsymbol{\pi}}$ is a function of $(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}},Y_{i})$ . By contrast, when $\boldsymbol{\pi}$ becomes stochastic, an additional random noise is introduced at $T_{i}$ that determines the generated action together with $(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}})$ . Consequently, the size of the jump in $X^{\boldsymbol{\pi}}$ is a function of $(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}},Y_{i})$ plus the random noise for exploration at $T_{i}$ .

This motivates us to construct new Poisson random measures on an extended space to capture the effect of random noise on jumps for stochastic policies. Specifically, for each $k=1,\cdots,\ell$ , we construct a new Poisson random measure, denoted by $N^{\prime}_{k}(dt,dz,du)$ , on the product space $[0,T]\times\mathbb{R}\times[0,1]^{n}$ , with its intensity measure given by $dt\nu_{k}(dz)du$ . Here, $u$ is the realized value of the $n$ -dimensional random vector $U$ that follows $\mathcal{U}([0,1]^{n})$ , which is the random noise introduced in the probability space for exploration. The new Poisson random measure $N^{\prime}_{k}$ is also a counting measure associated with a random configuration of points $(T_{i},Y_{i},U_{i})\in[0,T]\times\mathbb{R}\times[0,1]^{n}$ :

N^{\prime}_{k}=\sum_{i\geq 1}\delta_{(T_{i},Y_{i},U_{i})},

(36)

where $T_{i}$ and $Y_{i}$ are the same as above, and $U_{i}$ is the $\mathcal{U}([0,1]^{n})$ random vector that generates random exploration at $T_{i}$ . Hence, the $i$ th event is marked by both $Y_{i}$ and $U_{i}$ under $N^{\prime}_{k}$ . We let $N^{\prime}(dt,dz,du)=(N^{\prime}_{1}(dt,dz_{1},du),\cdots,N^{\prime}_{\ell}(dt,dz_{\ell},du))^{\top}$ .

In general, for any $n$ -dimensional random vector $\xi$ that follows distribution $\boldsymbol{\pi}$ , we can find a measurable function $G^{\boldsymbol{\pi}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ such that $\xi=G^{\boldsymbol{\pi}}(U)$ , where $U\sim\mathcal{U}([0,1]^{n})$ . As an example, consider $\xi\sim\mathcal{N}(\mu,AA^{\top})$ . We can represent it as $\xi=\mu+A\Phi^{-1}(U)$ , where $\Phi$ is the cumulative distribution function of the univariate standard normal distribution and $\Phi^{-1}(U)$ is a vector obtained by applying $\Phi^{-1}$ to each component of $U$ .

For the stochastic feedback policy $\boldsymbol{\pi}_{t,x}$ , using $a=G^{\boldsymbol{\pi}_{t,x}}(u)$ we obtain

		$\displaystyle\int_{\mathcal{A}}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)\boldsymbol{\pi}(a\|t,x)da$		(37)
	$\displaystyle=$	$\displaystyle\int_{\mathbb{R}\times[0,1]^{n}}\left(f\left(t,x+\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right)-f(t,x)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)du.$		(38)

It follows that the infinitesimal generator of the sample state process can be written as

	$\displaystyle\mathcal{L}^{\boldsymbol{\pi}}f(t,x)=\frac{\partial f}{\partial t}(t,x)+\tilde{b}(s,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\tilde{\sigma}^{2}(s,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x)$		(39)
	$\displaystyle+\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left(f\left(t,x+\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right)-f(t,x)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)du.$		(40)

Comparing (40) with (2.1) in terms of the integral part that characterizes the behavior of jumps, we observe that the new measure $\nu_{k}(dz)du$ replaces the Lévy measure $\nu_{k}(dz)$ and integration is done over an extended space to capture the effect of random exploration on jumps. The jump coefficient function that generates the jump size in the controlled state process $X$ given the Lévy jump with size $z$ and control variable $a$ is still the same. However, in (40) the control $a$ is generated from $u$ as $G^{\boldsymbol{\pi}_{t,x}}(u)$ , where $u$ is the realized value of the random noise introduced for exploration. In the following, we will also write $G^{\boldsymbol{\pi}_{t,x}}(u)$ as $G^{\boldsymbol{\pi}}(t,x,u)$ whenever using the latter simplifies notations.

Based on (40), we see that the exploratory state should be the solution to the following Lévy SDE:

	$\displaystyle dX^{\boldsymbol{\pi}}_{s}=$	$\displaystyle\ \tilde{b}(s,X^{\boldsymbol{\pi}}_{s-},\boldsymbol{\pi}(\cdot\|s,X^{\boldsymbol{\pi}}_{s-}))ds+\tilde{\sigma}(s,X^{\boldsymbol{\pi}}_{s-},\boldsymbol{\pi}(\cdot\|s,X^{\boldsymbol{\pi}}_{s-}))dW_{s}$		(41)
		$\displaystyle+\int_{\mathbb{R}\times[0,1]^{n}}\gamma\left(s,X^{\boldsymbol{\pi}}_{s-},G^{\boldsymbol{\pi}}(s,X^{\boldsymbol{\pi}}_{s-},u),z\right)\widetilde{N}^{\prime}(ds,dz,du),\ X^{\boldsymbol{\pi}}_{t}=x,\ s\in[t,T],$		(42)

which we call the exploratory Lévy SDE. The solution process, if exists, is denoted by $\tilde{X}^{\boldsymbol{\pi}}$ and called the exploratory (state) process. As we explain below, this process informs us the behavior of the key characteristics of the sample state process after averaging over infinitely many actions sampled from the stochastic policy $\boldsymbol{\pi}$ .

In general, the sample state process $X^{\boldsymbol{\pi}}$ defined by (22) is a semimartingale, as it is the sum of three processes: the drift process that has finite variation (the first term in (22)), the continuous (local) martingale driven by the Brownian motion (the second term in (22)), and the discontinuous (local) martingale driven by the compensated Poisson random measure (the third term in (22)). Any semimartingale is fully determined by three characteristics: the drift, the quadratic variation of the continuous local martingale, and the compensator of the random measure associated with the process’s jumps (the compensator gives the jump intensity); see Jacod and Shiryaev (2013) for detailed discussions of semimartingales and their characteristics.

For the sample state process, given that $X_{s-}^{\boldsymbol{\pi}}=x$ and the action sampled from $\boldsymbol{\pi}_{s,x}$ is $a\in\mathcal{A}$ , the characteristics over an infinitesimally small time interval $[s,s+ds]$ are given by the triplet $(b(s,x,a)ds,\sigma^{2}(s,x,a)ds,\sum_{k=1}^{\ell}\gamma_{k}(s,x,a,z)\nu_{k}(dz)ds)$ .

Now consider the exploratory state process $\tilde{X}^{\boldsymbol{\pi}}$ , which is also a semimartingale by (42). Its characteristics over an infinitesimally small time interval $[s,s+ds]$ with $\tilde{X}_{s-}^{\boldsymbol{\pi}}=x$ are given by the triplet $(\tilde{b}(s,x,\boldsymbol{\pi}_{s,x})ds,\tilde{\sigma}^{2}(s,x,\boldsymbol{\pi}_{s,x})ds,\sum_{k=1}^{\ell}\int_{[0,1]^{n}}\gamma_{k}(s,x,G^{\boldsymbol{\pi}}(s,x,u),z)du\cdot\nu_{k}(dz)ds)$ , where the third characteristic is obtained by calculating $\mathbb{E}\left[\sum_{k=1}^{\ell}\gamma_{k}\left(s,x,G^{\boldsymbol{\pi}}(s,x,u),z\right)N_{k}(ds,dz,du)\right]$ for Lévy jumps with size from $[z,z+dz]$ . Using (30), we have

	$\displaystyle\tilde{b}(s,x,\boldsymbol{\pi}_{s,x})ds=\int_{\mathcal{A}}b(s,x,a)\boldsymbol{\pi}(a\|s,x)dads,\ \ \tilde{\sigma}^{2}(s,x,\boldsymbol{\pi}_{s,x})ds=\int_{\mathcal{A}}\sigma^{2}(s,x,a)\boldsymbol{\pi}(a\|s,x)dads,$		(43)
	$\displaystyle\sum_{k=1}^{\ell}\int_{[0,1]^{n}}\gamma_{k}(s,x,G(s,x,u),z)du\cdot\nu_{k}(dz)ds=\sum_{k=1}^{\ell}\int_{\mathcal{A}}\gamma_{k}\left(s,x,a,z\right)\nu_{k}(dz)ds\cdot\boldsymbol{\pi}(a\|s,x)da.$		(44)

Thus, the semimartingale characteristics of the exploratory state process are the averages of those of the sample state process over action randomization.

Remark 1.

In general, there may be other ways to formulate the exploratory SDE in the jump-diffusion case as we may be able to obtain alternative representations for the infinitesimal generator $\mathcal{L}^{\boldsymbol{\pi}}$ based on (34). However, the law of the exploratory state would not change because its generator stays the same.

A technical yet foundational question is the well-posedness (i.e. existence and uniqueness of solution) of the exploratory SDE (42), which we address below. For that we first specify the class of admissible strategies, which is the same as that considered in Jia and Zhou (2023) for pure diffusions.

Definition 1.

A policy $\boldsymbol{\pi}=\boldsymbol{\pi}(\cdot|\cdot,\cdot)$ is called admissible, if

(i)

$\boldsymbol{\pi}(\cdot|t,x)\in\mathcal{P}(\mathcal{A})$ , $\text{supp}\boldsymbol{\pi}(\cdot|t,x)=\mathcal{A}$ for every $(t,x)\in[0,T]\times\mathbb{R}^{d}$ and $\boldsymbol{\pi}(a|t,x):(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}$ is measurable;

(ii)

$\boldsymbol{\pi}(a|t,x)$ is continuous in $(t,x)$ , i.e., $\int_{\mathcal{A}}\left|\boldsymbol{\pi}(a|t,x)-\boldsymbol{\pi}(a|t^{\prime},x^{\prime})\right|da\to 0$ as $(t^{\prime},x^{\prime})\to(t,x)$ . Furthermore, for any $K>0$ , there is a constant $C_{K}>0$ independent of $(t,a)$ such that

\int_{\mathcal{A}}\left|\boldsymbol{\pi}(a|t,x)-\boldsymbol{\pi}(a|t,x^{\prime})\right|da\leq C_{K}|x-x^{\prime}|,\ \forall x,x^{\prime}\in\mathbb{R}^{d}_{K};

(45)

(iii)

$\forall(t,x)$ , $\int_{\mathcal{A}}|\log\boldsymbol{\pi}(a|t,x)|\boldsymbol{\pi}(a|t,x)da\leq C(1+|x|^{p})$ for some $p\geq 2$ and $C$ is a positive constant; for any $q\geq 1$ , $\int_{\mathcal{A}}|a|^{q}\boldsymbol{\pi}(a|t,x)da\leq C_{q}(1+|x|^{p})$ for some $p\geq 2$ and $C_{q}$ is a positive constant that can depend on $q$ .

Next, we establish the well-posedness of (42) under any admissible policy. The result of the next lemma regarding $\tilde{b}$ and $\tilde{\sigma}$ is provided in the proof of Lemma 2 in Jia and Zhou (2022b), which uses property (ii) of admissibility.

Lemma 1.

Under Assumption 1, for any admissible policy $\boldsymbol{\pi}$ , the functions $\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})$ and $\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})$ have the following properties:

(i)

(local Lipschitz continuity) for $K>0$ , there exists a constant $C_{K}>0$ such that $\forall t\in[0,T],(x,x^{\prime})\in\mathbb{R}^{d}_{K}$ ,

|\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})-\tilde{b}(t,x^{\prime},\boldsymbol{\pi}_{t,x^{\prime}})|^{2}+|\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})-\tilde{\sigma}(t,x^{\prime},\boldsymbol{\pi}_{t,x^{\prime}})|^{2}\leq C_{K}|x-x^{\prime}|^{2}\ ;

(46)

(ii)

(linear growth in $x$ ) there exists a constant $C>0$ such that $\forall(t,x)\in[0,T]\times\mathbb{R}^{d}$ ,

$|\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})|^{2}+|\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})|^{2}\leq C(1+|x|^{2}).$ (47)

We now establish similar properties for $\gamma(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z)$ in the following lemmas, whose proofs are relegated to the appendix.

Lemma 2 (linear growth in $x$ ).

Under Assumption 1, for any admissible $\boldsymbol{\pi}$ and any $p\geq 2$ , there exists a constant $C_{p}>0$ that can depend on $p$ such that $\forall(t,x)\in[0,T]\times\mathbb{R}^{d}$ ,

\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right|^{p}\nu_{k}(dz)du\leq C_{p}(1+|x|^{p}).

(48)

For the local Lipschitz continuity of $\gamma_{k}(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z)$ , we make an additional assumption.

Assumption 2.

For $k=1,\cdots,\ell$ , the following conditions hold.

(i)

For any $K>0$ and any $p\geq 2$ , there exists a constant $C_{K,p}>0$ that can depend on $K$ and $p$ such that

\int_{\mathbb{R}}\left|\gamma_{k}(t,x,a,z)-\gamma_{k}(t,x,a^{\prime},z)\right|^{p}\nu_{k}(dz)\leq C_{K,p}|a-a^{\prime}|^{p},\quad\forall t\in[0,T],a,a^{\prime}\in\mathcal{A},x\in\mathbb{R}^{d}_{K},z\in\mathbb{R}.

(49)

(ii)

For any $K>0$ and any $p\geq 2$ , there exists a constant $C_{K,p}>0$ that can depend on $K$ and $p$ such that

\int_{[0,1]^{n}}\left|G^{\boldsymbol{\pi}}(t,x,u)-G^{\boldsymbol{\pi}}(t,x^{\prime},u)\right|^{p}du\leq C_{K,p}|x-x^{\prime}|^{p}.

(50)

For a stochastic feedback policy $\boldsymbol{\pi}_{t,x}\sim\mathcal{N}(\mu(t,x),A(t,x)A(t,x)^{\top})$ , we have $G^{\boldsymbol{\pi}}(t,x,u)=\mu(t,x)+A(t,x)\Phi^{-1}(u)$ . Clearly, Assumption 2-(ii) holds provided that $\mu(t,x)$ and $A(t,x)$ are locally Lipschitz continuous in $x$ .

Lemma 3 (local Lipschitz continuity).

Under Assumptions 1 and 2, for any admissible policy $\boldsymbol{\pi}$ , any $K>0$ , and any $p\geq 2$ , there exists a constant $C_{K,p}>0$ that can depend on $K$ and $p$ such that $\forall t\in[0,T],(x,x^{\prime})\in\mathbb{R}^{d}_{K}$ ,

\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du\leq C_{K,p}|x-x^{\prime}|^{p}.

(51)

With Lemmas 1 to 3, we can now apply (Kunita 2004, Theorem 3.2) and (Situ 2006, Theorem 119) to obtain the well-posedness of (42) along with the moment estimate of its solution.

Proposition 1.

Under Assumptions 1 and 2, for any admissible policy $\boldsymbol{\pi}$ , there exists a unique strong solution $\{\tilde{X}_{t}^{\boldsymbol{\pi}},0\leq t\leq T\}$ to the exploratory Lévy SDE (42). Furthermore, for any $p\geq 2$ , there exists a constant $C_{p}>0$ such that

\mathbb{E}_{t,x}^{\bar{\mathbb{P}}}\left[\sup_{t\leq s\leq T}|\tilde{X}_{s}^{\boldsymbol{\pi}}|^{p}\right]\leq C_{p}(1+|x|^{p}).

(52)

It should be noted that the conditions imposed in Assumptions 1 and 2 are sufficient but not necessary for obtaining the well-posedness and moment estimate of the exploratory Lévy SDE (42). For a specific problem, weaker conditions may suffice for these results if we exploit special structures of the problem.

From the previous discussion, we see that for a given admissible stochastic feedback policy $\boldsymbol{\pi}$ , the sample state process $\{X^{\boldsymbol{\pi}}_{t},t\in[0,T]\}$ and the exploratory state process $\{\tilde{X}^{\boldsymbol{\pi}}_{t},t\in[0,T]\}$ associated with $\boldsymbol{\pi}$ share the same infinitesimal generator and hence the same probability law. This is justified by (Ethier and Kurtz 1986, Chapter 4, Theorem 4.1) on the condition that the function space $C_{0}^{1,2}([0,T)\times\mathbb{R}^{d})$ is a core of the generator, which we assume to hold. It follows that

\mathbb{E}_{t,x}^{\bar{\mathbb{P}}}\left[\sup_{t\leq s\leq T}|X_{s}^{\boldsymbol{\pi}}|^{p}\right]=\mathbb{E}_{t,x}^{\bar{\mathbb{P}}}\left[\sup_{t\leq s\leq T}|\tilde{X}_{s}^{\boldsymbol{\pi}}|^{p}\right]\leq C_{p}(1+|x|^{p})

(53)

if (52) holds.

2.3 Exploratory HJB equation

With the exploratory dynamics (42), for any admissible stochastic policy $\boldsymbol{\pi}$ the value function $J(t,x;\boldsymbol{\pi})$ given by (23) can be rewritten as

	$\displaystyle J(t,x;\boldsymbol{\pi})$	$\displaystyle=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-\beta(s-t)}\int_{\mathcal{A}}\left(r(s,\tilde{X}_{s}^{\boldsymbol{\pi}},a_{s}^{\boldsymbol{\pi}})-\theta\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}\|s,\tilde{X}_{s-}^{\boldsymbol{\pi}})\right)\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}\|s,\tilde{X}_{s-}^{\boldsymbol{\pi}})dads\right.$		(54)
		$\displaystyle\quad\quad\quad\left.+e^{-\beta(T-t)}h(\tilde{X}_{T}^{\boldsymbol{\pi}})\right].$		(55)

Under Assumption 1 and using the admissibility of $\boldsymbol{\pi}$ and (52), it is easy to see that $J$ has polynomial growth in $x$ . We provide the Feynman–Kac formula for this function in Lemma 4 by working with the representation (55). In the proof, we consider the finite and infinite jump activity cases separately because special care is needed in the latter. We revise Assumption 1 by adding one more condition for this case.

Assumption 1^′.

Conditions (i) to (iv) in Assumption 1 hold. We further assume condition (v): if $\int_{\mathbb{R}}\nu_{k}(dz)=\infty$ , $|\gamma_{k}(t,x,a,z)|$ is bounded for any $|z|\leq 1$ , $t\in[0,T]$ , $x\in\mathbb{R}_{K}^{d}$ , and $a\in\mathcal{A}$ .

For Lemma 4, Lemma 5 and Theorem 1, we impose Assumption 1^′ and assume that the exploratory SDE (42) is well-posed with the moment estimate (52). For simplicity, we do not explicitly mention these assumptions in the statement of the results.

Lemma 4.

Given an admissible stochastic policy $\boldsymbol{\pi}$ , suppose there exists a solution $\phi\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d})$ to the following partial integro-differential equation (PIDE):

\displaystyle\frac{\partial\phi}{\partial t}(t,x)+\int_{\mathcal{A}}\left[H(t,x,a,\phi_{x},\phi_{xx},\phi)-\theta\log\boldsymbol{\pi}(a|t,x)\right]\boldsymbol{\pi}(a|t,x)da-\beta\phi(t,x)=0,\;(t,x)\in[0,T)\times\mathbb{R}^{d},

(56)

with terminal condition $\phi(T,x)=h(x)$ , $x\in\mathbb{R}^{d}$ . Moreover, for some $p\geq 2$ , $\phi$ satisfies

|\phi(t,x)|\leq C(1+|x|^{p}),\ \forall(t,x)\in[0,T]\times\mathbb{R}^{d}.

(57)

Then $\phi$ is the value function of the policy $\boldsymbol{\pi}$ , i.e. $J(t,x;\boldsymbol{\pi})=\phi(t,x)$ .

For ease of presentation, we henceforth assume the value function $J(\cdot,\cdot;\boldsymbol{\pi})\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d})$ for any admissible stochastic policy $\boldsymbol{\pi}$ .

Remark 2.

The conclusion in Lemma 4 still holds if we assume $\phi_{x}(t,x)$ shows polynomial growth in $x$ instead of Assumption 1^′-(v).

Next, we consider the optimal value function defined by

\displaystyle J^{*}(t,x)=\sup_{\boldsymbol{\pi}\in\boldsymbol{\Pi}}J(t,x;\boldsymbol{\pi}),

(58)

where $\boldsymbol{\Pi}$ is the class of admissible strategies. The following result characterizes $J^{*}$ and the optimal stochastic policy through the so-called exploratory HJB equation.

Lemma 5.

Suppose there exists a solution $\psi\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d})$ to the exploratory HJB equation:

\displaystyle\frac{\partial\psi}{\partial t}(t,x)+\sup_{\boldsymbol{\pi}\in\mathcal{P}(\mathcal{A})}\int_{\mathcal{A}}\{H(t,x,a,\psi_{x},\psi_{xx},\psi)-\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da-\beta\psi(t,x)=0,

(59)

with the terminal condition $\psi(T,x)=h(x)$ , where $H$ is the Hamiltonian defined in (19). Moreover, for some $p\geq 2$ , $\psi$ satisfies

|\psi(t,x)|\leq C(1+|x|^{p}),\ \forall(t,x)\in[0,T]\times\mathbb{R}^{d},

(60)

and it holds that

\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}H(t,x,a,\psi_{x},\psi_{xx},\psi)\right)da<\infty.

(61)

Then, the Gibbs measure or Boltzman distribution

\displaystyle\boldsymbol{\pi}^{*}(a|t,x)\propto\exp\left(\frac{1}{\theta}H(t,x,a,\psi_{x},\psi_{xx},\psi)\right)

(62)

is the optimal stochastic policy and $J^{*}(t,x)=\psi(t,x)$ provided that $\boldsymbol{\pi}^{*}$ is admissible.

Plugging the optimal stochastic policy (62) into Eq. (59) to remove the supremum operator, we obtain the following nonlinear PIDE for the optimal value function $J^{*}$ :

\displaystyle\frac{\partial J^{*}}{\partial t}(t,x)+\theta\log\left[\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right)da\right]-\beta J^{*}(t,x)=0;\;\;J^{*}(T,x)=h(x).

(63)

3 $q$ -Learning Theory

3.1 $q$ -function and policy improvement

We present the $q$ -learning theory for jump-diffusions that includes both policy evaluation and policy improvement, now that the exploratory formulation has been set up. The theory can be developed similarly to Jia and Zhou (2023); so we will just highlight the main differences in the analysis, skipping the parts that are similar.

Definition 2.

The $q$ -function of the problem (3)–(5) associated with a given policy $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ is defined by

\displaystyle q(t,x,a;\boldsymbol{\pi})

\displaystyle=\frac{\partial J(t,x;\boldsymbol{\pi})}{\partial t}+H(t,x,a,J_{x},J_{xx},J)-\beta J(t,x;\boldsymbol{\pi}),\;\;(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A},

(64)

where $J$ is given in (55) and the Hamiltonian function $H$ is defined in (19).

It is an immediate consequence of Lemma 4 that the $q$ -function satisfies

\displaystyle\int_{\mathcal{A}}[q(t,x,a;\boldsymbol{\pi})-\theta\log\boldsymbol{\pi}(a|t,x)]\boldsymbol{\pi}(a|t,x)da=0,\;\;(t,x)\in[0,T]\times\mathbb{R}^{d}.

(65)

The following policy improvement theorem can be proved similarly as in (Jia and Zhou 2023, Theorem 2) by using the arguments in the proof of Lemma 4.

Theorem 1 (Policy Improvement).

For any given $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ , define

\boldsymbol{\pi}^{\prime}(\cdot|t,x)\propto\exp\left(\frac{1}{\gamma}H(t,x,\cdot,J_{x}(t,x;\boldsymbol{\pi}),J_{xx}(t,x;\boldsymbol{\pi}),J(t,\cdot;\boldsymbol{\pi}))\right)\propto\exp\left(\frac{1}{\gamma}q(t,x,\cdot;\boldsymbol{\pi})\right).

If $\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi},$ then

\displaystyle J(t,x,\boldsymbol{\pi}^{\prime})\geq J(t,x,\boldsymbol{\pi}).

(66)

Moreover, if the following map

	$\displaystyle\mathcal{I}(\boldsymbol{\pi})$	$\displaystyle=\frac{\exp\left(\frac{1}{\theta}H(t,x,\cdot,J_{x}(t,x;\boldsymbol{\pi}),J_{xx}(t,x;\boldsymbol{\pi}),J(t,\cdot;\boldsymbol{\pi}))\right)}{\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}H(t,x,a,J_{x}(t,x;\boldsymbol{\pi}),J_{xx}(t,x;\boldsymbol{\pi}),J(t,\cdot;\boldsymbol{\pi}))\right)da},\quad\boldsymbol{\pi}\in\boldsymbol{\Pi}$		(67)
		$\displaystyle=\frac{\exp\left(\frac{1}{\theta}q(t,x,\cdot;\boldsymbol{\pi})\right)}{\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}q(t,x,a;\boldsymbol{\pi})\right)da}$		(68)

has a fixed point $\boldsymbol{\pi}^{*}$ , then $\boldsymbol{\pi}^{*}$ is an optimal policy.

3.2 Martingale characterization of the $q$ -function

Next we derive the martingale characterization of the $q$ -function associated with a policy $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ , assuming that its value function has already been learned and known. We will highlight the major differences in the proof, provided in the appendix, compared with the pure diffusion setting. For Theorem 2, Theorem 3, and Theorem 4, we impose Assumption 1 and assume the moment estimate (53) for the sample state process holds without explicitly mentioning them in the theorem statements.

Theorem 2.

Let a policy $\boldsymbol{\pi}\in\boldsymbol{\Pi},$ its value function $J(\cdot,\cdot;\boldsymbol{\pi})\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d})$ and a continuous function $\hat{q}:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}$ be given. Assume that $J(t,x;\boldsymbol{\pi})$ and $J_{x}(t,x;\boldsymbol{\pi})$ both have polynomial growth in $x$ . Then the following results hold.

(i)

$\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi})$ for all $(t,x,a)$ if and only if for any $(t,x),$ the following process

\displaystyle e^{-\beta s}J(t,X_{s}^{\boldsymbol{\pi}};\boldsymbol{\pi})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau

(69)

is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale, where $X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\}$ is the sample state process defined in (22) with $X^{\boldsymbol{\pi}}_{t}=x$ .

(ii)

If $\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi})$ for all $(t,x,a)$ , then for any $\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi}$ and any $(t,x)$ , the following process

\displaystyle e^{-\beta s}J(t,X_{s}^{\boldsymbol{\pi}^{\prime}};\boldsymbol{\pi})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})]d\tau

(70)

is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale, where $\{X^{\boldsymbol{\pi}^{\prime}}_{s}:t\leq s\leq T\}$ is the solution to (22) under $\boldsymbol{\pi}^{\prime}$ with initial condition $X^{\boldsymbol{\pi}^{\prime}}_{t}=x$ .

(iii)

If there exists $\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi}$ such that for all $(t,x)$ , the process (70) is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale where $X^{\boldsymbol{\pi}^{\prime}}_{t}=x$ , then we have $\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi})$ for all $(t,x,a)$ .

Moreover, in any of the three cases above, the $q$ -function satisfies

\displaystyle\int_{\mathcal{A}}\{q(t,x,a;\boldsymbol{\pi})-\gamma\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da=0,\quad\text{for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$}.

(71)

Remark 3.

Similar to Jia and Zhou (2023), Theorem 2-(i) facilitates on-policy learning, where learning the $q$ -function of the given target policy $\boldsymbol{\pi}$ is based on data $\{(s,X_{s}^{\boldsymbol{\pi}},a_{s}^{\boldsymbol{\pi}}),t\leq s\leq T\}$ generated by $\boldsymbol{\pi}$ . On the other hand, Theorem 2-(ii) and -(iii) are for off-policy learning, where learning the $q$ -function of $\boldsymbol{\pi}$ is based on data generated by a different, called behavior, policy $\boldsymbol{\pi}^{\prime}$ .

Next, we extend Theorem 7 in Jia and Zhou (2023) and obtain a martingale characterization of the value function and the $q$ -function simultaneously. The proof is essentially the same and hence omitted.

Theorem 3.

Let a policy $\boldsymbol{\pi}\in\boldsymbol{\Pi},$ a function $\hat{J}\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d})$ with polynomial growth and a continuous function $\hat{q}:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}$ be given satisfying

\displaystyle\hat{J}(T,x)=h(x),\quad\int_{\mathcal{A}}\{\hat{q}(t,x,a)-\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da=0,\quad\text{for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$}.

(72)

Assume that $\hat{J}$ and $\hat{J}_{x}$ both have polynomial growth. Then

(i)

$\hat{J}$ and $\hat{q}$ are respectively the value function and the q-function associated with $\boldsymbol{\pi}$ if and only if for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ , the following process

\displaystyle e^{-\beta s}\hat{J}(s,X_{s}^{\boldsymbol{\pi}})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau

(73)

is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale, where $X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\}$ satisfies (22) with $X^{\boldsymbol{\pi}}_{t}=x$ .

(ii)

If $\hat{J}$ and $\hat{q}$ are respectively the value function and the q-function associated with $\boldsymbol{\pi}$ , then for any $\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi}$ and for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ , the following process

\displaystyle e^{-\beta s}\hat{J}(s,X_{s}^{\boldsymbol{\pi}^{\prime}})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})]d\tau

(74)

is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale, where $\{X^{\boldsymbol{\pi}^{\prime}}_{s}:t\leq s\leq T\}$ satisfies (22) with $X^{\boldsymbol{\pi}^{\prime}}_{t}=x$ .

(iii)

If there exists $\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi}$ such that for all $(t,x)$ , the process (74) is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale where $X^{\boldsymbol{\pi}^{\prime}}_{t}=x$ , then we have $\hat{J}(t,x)=J(t,x;\boldsymbol{\pi})$ and $\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi})$ for all $(t,x,a)$ .

In any of the three cases above, if it holds that $\boldsymbol{\pi}(a|t,x)=\frac{\exp(\frac{1}{\theta}\hat{q}(t,x,a))}{\int_{\mathcal{A}}{\exp(\frac{1}{\theta}\hat{q}(t,x,a))da}}$ , then $\boldsymbol{\pi}$ is the optimal policy and $\hat{J}$ is the optimal value function.

3.3 Optimal $q$ -function

We consider in this section the optimal $q$ -function, i.e., the $q$ -function associated with the optimal policy $\boldsymbol{\pi}^{*}$ in (62). Based on Definition 2, we can define it by

\displaystyle q^{*}(t,x,a)

\displaystyle=\frac{\partial J^{*}(t,x)}{\partial t}+H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})-\beta J^{*}(t,x),

(75)

where $J^{*}$ is the optimal value function that solves the exploratory HJB equation in (59).

The following is the martingale condition that characterize the optimal value function $J^{*}$ and the optimal $q$ -function, that can be proved analogously to Theorem 9 in Jia and Zhou (2023).

Theorem 4.

Let a function $\hat{J}^{*}\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d})$ and a continuous function $\hat{q}^{*}:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}$ be given satisfying

\displaystyle\hat{J}^{*}(T,x)=h(x),\quad\int_{\mathcal{A}}{\exp\left(\frac{1}{\theta}\hat{q}^{*}(t,x,a)\right)da}=1,\quad\text{for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$}.

(76)

Assume that $\hat{J}^{*}(t,x)$ and $\hat{J}^{*}_{x}(t,x)$ both have polynomial growth in $x$ . Then

(i)

If $\hat{J}^{*}$ and $\hat{q}^{*}$ are respectively the optimal value function and the optimal $q$ -function, then for any $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ and for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ , the following process

\displaystyle e^{-\beta s}\hat{J}^{*}(s,X_{s}^{\boldsymbol{\pi}})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}^{*}(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau

(77)

is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale, where $X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\}$ satisfies (22) with $X^{\boldsymbol{\pi}}_{t}=x$ . Moreover, in this case, $\boldsymbol{\hat{\pi}}^{*}(a|t,x)=\exp\left(\frac{1}{\theta}\hat{q}^{*}(t,x,a)\right)$ is the optimal stochastic policy.

(ii)

If there exists $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ such that for all $(t,x)$ , the process (77) is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale where $X^{\boldsymbol{\pi}}_{t}=x$ , then $\hat{J}^{*}$ and $\hat{q}^{*}$ are respectively the optimal value function and the optimal $q$ -function.

4 $q$ -Learning Algorithms

In this section we present learning algorithms based on the martingale characterization of the $q$ -function discussed in the previous section. We need to distinguish two cases, depending on whether or not the density function of the Gibbs measure generated from the $q$ -function can be computed and integrated explicitly.

We first discuss the case when the normalizing constant in the Gibbs measure can be computed explicitly. We denote by $J^{\psi}$ and ${q}^{\phi}$ the parameterized function approximators for the optimal value function and optimal $q$ -function, respectively. In view of Theorem 4, these approximators are chosen to satisfy

{J}^{\psi}(T,x)=h(x),\ \int_{\mathcal{A}}\exp\left(\frac{1}{\theta}q^{\phi}(t,x,a)\right)da=1.

(78)

We can then update $(\psi,\phi)$ by enforcing the martingale condition discussed in Theorem 4 and applying the techniques developed in Jia and Zhou (2022a). This procedure has been discussed in details in Section 4.1 of Jia and Zhou (2023), and hence we omit the details. For reader’s convenience, we present Algorithms 1 and 2, which summarize the offline and online $q$ -learning algorithms respectively. Such algorithms are based on the so-called martingale orthogonality condition in Jia and Zhou (2022a), where the typical choices of test functions in these algorithms are $\xi_{t}=\frac{\partial J^{\psi}}{\partial\psi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}})$ , and $\zeta_{t}=\frac{\partial q^{\phi}}{\partial\phi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})$ , where $\boldsymbol{\pi}^{\phi}$ is the policy generated by $q^{\phi}$ . Note that these two algorithms are identical to Algorithms 2 and 3 in Jia and Zhou (2023).

Algorithm 1 Offline–Episodic

q

-Learning Algorithm

Inputs: initial state $x_{0}$ , horizon $T$ , time step $\Delta t$ , number of episodes $N$ , number of mesh grids $K$ , initial learning rates $\alpha_{\psi},\alpha_{\phi}$ and a learning rate schedule function $l(\cdot)$ (a function of the number of episodes), functional forms of parameterized value function $J^{\psi}(\cdot,\cdot)$ and $q$ -function $q^{\phi}(\cdot,\cdot,\cdot)$ satisfying (78), functional forms of test functions $\boldsymbol{\xi}(t,x_{\cdot\wedge t},a_{\cdot\wedge t})$ and $\boldsymbol{\zeta}(t,x_{\cdot\wedge t},a_{\cdot\wedge t})$ , and temperature parameter $\theta$ .

Required program (on-policy): environment simulator $(x^{\prime},r)=\textit{Environment}_{\Delta t}(t,x,a)$ that takes current time–state pair $(t,x)$ and action $a$ as inputs and generates state $x^{\prime}$ at time $t+\Delta t$ and instantaneous reward $r$ at time $t$ as outputs. Policy $\boldsymbol{\pi}^{\phi}(a|t,x)=\exp\left(\frac{1}{\gamma}q^{\phi}(t,x,a)\right)$ .

Required program (off-policy): observations $\{a_{t_{k}},r_{t_{k}},x_{t_{k+1}}\}_{k=0,\cdots,K-1}\cup\{x_{t_{K}},h(x_{t_{K}})\}=\textit{Observation}(\Delta t)$ including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grid with step size $\Delta t$ .

Learning procedure:

Initialize

\psi,\phi

for episode

j=1

N

Initialize

k=0

. Observe initial state

x_{0}

and store

x_{t_{k}}\leftarrow x_{0}

. {On-policy case

while

k<K

Generate action

a_{t_{k}}\sim\boldsymbol{\pi}^{\phi}(\cdot|t_{k},x_{t_{k}})

. Apply

a_{t_{k}}

to environment simulator

(x,r)=Environment_{\Delta t}(t_{k},x_{t_{k}},a_{t_{k}})

, and observe new state

x

and reward

r

as outputs. Store

x_{t_{k+1}}\leftarrow x

and

r_{t_{k}}\leftarrow r

. Update

k\leftarrow k+1

end while} {Off-policy case Obtain one observation

\{a_{t_{k}},r_{t_{k}},x_{t_{k+1}}\}_{k=0,\cdots,K-1}\cup\{x_{t_{K}},h(x_{t_{K}})\}=\textit{Observation}(\Delta t)

. } For every

k=0,1,\cdots,K-1

, compute and store test functions

\xi_{t_{k}}=\boldsymbol{\xi}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}})

\zeta_{t_{k}}=\boldsymbol{\zeta}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}})

. Compute

\Delta\psi=\sum_{i=0}^{K-1}\xi_{t_{i}}\big{[}J^{\psi}(t_{i+1},x_{t_{i+1}})-J^{\psi}(t_{i},x_{t_{i}})+r_{t_{i}}\Delta t-q^{\phi}(t_{i},x_{t_{i}},a_{t_{i}})\Delta t-\beta J^{\psi}(t_{i},x_{t_{i}})\Delta t\big{]},

\Delta\phi=\sum_{i=0}^{K-1}\zeta_{t_{i}}\big{[}J^{\psi}(t_{i+1},x_{t_{i+1}})-J^{\psi}(t_{i},x_{t_{i}})+r_{t_{i}}\Delta t-q^{\phi}(t_{i},x_{t_{i}},a_{t_{i}})\Delta t-\beta J^{\psi}(t_{i},x_{t_{i}})\Delta t\big{]}.

Update

\psi

and

\phi

\psi\leftarrow\psi+l(j)\alpha_{\psi}\Delta\psi.

\phi\leftarrow\phi+l(j)\alpha_{\phi}\Delta\phi.

end for

Algorithm 2 Online-Incremental

q

-Learning Algorithm

Inputs: initial state $x_{0}$ , horizon $T$ , time step $\Delta t$ , number of mesh grids $K$ , initial learning rates $\alpha_{\psi},\alpha_{\phi}$ and learning rate schedule function $l(\cdot)$ (a function of the number of episodes), functional forms of parameterized value function $J^{\psi}(\cdot,\cdot)$ and $q$ -function $q^{\phi}(\cdot,\cdot,\cdot)$ satisfying (78), functional forms of test functions $\boldsymbol{\xi}(t,x_{\cdot\wedge t},a_{\cdot\wedge t})$ and $\boldsymbol{\zeta}(t,x_{\cdot\wedge t},a_{\cdot\wedge t})$ , and temperature parameter $\theta$ .

Required program (off-policy): observations $\{a,r,x^{\prime}\}=\textit{Observation}(t,x;\Delta t)$ including the observed actions, rewards, and state when the current time-state pair is $(t,x)$ under the given behavior policy at the sampling time grid with step size $\Delta t$ .

Learning procedure:

Initialize

\psi,\phi

for episode

j=1

\infty

Initialize

k=0

. Observe initial state

x_{0}

and store

x_{t_{k}}\leftarrow x_{0}

while

k<K

{On-policy case Generate action

a_{t_{k}}\sim\boldsymbol{\pi}^{\phi}(\cdot|t_{k},x_{t_{k}})

. Apply

a_{t_{k}}

to environment simulator

(x,r)=Environment_{\Delta t}(t_{k},x_{t_{k}},a_{t_{k}})

, and observe new state

x

and reward

r

as outputs. Store

x_{t_{k+1}}\leftarrow x

and

r_{t_{k}}\leftarrow r

. } {Off-policy case Obtain one observation

a_{t_{k}},r_{t_{k}},x_{t_{k+1}}=\textit{Observation}(t_{k},x_{t_{k}};\Delta t)

. } Compute test functions

\xi_{t_{k}}=\boldsymbol{\xi}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}})

\zeta_{t_{k}}=\boldsymbol{\zeta}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}})

. Compute

		$\displaystyle\delta=J^{\psi}(t_{k+1},x_{t_{k+1}})-J^{\psi}(t_{k},x_{t_{k}})+r_{t_{k}}\Delta t-q^{\phi}(t_{k},x_{t_{k}},a_{t_{k}})\Delta t-\beta J^{\psi}(t_{k},x_{t_{k}})\Delta t,$
		$\displaystyle\Delta\psi=\xi_{t_{k}}\delta,$
		$\displaystyle\Delta\phi=\zeta_{t_{k}}\delta.$

Update

\psi

and

\phi

\psi\leftarrow\psi+l(j)\alpha_{\psi}\Delta\psi.

\phi\leftarrow\phi+l(j)\alpha_{\phi}\Delta\phi.

Update

k\leftarrow k+1

end while

end for

When the normalizing constant in the Gibbs measure is not available, we take the same approach as in Jia and Zhou (2023) to develop learning algorithms. Specifically, we consider $\{\boldsymbol{\pi}^{\phi}(\cdot|t,x)\}_{\phi\in\Phi}$ , which is a family of density functions of some tractable distributions, e.g. multivariate normal distributions. Starting from a stochastic policy $\boldsymbol{\pi}^{\phi}$ in this family, we update the policy by considering the optimization problem

\displaystyle\min_{\phi^{\prime}\in\Phi}\text{KL}\left(\boldsymbol{\pi}^{\phi^{\prime}}(\cdot|t,x)\Big{|}\exp\left(\frac{1}{\theta}q(t,x,\cdot;\boldsymbol{\pi}^{\phi})\right)\right).

Specifically, using gradient descent, we can update $\phi$ as in Jia and Zhou (2023), by

\displaystyle\phi\leftarrow\phi-\theta\alpha_{\phi}dt\left[\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}})-\frac{1}{\theta}q(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})\right]\frac{\partial}{\partial\phi}\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}}).

(79)

In the above updating rule, we need only the values of the $q$ -function along the trajectory – the “data” – $\{(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}});0\leq t\leq T\}$ , instead of its full functional form. These values can be learned through the “temporal difference” of the value function along the data. To see this, applying Itô’s formula (15) to $J(\cdot,\cdot;\boldsymbol{\pi}^{\phi})$ , we have

$\displaystyle q(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})dt$	$\displaystyle=dJ(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})+[r(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})-\beta J(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})]dt$	(80)
	$\displaystyle\quad+J_{x}(t,X_{t-}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi}){\sigma}(t,X_{t-}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})dW_{t}$
	$\displaystyle\quad+\int_{\mathbb{R}^{d}}\left(J(t,X^{\boldsymbol{\pi}^{\phi}}_{t-}+{\gamma}(t,X^{\boldsymbol{\pi}^{\phi}}_{t-},a_{t}^{\boldsymbol{\pi}^{\phi}},z))-J(t,X^{\boldsymbol{\pi}^{\phi}}_{t-};\boldsymbol{\pi}^{\phi})\right)\widetilde{N}(dt,dz).$	(81)

We may ignore the $dW_{t}$ and $\widetilde{N}(dt,dz)$ terms which are martingale differences with mean zero, and then the updating rule in (79) becomes

	$\displaystyle\phi\leftarrow\phi+\alpha_{\phi}\left[-\theta\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}\|t,X_{t-}^{\boldsymbol{\pi}^{\phi}})dt+dJ(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})+\left(r(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})-\beta J(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})\right)dt\right]$		(82)
	$\displaystyle\qquad\qquad\qquad\cdot\frac{\partial}{\partial\phi}\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}\|t,X_{t-}^{\boldsymbol{\pi}^{\phi}}).$		(83)

Using $J^{\psi}(\cdot,\cdot)$ as the parameterized function approximator for $J(\cdot,\cdot;\boldsymbol{\pi}^{\phi})$ , we arrive at the updating rule for the policy parameter $\phi$ :

	$\displaystyle\phi\leftarrow\phi+\alpha_{\phi}\left[-\theta\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}\|t,X_{t-}^{\boldsymbol{\pi}^{\phi}})dt+dJ^{\psi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}})+\left(r(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})-\beta J^{\psi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}})\right)dt\right]$		(84)
	$\displaystyle\qquad\qquad\qquad\cdot\frac{\partial}{\partial\phi}\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}\|t,X_{t-}^{\boldsymbol{\pi}^{\phi}}).$		(85)

Therefore, we can update $\psi$ using the PE methods in Jia and Zhou (2022a), and update $\phi$ using the above rule, leading to actor–critic type of algorithms.

To conclude, we are able to use the same RL algorithms to learn the optimal policy and optimal value function, without having to know a priori whether the unknown environment entails a pure diffusion or a jump-diffusion. This important conclusion is based on the theoretical analysis carried out in the previous sections.

5 Application: Mean–Variance Portfolio Selection

We now present an applied example of the general theory and algorithms derived. Consider investing in a market where there are a risk-free asset and a risky asset (e.g. a stock or an index). The risk-free rate is given by a constant $r_{f}$ and the risky asset price process follows

dS_{t}=S_{t-}\left[\mu dt+\sigma dW_{t}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(dt,dz)\right].

(86)

Let $X_{t}$ be the discounted wealth value at time $t$ , and $a_{t}$ is the discounted dollar value of the investment in the risky asset. The self-financing discounted wealth process follows

dX^{a}_{t}=a_{t}\sigma\rho dt+a_{t}\sigma dW_{t}+a_{t}\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(dt,dz),

(87)

where $\rho$ is the Sharpe ratio of the risky asset and given by

\rho=\frac{\mu-r_{f}}{\sigma}.

(88)

We assume

\displaystyle\int_{|z|>1}\exp(z)\nu(dz)<\infty,\quad\int_{|z|>1}\exp(2z)\nu(dz)<\infty.

(89)

Condition (89) implies that $\mathbb{E}[S_{t}]$ and $\mathbb{E}[S_{t}^{2}]$ are finite for every $t\geq 0$ ; see (Cont and Tankov 2004, Proposition 3.14). We set

\sigma_{J}^{2}\coloneqq\int_{\mathbb{R}}(\exp(z)-1)^{2}\nu(dz),

(90)

which is finite by conditions (2) and (89).

Fix the investment horizon as $[0,T]$ . The mean-variance (MV) portfolio selection problem considers

\min_{a}\operatorname{Var}\left[X_{T}^{a}\right]\quad\text{ subject to }\mathbb{E}\left[X_{T}^{a}\right]=z.

(91)

We seek the optimal pre-committed strategy for the MV problem as in Zhou and Li (2000). We can transform the above constrained problem into an unconstrained one by introducing a Lagrange multiplier, which yields

\min_{a}\mathbb{E}\left[\left(X_{T}^{a}\right)^{2}\right]-z^{2}-2\omega\left(\mathbb{E}\left[X_{T}^{a}\right]-z\right)=\min_{a}\mathbb{E}\left[\left(X_{T}^{a}-\omega\right)^{2}\right]-(\omega-z)^{2}.

(92)

Note that the optimal solution to the unconstrained minimization problem depends on $\omega$ , and we can obtain the optimal multiplier $\omega^{*}$ by solving $\mathbb{E}\left[X_{T}^{a^{*}}(\omega)\right]=z$ .

The exploratory formulation of the problem is

\displaystyle\min_{\boldsymbol{\pi}}\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\left(X_{T}^{\boldsymbol{\pi}}-\omega\right)^{2}+\theta\int_{t}^{T}\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})ds\right]-(\omega-z)^{2},

(93)

where the discounted wealth under a stochastic policy $\boldsymbol{\pi}$ follows

dX^{\boldsymbol{\pi}}_{s}=a_{s}^{\boldsymbol{\pi}}\sigma\rho ds+a_{s}^{\boldsymbol{\pi}}\sigma dW_{s}+a_{s}^{\boldsymbol{\pi}}\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(ds,dz),\quad s\in[t,T],\ X^{\boldsymbol{\pi}}_{s}=x.

(94)

5.1 Solution of the exploratory control problem

We consider the HJB equation for problem (93):

\displaystyle\frac{\partial V}{\partial t}(t,x)+\inf_{\boldsymbol{\pi}\in\mathcal{P}(\mathbb{R})}\int_{\mathbb{R}}\{H(t,x,a,V_{x},V_{xx},V)+\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da=0,

(95)

with the terminal condition $V(T,x)=(x-\omega)^{2}-(\omega-z)^{2}$ . Note that supremum becomes infimum and the sign before $\theta\log\boldsymbol{\pi}(a|t,x)$ flips compared with (59) because we consider minimization here. The Hamiltonian of the problem is given by

	$\displaystyle H(t,x,a,V_{x},V_{xx},V)$	$\displaystyle=a\sigma\rho V_{x}(t,x)+\frac{1}{2}a^{2}\sigma^{2}V_{xx}(t,x)$		(96)
		$\displaystyle+\int_{\mathbb{R}}\left(V(t,x+\gamma(a,z))-V(t,x)-\gamma(a,z)V_{x}(t,x)\right)\nu(dz),$		(97)

where $\gamma(a,z)=a(e^{z}-1).$ We take the following ansatz for the solution of the HJB equation (95):

V(t,x)=(x-\omega)^{2}f(t)+g(t)-(\omega-z)^{2}.

(98)

As $V(t,x)$ is quadratic in $x$ , we can easily calculate the integral term in the Hamiltonian and obtain

H(t,x,a,V_{x},V_{xx},V)=a\sigma\rho V_{x}(t,x)+\frac{1}{2}a^{2}(\sigma^{2}+\sigma_{J}^{2})V_{xx}(t,x).

(99)

The probability density function that minimizes the integral in (95) is given by

\boldsymbol{\pi}_{c}(\cdot|t,x)\propto\exp\left(-\frac{1}{\theta}H(t,x,a,V_{x},V_{xx},V)\right),

(100)

which is a candidate for the optimal stochastic policy. From (99), we obtain

\boldsymbol{\pi}_{c}(\cdot|t,x)\sim\mathcal{N}\left(\cdot\mid-\frac{\sigma\rho V_{x}}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}},\frac{\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}\right).

(101)

Substituting it back to the HJB equation (95), we obtain a nonlinear PDE as

	$\displaystyle V_{t}-\frac{\rho^{2}\sigma^{2}}{2(\sigma^{2}+\sigma_{J}^{2})}\frac{(V_{x})^{2}}{V_{xx}}-\frac{\theta}{2}\ln\frac{2\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}=0,\quad(t,x)\in[0,T)\times\mathbb{R},$		(102)
	$\displaystyle V(T,x)=(x-\omega)^{2}-(\omega-z)^{2}.$		(103)

We plug in the ansatz (98) to the above PDE and obtain that $f(t)$ satisfies

\displaystyle f^{\prime}(t)-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}f(t)=0,\ f(T)=1,

(104)

and $g(t)$ satisfies

\displaystyle g^{\prime}(t)-\frac{\theta}{2}\ln\frac{\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})f(t)}=0,\ g(T)=0.

(105)

These two ordinary differential equations can be solved analytically, and we obtain

	$\displaystyle V(t,x)=$	$\displaystyle(x-\omega)^{2}\exp\left(-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right)+\frac{\theta\rho^{2}\sigma^{2}}{4(\sigma^{2}+\sigma_{J}^{2})}(T^{2}-t^{2})$		(106)
		$\displaystyle-\frac{\theta}{2}\left(\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}T+\ln\frac{\pi\theta}{\sigma^{2}+\sigma_{J}^{2}}\right)(T-t)-(\omega-z)^{2}.$		(107)

It follows that

\boldsymbol{\pi}_{c}(\cdot|t,x)\sim\mathcal{N}\left(\cdot\ \Big{|}-\frac{\sigma\rho}{\sigma^{2}+\sigma_{J}^{2}}(x-\omega),\frac{\theta}{2(\sigma^{2}+\sigma_{J}^{2})}\exp\left(\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right)\right).

(108)

It is straightforward to verify that $\boldsymbol{\pi}_{c}$ is admissible by checking the four conditions in Definition 1. Furthermore, $V$ solves the HJB equation (95) and shows quadratic growth. Therefore, by Lemma 5, we have the following conclusion.

Proposition 2.

For the unconstrained MV problem (92), the optimal value function $J^{*}(t,x)=V(t,x)$ and the optimal stochastic policy $\boldsymbol{\pi}^{*}=\boldsymbol{\pi}_{c}$ .

When there is no jump, we have $\sigma_{J}^{2}=0$ and thus recover the expressions of the optimal value function and optimal policy derived in Wang and Zhou (2020) for the unconstrained MV problem in the pure diffusion setting.

5.2 Parametrizations for $q$ -learning

It is important to observe that the optimal value function, optimal policy and the Hamiltonian given by (99) take the same structural forms regardless of the presence of jumps, while the only differences are the constant coefficients in those functions. However, those coefficients are unknown anyway and will be parameterized in the implementation of our RL algorithms. Consequently, we can use the same parameterizations for the optimal value function and optimal $q$ -function for learning as in the diffusion setting of Jia and Zhou (2023). This important insight, concluded only after a rigorous theoretical analysis, shows that the continuous-time RL algorithms are robust to the presence of jumps and essentially model-free, at least for the MV portfolio selection problem.

Following Jia and Zhou (2023), we parametrize the value function as

J^{\psi}(t,x;\omega)=(x-\omega)^{2}e^{-\psi_{3}(T-t)}+\psi_{2}\left(t^{2}-T^{2}\right)+\psi_{1}(t-T)-(\omega-z)^{2},

(109)

and the $q$ -function as

q^{\phi}(t,x,a;w)=-\frac{e^{-\phi_{2}-\phi_{3}(T-t)}}{2}\left(a+\phi_{1}(x-w)\right)^{2}-\frac{\theta}{2}\left[\log 2\pi\theta+\phi_{2}+\phi_{3}(T-t)\right].

(110)

Let $\psi=\left(\psi_{1},\psi_{2},\psi_{3}\right)^{\top}$ and $\phi=\left(\phi_{1},\phi_{2},\phi_{3}\right)^{\top}$ . The policy associated with the parametric $q$ -function is ${\boldsymbol{\pi}}^{\phi}(\cdot\mid t,x;w)=\mathcal{N}\left(-\phi_{1}(x-w),\theta e^{\phi_{2}+\phi_{3}(T-t)}\right)$ . In addition to $\psi$ and $\phi$ , we learn the Lagrange multiplier $\omega$ in the same way as in Jia and Zhou (2023) by the stochastic approximation algorithm that updates $\omega$ with a learning rate after a fixed number of iterations.

5.3 Simulation study

We assess the effect of jumps on the convergence behavior of our algorithms via a simulation study. We use the same basic setting as in Jia and Zhou (2023): $x_{0}=1$ , $z=1.4$ , $T=1$ year, $\Delta t=1/252$ years (corresponding to one trading day), and a chosen temperature parameter $\theta=0.1$ . We consider two market simulators: one is given by the Black–Scholes (BS) model and the other is Merton’s jump-diffusion (MJD) model in which the Lévy density is a scaled Gaussian density, i.e.,

\nu(z)=\lambda\frac{1}{\sqrt{2\pi\delta^{2}}}\exp\left(-\frac{(z-m)^{2}}{2\delta^{2}}\right),\ \lambda>0,\delta>0,m\in\mathbb{R},

(111)

where $\lambda$ is the arrival rate of the Poisson jumps. The Gaussian assumption is a common choice in the finance literature for modeling the jump-size distribution (see e.g. Merton 1976, Bates 1991, Das 2002), partly due to its tractability for statistical estimation and partly because heavy-tailed distributions may not be easily identified from real data when the number of jumps is limited (see Heyde and Kou 2004).

Under the latter model, we have

\sigma_{J}^{2}=\lambda\left[\exp\left(2m+2\delta^{2}\right)-2\exp\left(m+\frac{1}{2}\delta^{2}\right)+1\right].

(112)

To mimic the real market, we set the parameters of these two simulators by estimating them from the daily data of the S&P 500 index using maximum likelihood estimation (MLE). Our estimation data cover a long period from the beginning of 2000 to the end of 2023. In Table 1, we summarize the estimated parameter values (used for the simulators) and the corresponding value of $\phi_{1}^{*}$ in the optimal policy. Note that although we use a stochastic policy to interact with the environment during training to update the policy parameters, for actual execution of portfolio selection we apply a deterministic policy which is the mean part of the optimal stochastic policy after it has been learned. So it is off-policy learning here. The advantage of doing so among others is to reduce the variance of the final wealth; see Huang et al. (2022) for a discussion on this approach. As a result, here we only display the values for $\phi_{1}^{*}$ in these two environments and use them as benchmarks to check the convergence of our algorithm (see Figure 1).

Simulator	Parameters	Optimal
BS	$\mu=0.0690,\sigma=0.1965$	$\phi_{1}^{*}=1.5940$
MJD	$\mu=0.0636,\sigma=0.1347,\lambda=28.4910,m=-0.0039,\delta=0.0275$	$\phi_{1}^{*}=1.7869$

Table 1: Parameters used in the two simulators. The column “Optimal” reports the values of

\phi_{1}

in the optimal policies calculated using the respective simulators.

For offline learning, the Lagrange multiplier $\omega$ is updated after every $m=10$ iterations, and the parameter vectors $\psi$ and $\phi$ are initialized as zero vectors. The learning rates are set to be $\alpha_{w}=0.05$ , $\alpha_{\psi}=0.001$ , and $\alpha_{\phi}=0.1$ with decay rate $l(j)=j^{-0.51}$ , where $j$ is the iteration index. In each iteration, we generate $32$ independent $T$ -year trajectories to update the parameters. We train the model for $N=20000$ iterations.

We also consider online learning with $\Delta t$ equal to one trading day. We select a batch size of $128$ trading days and update the parameters after this number of observations coming in. We set $m=1$ for updating $\omega$ and initialize $\psi$ and $\phi$ as zero vectors. The learning rates are set as $\alpha_{w}=0.01$ , $\alpha_{\psi}=0.001$ , and $\alpha_{\phi}=0.05$ with decay rate $l(j)=j^{-0.51}$ . Some of the rates are notably smaller under online learning because now we update with fewer observations and thus must be more cautious. The model is again trained for $N=20000$ iterations.

Figure 1 plots the convergence behavior of offline and online learning under both simulators or market environments (one with jumps and one without). The algorithms have converged after a sufficient number of iterations, whether jumps are present or not. This demonstrates that convergence of the offline and online $q$ -learning algorithms proposed in Jia and Zhou (2023) under the diffusion setting is robust to the presence of jumps for mean–variance portfolio selection. However, jumps in the environment can introduce more variability in the convergence process as seen from the plots.

Refer to caption — Figure 1: Convergence of the offline and online $q$ -Learning algorithms under two market simulators for the policy parameter $\phi_{1}$ (iteration index on the $x$ -axis)

5.4 Effects of jumps

The theoretical analysis so far in this section shows that, for the mean–variance problem, one does not need to know in advance whether or not the stock prices have jumps in order to carry out the RL task, because the optimal stochastic policy is Gaussian and the corresponding value function and $q$ -function have the same structures for parameterization irrespective of the presence of jumps. However, we stress that this is rather an exception than a rule. Here we give a counterexample.

Consider a modification of the mean–variance problem where the controlled system dynamics is

\displaystyle dX_{t}^{a}=a_{t}\sigma\rho dt+a_{t}\sigma dB_{t}+\int_{\mathbb{R}}\gamma(a_{t},z)\widetilde{N}(dt,dz),

(113)

with

\displaystyle\gamma(a,z)=a^{2},

(114)

and the exploratory objective is

\displaystyle J^{*}(t,x;w)=\min_{\boldsymbol{\pi}}\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[(X_{T}^{\boldsymbol{\pi}}-w)^{2}+\theta\int_{t}^{T}\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})ds\right]-(\omega-z)^{2}.

(115)

Note that this is not a mean–variance portfolio selection problem because (113) does not correspond to a self-financed wealth equation with a reasonably modelled stock price process.

The Hamiltonian is given by

	$\displaystyle H(t,x,a,v_{x},v_{xx},v)$	$\displaystyle=a\sigma\rho v_{x}(t,x)+\frac{1}{2}a^{2}\sigma^{2}v_{xx}(t,x)$		(116)
		$\displaystyle\quad+\int_{\mathbb{R}}\left(v(t,x+a^{2})-v(t,x)-a^{2}\cdot v_{x}(t,x)\right)\nu(dz).$		(117)

If an optimal stochastic policy exists, then it must be

\displaystyle\boldsymbol{\pi}^{*}(a|t,x)\propto\exp\left(-\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right).

(118)

We show by contradiction that the optimal stochastic policy can not be Gaussian in this case. Note that if there is no optimal stochastic policy, then it would already demonstrate that jumps matter because the optimal stochastic policy for the case of no jumps exists and is Gaussian.

Remark 4.

The existence of optimal stochastic policy in (118) is equivalent to the integrability of the quantity $\exp\left(\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right)$ over $a\in\mathcal{A}=(-\infty,\infty)$ . This integrability depends on the tail behavior of the Hamiltonian and, in particular, the behavior of $J^{*}(t,x+a^{2})$ when $a^{2}$ is large.

Suppose the optimal stochastic policy $\boldsymbol{\pi}^{*}(\cdot|t,x)$ is Gaussian for all $(t,x)$ , implying that the Hamilitonian $H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})$ is a quadratic function of $a$ . It then follows from (116) that there exist functions $h_{1}(t,x)$ and $h_{2}(t,x)$ such that

\displaystyle J^{*}(t,x+a^{2})-J^{*}(t,x)-a^{2}J^{*}_{x}(t,x)=a^{2}\cdot h_{1}(t,x)+a\cdot h_{2}(t,x),\quad\text{for all $(t,x,a)$}.

(119)

We do not put a term independent of $a$ on the right-hand side because the left-hand side is zero when $a=0$ . Taking derivative with respect to $a$ , we obtain

\displaystyle J^{*}_{x}(t,x+a^{2})\cdot 2a-2aJ^{*}_{x}(t,x)=2a\cdot h_{1}(t,x)+h_{2}(t,x),\quad\text{for all $(t,x,a)$}.

(120)

Setting $a=0$ , we get $h_{2}=0.$ It follows that

\displaystyle a\cdot\left[J^{*}_{x}(t,x+a^{2})-J^{*}_{x}(t,x)-h_{1}(t,x)\right]=0\quad\text{for all $(t,x,a)$}.

(121)

Hence we have $h_{1}(t,x)=J^{*}_{x}(t,x+a^{2})-J^{*}_{x}(t,x)$ for any $a\neq 0.$ Sending $a$ to zero yields $h_{1}(t,x)=0$ for all $(t,x)$ . Therefore, we obtain from (119) that

\displaystyle J^{*}(t,x+a^{2})-J^{*}(t,x)-a^{2}J^{*}_{x}(t,x)=0,\quad\text{for all $(t,x,a)$}.

(122)

Taking derivative in $a$ in the above we have

\displaystyle J^{*}_{x}(t,x+a^{2})-J^{*}_{x}(t,x)=0,\quad\text{for all $(t,x,a)$}.

(123)

Thus $J^{*}_{x}$ is constant in $x$ or $J^{*}$ is affine in $x$ , leading to $J^{*}(t,x)=g_{1}(t)x+g_{2}(t)$ for some functions $g_{1}(t)$ and $g_{2}(t)$ . The resulting Hamiltonian becomes

\displaystyle H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})

\displaystyle=a\sigma\rho g_{1}(t).

(124)

This is linear in $a\in\mathcal{A}=(-\infty,\infty)$ and hence the integral $\int_{\mathcal{A}}\exp\left(-\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right)da$ does not exist. It follows that $\boldsymbol{\pi}^{*}(\cdot|t,x)$ does not exist, which is a contradiction. Therefore, we have shown that under (114), the optimal stochastic policy either does not exist or is not Gaussian when it exists.

Remark 5.

The argument above works for $\gamma(a,z)=a^{m}$ for any $m>1$ .

6 Application: Mean–Variance Hedging of Options

The MV portfolio selection problem considered in Section 5 is an LQ problem. In this section, we present another application that is non-LQ. Consider an option seller who needs to hedge a short position in a European-style option that expires at time $T$ . The option is written on a risky asset whose price process $S$ is described by the SDE (86) with condition (89) satisfied. At the terminal time $T$ , the seller pays $G(S_{T})$ to the option holder. We assume that the seller’s hedging activity will not affect the risky asset price.

To hedge the random payoff, the seller constructs a portfolio consisting of the underlying risky asset and cash. We consider discounted quantities in the problem: the discounted risky asset price $\hat{S}_{t}\coloneqq e^{-r_{f}t}S_{t}$ and discounted payoff $\hat{G}(\hat{S}_{T}):=e^{-r_{f}T}G(S_{T})$ , where $r_{f}$ is again the constant risk-free interest rate. As an example, for a put option with strike price $\mathcal{K}$ , $G(S_{T})=(\mathcal{K}-S_{T})^{+}$ ( $x^{+}:=\max(x,0)$ ), and $\hat{G}(\hat{S}_{T})=(\hat{\mathcal{K}}-\hat{S}_{T})^{+}$ , where $\hat{\mathcal{K}}:=e^{-r_{f}T}\mathcal{K}$ . Here $\hat{S}$ follows the SDE

d\hat{S}_{t}=\hat{S}_{t-}\left[\rho\sigma dt+\sigma dW_{t}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(dt,dz)\right],

(125)

where $\rho$ is defined in (88).

We denote the discounted dollar value in the risky asset and the discounted value of the hedging portfolio at time $t$ by $a_{t}$ and $X_{t}$ , respectively. As in the MV portfolio selection problem, $a_{t}$ is the control variable and $X$ follows the SDE (87). The seller seeks a hedging policy to minimize the deviation from the terminal payoff. A popular formulation is mean–variance hedging (also known as quadratic hedging), which considers the objective

\min_{a}\mathbb{E}\left[\left(X_{T}^{a}-\hat{G}(\hat{S}_{T})\right)^{2}\right].

(126)

Note that this expectation is taken under the real-world probability measure, rather than under any martingale measure for option pricing. An advantage of this formulation is that relevant data are observable in real world (but not necessarily in a risk-neutral world). Although objective (126) looks similar to that of the MV portfolio selection problem, the target the hedging portfolio tries to meet is now random instead of constant. Furthermore, the payoff is generally nonlinear in $\hat{S}_{T}$ . Thus, MV hedging is not an LQ problem. Also, both $\hat{S}_{t}$ and $X_{t}^{a}$ matter for the hedging decision at time $t$ , whereas $\hat{S}_{t}$ is irrelevant for decision in the MV portfolio selection problem. This increase in the state dimension makes the hedging problem much more difficult to solve.

We consider the following exploratory formulation to encourage exploration for learning:

\displaystyle\min_{\boldsymbol{\pi}}\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\left(X_{T}^{\boldsymbol{\pi}}-\hat{G}(\hat{S}_{T})\right)^{2}+\theta\int_{t}^{T}\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})ds\right],

(127)

where the discounted value of the hedging portfolio under a stochastic policy $\boldsymbol{\pi}$ follows (94).

6.1 Solution of the exploratory control problem

We consider the HJB equation for problem (127):

\frac{\partial V}{\partial t}+\inf_{\boldsymbol{\pi}\in\mathcal{P}(\mathbb{R})}\int_{\mathbb{R}}\left\{H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V)+\theta\log\boldsymbol{\pi}(a|t,S,x)\right\}\boldsymbol{\pi}(a|t,S,x)da=0,

(128)

with terminal condition $v(T,S,x)=(x-\hat{G}(S))^{2}$ . The Hamiltonian of the problem is given by

	$\displaystyle H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V)$	(129)
$\displaystyle=$	$\displaystyle\rho\sigma SV_{S}(t,S,x)+\frac{1}{2}\sigma^{2}S^{2}V_{SS}(t,S,x)+a\rho\sigma V_{x}(t,S,x)+\frac{1}{2}a^{2}\sigma^{2}V_{xx}(t,S,x)+a\sigma^{2}SV_{Sx}(t,S,x)$	(130)
$\displaystyle+$	$\displaystyle\int_{\mathbb{R}}\left(V(t,S+\tilde{\gamma}(S,z),x+\gamma(a,z))-V(t,S,x)-\tilde{\gamma}(S,z)V_{S}(t,S,x)-\gamma(a,z)V_{x}(t,S,x)\right)\nu(dz),$	(131)

where $\gamma(a,z)=a(\exp(z)-1)$ and $\tilde{\gamma}(S,z)=S(\exp(z)-1)$ .

We make the following ansatz for the solution of the HJB equation (128):

V(t,S,x)=(x-h(t,S))^{2}f(t)+g(t,S).

(132)

With this, we can simplify the integral term in the Hamiltonian and obtain

	$\displaystyle H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V)$	(133)
$\displaystyle=$	$\displaystyle\rho\sigma SV_{S}(t,S,x)+\frac{1}{2}\sigma^{2}S^{2}V_{SS}(t,S,x)+\frac{1}{2}a^{2}(\sigma^{2}+\sigma_{J}^{2})V_{xx}(t,S,x)$	(134)
	$\displaystyle+a\left(\rho\sigma V_{x}(t,S,x)+\sigma^{2}SV_{Sx}-\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)\right)$	(135)
	$\displaystyle+\int_{\mathbb{R}}\left(V(t,S\exp(z),x)-V(t,S,x)-S(\exp(z)-1)V_{S}(t,S,x)\right)\nu(dz).$	(136)

The probability density function that minimizes the integral in (128) is given by

\boldsymbol{\pi}_{c}(\cdot|t,S,x)\propto\exp\left(-\frac{1}{\theta}H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V)\right),

(137)

which is a candidate for the optimal stochastic policy. From (136), we obtain that $\boldsymbol{\pi}_{c}(\cdot|t,S,x)$ is given by

\mathcal{N}\left(\cdot\ \Big{|}-\frac{\rho\sigma V_{x}+\sigma^{2}SV_{Sx}-\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)V_{xx}}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}},\frac{\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}\right).

(138)

Substituting it back to the HJB equation (128), we obtain a nonlinear PIDE

	$\displaystyle V_{t}+\rho\sigma SV_{S}+\frac{1}{2}\sigma^{2}S^{2}V_{SS}+\int_{\mathbb{R}}\left(V(t,S\exp(z),x)-V(t,S,x)-S(\exp(z)-1)V_{S}(t,S,x)\right)\nu(dz)$		(139)
	$\displaystyle-\frac{\left(\rho\sigma V_{X}+\sigma^{2}SV_{Sx}-\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)V_{xx}\right)^{2}}{2(\sigma^{2}+\sigma_{J}^{2})V_{xx}}$		(140)
	$\displaystyle-\frac{\theta}{2}\ln\frac{2\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}=0,\quad(t,S,x)\in[0,T)\times\mathbb{R}_{+}\times\mathbb{R},\ V(T,x)=(x-\omega)^{2}-(\omega-z)^{2}.$		(141)

We plug in the ansatz (132) to the above PIDE. After some lengthy calculations, we can collect similar terms and obtain the following equations satisfied by $f$ , $h$ and $g$ :

\displaystyle f^{\prime}(t)-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}f(t)=0,\ f(T)=1,

(142)

	$\displaystyle h_{t}+\int_{\mathbb{R}}\left(h(t,S\exp(z))-h(t,S)-S(\exp(z)-1)h_{S}(t,S)\right)\left[1-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}(\exp(z)-1)\right]\nu(dz)$		(143)
	$\displaystyle+\frac{1}{2}\sigma^{2}S^{2}h_{SS}=0,\quad(t,S)\in[0,T)\times\mathbb{R}_{+},\ h(T,S)=\hat{G}(S),$		(144)

and

	$\displaystyle g_{t}+\rho\sigma Sg_{S}+\frac{1}{2}\sigma^{2}S^{2}g_{SS}+\int_{\mathbb{R}}\left(g(t,S\exp(z))-g(t,S)-S(\exp(z)-1)g_{S}(t,S)\right)\nu(dz)$		(145)
	$\displaystyle+\sigma^{2}S^{2}(h_{S})^{2}f(t)+\int_{\mathbb{R}}\left(h(t,S\exp(z))-h(t,S)\right)^{2}f(t)\nu(dz)$		(146)
	$\displaystyle-\frac{f(t)}{\sigma^{2}+\sigma_{J}^{2}}\left(\sigma^{2}Sh_{S}+\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)\right)^{2}$		(147)
	$\displaystyle-\frac{\theta}{2}\ln\frac{\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})f(t)}=0,\quad(t,S)\in[0,T)\times\mathbb{R}_{+},\ g(T,S)=0.$		(148)

The function $f$ is given by

f(t)=\exp\left(-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right).

(149)

However, the two PIDEs cannot be solved in closed form in general. Below we first present some properties of $h$ and $g$ .

Lemma 6.

Assume $\sigma>0$ and $\hat{G}(S)$ is Lipschitz continuous in $S$ . The PIDE (144) has a unique solution in $C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ that is Lipschitz continuous in $S$ . The PIDE (148) also has a unique solution in $C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ . Moreover, there exists a constant $C>0$ such that

|h(t,S)|\leq C(1+S),\quad|g(t,S)|\leq C(1+S^{2})\ \ \text{for any}\ (t,S)\in[0,T]\times\mathbb{R}_{+}.

(150)

Next, we provide stochastic representations for the solutions to PIDEs (144) and (148), which will be subsequently exploited for our RL study. Construct a new probability measure $\mathbb{Q}$ from the given probability measure $\mathbb{P}$ with the Radon–Nikodym density process $\Lambda_{s}:=\frac{d\mathbb{Q}}{d\mathbb{P}}|_{\mathcal{F}_{s}}$ defined by

d\Lambda_{s}=-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}\Lambda_{s-}\left[\sigma dW_{s}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(ds,dz)\right],\ \Lambda_{t}=1.

(151)

Condition (89) guarantees that $\{\Lambda_{s}:t\leq s\leq T\}$ is a true martingale with unit expectation. The standard measure change results yield that, under $\mathbb{Q}$ ,

W^{\prime}_{s}:=W_{s}+\frac{\rho\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(s-t),\ \widetilde{N}^{\prime}(ds,dz):=N(ds,dz)-\left[1-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}(\exp(z)-1)\right]\nu(dz)ds

(152)

are standard Brownian motion and compensated Poisson random measure, respectively. Using them we can rewrite the SDE for $\hat{S}$ as

d\hat{S}_{s}=\hat{S}_{s-}\left[\sigma dW^{\prime}_{s}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}^{\prime}(ds,dz)\right].

(153)

Let $Y_{t}:=1+\sigma W_{t}+\int_{0}^{t}\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}^{\prime}(ds,dz)$ . Clearly, it is a Lévy process and also a martingale under condition (89). Process $\hat{S}$ is the stochastic exponential of $Y$ and it follows from (Cont and Tankov 2004, Proposition 8.23) that $\hat{S}$ is also a martingale under $\mathbb{Q}$ . The Feynman–Kac theorem gives the representation of the solution $h$ to PIDE (144) as

h(t,S)=\mathbb{E}^{\mathbb{Q}}\left[\hat{G}(\hat{S}_{T})\big{|}\hat{S}_{t}=S\right],\quad t\in[0,T].

(154)

We can view $\mathbb{Q}$ as the martingale measure for option valuation and interpret $h(t,S)$ as the option price at time $t$ and underlying price $S$ . For PIDE (148), again by the Feynman–Kac theorem we obtain that its solution $g$ is given by

	$\displaystyle g(t,S)$	$\displaystyle=g_{e}(t,S)-\int_{t}^{T}\frac{\theta}{2}\ln\frac{\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})f(s)}ds$		(155)
		$\displaystyle=g_{e}(t,S)-\frac{\theta}{2}\left[\ln\left(\frac{\pi\theta}{\sigma^{2}+\sigma_{J}^{2}}\right)(T-t)+\frac{\rho^{2}\sigma^{2}}{2(\sigma^{2}+\sigma_{J}^{2})}(T-t)^{2}\right],$		(156)

where

	$\displaystyle g_{e}(t,S):=\mathbb{E}^{\mathbb{P}}\left[\int_{t}^{T}f(s)\left(\sigma^{2}\hat{S}_{s}^{2}(h_{S}(s,\hat{S}_{s}))^{2}+\int_{\mathbb{R}}\left(h(s,\hat{S}_{s}\exp(z))-h(s,\hat{S}_{s})\right)^{2}\nu(dz)\right)ds\right.$		(157)
	$\displaystyle\left.-\int_{t}^{T}\frac{f(s)}{\sigma^{2}+\sigma_{J}^{2}}\left(\sigma^{2}\hat{S}_{s}h_{S}(s,\hat{S}_{s})+\int_{\mathbb{R}}(\exp(z)-1)\left(h(s,\hat{S}_{s}\exp(z))-h(s,\hat{S}_{s})\right)\nu(dz)\right)^{2}ds\big{\|}\hat{S}_{t}=S\right].$		(158)

Using the expression for $V(t,S,x)$ , we obtain $\boldsymbol{\pi}_{c}(\cdot|t,S,x)$ as

\boldsymbol{\pi}_{c}(\cdot|t,S,x)\sim\mathcal{N}\left(\cdot\ \Big{|}\ \mathcal{M}^{*}(t,S,x),\frac{\theta}{2(\sigma^{2}+\sigma_{J}^{2})}\exp\left(\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right)\right),

(159)

where

\mathcal{M}^{*}(t,S,x)=\frac{\sigma^{2}Sh_{S}(t,S)+\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)-\rho\sigma(x-h(t,S))}{\sigma^{2}+\sigma_{J}^{2}}.

(160)

Using Lemma 6, we can directly verify that $\boldsymbol{\pi}_{c}$ is admissible. It is also obvious to see that $V(t,S,x)$ given by (132) has quadratic growth in $S$ and $x$ . Therefore, by Lemma 5, we have the following conclusion.

Proposition 3.

For the MV hedging problem (126), the optimal value function $J^{*}(t,S,x)=V(t,S,x)$ and the optimal stochastic policy $\boldsymbol{\pi}^{*}=\boldsymbol{\pi}_{c}$ under the assumptions that $\sigma>0$ and $\hat{G}(S)$ is Lipschitz continuous in $S$ .

The mean part of $\boldsymbol{\pi}^{*}$ , $\mathcal{M}^{*}(t,S,x)$ , comprises three terms. The quantity $h_{S}(t,S)$ is the delta of the option and hence $Sh_{S}(t,S)$ gives the dollar amount of the risky asset as required by delta hedging, which is used to hedge continuous price movements. The integral term shows the dollar amount of the risky asset that should be held to hedge discontinuous changes. The two hedges are combined using weights $\sigma^{2}/(\sigma^{2}+\sigma_{J}^{2})$ and $1/(\sigma^{2}+\sigma_{J}^{2})$ . The last term reflects the adjustment that needs to be made due to the discrepancy between the hedging portfolio value and option price. It is easy to show that $\mathcal{M}^{*}(t,S,x)$ is the optimal deterministic policy of problem (126).

The variance of $\boldsymbol{\pi}^{*}$ is the same as that in the MV portfolio selection problem (c.f. (108)), which decreases as $t$ approaches the terminal time $T$ . This implies that one gradually reduces exploration over time.

6.2 Parametrizations and actor–critic learning

We use the previously derived solution of the exploratory problem as knowledge for RL. As we will see, the exploratory solution lends itself to natural parametrizations for the policy (“actor”) and value function (“critic”), but less so for parametrizing the $q$ -function.

To simplify the integral in $\mathcal{M}^{*}(t,S,x)$ defined in (160), we first parametrize the Lévy density $\nu(z)$ as in (111) where we have three parameters: $\lambda$ is the jump arrival rate, and $m$ and $\delta$ are the mean and standard deviation of the normal distribution for the jump size. We then use the Fourier-cosine method developed in Fang and Oosterlee (2009) to obtain an expression for $h(t,S)$ . Let $Y_{t}:=\log(\hat{S}_{t}/\hat{S}_{0})$ . Applying Itô’s formula, under $\mathbb{P}$ we obtain

d\log\hat{S}_{t}=\left[\rho\sigma-\frac{1}{2}\sigma^{2}-\lambda\left(\exp\left(m+\frac{1}{2}\delta^{2}\right)-1\right)\right]dt+\sigma dW_{t}+\int_{\mathbb{R}}zN(dt,dz).

(161)

Noting (152), under $\mathbb{Q}$ we have

d\log\hat{S}_{t}=\left[\rho\sigma\frac{\sigma_{J}^{2}}{\sigma^{2}+\sigma_{J}^{2}}-\frac{1}{2}\sigma^{2}-\lambda\left(\exp\left(m+\frac{1}{2}\delta^{2}\right)-1\right)\right]dt+\sigma dW^{\prime}_{t}+\int_{\mathbb{R}}zN(dt,dz),

(162)

where the Lévy measure of $N(dt,dz)$ becomes

\nu^{\prime}(dz)=\left[1-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}(\exp(z)-1)\right]\frac{\lambda}{\sqrt{2\pi\delta^{2}}}\exp\left(-\frac{(z-m)^{2}}{2\delta^{2}}\right)dz.

(163)

The characteristic function of $Y_{t}$ under $\mathbb{Q}$ is given by

\Phi_{Y_{t}}^{\mathbb{Q}}(u)=\exp\left[iut\left(\rho\sigma\frac{\sigma_{J}^{2}}{\sigma^{2}+\sigma_{J}^{2}}-\frac{1}{2}\sigma^{2}-\lambda\left(\exp\left(m+\frac{1}{2}\delta^{2}\right)-1\right)\right)-\frac{1}{2}\sigma^{2}u^{2}t+tA(u)\right],

(164)

where

$\displaystyle A(u)=$	$\displaystyle\int_{\mathbb{R}}\left(\exp(iuz)-1\right)\nu^{\prime}(dz)$	(165)
$\displaystyle=$	$\displaystyle\lambda\left(\exp\left(ium-\frac{1}{2}\delta^{2}u^{2}\right)-1\right)-\frac{\lambda\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}\times$	(166)
	$\displaystyle\left(\exp\left(m+\frac{1}{2}\delta^{2}+iu(m+\delta^{2})-\frac{1}{2}\delta^{2}u^{2}\right)-\exp\left(ium-\frac{1}{2}\delta^{2}u^{2}\right)-\exp\left(m+\frac{1}{2}\delta^{2}\right)+1\right).$	(167)

The Fourier-cosine method then calculates $h(t,S)$ as

\hat{h}(t,S):=\sideset{}{{}^{\prime}}{\sum}_{k=0}^{N-1}\Re\left(\Phi_{Y_{t}}^{\mathbb{Q}}\left(\frac{k\pi}{b-a}\right)\exp\left(ik\pi\frac{\ln(S/\hat{\mathcal{K}})-a}{b-a}\right)\right)V_{k}.

(168)

Here, $\Re(x)$ denotes the real part of complex number $x$ , $\Sigma^{\prime}$ indicates the first term in the sum is halved, $\hat{\mathcal{K}}$ is the discounted strike price of the call or put option under consideration, and $[a,b]$ is the truncation region for $Y_{T}$ (chosen wide enough so that the probability of $Y_{T}$ going outside under $\mathbb{Q}$ can be neglected). The expression of $V_{k}$ is given by Eqs. (24) and (25) in Fang and Oosterlee (2009) for call and put options, respectively, and it only depends on $a,b,\hat{\mathcal{K}}$ . We can also calculate $h_{S}(t,S)$ by differentiating $\hat{h}(t,S)$ w.r.t. $S$ , which yields

\hat{h}_{S}(t,S):=\sideset{}{{}^{\prime}}{\sum}_{k=0}^{N-1}\Re\left(\Phi_{Y_{t}}^{\mathbb{Q}}\left(\frac{k\pi}{b-a}\right)\frac{ik\pi}{(b-a)S}\exp\left(ik\pi\frac{\ln(S/\hat{\mathcal{K}})-a}{b-a}\right)\right)V_{k}.

(169)

Actor parametrization. We follow the form of the optimal stochastic policy given in (159) to parametrize our policy. To obtain a parsimonious parametrization, we introduce $\phi_{1},\phi_{2},\phi_{3},\phi_{4},\phi_{5}$ , which respectively represent the agent’s understanding – relative to her learning – of drift $\mu$ , volatility $\sigma$ , jump arrival rate $\lambda$ , mean $m$ and standard deviation $\delta$ of the jump size of the risky asset. We emphasize that in our learning algorithm these parameters are not estimated by any statistical method (and hence they do not necessarily correspond to the true dynamics of the risky asset); rather they will be updated based on the hedging experiences (via exploration and exploitation) from the real environment. To implement (159), we also need two dependent parameters $\phi_{6}$ and $\phi_{7}$ that are calculated by

\phi_{6}=\frac{\phi_{1}-r_{f}}{\phi_{2}},\ \phi_{7}=\sqrt{\phi_{3}\left[\exp\left(2\phi_{4}+2\phi_{5}^{2}\right)-2\exp\left(\phi_{4}+\frac{1}{2}\phi_{5}^{2}\right)+1\right]}.

(170)

These two parameters show the agent’s understanding of $\rho$ and $\sigma_{J}$ , and they will be updated in the learning process by the above equations with the update of the independent parameters $\phi_{1}$ to $\phi_{5}$ . Denote $\phi=(\phi_{i})_{i=1}^{7}$ .

We use $\phi_{1},\phi_{2},\phi_{3},\phi_{4},\phi_{5},\phi_{6},\phi_{7}$ to replace their counterparts $\mu,\sigma,\lambda,m,\delta,\rho,\sigma_{J}$ in (159) to obtain a parametrized policy. For the mean part, we calculate the option price and its delta by plugging $\phi$ into (168) and (169), with the results denoted by $\hat{h}^{\phi}(t,S)$ and $\hat{h}_{S}^{\phi}(t,S)$ respectively, and approximate the integral by Gauss-Hermite (G-H) quadrature, which yields

\mathcal{M}^{\phi}(t,S,x)=\frac{\phi_{2}^{2}S\hat{h}_{S}^{\phi}(t,S)+\frac{\phi_{3}}{\sqrt{2\pi}}\sum\limits_{q=1}^{Q}w_{q}(\exp(z_{q})-1)\left(\hat{h}^{\phi}(t,S\exp(z_{q}))-\hat{h}^{\phi}(t,S)\right)-\phi_{2}\phi_{6}(x-\hat{h}^{\phi}(t,S))}{\phi_{2}^{2}+\phi_{7}^{2}},

(171)

where $z_{q}=\phi_{4}+\phi_{5}u_{q}$ and $\{(w_{q},u_{q}),q=1,\cdots,Q\}$ are the weights and abscissa of the $Q$ -point G-H rule that are predetermined and independent of $\phi$ . For the variance part, we use $\phi$ in (149) to obtain $f^{\phi}(t)$ . Thus, our parametrized policy is given by

\boldsymbol{\pi}^{\phi}(\cdot|t,S,x)\sim\mathcal{N}\left(\cdot\ \Big{|}\ \mathcal{M}^{\phi}(t,S,x),\frac{\theta}{2(\phi_{2}^{2}+\phi_{7}^{2})f^{\phi}(t)}\right).

(172)

Critic parametrization. For the value function of $\boldsymbol{\pi}^{\phi}$ , following (132) we parametrize it as

J^{\psi,\phi}(t,S,x)=(x-\hat{h}^{\phi}(t,S))^{2}f^{\phi}(t)+g_{e}^{\psi}(t,S)-\frac{\theta}{2}\left[\ln\left(\frac{\pi\theta}{\phi_{2}^{2}+\phi_{7}^{2}}\right)(T-t)+\frac{\phi_{2}^{2}\phi_{6}^{2}}{2(\phi_{2}^{2}+\phi_{7}^{2})}(T-t)^{2}\right].

(173)

We do not specify any parametric form for $g_{e}^{\psi}(t,S)$ , but will learn it using Gaussian process regression (Williams and Rasmussen 2006), where $\psi$ denotes its parameters. We choose Gaussian process for our problem because it is able to generate flexible shapes and can often achieve a reasonably good fit with a moderate amount of data.

Actor–critic learning. Consider a time grid $\{t_{k}\}_{k=0}^{K}$ with $t_{0}=0$ , $t_{K}=T$ , and time step $\Delta t$ . In an iteration, we start with a given policy $\boldsymbol{\pi}^{\phi}$ and use it to collect $M$ episodes from the environment, where the $m$ -th episode is given by $\{(t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)},a_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})\}_{k=0}^{K}$ . We then perform two steps.

In the critic step, the main task is to learn $g_{e}^{\psi}(t_{k},S)$ as a function of $S$ for each $t_{k}$ . The stochastic representation (158) shows that $g_{e}$ can be viewed as the value function of a stream of rewards that depend on time and the risky asset price; so the task is a policy evaluation problem. We first obtain the running reward path in every episode, where the integral involving the Lévy measure is again calculated by the $Q$ -point G-H rule. Specifically, the reward at time $t_{k}$ for $k=0,\cdots,K-1$ is given by

	$\displaystyle R_{t_{k}}^{g}:=f^{\phi}(t_{k})\left[\phi_{2}^{2}\hat{S}_{t_{k}}^{2}(\hat{h}^{\phi}_{S}(t_{k},\hat{S}_{t_{k}}))^{2}+\frac{\phi_{3}}{\sqrt{2\pi}}\sum_{q=1}^{Q}w_{q}\left(\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}}\exp(z_{q}))-\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}})\right)^{2}\right]\Delta t$		(174)
	$\displaystyle-\frac{f^{\phi}(t_{k})}{\phi_{2}^{2}+\phi_{7}^{2}}\left[\phi_{2}^{2}\hat{S}_{t_{k}}\hat{h}^{\phi}_{S}(t_{k},\hat{S}_{t_{k}})+\frac{\phi_{3}}{\sqrt{2\pi}}\sum_{q=1}^{Q}w_{q}(\exp(z_{q})-1)\left(\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}}\exp(z_{q}))-\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}})\right)\right]^{2}\Delta t,$		(175)

where the risky asset price path $\{(\hat{S}_{t_{k}})\}_{k=0}^{K-1}$ is collected from the real environment. We stress again that the above calculation uses the agent’s parameters $\phi$ and does not involve any parameters of the true dynamics. We then fit a Gaussian process to the sample of the cumulative rewards $\{\sum_{\ell=k}^{K-1}R_{t_{\ell}}^{g,(m)}\}_{m=1}^{M}$ to estimate $g_{e}^{\psi}(t_{k},S)$ . Figure 2 illustrates the fit at three time points with the radial basis kernel. The Gaussian process is able to provide a good fit to the shape of $g_{e}(t,S)$ as a function of $S$ for each given $t$ .

In the actor step, we update the policy parameter $\phi_{i}$ for $i=1,\cdots,5$ by

$\displaystyle\phi_{i}\leftarrow$	$\displaystyle~{}\phi_{i}-\frac{\alpha_{\phi_{i}}}{M}\sum_{m=1}^{M}\sum_{k=0}^{K-1}\left.\Big{[}\theta\log\boldsymbol{\pi}^{\phi}(a_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)}\|~{}t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})\Delta t\right.$	(176)
	$\displaystyle\left.+J^{\psi,\phi}(t_{k+1},\hat{S}_{t_{k+1}}^{(m)},X_{t_{k+1}}^{\boldsymbol{\pi}^{\phi},(m)})-J^{\psi,\phi}(t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})-\beta J^{\psi,\phi}(t_{k},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})\Delta t\Big{]}\right.$	(177)
	$\displaystyle\ \cdot\frac{\partial}{\partial\phi_{i}}\log\boldsymbol{\pi}^{\phi}(a_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)}\|~{}t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)}).$	(178)

This update essentially follows from (85); the difference is that we do batch processing and update the parameters only after observing all the episodes in a batch. Furthermore, we flip the sign before $\alpha_{\phi_{i}}$ and $\theta$ as we consider minimization here while the problem in the general discussion is for maximization.

6.3 Simulation study

We check the convergence behavior of our actor–critic algorithm in a simulation study. We assume the environment is described by an MJD model with its true parameters shown in the upper row of Table 2.²²2These values are chosen close to the estimates obtained from the market data of S&P 500 in Table 1 to make the MJD model better resemble the reality in the simulation study. We consider hedging a 4-month put option with strike price $\mathcal{K}=100$ because short-term options with maturity of a few months are typically traded much more actively than longer-term options. We assume the risk-free rate is zero.

We use 20 quadrature points for the G-H rule in our implementation, which suffices for high accuracy. In each iteration, we sample $M=32$ episodes from the environment simulated by the MJD model. The learning rates are chosen to be $\alpha_{\phi_{1}}=5\mathrm{e}{-3}$ , $\alpha_{\phi_{2}}=2\mathrm{e}{-5}$ , $\alpha_{\phi_{3}}=15$ , $\alpha_{\phi_{4}}=1\mathrm{e}{-6}$ , $\alpha_{\phi_{5}}=1\mathrm{e}{-6}$ , while the temperature parameter $\theta=0.1$ . We keep the learning rates constant in the first 30 iterations and then they decay at rate $l(j)=j^{-0.5}$ , where $j$ is the iteration index. Before our actor–critic learning, we sample a 15-year path from the MJD model as data for MLE and obtain estimates of the parameters shown in Table 2. MLE estimates all the parameters quite well except the drift $\mu$ , where the estimate is more than twice the true value.³³3This is expected due to the well-known mean-blur problem, as 15 years of data are significantly insufficient to obtain a reasonably accurate estimate of the mean using statistical methods. We then use these estimates as initial values of $\phi_{1}$ to $\phi_{5}$ for learning. Figure 3 clearly shows that our RL algorithm is able to converge near the true values. In particular, the estimate of $\mu$ eventually moves closely around the true level despite being distant to it initially.

	Parameters
True	$\mu=0.06,\sigma=0.13,\lambda=28,m=-0.004,\delta=0.03$
MLE	$\mu=0.1289,\sigma=0.1331,\lambda=26.1091,m=-0.0033,\delta=0.0317$

Table 2: Parameters of the true MJD model and MLE estimates from 15 years of data

6.4 Empirical study

In the empirical study we consider hedging a European put option written on the S&P 500 index. This type of option is popular in the financial market as a tool for investors to protect their high beta investment. We obtain daily observations of the index from the beginning of 2000 to the end of 2023, and split them into two periods: the training period (the first 20 years) and the test period (the last 4 years). The test period includes the U.S. stock market meltdown at the onset of the pandemic, the 2022 bear period, and various bull periods.

We use the MLE estimates from the first 20 years of observations to initialize $\phi_{1}$ to $\phi_{5}$ for learning. To sample an episode from the environment for training our RL algorithm, we bootstrap index returns from the training period. Similarly, we bootstrap index returns from the test period to generate index price paths to test the learned policy. We use the 20-point G-H rule with $M=32$ episodes in each training iteration. The learning rates are set as $\alpha_{\phi_{1}}=5\mathrm{e}{-3}$ , $\alpha_{\phi_{2}}=5\mathrm{e}{-4}$ , $\alpha_{\phi_{3}}=50$ , $\alpha_{\phi_{4}}=5\mathrm{e}{-6}$ , $\alpha_{\phi_{5}}=1\mathrm{e}{-5}$ with the same learning rate decay schedule as in the simulation study. We set the temperature parameter $\theta=0.1$ .

We consider four maturities ranging from 1 to 4 months for the put option. For a standard S&P500 index option traded on CBOE, in practice its strike is quoted in points commensurate with the scale of the index level and converted to dollar amount by a multiplier of 100. That is, the dollar strike $\mathcal{K}=100\mathcal{K}_{p}$ , where $\mathcal{K}_{p}$ is the strike in points.⁴⁴4For example, for an at-the-money option contract where the current index is 5000 points, we have $\mathcal{K}_{p}=5000$ . The contract’s notional size is 500000 dollars. Let $S_{t}$ be the index level at time $t$ converted to dollars by the multiplier of 100. We sample $S_{0}$ uniformly from the range $[0.7\mathcal{K},1.3\mathcal{K}]$ to generate training and test episodes. This allows us to include scenarios where the option is deep out of money or in the money. We train our algorithm for 100 iterations and then test it on 5000 index price episodes. Finally, although we use a stochastic policy to interact with the environment for learning/updating the parameters during training, in the test stage we apply a deterministic policy which is the mean of the learned stochastic policy. This can reduce the variance of the final hedging error.

To measure the test performance, we consider the mean squared hedging error (denominated in squared dollars) divided by $\mathcal{K}_{p}^{2}$ . This effectively normalizes the contract’s scale; so we can just set $\mathcal{K}_{p}=1$ in our implementation. The test result can be interpreted as the average mean squared hedging error in squared dollars for every point of strike price.

We compare the performances of the policy learned by our RL algorithm and the one obtained by plugging the MLE estimates from the training period data to (160). Both policies are tested on the same index price episodes. We check the statistical significance of the difference in the squared hedging errors of the two policies by applying the t-test. Table 3 reports the mean squared hedging error for each policy as well as the p-value of the t-test. Not surprisingly, the hedging error of each policy increases as the maturity becomes longer due to greater uncertainty in the problem. In every case, the RL policy achieves notable reduction in the hedging error compared with the MLE-based policy. The p-value further shows the outperformance of the former is statistically significant.

Maturity	RL policy	MLE policy	p-value
1 month	0.3513 (0.034)	0.4006 (0.043)	3.2e-7
2 months	0.5668 (0.029)	0.6796 (0.050)	1.2e-5
3 months	0.8283 (0.031)	1.0080 (0.060)	1.6e-5
4 months	0.8649 (0.047)	1.1544 (0.068)	4.2e-13

Table 3: Mean squared hedging errors of the two policies in the empirical test with their standard errors shown in parenthesis

7 Conclusions

Fluctuations in data or time series coming from nonlinear, complex dynamic systems are characterized by two types: slow changes and sudden jumps, the latter occurring much rarely than the former. Hence jump-diffusions capture the key structural characteristics of many data generating processes in areas such as physics, astrophysics, earth science, engineering, finance, and medicine. As a result, RL for jump-diffusions is important both theoretically and practically. This paper endeavors to lay a theoretical foundation for the study. A key insight from this research is that temporal–difference algorithms designed for diffusions can work seamlessly for jump-diffusions. However, unless using general neural networks, policy parameterization does need to respond to the presence of jumps if one is to take advantage of any special structure of an underlying problem.

There are plenty of open questions starting from here, including the convergence of the algorithms, regret bounds, decaying rates of the temperature parameters, and learning rates of the gradient decent and/or stochastic approximation procedures involved.

Acknowledgements

Gao acknowledges support from the Hong Kong Research Grants Council [GRF 14201424, 14212522, 14200123]. Li is supported by the Hong Kong Research Grant Council [GRF 14216322]. Zhou is supported by the Nie Center for Intelligent Asset Management at Columbia University.

References

Aït-Sahalia et al. (2012) Aït-Sahalia, Y., J. Jacod, and J. Li (2012). Testing for jumps in noisy high frequency data. Journal of Econometrics 168(2), 207–222.
Andersen et al. (2002) Andersen, T. G., L. Benzoni, and J. Lund (2002). An empirical investigation of continuous-time equity return models. The Journal of Finance 57(3), 1239–1284.
Applebaum (2009) Applebaum, D. (2009). Lévy Processes and Stochastic Calculus. Cambridge University Press.
Bates (1991) Bates, D. S. (1991). The crash of 87: was it expected? The evidence from options markets. The Journal of Finance 46(3), 1009–1044.
Bates (1996) Bates, D. S. (1996). Jumps and stochastic volatility: Exchange rate processes implicit in Deutsche Mark options. The Review of Financial Studies 9(1), 69–107.
Bender and Thuan (2023) Bender, C. and N. T. Thuan (2023). Entropy-regularized mean-variance portfolio optimization with jumps. arXiv preprint arXiv:2312.13409.
Bo et al. (2024) Bo, L., Y. Huang, X. Yu, and T. Zhang (2024). Continuous-time q-learning for jump-diffusion models under Tsallis entropy. arXiv preprint arXiv:2407.03888.
Cai and Kou (2011) Cai, N. and S. G. Kou (2011). Option pricing under a mixed-exponential jump diffusion model. Management Science 57(11), 2067–2081.
Cont and Tankov (2004) Cont, R. and P. Tankov (2004). Financial Modelling with Jump Processes. Chapman and Hall/CRC.
Dai et al. (2023) Dai, M., Y. Dong, and Y. Jia (2023). Learning equilibrium mean–variance strategy. Mathematical Finance 33(4), 1166–1212.
Das (2002) Das, S. R. (2002). The surprise element: Jumps in interest rates. Journal of Econometrics 106(1), 27–65.
Denkert et al. (2024) Denkert, R., H. Pham, and X. Warin (2024). Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching. arXiv preprint arXiv:2404.17939.
Ethier and Kurtz (1986) Ethier, S. N. and T. G. Kurtz (1986). Markov Processes: Characterization and Convergence. John Wiley & Sons.
Fang and Oosterlee (2009) Fang, F. and C. W. Oosterlee (2009). A novel pricing method for European options based on Fourier-cosine series expansions. SIAM Journal on Scientific Computing 31(2), 826–848.
Gammaitoni et al. (1998) Gammaitoni, L., P. Hänggi, P. Jung, and F. Marchesoni (1998). Stochastic resonance. Reviews of Modern Physics 70(1), 223–287.
Gao et al. (2022) Gao, X., Z. Q. Xu, and X. Y. Zhou (2022). State-dependent temperature control for langevin diffusions. SIAM Journal on Control and Optimization 60(3), 1250–1268.
Giraudo and Sacerdote (1997) Giraudo, M. T. and L. Sacerdote (1997). Jump-diffusion processes as models for neuronal activity. Biosystems 40(1-2), 75–82.
Goswami et al. (2018) Goswami, B., N. Boers, A. Rheinwalt, N. Marwan, J. Heitzig, S. Breitenbach, and J. Kurths (2018). Abrupt transitions in time series with uncertainties. Nature Communications 9(1), 48–57.
Guo et al. (2023) Guo, X., A. Hu, and Y. Zhang (2023). Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. SIAM Journal on Control and Optimization 61(2), 755–787.
Guo et al. (2022) Guo, X., R. Xu, and T. Zariphopoulou (2022). Entropy regularization for mean field games with learning. Mathematics of Operations research 47(4), 3239–3260.
Heyde and Kou (2004) Heyde, C. C. and S. G. Kou (2004). On the controversy over tailweight of distributions. Operations Research Letters 32(5), 399–408.
Huang et al. (2022) Huang, Y., Y. Jia, and X. Zhou (2022). Achieving mean–variance efficiency by continuous-time reinforcement learning. In Proceedings of the Third ACM International Conference on AI in Finance, pp. 377–385.
Jacod and Shiryaev (2013) Jacod, J. and A. Shiryaev (2013). Limit Theorems for Stochastic Processes. Springer.
Jia and Zhou (2022a) Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research 23(154), 1–55.
Jia and Zhou (2022b) Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research 23(154), 1–55.
Jia and Zhou (2023) Jia, Y. and X. Y. Zhou (2023). q-learning in continuous time. Journal of Machine Learning Research 24(161), 1–61.
Kou (2002) Kou, S. G. (2002). A jump-diffusion model for option pricing. Management Science 48(8), 1086–1101.
Kunita (2004) Kunita, H. (2004). Stochastic differential equations based on Lévy processes and stochastic flows of diffeomorphisms. In M. M. Rao (Ed.), Real and Stochastic Analysis: New Perspectives, pp. 305–373. Birkhäuser.
Kushner (2000) Kushner, H. J. (2000). Jump-diffusions with controlled jumps: Existence and numerical methods. Journal of Mathematical Analysis and Applications 249(1), 179–198.
Li and Linetsky (2014) Li, L. and V. Linetsky (2014). Time-changed Ornstein–Uhlenbeck processes and their applications in commodity derivative models. Mathematical Finance 24(2), 289–330.
Lim (2005) Lim, A. E. (2005). Mean-variance hedging when there are jumps. SIAM Journal on Control and Optimization 44(5), 1893–1922.
Merton (1976) Merton, R. C. (1976). Option pricing when underlying stock returns are discontinuous. Journal of Financial Economics 3(1-2), 125–144.
Munos (2006) Munos, R. (2006). Policy gradient in continuous time. Journal of Machine Learning Research 7, 771–791.
Øksendal and Sulem (2007) Øksendal, B. K. and A. Sulem (2007). Applied Stochastic Control of Jump Diffusions, Volume 498. Springer.
Park et al. (2021) Park, S., J. Kim, and G. Kim (2021). Time discretization-invariant safe action repetition for policy gradient methods. Advances in Neural Information Processing Systems 34, 267–279.
Reisinger and Zhang (2021) Reisinger, C. and Y. Zhang (2021). Regularity and stability of feedback relaxed controls. SIAM Journal on Control and Optimization 59(5), 3118–3151.
Sato (1999) Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press.
Schweizer (1996) Schweizer, M. (1996). Approximation pricing and the variance-optimal martingale measure. The Annals of Probability 24(1), 206–236.
Schweizer (2001) Schweizer, M. (2001). A guided tour through quadratic hedging approaches. In E. Jouini, J. Cvitanic, and M. Musiela (Eds.), Handbooks in Mathematical Finance: Option Pricing, Interest Rates and Risk Management, pp. 538–74. Cambridge University Press.
Situ (2006) Situ, R. (2006). Theory of Stochastic Differential Equations with Jumps and Applications. Springer.
Tallec et al. (2019) Tallec, C., L. Blier, and Y. Ollivier (2019). Making deep q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732.
Wang et al. (2023) Wang, B., X. Gao, and L. Li (2023). Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950.
Wang and Zheng (2022) Wang, B. and X. Zheng (2022). Testing for the presence of jump components in jump diffusion models. Journal of Econometrics 230(2), 483–509.
Wang et al. (2020) Wang, H., T. Zariphopoulou, and X. Y. Zhou (2020). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research 21(198), 1–34.
Wang and Zhou (2020) Wang, H. and X. Y. Zhou (2020). Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance 30(4), 1273–1308.
Williams and Rasmussen (2006) Williams, C. K. and C. E. Rasmussen (2006). Gaussian Processes for Machine Learning. MIT Press.
Wu and Li (2024) Wu, B. and L. Li (2024). Reinforcement learning for continuous-time mean–variance portfolio selection in a regime-switching market. Journal of Economic Dynamics and Control 158, 104787.
Zhou and Li (2000) Zhou, X. Y. and D. Li (2000). Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization 42(1), 19–33.
Zhu et al. (2015) Zhu, C., G. Yin, and N. A. Baran (2015). Feynman–Kac formulas for regime-switching jump diffusions and their applications. Stochastics 87(6), 1000–1032.

Appendix A Proofs

Proof of Lemma 2.

We observe that for each $k=1,\cdots,\ell$ ,

\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right|^{p}\nu_{k}(dz)du=\int_{\mathcal{A}}\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|^{p}\nu_{k}(dz)\boldsymbol{\pi}(a|t,x)da.

(179)

From Assumption 1-(iii), we have $\sum_{k=1}^{\ell}\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|^{p}\nu_{k}(dz)\leq C_{p}(1+|x|^{p})$ for any $(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}$ , where $C_{p}$ does not depend on $a$ . Thus, integrating over $a$ with the measure $\boldsymbol{\pi}(a|t,x)da$ preserves the linear growth property. ∎

Proof of Lemma 3.

We consider

\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du,

(180)

which is bounded by

	$\displaystyle C_{p}$	$\displaystyle\left(\int_{\mathbb{R}\times[0,1]^{n}}\left\|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right\|^{p}\nu_{k}(dz)du\right.$		(181)
		$\displaystyle\left.+\int_{\mathbb{R}\times[0,1]^{n}}\left\|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right\|^{p}\nu_{k}(dz)du\right)$		(182)

for some constant $C_{p}>0$ .

For the first integral, using Assumption 2 we obtain

		$\displaystyle\int_{\mathbb{R}\times[0,1]^{n}}\left\|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right\|^{p}\nu_{k}(dz)du$		(183)
	$\displaystyle\leq$	$\displaystyle C_{K,p}\int_{[0,1]^{n}}\left\|G^{\boldsymbol{\pi}}(t,x,u)-G^{\boldsymbol{\pi}}(t,x^{\prime},u)\right\|^{p}du\leq C^{\prime}_{K,p}\|x-x^{\prime}\|^{p}.$		(184)

For the second integral, we have

	$\displaystyle\int_{\mathbb{R}\times[0,1]^{n}}\left\|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right\|^{p}\nu_{k}(dz)du$	(185)
$\displaystyle=$	$\displaystyle\int_{\mathcal{A}}\int_{\mathbb{R}}\left\|\gamma_{k}\left(t,x,a,z\right)-\gamma_{k}\left(t,x^{\prime},a,z\right)\right\|^{p}\nu_{k}(dz)\boldsymbol{\pi}(a\|t,x^{\prime})da$	(186)
$\displaystyle\leq$	$\displaystyle\int_{\mathcal{A}}C^{\prime\prime}_{K,p}\|x-x^{\prime}\|^{p}\boldsymbol{\pi}(a\|t,x^{\prime})da$	(187)
$\displaystyle=$	$\displaystyle C^{\prime\prime}_{K,p}\|x-x^{\prime}\|^{p},$	(188)

where we used Assumption 1-(ii) and $C^{\prime\prime}_{K,p}$ is the constant there. The desired claim is obtained by combining these results. ∎

Proof of Lemma 4.

Fix $t\in[0,T)$ and suppose $\tilde{X}^{\pi}_{t}=x$ . Define a sequence of stopping times $\tau_{n}=\inf\{s\geq t:|\tilde{X}^{\pi}_{s}|\geq n\}$ for $n\in\mathbb{N}$ . Applying Itô’s formula (15) to the process $e^{-\beta s}\phi(s,\tilde{X}^{\pi}_{s})$ , where $\tilde{X}^{\pi}_{s}$ follows the exploratory SDE (42), we obtain $\forall t^{\prime}\in[t,T]$ ,

	$\displaystyle e^{-\beta(t^{\prime}\wedge\tau_{n})}\phi(t^{\prime}\wedge\tau_{n},\tilde{X}^{\pi}_{r\wedge\tau_{n}})-e^{-\beta t}\phi(t,x)$	(189)
$\displaystyle=$	$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\left(\mathcal{L}^{\boldsymbol{\pi}}\phi(s,\tilde{X}^{\pi}_{s-})-\beta\phi(s,\tilde{X}^{\pi}_{s-})\right)ds$	(190)
$\displaystyle+$	$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\phi_{x}(s,\tilde{X}^{\pi}_{s-})\circ\tilde{\sigma}(s,\tilde{X}^{\pi}_{s-},\pi(\cdot\|s,\tilde{X}^{\pi}_{s-}))dW_{s}$	(191)
$\displaystyle+$	$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}^{\prime}_{k}(ds,dz,du),$	(192)

where $\mathcal{L}^{\boldsymbol{\pi}}$ is the generator of the exploratory process $(\tilde{X}^{\pi}_{t})$ given in (40). We next show that the expectations of (191) and (LABEL:eq:ito1N) are zero.

Note that for $s\in[t,t^{\prime}\wedge\tau_{n}]$ , $|\tilde{X}^{\pi}_{s-}|\leq n$ . Thus, $\phi_{x}(s,\tilde{X}^{\pi}_{s-})$ is also bounded. Using the linear growth of $\tilde{\sigma}(s,\tilde{X}^{\pi}_{s-},\pi(\cdot|s,\tilde{X}^{\pi}_{s-}))$ in Lemma 1 and the moment estimate (52), we can see that (191) is a square-integrable martingale and hence its expectation is zero.

Next, we analyze the following stochastic integral:

\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}_{k}(ds,dz,du).

(194)

We consider the finite and infinite jump activity cases separately.

Case 1: $\int_{\mathbb{R}}\nu_{k}(dz)<\infty$ . In this case, both of the processes

	$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\widetilde{N}_{k}(ds,dz,du),$		(195)
	$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\phi(s,\tilde{X}^{\pi}_{s-})\widetilde{N}_{k}(ds,dz,du),$		(196)

are square-integrable martingales and hence have zero expectations. We prove this claim for (195) and analyzing (196) is entirely analogous.

Using the polynomial growth of $\phi$ , Lemma 2, and $|\tilde{X}^{\pi}_{s-}|\leq n$ for $s\in[t,t^{\prime}\wedge\tau_{n}]$ , we obtain

	$\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-2\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\left\|\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\right\|^{2}\nu_{k}(dz)duds\right]$	(197)
$\displaystyle\leq$	$\displaystyle C_{p}\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}\int_{\mathbb{R}\times[0,1]^{n}}\left(1+\|\tilde{X}^{\pi}_{s-}\|^{p}+\|\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\|^{p}\right)\nu_{k}(dz)duds\right]$	(198)
$\displaystyle\leq$	$\displaystyle C_{p}^{\prime}\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}(1+\|\tilde{X}^{\pi}_{s-}\|^{p})ds\right]<\infty.$	(199)

This implies the process (195) is a square-integrable martingale; see e.g. (Situ, 2006, Section 1.9) for square-integrability of stochastic integrals with respect to compensated Poisson random measures.

Case 2: $\int_{\mathbb{R}}\nu_{k}(dz)=\infty$ . Let $B_{1}=\{|z|\leq 1\}\times[0,1]^{n}$ and $B_{1}^{c}=\{|z|>1\}\times[0,1]^{n}$ . The stochastic integral (194) can be written as the sum of two integrals:

		$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}_{k}(ds,dz,du)$		(200)
	$\displaystyle+$	$\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}^{c}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}_{k}(ds,dz,du).$		(201)

Using the mean-value theorem, the stochastic integral (200) is equal to

\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}}\phi_{x}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\circ\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\widetilde{N}_{k}(ds,dz,du)

(202)

for some $\alpha_{s}\in[0,1]$ . For $s\in[t,t^{\prime}\wedge\tau_{n}]$ , $|\tilde{X}^{\pi}_{s-}|\leq n$ . By Assumption 1^′-(v), $|\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)|$ is bounded for any $|z|\leq 1$ , $s\in[t,t^{\prime}\wedge\tau_{n}]$ and $u\in[0,1]^{n}$ , which further implies the boundedness of $|\phi_{x}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))|$ . Now, for each $j=1,\cdots,d$ ,

\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}}\phi_{x_{j}}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\gamma_{jk}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\widetilde{N}_{k}(ds,dz,du)

(203)

is a square-integrable martingale because

	$\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-2\beta s}\int_{B_{1}}\phi_{x_{j}}^{2}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\gamma_{jk}^{2}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\nu_{k}(dz)duds\right]$	(204)
$\displaystyle\leq$	$\displaystyle C\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}\int_{B_{1}}\gamma_{jk}^{2}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\nu_{k}(dz)duds\right]$	(205)
$\displaystyle\leq$	$\displaystyle C^{\prime}\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}(1+\|\tilde{X}^{\pi}_{s-}\|^{2})ds\right]<\infty,$	(206)

where we used Lemma 2 and boundedness of $|\tilde{X}^{\pi}_{s-}|$ in the above. Thus, (200) is a square-integrable martingale with mean zero.

For (201), we can use the same arguments as in the finite activity case by observing $\int_{|z|>1}\nu_{k}(dz)<\infty$ to show that each of the two processes in (201) is a square-integrable martingale with mean zero.

Combining the results above, setting $t^{\prime}=T$ , and taking expectation, we obtain

\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[e^{-\beta(T\wedge\tau_{n})}\phi(\tilde{X}^{\boldsymbol{\pi}}_{T\wedge\tau_{n}})\right]-e^{-\beta t}\phi(t,x)

\displaystyle=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T\wedge\tau_{n}}e^{-\beta s}\left(\mathcal{L}^{\boldsymbol{\pi}}\phi(s,\tilde{X}^{\pi}_{s-})-\beta\phi(s,\tilde{X}^{\pi}_{s-})\right)ds\right].

As $\phi(s,x)$ satisfies Eq. (56), it follows from (19) that

	$\displaystyle\phi(t,x)$	$\displaystyle=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T\wedge\tau_{n}}e^{-\beta(s-t)}\int_{\mathcal{A}}\left(r(s,\tilde{X}^{\boldsymbol{\pi}}_{s-},a)-\theta\log{\boldsymbol{\pi}}(a\|s,\tilde{X}^{\pi}_{s-})\right)\boldsymbol{\pi}(a\|s,\tilde{X}^{\boldsymbol{\pi}}_{s-})dads\right.$		(207)
		$\displaystyle\quad\quad\quad\left.+e^{-\beta(T\wedge\tau_{n}-t)}\phi(\tilde{X}^{\boldsymbol{\pi}}_{T\wedge\tau_{n}})\right].$		(208)

It follows from Assumption 1-(iv), Definition 1-(iii), and Eq. (57) that the term on the right-hand side of (208) is dominated by $C(1+\sup_{t\leq s\leq T}|\tilde{X}_{s}^{\boldsymbol{\pi}}|^{p})$ for some $p\geq 2$ , which has finite expectation from the moment estimate (52). Therefore, sending $n$ to infinity in (208) and applying the dominated convergence theorem, we obtain $\phi(t,x)=J(t,x,\boldsymbol{\pi})$ .∎

Proof of Lemma 5.

We first maximize the integral in (59). Applying (Jia and Zhou, 2023, Lemma 13), we deduce that $\boldsymbol{\pi}^{*}$ given by (62) is the unique maximizer. Next, we show that $\psi(t,x)$ is the optimal value function.

On one hand, given any admissible stochastic policy $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ , from (59) we have

\displaystyle\frac{\partial\psi(t,x)}{\partial t}+\int_{\mathcal{A}}\{H(t,x,a,\psi_{x},\psi_{xx},\psi)-\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da-\beta\psi(t,x)\leq 0.

(209)

Using similar arguments as in the proof of Lemma 4, we obtain $J(t,x,\boldsymbol{\pi})\leq\psi(t,x)$ for any $\boldsymbol{\pi}\in\boldsymbol{\Pi}$ . Thus, $J^{*}(t,x)\leq\psi(t,x)$ .

On the other hand, Eq. (59) becomes

\displaystyle\frac{\partial\psi(t,x)}{\partial t}+\int_{\mathcal{A}}\{H(t,x,a,\psi_{x},\psi_{xx},\psi)-\theta\log\boldsymbol{\pi}^{*}(a)\}\boldsymbol{\pi}^{*}(a)da-\beta\psi(t,x)

\displaystyle=0,

(210)

with $\psi(T,x)=h(x).$ By Lemma 4, we obtain that $J(t,x,\boldsymbol{\pi}^{*})=\psi(t,x)$ . It follows that $J^{*}(t,x)\geq\psi(t,x)$ .

Combining these results, we conclude that $J^{*}(t,x)=\psi(t,x)$ and $\boldsymbol{\pi}^{*}$ is the optimal stochastic policy. ∎

Proof of Theorem 2.

Consider part (i). Applying Itô’s formula to the value function of policy $\boldsymbol{\pi}$ over the sample state process defined by (22) and using the definition of $q$ -function, we obtain that for $0\leq t<s\leq T$ ,

	$\displaystyle e^{-\beta s}J(s,X_{s}^{\boldsymbol{\pi}};\boldsymbol{\pi})-e^{-\beta t}J(t,x;\boldsymbol{\pi})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau$
	$\displaystyle=\int_{t}^{s}e^{-\beta\tau}[q(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}};\boldsymbol{\pi})-\hat{q}(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau+\int_{t}^{s}e^{-\beta\tau}J_{x}(\tau,X_{\tau-}^{\boldsymbol{\pi}};\boldsymbol{\pi})\circ{\sigma}(\tau,X^{\pi}_{\tau-},a^{\pi}_{\tau})dW_{\tau}$
	$\displaystyle\quad+\sum_{k=1}^{\ell}\int_{t}^{s}e^{-\beta\tau}\int_{\mathbb{R}}\left(J(\tau,X^{\boldsymbol{\pi}}_{\tau-}+{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z))-J(\tau,X^{\boldsymbol{\pi}}_{\tau-};\boldsymbol{\pi})\right)\widetilde{N}_{k}(d\tau,dz).$		(211)

Suppose $\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi})$ for all $(t,x,a)$ . Hence the first term on the right-hand side of (211) is zero. We verify the following two conditions:

	$\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-2\beta\tau}\|J_{x}(\tau,X_{\tau-}^{\boldsymbol{\pi}};\boldsymbol{\pi})\circ{\sigma}(\tau,X^{\pi}_{\tau-},a^{\pi}_{\tau})\|^{2}d\tau\right]<\infty,$		(212)
	$\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left\|J(\tau,X^{\boldsymbol{\pi}}_{\tau-}+{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z);\boldsymbol{\pi})-J(\tau,X^{\boldsymbol{\pi}}_{\tau-};\boldsymbol{\pi})\right\|^{2}\nu_{k}(dz)d\tau\right]<\infty.$		(213)

Eq. (212) follows from Assumption 1-(iii), the polynomial growth of $J_{x}$ in $x$ , and the moment estimate (53). For (213), we apply the mean-value theorem and the integral becomes

\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left|J_{x}(\tau,X^{\boldsymbol{\pi}}_{\tau-}+\alpha_{\tau}{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z);\boldsymbol{\pi})\circ{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2}\nu_{k}(dz)d\tau

(214)

for some $\alpha_{\tau}\in[0,1]$ . Using the polynomial growth of $J_{x}$ in $x$ , we can bound the integral by

	$\displaystyle\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left\|J_{x}(\tau,X^{\boldsymbol{\pi}}_{\tau-}+\alpha_{\tau}{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z);\boldsymbol{\pi})\right\|^{2}\cdot\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{2}\nu_{k}(dz)d\tau$	(215)
$\displaystyle\leq$	$\displaystyle C\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left(1+\|X^{\boldsymbol{\pi}}_{\tau-}+\alpha_{\tau}{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\|^{p}\right)^{2}\cdot\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{2}\nu_{k}(dz)d\tau$	(216)
$\displaystyle\leq$	$\displaystyle C^{\prime}\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left(1+\left\|X^{\boldsymbol{\pi}}_{\tau-}\right\|^{p}+\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{p}\right)^{2}\cdot\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{2}\nu_{k}(dz)d\tau$	(217)
$\displaystyle\leq$	$\displaystyle C^{\prime}\int_{t}^{T}\left((1+\left\|X^{\boldsymbol{\pi}}_{\tau-}\right\|^{p})^{2}\int_{\mathbb{R}}\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{2}\nu_{k}(dz)\right.$	(218)
	$\displaystyle\left.+2(1+\left\|X^{\boldsymbol{\pi}}_{\tau-}\right\|^{p})\int_{\mathbb{R}}\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{p+2}\nu_{k}(dz)+\int_{\mathbb{R}}\left\|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right\|^{2p+2}\nu_{k}(dz)\right)d\tau$	(219)

Using Assumption 1-(iii) and the moment estimate (53), we obtain (213). It follows that the second and third processes on the right-hand side of (211) are $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingales and thus we have the martingale property of the process given by (69).

Conversely, if (69) is a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale, we see from (211) that the process

\int_{t}^{s}e^{-\beta\tau}[q(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}};\boldsymbol{\pi})-\hat{q}(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau

(220)

is also a $(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})$ -martingale. Furthermore, it has continuous sample paths and finite variation and thus is equal to zero $\bar{\mathbb{P}}$ -almost surely. We can then follow the argument in the proof of Theorem 6 in Jia and Zhou (2023) to show that $\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi})$ for all $(t,x,a)$ . There is only one step in their proof that we need to modify due to the presence of jumps.

Specifically, consider the sample state process $X^{\boldsymbol{\pi}}$ starting from some time-state-action $(t^{*},x^{*},a^{*})$ . Fix $\delta>0$ and define

\displaystyle T_{\delta}=\inf\{t^{\prime}\geq t^{*}:|X_{t^{\prime}}^{\boldsymbol{\pi}}-x^{*}|>\delta\}\wedge(t^{*}+\delta).

(221)

In the pure diffusion case, Jia and Zhou (2023) uses the continuity of the sample paths of $X^{\boldsymbol{\pi}}$ to argue that $T_{\delta}>t^{*}$ , $\bar{\mathbb{P}}$ -almost surely. This result still holds with presence of jumps, because the Lévy processes that drive our controlled state $X^{\boldsymbol{\pi}}$ are stochastic continuous, i.e., the probability of having a jump at the fixed time $t^{*}$ is zero.

To prove parts (ii) and (iii), we can apply the arguments used in proving part (i) together with those arguments from the proof of Theorem 6 in Jia and Zhou (2023). The details are omitted. ∎

Proof of Lemma 6.

(1) Consider the function $h^{\prime}(t,S)$ defined by the RHS of (154). Proposition 12.1 in Cont and Tankov (2004) shows that $h^{\prime}\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ and it satisfies the PIDE (144). Furthermore, $h^{\prime}(t,S)$ is Lipschitz continuous in $S$ . The Lipschitz continuity of function $\hat{G}$ also implies $|\hat{G}(\hat{S}_{T})|\leq C(1+\hat{S}_{T})$ for some constant $C>0$ . It follows that

|h^{\prime}(t,S)|\leq C\left(1+\mathbb{E}^{\mathbb{Q}}[\hat{S}_{T}|\hat{S}_{t}=S]\right)\leq C(1+S),

(222)

where we used the martingale property of $\hat{S}$ under $\mathbb{Q}$ . The Feynman-Kac Theorem (see Zhu et al. (2015)) implies uniqueness of classical solutions satisfying the linear growth condition.

(2) To study the PIDE (148), we consider the function $g^{\prime}(t,S)$ defined by the RHS of (156). Under the assumption of Lemma 6, the arguments of Proposition 12.1 in Cont and Tankov (2004) can be used to show that $g^{\prime}\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ and it satisfies the PIDE (148). The Lipschits continuity of $h(t,S)$ in $S$ implies that $h_{S}(t,S)$ is bounded and $|h(t,S\exp(z))-h(t,S)|\leq C(\exp(z)-1)$ . We also have $\mathbb{E}^{\mathbb{P}}[\sup_{t\leq s\leq T}\hat{S}_{s}^{2}|\hat{S}_{t}=S]\leq C(1+S^{2})$ (Kunita (2004), Theorem 3.2). Combining these results, we see that $g^{\prime}(t,S)$ shows quadratic growth in $S$ . Again the Feynman-Kac Theorem implies uniqueness of classical solutions satisfying the quadratic growth condition. ∎

Reinforcement Learning for Jump-Diffusions, with Financial Applications

Abstract

1 Introduction

2 Problem Formulation and Preliminaries

2.1 Classical stochastic control of jump-diffusions

Assumption 1.

2.2 Randomized control and exploratory formulation

Remark 1.

Definition 1.

Lemma 1.

Lemma 2 (linear growth in xx).

Assumption 2.

Lemma 3 (local Lipschitz continuity).

Proposition 1.

2.3 Exploratory HJB equation

Assumption 1′.

Lemma 4.

Remark 2.

Lemma 5.

3 qq-Learning Theory

3.1 qq-function and policy improvement

Definition 2.

Theorem 1 (Policy Improvement).

3.2 Martingale characterization of the qq-function

Theorem 2.

Remark 3.

Theorem 3.

3.3 Optimal qq-function

Theorem 4.

4 qq-Learning Algorithms

5 Application: Mean–Variance Portfolio Selection

5.1 Solution of the exploratory control problem

Proposition 2.

5.2 Parametrizations for qq-learning

5.3 Simulation study

5.4 Effects of jumps

Remark 4.

Remark 5.

6 Application: Mean–Variance Hedging of Options

6.1 Solution of the exploratory control problem

Lemma 6.

Proposition 3.

6.2 Parametrizations and actor–critic learning

6.3 Simulation study

6.4 Empirical study

7 Conclusions

Acknowledgements

References

Appendix A Proofs

Proof of Lemma 2.

Proof of Lemma 3.

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Theorem 2.

Proof of Lemma 6.

Lemma 2 (linear growth in $x$ ).

Assumption 1^′.

3 $q$ -Learning Theory

3.1 $q$ -function and policy improvement

3.2 Martingale characterization of the $q$ -function

3.3 Optimal $q$ -function

4 $q$ -Learning Algorithms

5.2 Parametrizations for $q$ -learning