This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning for Jump-Diffusions, with Financial Applications

Xuefeng Gao Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China. E-mail: [email protected]    Lingfei Li Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China. E-mail: [email protected]    Xun Yu Zhou Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia University, New York, NY 10027, USA. Email: [email protected]
Abstract

We study continuous-time reinforcement learning (RL) for stochastic control in which system dynamics are governed by jump-diffusion processes. We formulate an entropy-regularized exploratory control problem with stochastic policies to capture the exploration–exploitation balance essential for RL. Unlike the pure diffusion case initially studied by Wang et al. (2020), the derivation of the exploratory dynamics under jump-diffusions calls for a careful formulation of the jump part. Through a theoretical analysis, we find that one can simply use the same policy evaluation and qq-learning algorithms in Jia and Zhou (2022a), Jia and Zhou (2023), originally developed for controlled diffusions, without needing to check a priori whether the underlying data come from a pure diffusion or a jump-diffusion. However, we show that the presence of jumps ought to affect parameterizations of actors and critics in general. We investigate as an application the mean–variance portfolio selection problem with stock price modelled as a jump-diffusion, and show that both RL algorithms and parameterizations are invariant with respect to jumps. Finally, we present a detailed study on applying the general theory to option hedging.

Keywords. Reinforcement learning, continuous time, jump-diffusions, exploratory formulation, well-posedness, Hamiltonian, martingale, qq-learning.

1 Introduction

Recently there is an upsurge of interest in continuous-time reinforcement learning (RL) with continuous state spaces and possibly continuous action spaces. Continuous RL problems are important because: 1) many if not most practical problems are naturally continuous in time (and in space), such as autonomous driving, robot navigation, video game play and high frequency trading; 2) while one can discretize time upfront and turn a continuous-time problem into a discrete-time MDP, it has been known, indeed shown experimentally e.g., Munos (2006), Tallec et al. (2019) and Park et al. (2021), that this approach is very sensitive to time discretization and performs poorly with small time steps; 3) there are more analytical tools available for the continuous setting that enable a rigorous and thorough analysis leading to interpretable (instead of black-box) and general (instead of ad hoc) RL algorithms.

Compared with the vast literature of RL for MDPs, continuous-time RL research is still in its infancy with the latest study focusing on establishing a rigorous mathematical theory and devising resulting RL algorithms. This strand of research starts with Wang et al. (2020) that introduces a mathematical formulation to capture the essence of RL – the exploration–exploitation tradeoff – in the continuous setting, followed by a “trilogy” (Jia and Zhou 2022a, b, Jia and Zhou 2023) that develops intertwining theories on policy evaluation, policy gradient and qq-learning respectively. The common underpinning of the entire theory is the martingale property of certain stochastic processes, the enforcement of which naturally leads to various temporal difference algorithms to train and learn qq-functions, value functions and optimal (stochastic) policies. The research is characterized by carrying out all the analysis in the continuous setting, and discretizing time only at the final, implementation stage for approximating the integrated rewards and the temporal difference. The theory has been adapted and extended in different directions; see e.g. Reisinger and Zhang (2021), Guo et al. (2022), Dai et al. (2023), as well as employed for applications; see e.g. Wang and Zhou (2020), Huang et al. (2022), Gao et al. (2022), Wang et al. (2023), and Wu and Li (2024).

The study so far has been predominantly on pure diffusion processes, namely the state processes are governed by controlled stochastic differential equations (SDEs) with a drift part and a diffusion one. While it is reasonable to model the underlying data generating processes as diffusions within a short period of time, sudden and drastic changes can and do happen over time. An example is a stock price process: while it is approximately a diffusion over a sufficiently short period, it may respond dramatically to a surprisingly good or bad earning report. Other examples include neuron dynamics (Giraudo and Sacerdote 1997), stochastic resonance (Gammaitoni et al. 1998) and climate data (Goswami et al. 2018). It is therefore natural and necessary to extend the continuous RL theory and algorithms to the case when jumps are present. This is particularly important for decision makings in financial markets, where it has been well recognized that using jumps to capture large sudden movements provides a more realistic way to model market dynamics; see the discussions in Chapter 1 of Cont and Tankov (2004). The financial modeling literature with jumps dates back to the seminal work of Merton (1976), who extends the classical Black–Scholes model by introducing a compound Poisson process with normally distributed jumps in the log returns. Since then alternative jump size distributions have been proposed in e.g. Kou (2002) and Cai and Kou (2011). Empirical success of jump-diffusion models have been documented for many asset classes; see Bates (1991), Andersen et al. (2002), and Aït-Sahalia et al. (2012) for stocks and stock indices, Bates (1996) for exchange rates, Das (2002) for interest rates, and Li and Linetsky (2014) for commodities, among many others.

This paper makes two major contributions. The first is mathematical in terms of setting up the suitable exploratory formulation and proving the well-posedness of the resulting exploratory SDE, which form the foundation of the RL theory for jump-diffusions. Wang et al. (2020) apply the classical stochastic relaxed control to model the exploration or randomization prevalent in RL, and derive an exploratory state equation that dictates the dynamics of the “average” of infinitely many state processes generated by repeatedly sampling from the same exploratory, stochastic policy. The drift and variance coefficients of the exploratory SDE are the means of those coefficients against the given stochastic policy (which is a probability distribution) respectively. The derivation therein is based on a law of large number argument to the first two moments of the diffusion process. That argument fails for jump-diffusions which are not uniquely determined by the first two moments. We overcome this difficulty by analyzing instead the infinitesimal generator of the sample state process, from which we identify the dynamics of the exploratory state process. Inspired by Kushner (2000) who studies relaxed control for jump-diffusions, we formulate the exploratory SDE by extending the original Poisson random measures for jumps to capture the effect of random exploration. It should be noted that, like almost all the earlier works on relaxed control, Kushner (2000) is motivated by answering the theoretical question of whether an optimal control exists, as randomization convexifies the universe of control strategies. In comparison, our formulation is guided by the practical motivation of exploration for learning. There is also another subtle but important difference. We consider stochastic feedback policies while Kushner (2000) does not. This in turn creates technical issues in studying the well-posedness of the exploratory SDE in our framework.

The second main contribution is several implications regarding the impact of jumps on RL algorithm design. Thanks to the established exploratory formulation, we can define the Hamiltonian that, compared with the pure diffusion counterpart, has to include an additional term corresponding to the jumps. The resulting HJB equation – called the exploratory HJB – is now a partial integro-differential equation (PIDE) instead of a PDE due to that additional term. However, when expressed in terms of the Hamiltonian, the exploratory HJB equation has exactly the same form as that in the diffusion case. This leads to several completely identical statements of important results, including the optimality of the Gibbs exploration, definition of a qq-function, and martingale characterizations of value functions and qq-functions. Here by “identical” we mean in terms of the Hamiltonian; in other words, these statements differ between diffusions and jump-diffusions entirely because the Hamiltonian is defined differently (which also causes some differences in the proofs of the results concerned). Most important of all, in the resulting RL algorithms, the Hamiltonian (or equivalently the qq-function) can be computed using temporal difference of the value function by virtue of the Itô lemma; as a result the algorithms are completely identical no matter whether or not there are jumps. This has a significant practical implication: we can just use the same RL algorithms without the need of checking in advance whether the underlying data come from a pure diffusion or a jump-diffusion. It is significant for the following reason. In practice, data are always observed or sampled at discrete times, no matter how frequent they arrive. Thus we encounter successive discontinuities along the sample trajectory even when the data actually come from a diffusion process. There are some criteria that can be used to check whether the underlying process is a diffusion or a jump-diffusion, e.g. Aït-Sahalia et al. (2012), Wang and Zheng (2022). But these methods typically require data with very high frequency to be effective, which may not always be available. In addition, noises must be properly handled for them to work.

Even though we can apply the same RL algorithms irrespective of the presence of jumps, the parametrization of the policy and value function may still depend on it, if we try to exploit certain special structure of the problem instead of using general neural networks for parameterization. Indeed, we give an example in which the optimal exploratory policy is Gaussian when there are no jumps, whereas an optimal policy either does not exist or becomes non-Gaussian when there are jumps. However, in the mean–variance portfolio selection we present as a concrete application, the optimal Gibbs exploration measure again reduces to Gaussian and the value function is quadratic as in Wang and Zhou (2020), both owing to the inherent linear–quadratic (LQ) structure of the problem. Hence in this particular case jumps do not even affect the parametrization of the policy and value function/qq-function for learning.

We also consider mean–variance hedging of options as another application. This is a non-LQ problem and hence more difficult to solve than mean–variance portfolio selection. The mean–variance hedging problem has been studied in various early works, such as Schweizer (1996, 2001) and Lim (2005). Here, we introduce the entropy-regularized mean–variance hedging objective for an asset following a jump-diffusion and derive analytical representations for the optimal stochastic policy, which is again Gaussian, as well as the optimal value function. We use these representations to devise an actor–critic algorithm to learn the optimal hedging policy from data.

We compare our work with four recent related papers. (1) Bender and Thuan (2023) consider the continuous-time mean–variance portfolio selection problem with exploration under a jump-diffusion setting. Our paper differs from theirs in several aspects. First, they consider a specific application problem while we study RL for general controlled jump-diffusions. Second, they obtain the SDE for the exploratory state by taking limit of the discrete-time exploratory dynamics, whereas our approach first derives the form of the infinitesimal generator of the sample state process and then infers the exploratory SDE from it. It is unclear how their approach works when dealing with general control problems. Finally, they do not consider how to develop RL algorithms based on their solution of the exploratory mean–variance portfolio selection, which we do in this paper. (2) Guo et al. (2023) consider continuous-time RL for linear–convex models with jumps. The scope and motivation are different from ours: They focus on the Lipschitz stability of feedback controls for this special class of control problems where the diffusion and jump terms are not controlled, and propose a least-square model-based algorithm and obtain sublinear regret guarantees in the episodic setting. By contrast, we consider RL for general jump-diffusions and develop model-free algorithms without considering regret bounds. (3) Denkert et al. (2024) aim to unify certain types of stochastic control problems by considering the so-called randomized control formulation which leads to the same optimal value functions as those of the original problems. They develop a policy gradient representation and actor-critic algorithms for RL. The randomized control formulation is fundamentally different from the framework we are considering: therein the control is applied at a set of random time points generated by a random point process instead of at every time point as in our framework. (4) Bo et al. (2024) develop qq-learning for jump-diffusions by using Tsallis’ entropy for regularization instead of Shannon’s entropy considered in our paper and in Jia and Zhou (2022b), Jia and Zhou (2023).111The paper Bo et al. (2024) came to our attention after a previous version of our paper was completed and posted. While this entropy presents an interesting alternative for developing RL algorithms, it may make the exploratory control problem less tractable to solve and lead to policy distributions that are inefficient to sample for exploration in certain applications.

The remainder of the paper is organized as follows. In Section 2, we discuss the problem formulation. In Section 3, we present the theory of qq-learning for jump-diffusions, followed by the discussion of qq-learning algorithms in Section 4. In Section 5, we apply the general theory and algorithms to a mean–variance portfolio selection problem, and discuss the impact of jumps. Section 6 presents the application to a mean–variance option hedging problem. Finally, Section 7 concludes. All the proofs are given in an appendix.

2 Problem Formulation and Preliminaries

For readers’ convenience, we first recall some basic concepts for one-dimensional (1D) Lévy processes, which can be found in standard references such as Sato (1999) and Applebaum (2009). A 1D process L={Lt:t0}L=\{L_{t}:t\geq 0\} is a Lévy process if it is continuous in probability, has stationary and independent increments, and L0=0L_{0}=0 almost surely. Denote the jump of LL at time tt by ΔLt=LtLt\Delta L_{t}=L_{t}-L_{t-}, and let B0\boldsymbol{\text{B}}_{0} be the collection of Borel sets of \mathbb{R} whose closure does not contain 0. The Poisson random measure (or jump measure) of LL is defined as

N(t,B)=s:0<st1B(ΔLs),BB0,\displaystyle N(t,B)=\sum_{s:0<s\leq t}1_{B}(\Delta L_{s}),\ B\in\boldsymbol{\text{B}}_{0}, (1)

which gives the number of jumps up to time tt with jump size in a Borel set BB away from 0. The Lévy measure ν\nu of LL is defined by ν(B)=𝔼[N(1,B)]\nu(B)=\mathbb{E}[N(1,B)] for BB0B\in\boldsymbol{\text{B}}_{0}, which shows the expected number of jumps in BB in unit time, and ν(B)\nu(B) is finite. For any BB0B\in\boldsymbol{\text{B}}_{0}, {N(t,B):t0}\{N(t,B):t\geq 0\} is a Poisson process with intensity given by ν(B)\nu(B). The differential forms of these two measures are written as N(dt,dz)N(dt,dz) and ν(dz)\nu(dz), respectively. If ν\nu is absolutely continuous, we write ν(dz)=ν(z)dz\nu(dz)=\nu(z)dz by using the same letter for the measure and its density function. The Lévy measure ν\nu must satisfy the integrability condition

min{z2,1}ν(dz)<.\int_{\mathbb{R}}\min\{z^{2},1\}\nu(dz)<\infty. (2)

However, it is not necessarily a finite measure on /{0}\mathbb{R}/\{0\} but always a σ\sigma-finite measure. The Lévy process LL is said to have jumps of finite (infinite activity) if ν(dz)<\int_{\mathbb{R}}\nu(dz)<\infty (==\infty). The number of jumps on any finite time interval is finite in the former case but infinite in the latter. For any Borel set BB with its closure including 0, ν(B)\nu(B) is finite in the finite activity case but infinite otherwise. Finally, the compensated Poisson random measure is defined as N~(dt,dz)=N(dt,dz)ν(dz)dt\widetilde{N}(dt,dz)=N(dt,dz)-\nu(dz)dt. For any BB0B\in\boldsymbol{\text{B}}_{0}, the process {N~(t,B):t0}\{\widetilde{N}(t,B):t\geq 0\} is a martingale.

Throughout the paper, we use the following notations. By convention, all vectors are column vectors unless otherwise specified. We use k\mathbb{R}^{k} and k×\mathbb{R}^{k\times\ell} to denote the space of kk-dimensional vectors and k×k\times\ell matrices, respectively. For matrix AA, we use AA^{\top} for its transpose, |A||A| for its Euclidean/Frobenius norm, and write A2AAA^{2}\coloneqq AA^{\top}. Given two matrices AA and BB of the same size, we denote by ABA\circ B the inner product between AA and BB, which is given by tr(AB)\text{tr}(AB^{\top}). For a positive semidefinite matrix AA, we write A=UD1/2V\sqrt{A}=UD^{1/2}V^{\top}, where A=UDVA=UDV^{\top} is its singular value decomposition with U,VU,V as two orthogonal matrices and DD as a diagonal matrix, and D1/2D^{1/2} is the diagonal matrix whose entries are the square root of those of DD. We use f=f()f=f(\cdot) to denote the function ff, and f(x)f(x) to denote the function value of ff at xx. We use both fx,fxxf_{x},f_{xx} and fx,2fx2\frac{\partial f}{\partial x},\frac{\partial^{2}f}{\partial x^{2}} for the firs and second (partial) derivatives of a function ff with respect to xx. We write the minimum of two values aa and bb as aba\wedge b. The notation 𝒰(B)\mathcal{U}(B) denotes the uniform distribution over set BB while 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) refers to the Gaussian distribution with mean vector μ\mu and covariance matrix Σ\Sigma.

2.1 Classical stochastic control of jump-diffusions

Consider a filtered probability space (Ω,,{t},)(\Omega,\mathcal{F},\{\mathcal{F}_{t}\},\mathbb{P}) satisfying the usual hypothesis. Assume that this space is rich enough to support W={Wt:t0}W=\{W_{t}:t\geq 0\}, a standard Brownian motion in m\mathbb{R}^{m}, and \ell independent one-dimensional (1D) Lévy processes L1,,LL_{1},\cdots,L_{\ell}, which are also independent of WW. Let N(dt,dz)=(N1(dt,dz1),,N(dt,dz))N(dt,dz)=(N_{1}(dt,dz_{1}),\cdots,N_{\ell}(dt,dz_{\ell}))^{\top} be the vector of their Poisson random measures, and similarly define ν(dz)\nu(dz) and N~(dt,dz)\widetilde{N}(dt,dz). The controlled system dynamics are governed by the following Lévy SDE (Øksendal and Sulem 2007, Chapter 3):

dXsa=b(s,Xsa,as)ds+σ(s,Xsa,as)dWs+γ(s,Xsa,as,z)N~(ds,dz),s[0,T],\displaystyle dX_{s}^{a}=b(s,X_{s-}^{a},a_{s})ds+\sigma(s,X_{s-}^{a},a_{s})dW_{s}+\int_{\mathbb{R}^{\ell}}\gamma(s,X_{s-}^{a},a_{s},z)\widetilde{N}(ds,dz),\ s\in[0,T], (3)

where

b:[0,T]×d×𝒜d,σ:[0,T]×d×𝒜d×m and γ:[0,T]×d×𝒜×d×,b:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}^{d},\ \sigma:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R}^{d\times m}\ \text{ and }\ \gamma:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\times\mathbb{R}^{\ell}\rightarrow\mathbb{R}^{d\times\ell}, (4)

asa_{s} is the control or action at time ss, 𝒜n\mathcal{A}\subseteq\mathbb{R}^{n} is the control space, and a={as:s[0,T]}a=\{a_{s}:s\in[0,T]\} is the control process assumed to be predictable with respect to {s:s[0,T]}\{\mathcal{F}_{s}:s\in[0,T]\}. We denote the kk-th column of the matrix γ\gamma by γk\gamma_{k}. The goal of stochastic control is, for each initial time-state pair (t,x)(t,x) of (3), to find the optimal control process aa that maximizes the expected total reward:

𝔼[tTeβ(st)r(s,Xsa,as)𝑑s+eβ(st)h(XTa)|Xta=x],\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}r(s,X_{s}^{a},a_{s})ds+e^{-\beta(s-t)}h(X_{T}^{a})\Big{|}X_{t}^{a}=x\right], (5)

where β0\beta\geq 0 is a discount factor that measures the time value of the payoff.

The stochastic control problem (3)–(5) is very general; in particular, control processes affect the drift, diffusion and jump coefficients. We now make the following assumption to ensure well-posedness of the problem. Define Kd{xd:|x|K}\mathbb{R}^{d}_{K}\coloneqq\{x\in\mathbb{R}^{d}:|x|\leq K\}.

Assumption 1.

Suppose the following conditions are satisfied by the state dynamics and reward functions:

  1. (i)

    b,σ,γ,r,hb,\sigma,\gamma,r,h are all continuous functions in their respective arguments;

  2. (ii)

    (local Lipschitz continuity) for any K>0K>0 and any p2p\geq 2, there exist positive constants CKC_{K} and CK,pC_{K,p} such that (t,a)[0,T]×𝒜,(x,x)Kd\forall(t,a)\in[0,T]\times\mathcal{A},(x,x^{\prime})\in\mathbb{R}^{d}_{K},

    |b(t,x,a)b(t,x,a)|2+|σ(t,x,a)σ(t,x,a)|2CK|xx|2,\displaystyle|b(t,x,a)-b(t,x^{\prime},a)|^{2}+|\sigma(t,x,a)-\sigma(t,x^{\prime},a)|^{2}\leq C_{K}|x-x^{\prime}|^{2}, (6)
    k=1|γk(t,x,a,zk)γk(t,x,a,zk)|pνk(dz)CK,p|xx|p;\displaystyle\sum_{k=1}^{\ell}\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z_{k})-\gamma_{k}(t,x^{\prime},a,z_{k})|^{p}\nu_{k}(dz)\leq C_{K,p}|x-x^{\prime}|^{p}; (7)
  3. (iii)

    (linear growth in xx) for any p1p\geq 1, there exist positive constants CC and CpC_{p} such that (t,x,a)[0,T]×d×𝒜\forall(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A},

    |b(t,x,a)|2+|σ(t,x,a)|2C(1+|x|2),\displaystyle|b(t,x,a)|^{2}+|\sigma(t,x,a)|^{2}\leq C(1+|x|^{2}), (8)
    k=1|γk(t,x,a,z)|pνk(dz)Cp(1+|x|p);\displaystyle\sum_{k=1}^{\ell}\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|^{p}\nu_{k}(dz)\leq C_{p}(1+|x|^{p}); (9)
  4. (iv)

    there exists a constant C>0C>0 such that

    |r(t,x,a)|C(1+|x|p+|a|q),|h(x)|C(1+|x|p),(t,x,a)[0,T]×d×𝒜|r(t,x,a)|\leq C\left(1+|x|^{p}+|a|^{q}\right),\ |h(x)|\leq C\left(1+|x|^{p}\right),\ \forall(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A} (10)

    for some p2p\geq 2 and some q1q\geq 1; moreover, 𝔼[0T|as|q𝑑s]\mathbb{E}[\int_{0}^{T}|a_{s}|^{q}ds] is finite.

Conditions (i)-(iii) guarantee the existence of a unique strong solution to the Lévy SDE (3) with initial condition Xta=xdX_{t}^{a}=x\in\mathbb{R}^{d}. Furthermore, for any p2p\geq 2, there exists a constant Cp>0C_{p}>0 such that

𝔼t,x[suptsT|Xsa|p]Cp(1+|x|p);\mathbb{E}_{t,x}\left[\sup_{t\leq s\leq T}|X_{s}^{a}|^{p}\right]\leq C_{p}(1+|x|^{p}); (11)

see (Kunita 2004, Theorem 3.2) and (Situ 2006, Theorem 119). With the moment estimate (11), it follows that condition (iv) implies that the expected value in (5) is finite.

Let a\mathcal{L}^{a} be the infinitesimal generator associated with the Lévy SDE (3). Under condition (iii), we have |γk(t,x,a,z)|νk(dz)<\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|\nu_{k}(dz)<\infty for k=1,,k=1,\cdots,\ell. Thus, we can write a\mathcal{L}^{a} in the following form:

af(t,x)\displaystyle\mathcal{L}^{a}f(t,x) =ft(t,x)+b(t,x,a)fx(t,x)+12σ2(t,x,a)2fx2(t,x)\displaystyle=\frac{\partial f}{\partial t}(t,x)+b(t,x,a)\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\sigma^{2}(t,x,a)\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x)
+k=1(f(t,x+γk(t,x,a,z))f(t,x)γk(t,x,a,z)fx(t,x))νk(dz),\displaystyle+\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz), (12)

where fxd\frac{\partial f}{\partial x}\in\mathbb{R}^{d} is the gradient and 2fx2d×d\frac{\partial^{2}f}{\partial x^{2}}\in\mathbb{R}^{d\times d} is the Hessian matrix.

We recall Itô’s formula, which will be frequently used in our analysis; see e.g. (Øksendal and Sulem 2007, Theorem 1.16). Let XaX^{a} be the unique strong solution to (3). For any fC1,2(+×d)f\in C^{1,2}(\mathbb{R}^{+}\times\mathbb{R}^{d}), we have

df(t,Xta)=ft(t,Xta)dt+b(t,Xta,at)fx(t,Xta)dt+12σ2(t,Xta,at)2fx2(t,Xta)dt\displaystyle df(t,X^{a}_{t})=\frac{\partial f}{\partial t}(t,X^{a}_{t-})dt+b(t,X^{a}_{t-},a_{t})\circ\frac{\partial f}{\partial x}(t,X^{a}_{t-})dt+\frac{1}{2}\sigma^{2}(t,X^{a}_{t-},a_{t})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,X^{a}_{t-})dt (13)
+\displaystyle+ k=1(f(t,Xta+γk(t,Xta,at,z))f(t,Xta)γk(t,Xta,at,z)fx(t,Xta))νk(dz)𝑑t\displaystyle\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,X^{a}_{t-}+\gamma_{k}(t,X^{a}_{t-},a_{t},z))-f(t,X^{a}_{t-})-\gamma_{k}(t,X^{a}_{t-},a_{t},z)\circ\frac{\partial f}{\partial x}(t,X^{a}_{t-})\right)\nu_{k}(dz)dt (14)
+\displaystyle+ fx(t,Xta)σ(t,Xta,at)dWt+k=1(f(t,Xta+γk(t,Xta,at,z))f(t,Xta))N~k(dt,dz).\displaystyle\frac{\partial f}{\partial x}(t,X^{a}_{t-})\circ\sigma(t,X^{a}_{t-},a_{t})dW_{t}+\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,X^{a}_{t-}+\gamma_{k}(t,X^{a}_{t-},a_{t},z))-f(t,X^{a}_{t-})\right)\widetilde{N}_{k}(dt,dz). (15)

It is known that the Hamilton–Jacobi–Bellman (HJB) equation for the control problem (3)–(5) is given by

supa𝒜{r(t,x,a)+aV(t,x)}βV(t,x)\displaystyle\sup_{a\in\mathcal{A}}\{r(t,x,a)+\mathcal{L}^{a}V(t,x)\}-\beta V(t,x) =0,(t,x)[0,T)×d,\displaystyle=0,\quad(t,x)\in[0,T)\times\mathbb{R}^{d}, (16)
V(T,x)\displaystyle V(T,x) =h(x),\displaystyle=h(x),

where a\mathcal{L}^{a} is given in (2.1). Under proper conditions, the solution to the above equation is the optimal value function VV^{*} for control problem (5). Moreover, the following function, which maps a time-state pair to an action:

a(t,x)=argmaxa𝒜{r(t,x,a)+aV(t,x)}\displaystyle a^{*}(t,x)=\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\{r(t,x,a)+\mathcal{L}^{a}V^{*}(t,x)\right\} (17)

is the optimal feedback control policy of the problem.

Given a smooth function V(t,x)C1,2([0,T]×d),V(t,x)\in C^{1,2}([0,T]\times\mathbb{R}^{d}), we define the Hamiltonian HH by

H(t,x,a,\displaystyle H(t,x,a, Vx,Vxx,V)=r(t,x,a)+b(t,x,a)Vx(t,x)+12σ2(t,x,a)Vxx(t,x)\displaystyle V_{x},V_{xx},V)=r(t,x,a)+b(t,x,a)\circ V_{x}(t,x)+\frac{1}{2}\sigma^{2}(t,x,a)\circ V_{xx}(t,x) (18)
+k=1d(V(t,x+γk(t,x,a,z))V(t,x)γk(t,x,a,z)Vx(t,x))νk(dz).\displaystyle+\quad\sum_{k=1}^{\ell}\int_{\mathbb{R}^{d}}\left(V(t,x+\gamma_{k}(t,x,a,z))-V(t,x)-\gamma_{k}(t,x,a,z)\circ V_{x}(t,x)\right)\nu_{k}(dz). (19)

Then, the HJB equation (16) can be recast as

V(t,x)t+supa𝒜{H(t,x,a,Vx,Vxx,V)}βV(t,x)\displaystyle\frac{\partial V(t,x)}{\partial t}+\sup_{a\in\mathcal{A}}\{H(t,x,a,V_{x},V_{xx},V)\}-\beta V(t,x) =0,(t,x)[0,T)×d,\displaystyle=0,\quad(t,x)\in[0,T)\times\mathbb{R}^{d}, (20)
V(T,x)\displaystyle V(T,x) =h(x).\displaystyle=h(x).

2.2 Randomized control and exploratory formulation

A key idea of RL is to explore the unknown environment by randomizing the actions. Let 𝝅:(t,x)[0,T]×d𝝅(|t,x)𝒫(𝒜)\boldsymbol{\pi}:(t,x)\in[0,T]\times\mathbb{R}^{d}\rightarrow\boldsymbol{\pi}(\cdot|t,x)\in\mathcal{P}(\mathcal{A}) be a given stochastic feedback policy, where 𝒫(𝒜)\mathcal{P}(\mathcal{A}) is the set of probability density functions defined on 𝒜\mathcal{A}. Let 𝐚:(t,x)[0,T]×d𝐚(t,x)𝒜{\bf a}:(t,x)\in[0,T]\times\mathbb{R}^{d}\rightarrow{\bf a}(t,x)\in\mathcal{A} be sampled from 𝝅\boldsymbol{\pi} (i.e. 𝐚{\bf a} is a copy of 𝝅\boldsymbol{\pi}), which is a deterministic feedback policy. Applying this policy to (3), we get for s[0,T]s\in[0,T],

dXs𝐚=b(s,Xs𝐚,𝐚(s,Xs𝐚))ds+σ(s,Xs𝐚,𝐚(s,Xs𝐚))dWs+γ(s,Xs𝐚,𝐚(s,Xs𝐚),z)N~(ds,dz).\displaystyle dX_{s}^{\bf a}=b(s,X_{s-}^{\bf a},{\bf a}(s,X_{s-}^{\bf a}))ds+\sigma(s,X_{s-}^{\bf a},{\bf a}(s,X_{s-}^{\bf a}))dW_{s}+\int_{\mathbb{R}^{\ell}}\gamma(s,X_{s-}^{\bf a},{\bf a}(s,X_{s-}^{\bf a}),z)\widetilde{N}(ds,dz). (21)

Assuming the solution to the above SDE uniquely exists, we say the action process a𝝅={as𝝅=𝐚(s,Xs𝐚):tsT}a^{\boldsymbol{\pi}}=\{a^{\boldsymbol{\pi}}_{s}={\bf a}(s,X_{s-}^{\bf a}):t\leq s\leq T\} to be generated from 𝝅\boldsymbol{\pi}. Note that a𝝅a^{\boldsymbol{\pi}} depends on the specific sample 𝐚𝝅{\bf a}\sim\boldsymbol{\pi}, which we omit to write out for notational simplicity. In the following, we will also write 𝝅(|t,x)\boldsymbol{\pi}(\cdot|t,x) as 𝝅t,x()\boldsymbol{\pi}_{t,x}(\cdot).

We need to enlarge the original filtered probability space to include the additional randomness from sampling actions. Following Jia and Zhou (2022b), Jia and Zhou (2023), we assume that the probability space is rich enough to support independent copies of an nn-dimensional random vector uniformly distributed over [0,1]n[0,1]^{n}, where nn is the dimension of the control space. These copies are also independent of WW and L1,,LL_{1},\cdots,L_{\ell}. Let 𝒢s\mathcal{G}_{s} be the new sigma-algebra generated by s\mathcal{F}_{s} and the copies of the uniform random vector up to time ss. The new filtered probability space is (Ω,,{𝒢t},¯)(\Omega,\mathcal{F},\{\mathcal{G}_{t}\},\bar{\mathbb{P}}), where ¯\bar{\mathbb{P}} is the product extension from \mathbb{P} and they coincide when restricted to T\mathcal{F}_{T}.

Fix a stochastic feedback policy 𝝅\boldsymbol{\pi} and an initial time-state pair (t,x)(t,x). An action process a𝝅={as𝝅:tsT}a^{\boldsymbol{\pi}}=\{a^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\} generated from 𝝅\boldsymbol{\pi} is an 𝒢s\mathcal{G}_{s}-progressively measurable process that is also predictable. Consider the sample state process X𝝅={Xs𝝅:tsT}X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\} that follows the SDE

dXs𝝅=b(s,Xs𝝅,as𝝅)ds+σ(s,Xs𝝅,as𝝅)dWs+γ(s,Xs𝝅,as𝝅,z)N~(ds,dz),s[t,T].\displaystyle dX^{\boldsymbol{\pi}}_{s}=b(s,X^{\boldsymbol{\pi}}_{s-},a^{\boldsymbol{\pi}}_{s})ds+\sigma(s,X^{\boldsymbol{\pi}}_{s-},a^{\boldsymbol{\pi}}_{s})dW_{s}+\int_{\mathbb{R}^{\ell}}\gamma(s,X^{\boldsymbol{\pi}}_{s-},a^{\boldsymbol{\pi}}_{s},z)\widetilde{N}(ds,dz),\quad s\in[t,T]. (22)

Once again, bear in mind that the above equation depends on a specific sample 𝐚𝝅{\bf a}\sim\boldsymbol{\pi}; so there are in fact infinitely many similar equations, each corresponding to a sample of 𝝅\boldsymbol{\pi}.

To encourage exploration, we add an entropy regularizer to the running reward, leading to

J(t,x;𝝅)=𝔼t,x¯[tTeβ(st)(r(s,Xs𝝅,as𝝅)θlog𝝅(as𝝅|s,Xs𝝅))𝑑s+eβ(Tt)h(XT𝝅)],\displaystyle J(t,x;\boldsymbol{\pi})=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-\beta(s-t)}\left(r(s,X_{s}^{\boldsymbol{\pi}},a_{s}^{\boldsymbol{\pi}})-\theta\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})\right)ds+e^{-\beta(T-t)}h(X_{T}^{\boldsymbol{\pi}})\right], (23)

where 𝔼t,x¯\mathbb{E}^{\bar{\mathbb{P}}}_{t,x} is the expectation conditioned on Xt𝝅=xX_{t}^{\boldsymbol{\pi}}=x, taken with respect to the randomness in the Brownian motion, the Poisson random measures, and the action randomization. Here θ>0\theta>0 is the temperature parameter that controls the level of exploration. The function J(,;𝝅)J(\cdot,\cdot;\boldsymbol{\pi}) is called the value function of the policy 𝝅\boldsymbol{\pi}. The goal of RL is to find the policy that maximizes the value function among admissible policies that are to be specified in Definition 1 below.

For theoretical analysis, we consider the exploratory dynamics of X𝝅X^{\boldsymbol{\pi}}, which represent the key averaged characteristics of the sample state process over infinitely many randomized actions. In the case of diffusions, Wang et al. (2020) derive such exploratory dynamics by applying a law of large number argument to the first two moments of the diffusion process. Their approach, however, cannot be applied to jump-diffusions. Here, we get around by studying the infinitesimal generator of the sample state process, from which we will identify the dynamics of the exploratory state process.

To this end, let fC01,2([0,T)×d)f\in C_{0}^{1,2}([0,T)\times\mathbb{R}^{d}), which is continuously differentiable in tt and twice continuously differentiable in xx with compact support, and we need to analyze lims0𝔼t,x¯[f(t+s,Xt+s𝝅)]f(t,x)s\lim_{s\to 0}\frac{\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}[f(t+s,X_{t+s}^{\boldsymbol{\pi}})]-f(t,x)}{s}. Fixing (t,x)(t,x), consider the SDE (22) starting from Xt𝝅=xX^{\boldsymbol{\pi}}_{t}=x with NN independent copies 𝐚1,,𝐚N{\bf a}_{1},\cdots,{\bf a}_{N} of 𝝅\boldsymbol{\pi}. Let s>0s>0 be very small and assume the corresponding actions a1,,aNa_{1},\cdots,a_{N} are fixed from tt to t+st+s. Denote by Xt+saiX_{t+s}^{a_{i}} the value of the state process corresponding to 𝐚i{\bf a}_{i} at t+st+s. Then

lims0𝔼t,x¯[f(t+s,Xt+s𝝅)]f(t,x)s\displaystyle\phantom{=}\lim_{s\to 0}\frac{\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}[f(t+s,X_{t+s}^{\boldsymbol{\pi}})]-f(t,x)}{s} (24)
=lims0limN1Ni=1N𝔼t,x[f(t+s,Xt+sai)]f(t,x)s\displaystyle=\lim_{s\to 0}\frac{\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{t,x}^{\mathbb{P}}[f(t+s,X_{t+s}^{a_{i}})]-f(t,x)}{s} (25)
=limN1Ni=1Nlims0𝔼t,x[f(t+s,Xt+sai)]f(t,x)s\displaystyle=\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\lim_{s\to 0}\frac{\mathbb{E}_{t,x}^{\mathbb{P}}[f(t+s,X_{t+s}^{a_{i}})]-f(t,x)}{s} (26)
=limN1Ni=1N(ft(t,x)+b(t,x,ai)fx(t,x)+12σ2(t,x,ai)2fx2(t,x))\displaystyle=\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\left(\frac{\partial f}{\partial t}(t,x)+b(t,x,a_{i})\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\sigma^{2}(t,x,a_{i})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x)\right) (27)
+limN1Ni=1Nk=1(f(t,x+γk(t,x,ai,z))f(t,x)γk(t,x,ai,z)fx(t,x))νk(dz).\displaystyle+\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{\ell}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a_{i},z))-f(t,x)-\gamma_{k}(t,x,a_{i},z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz). (28)

Using the law of large number, we obtain

(27)=ft(t,x)+b~(t,x,𝝅t,x)fx(t,x)+12σ~2(t,x,𝝅t,x)2fx2(t,x),\eqref{eq:part1}=\frac{\partial f}{\partial t}(t,x)+\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\tilde{\sigma}^{2}(t,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x), (29)

where

b~(t,x,𝝅t,x)𝒜b(t,x,a)𝝅(a|t,x)𝑑a,σ~(t,x,𝝅t,x)(𝒜σ2(t,x,a)𝝅(a|t,x)𝑑a)1/2.\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})\coloneqq\int_{\mathcal{A}}b(t,x,a)\boldsymbol{\pi}(a|t,x)da,\quad\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})\coloneqq\left(\int_{\mathcal{A}}\sigma^{2}(t,x,a)\boldsymbol{\pi}(a|t,x)da\right)^{1/2}. (30)

These “exploratory” drift and diffusion coefficients are consistent with those in Wang et al. (2020). It is tempting to think the exploratory jump coefficient γ~\tilde{\gamma} is similarly the average of γ\gamma with respect to 𝝅\boldsymbol{\pi}; but unfortunately it is generally not true. This in turn is one of the main distinctive features in studying RL for jump-diffusions.

We approach the problem by analyzing the integrals in (28). Using the second-order Taylor expansion, the boundedness of 2fx2(t,x)\frac{\partial^{2}f}{\partial x^{2}}(t,x) for xdx\in\mathbb{R}^{d} and condition (iii) of Assumption 1, we obtain that for fixed (t,x)(t,x) and each kk,

|(f(t,x+γk(t,x,a,z))f(t,x)γk(t,x,a,z)fx(t,x))νk(dz)|\displaystyle\left|\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)\right| (31)
C|γk(t,x,a,z)|2νk(dz)C(1+|x|2)\displaystyle\leq C\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|^{2}\nu_{k}(dz)\leq C(1+|x|^{2}) (32)

for some constant C>0C>0, which is independent of aa. It follows that

(28)=k=1𝒜(f(t,x+γk(t,x,a,z))f(t,x)γk(t,x,a,z)fx(t,x))νk(dz)𝝅(a|t,x)𝑑a.\eqref{eq:part2}=\sum_{k=1}^{\ell}\int_{\mathcal{A}}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)\boldsymbol{\pi}(a|t,x)da. (33)

Combining (29) and (33), the infinitesimal generator 𝝅\mathcal{L}^{\boldsymbol{\pi}} of the sample state process is given by the probability weighted average of the generator a\mathcal{L}^{a} of the classical controlled jump-diffusion, i.e.,

𝝅f(t,x)=𝒜af(t,x)𝝅(a|t,x)𝑑a.\mathcal{L}^{\boldsymbol{\pi}}f(t,x)=\int_{\mathcal{A}}\mathcal{L}^{a}f(t,x)\boldsymbol{\pi}(a|t,x)da. (34)

Next, we reformulate the integrals in (33) to convert them to the same form as (2.1), from which we can infer the SDE for the exploratory state process.

Recall that the Poisson random measure Nk(dt,dz)N_{k}(dt,dz) with intensity measure dtνk(dz)dt\nu_{k}(dz) (k=1,,k=1,\cdots,\ell) is defined over the product space [0,T]×[0,T]\times\mathbb{R}. We can also interpret NkN_{k} as a counting measure associated with a random configuration of points (Ti,Yi)[0,T]×(T_{i},Y_{i})\in[0,T]\times\mathbb{R} (Cont and Tankov 2004, Section 2.6.3), i.e.,

Nk=i1δ(Ti,Yi),N_{k}=\sum_{i\geq 1}\delta_{(T_{i},Y_{i})}, (35)

where δx\delta_{x} is the Dirac measure with mass one at point xx, TiT_{i} is the arrival time of the iith jump of the Lévy process LkL_{k}, and Yi=Lk(Ti)Lk(Ti)Y_{i}=L_{k}(T_{i})-L_{k}(T_{i}-) is the size of this jump. We can interpret YiY_{i} as the mark of the iith event.

At TiT_{i}, the size of the jump in the controlled state XX under policy 𝝅\boldsymbol{\pi} is given by γ(Ti,XTi𝝅,aTi𝝅,Yi)\gamma(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}},a_{T_{i}}^{\boldsymbol{\pi}},Y_{i}), where XTi𝝅X_{T_{i}-}^{\boldsymbol{\pi}} is the state right before the jump occurs and aTi𝝅a_{T_{i}}^{\boldsymbol{\pi}} is the action generated from the feedback policy 𝝅\boldsymbol{\pi}. When the policy 𝝅\boldsymbol{\pi} is deterministic, the generated action is determined by (Ti,XTi𝝅)(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}}) and thus the size of the jump in X𝝅X^{\boldsymbol{\pi}} is a function of (Ti,XTi𝝅,Yi)(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}},Y_{i}). By contrast, when 𝝅\boldsymbol{\pi} becomes stochastic, an additional random noise is introduced at TiT_{i} that determines the generated action together with (Ti,XTi𝝅)(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}}). Consequently, the size of the jump in X𝝅X^{\boldsymbol{\pi}} is a function of (Ti,XTi𝝅,Yi)(T_{i},X_{T_{i}-}^{\boldsymbol{\pi}},Y_{i}) plus the random noise for exploration at TiT_{i}.

This motivates us to construct new Poisson random measures on an extended space to capture the effect of random noise on jumps for stochastic policies. Specifically, for each k=1,,k=1,\cdots,\ell, we construct a new Poisson random measure, denoted by Nk(dt,dz,du)N^{\prime}_{k}(dt,dz,du), on the product space [0,T]××[0,1]n[0,T]\times\mathbb{R}\times[0,1]^{n}, with its intensity measure given by dtνk(dz)dudt\nu_{k}(dz)du. Here, uu is the realized value of the nn-dimensional random vector UU that follows 𝒰([0,1]n)\mathcal{U}([0,1]^{n}), which is the random noise introduced in the probability space for exploration. The new Poisson random measure NkN^{\prime}_{k} is also a counting measure associated with a random configuration of points (Ti,Yi,Ui)[0,T]××[0,1]n(T_{i},Y_{i},U_{i})\in[0,T]\times\mathbb{R}\times[0,1]^{n}:

Nk=i1δ(Ti,Yi,Ui),N^{\prime}_{k}=\sum_{i\geq 1}\delta_{(T_{i},Y_{i},U_{i})}, (36)

where TiT_{i} and YiY_{i} are the same as above, and UiU_{i} is the 𝒰([0,1]n)\mathcal{U}([0,1]^{n}) random vector that generates random exploration at TiT_{i}. Hence, the iith event is marked by both YiY_{i} and UiU_{i} under NkN^{\prime}_{k}. We let N(dt,dz,du)=(N1(dt,dz1,du),,N(dt,dz,du))N^{\prime}(dt,dz,du)=(N^{\prime}_{1}(dt,dz_{1},du),\cdots,N^{\prime}_{\ell}(dt,dz_{\ell},du))^{\top}.

In general, for any nn-dimensional random vector ξ\xi that follows distribution 𝝅\boldsymbol{\pi}, we can find a measurable function G𝝅:nnG^{\boldsymbol{\pi}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} such that ξ=G𝝅(U)\xi=G^{\boldsymbol{\pi}}(U), where U𝒰([0,1]n)U\sim\mathcal{U}([0,1]^{n}). As an example, consider ξ𝒩(μ,AA)\xi\sim\mathcal{N}(\mu,AA^{\top}). We can represent it as ξ=μ+AΦ1(U)\xi=\mu+A\Phi^{-1}(U), where Φ\Phi is the cumulative distribution function of the univariate standard normal distribution and Φ1(U)\Phi^{-1}(U) is a vector obtained by applying Φ1\Phi^{-1} to each component of UU.

For the stochastic feedback policy 𝝅t,x\boldsymbol{\pi}_{t,x}, using a=G𝝅t,x(u)a=G^{\boldsymbol{\pi}_{t,x}}(u) we obtain

𝒜(f(t,x+γk(t,x,a,z))f(t,x)γk(t,x,a,z)fx(t,x))νk(dz)𝝅(a|t,x)𝑑a\displaystyle\int_{\mathcal{A}}\int_{\mathbb{R}}\left(f(t,x+\gamma_{k}(t,x,a,z))-f(t,x)-\gamma_{k}(t,x,a,z)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)\boldsymbol{\pi}(a|t,x)da (37)
=\displaystyle= ×[0,1]n(f(t,x+γk(t,x,G𝝅t,x(u),z))f(t,x)γk(t,x,G𝝅t,x(u),z)fx(t,x))νk(dz)𝑑u.\displaystyle\int_{\mathbb{R}\times[0,1]^{n}}\left(f\left(t,x+\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right)-f(t,x)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)du. (38)

It follows that the infinitesimal generator of the sample state process can be written as

𝝅f(t,x)=ft(t,x)+b~(s,x,𝝅t,x)fx(t,x)+12σ~2(s,x,𝝅t,x)2fx2(t,x)\displaystyle\mathcal{L}^{\boldsymbol{\pi}}f(t,x)=\frac{\partial f}{\partial t}(t,x)+\tilde{b}(s,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial f}{\partial x}(t,x)+\frac{1}{2}\tilde{\sigma}^{2}(s,x,\boldsymbol{\pi}_{t,x})\circ\frac{\partial^{2}f}{\partial x^{2}}(t,x) (39)
+k=1×[0,1]n(f(t,x+γk(t,x,G𝝅t,x(u),z))f(t,x)γk(t,x,G𝝅t,x(u),z)fx(t,x))νk(dz)𝑑u.\displaystyle+\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left(f\left(t,x+\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right)-f(t,x)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\circ\frac{\partial f}{\partial x}(t,x)\right)\nu_{k}(dz)du. (40)

Comparing (40) with (2.1) in terms of the integral part that characterizes the behavior of jumps, we observe that the new measure νk(dz)du\nu_{k}(dz)du replaces the Lévy measure νk(dz)\nu_{k}(dz) and integration is done over an extended space to capture the effect of random exploration on jumps. The jump coefficient function that generates the jump size in the controlled state process XX given the Lévy jump with size zz and control variable aa is still the same. However, in (40) the control aa is generated from uu as G𝝅t,x(u)G^{\boldsymbol{\pi}_{t,x}}(u), where uu is the realized value of the random noise introduced for exploration. In the following, we will also write G𝝅t,x(u)G^{\boldsymbol{\pi}_{t,x}}(u) as G𝝅(t,x,u)G^{\boldsymbol{\pi}}(t,x,u) whenever using the latter simplifies notations.

Based on (40), we see that the exploratory state should be the solution to the following Lévy SDE:

dXs𝝅=\displaystyle dX^{\boldsymbol{\pi}}_{s}= b~(s,Xs𝝅,𝝅(|s,Xs𝝅))ds+σ~(s,Xs𝝅,𝝅(|s,Xs𝝅))dWs\displaystyle\ \tilde{b}(s,X^{\boldsymbol{\pi}}_{s-},\boldsymbol{\pi}(\cdot|s,X^{\boldsymbol{\pi}}_{s-}))ds+\tilde{\sigma}(s,X^{\boldsymbol{\pi}}_{s-},\boldsymbol{\pi}(\cdot|s,X^{\boldsymbol{\pi}}_{s-}))dW_{s} (41)
+×[0,1]nγ(s,Xs𝝅,G𝝅(s,Xs𝝅,u),z)N~(ds,dz,du),Xt𝝅=x,s[t,T],\displaystyle+\int_{\mathbb{R}\times[0,1]^{n}}\gamma\left(s,X^{\boldsymbol{\pi}}_{s-},G^{\boldsymbol{\pi}}(s,X^{\boldsymbol{\pi}}_{s-},u),z\right)\widetilde{N}^{\prime}(ds,dz,du),\ X^{\boldsymbol{\pi}}_{t}=x,\ s\in[t,T], (42)

which we call the exploratory Lévy SDE. The solution process, if exists, is denoted by X~𝝅\tilde{X}^{\boldsymbol{\pi}} and called the exploratory (state) process. As we explain below, this process informs us the behavior of the key characteristics of the sample state process after averaging over infinitely many actions sampled from the stochastic policy 𝝅\boldsymbol{\pi}.

In general, the sample state process X𝝅X^{\boldsymbol{\pi}} defined by (22) is a semimartingale, as it is the sum of three processes: the drift process that has finite variation (the first term in (22)), the continuous (local) martingale driven by the Brownian motion (the second term in (22)), and the discontinuous (local) martingale driven by the compensated Poisson random measure (the third term in (22)). Any semimartingale is fully determined by three characteristics: the drift, the quadratic variation of the continuous local martingale, and the compensator of the random measure associated with the process’s jumps (the compensator gives the jump intensity); see Jacod and Shiryaev (2013) for detailed discussions of semimartingales and their characteristics.

For the sample state process, given that Xs𝝅=xX_{s-}^{\boldsymbol{\pi}}=x and the action sampled from 𝝅s,x\boldsymbol{\pi}_{s,x} is a𝒜a\in\mathcal{A}, the characteristics over an infinitesimally small time interval [s,s+ds][s,s+ds] are given by the triplet (b(s,x,a)ds,σ2(s,x,a)ds,k=1γk(s,x,a,z)νk(dz)ds)(b(s,x,a)ds,\sigma^{2}(s,x,a)ds,\sum_{k=1}^{\ell}\gamma_{k}(s,x,a,z)\nu_{k}(dz)ds).

Now consider the exploratory state process X~𝝅\tilde{X}^{\boldsymbol{\pi}}, which is also a semimartingale by (42). Its characteristics over an infinitesimally small time interval [s,s+ds][s,s+ds] with X~s𝝅=x\tilde{X}_{s-}^{\boldsymbol{\pi}}=x are given by the triplet (b~(s,x,𝝅s,x)ds,σ~2(s,x,𝝅s,x)ds,k=1[0,1]nγk(s,x,G𝝅(s,x,u),z)𝑑uνk(dz)𝑑s)(\tilde{b}(s,x,\boldsymbol{\pi}_{s,x})ds,\tilde{\sigma}^{2}(s,x,\boldsymbol{\pi}_{s,x})ds,\sum_{k=1}^{\ell}\int_{[0,1]^{n}}\gamma_{k}(s,x,G^{\boldsymbol{\pi}}(s,x,u),z)du\cdot\nu_{k}(dz)ds), where the third characteristic is obtained by calculating 𝔼[k=1γk(s,x,G𝝅(s,x,u),z)Nk(ds,dz,du)]\mathbb{E}\left[\sum_{k=1}^{\ell}\gamma_{k}\left(s,x,G^{\boldsymbol{\pi}}(s,x,u),z\right)N_{k}(ds,dz,du)\right] for Lévy jumps with size from [z,z+dz][z,z+dz]. Using (30), we have

b~(s,x,𝝅s,x)ds=𝒜b(s,x,a)𝝅(a|s,x)𝑑a𝑑s,σ~2(s,x,𝝅s,x)ds=𝒜σ2(s,x,a)𝝅(a|s,x)𝑑a𝑑s,\displaystyle\tilde{b}(s,x,\boldsymbol{\pi}_{s,x})ds=\int_{\mathcal{A}}b(s,x,a)\boldsymbol{\pi}(a|s,x)dads,\ \ \tilde{\sigma}^{2}(s,x,\boldsymbol{\pi}_{s,x})ds=\int_{\mathcal{A}}\sigma^{2}(s,x,a)\boldsymbol{\pi}(a|s,x)dads, (43)
k=1[0,1]nγk(s,x,G(s,x,u),z)𝑑uνk(dz)𝑑s=k=1𝒜γk(s,x,a,z)νk(dz)𝑑s𝝅(a|s,x)𝑑a.\displaystyle\sum_{k=1}^{\ell}\int_{[0,1]^{n}}\gamma_{k}(s,x,G(s,x,u),z)du\cdot\nu_{k}(dz)ds=\sum_{k=1}^{\ell}\int_{\mathcal{A}}\gamma_{k}\left(s,x,a,z\right)\nu_{k}(dz)ds\cdot\boldsymbol{\pi}(a|s,x)da. (44)

Thus, the semimartingale characteristics of the exploratory state process are the averages of those of the sample state process over action randomization.

Remark 1.

In general, there may be other ways to formulate the exploratory SDE in the jump-diffusion case as we may be able to obtain alternative representations for the infinitesimal generator 𝛑\mathcal{L}^{\boldsymbol{\pi}} based on (34). However, the law of the exploratory state would not change because its generator stays the same.

A technical yet foundational question is the well-posedness (i.e. existence and uniqueness of solution) of the exploratory SDE (42), which we address below. For that we first specify the class of admissible strategies, which is the same as that considered in Jia and Zhou (2023) for pure diffusions.

Definition 1.

A policy 𝛑=𝛑(|,)\boldsymbol{\pi}=\boldsymbol{\pi}(\cdot|\cdot,\cdot) is called admissible, if

  1. (i)

    𝝅(|t,x)𝒫(𝒜)\boldsymbol{\pi}(\cdot|t,x)\in\mathcal{P}(\mathcal{A}), supp𝝅(|t,x)=𝒜\text{supp}\boldsymbol{\pi}(\cdot|t,x)=\mathcal{A} for every (t,x)[0,T]×d(t,x)\in[0,T]\times\mathbb{R}^{d} and 𝝅(a|t,x):(t,x,a)[0,T]×d×𝒜\boldsymbol{\pi}(a|t,x):(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R} is measurable;

  2. (ii)

    𝝅(a|t,x)\boldsymbol{\pi}(a|t,x) is continuous in (t,x)(t,x), i.e., 𝒜|𝝅(a|t,x)𝝅(a|t,x)|da0\int_{\mathcal{A}}\left|\boldsymbol{\pi}(a|t,x)-\boldsymbol{\pi}(a|t^{\prime},x^{\prime})\right|da\to 0 as (t,x)(t,x)(t^{\prime},x^{\prime})\to(t,x). Furthermore, for any K>0K>0, there is a constant CK>0C_{K}>0 independent of (t,a)(t,a) such that

    𝒜|𝝅(a|t,x)𝝅(a|t,x)|daCK|xx|,x,xKd;\int_{\mathcal{A}}\left|\boldsymbol{\pi}(a|t,x)-\boldsymbol{\pi}(a|t,x^{\prime})\right|da\leq C_{K}|x-x^{\prime}|,\ \forall x,x^{\prime}\in\mathbb{R}^{d}_{K}; (45)
  3. (iii)

    (t,x)\forall(t,x), 𝒜|log𝝅(a|t,x)|𝝅(a|t,x)daC(1+|x|p)\int_{\mathcal{A}}|\log\boldsymbol{\pi}(a|t,x)|\boldsymbol{\pi}(a|t,x)da\leq C(1+|x|^{p}) for some p2p\geq 2 and CC is a positive constant; for any q1q\geq 1, 𝒜|a|q𝝅(a|t,x)𝑑aCq(1+|x|p)\int_{\mathcal{A}}|a|^{q}\boldsymbol{\pi}(a|t,x)da\leq C_{q}(1+|x|^{p}) for some p2p\geq 2 and CqC_{q} is a positive constant that can depend on qq.

Next, we establish the well-posedness of (42) under any admissible policy. The result of the next lemma regarding b~\tilde{b} and σ~\tilde{\sigma} is provided in the proof of Lemma 2 in Jia and Zhou (2022b), which uses property (ii) of admissibility.

Lemma 1.

Under Assumption 1, for any admissible policy 𝛑\boldsymbol{\pi}, the functions b~(t,x,𝛑t,x)\tilde{b}(t,x,\boldsymbol{\pi}_{t,x}) and σ~(t,x,𝛑t,x)\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x}) have the following properties:

  1. (i)

    (local Lipschitz continuity) for K>0K>0, there exists a constant CK>0C_{K}>0 such that t[0,T],(x,x)Kd\forall t\in[0,T],(x,x^{\prime})\in\mathbb{R}^{d}_{K},

    |b~(t,x,𝝅t,x)b~(t,x,𝝅t,x)|2+|σ~(t,x,𝝅t,x)σ~(t,x,𝝅t,x)|2CK|xx|2;|\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})-\tilde{b}(t,x^{\prime},\boldsymbol{\pi}_{t,x^{\prime}})|^{2}+|\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})-\tilde{\sigma}(t,x^{\prime},\boldsymbol{\pi}_{t,x^{\prime}})|^{2}\leq C_{K}|x-x^{\prime}|^{2}\ ; (46)
  2. (ii)

    (linear growth in xx) there exists a constant C>0C>0 such that (t,x)[0,T]×d\forall(t,x)\in[0,T]\times\mathbb{R}^{d},

    |b~(t,x,𝝅t,x)|2+|σ~(t,x,𝝅t,x)|2C(1+|x|2).|\tilde{b}(t,x,\boldsymbol{\pi}_{t,x})|^{2}+|\tilde{\sigma}(t,x,\boldsymbol{\pi}_{t,x})|^{2}\leq C(1+|x|^{2}). (47)

We now establish similar properties for γ(t,x,G𝝅t,x(u),z)\gamma(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z) in the following lemmas, whose proofs are relegated to the appendix.

Lemma 2 (linear growth in xx).

Under Assumption 1, for any admissible 𝛑\boldsymbol{\pi} and any p2p\geq 2, there exists a constant Cp>0C_{p}>0 that can depend on pp such that (t,x)[0,T]×d\forall(t,x)\in[0,T]\times\mathbb{R}^{d},

k=1×[0,1]n|γk(t,x,G𝝅t,x(u),z)|pνk(dz)𝑑uCp(1+|x|p).\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right|^{p}\nu_{k}(dz)du\leq C_{p}(1+|x|^{p}). (48)

For the local Lipschitz continuity of γk(t,x,G𝝅t,x(u),z)\gamma_{k}(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z), we make an additional assumption.

Assumption 2.

For k=1,,k=1,\cdots,\ell, the following conditions hold.

  1. (i)

    For any K>0K>0 and any p2p\geq 2, there exists a constant CK,p>0C_{K,p}>0 that can depend on KK and pp such that

    |γk(t,x,a,z)γk(t,x,a,z)|pνk(dz)CK,p|aa|p,t[0,T],a,a𝒜,xKd,z.\int_{\mathbb{R}}\left|\gamma_{k}(t,x,a,z)-\gamma_{k}(t,x,a^{\prime},z)\right|^{p}\nu_{k}(dz)\leq C_{K,p}|a-a^{\prime}|^{p},\quad\forall t\in[0,T],a,a^{\prime}\in\mathcal{A},x\in\mathbb{R}^{d}_{K},z\in\mathbb{R}. (49)
  2. (ii)

    For any K>0K>0 and any p2p\geq 2, there exists a constant CK,p>0C_{K,p}>0 that can depend on KK and pp such that

    [0,1]n|G𝝅(t,x,u)G𝝅(t,x,u)|p𝑑uCK,p|xx|p.\int_{[0,1]^{n}}\left|G^{\boldsymbol{\pi}}(t,x,u)-G^{\boldsymbol{\pi}}(t,x^{\prime},u)\right|^{p}du\leq C_{K,p}|x-x^{\prime}|^{p}. (50)

For a stochastic feedback policy 𝝅t,x𝒩(μ(t,x),A(t,x)A(t,x))\boldsymbol{\pi}_{t,x}\sim\mathcal{N}(\mu(t,x),A(t,x)A(t,x)^{\top}), we have G𝝅(t,x,u)=μ(t,x)+A(t,x)Φ1(u)G^{\boldsymbol{\pi}}(t,x,u)=\mu(t,x)+A(t,x)\Phi^{-1}(u). Clearly, Assumption 2-(ii) holds provided that μ(t,x)\mu(t,x) and A(t,x)A(t,x) are locally Lipschitz continuous in xx.

Lemma 3 (local Lipschitz continuity).

Under Assumptions 1 and 2, for any admissible policy 𝛑\boldsymbol{\pi}, any K>0K>0, and any p2p\geq 2, there exists a constant CK,p>0C_{K,p}>0 that can depend on KK and pp such that t[0,T],(x,x)Kd\forall t\in[0,T],(x,x^{\prime})\in\mathbb{R}^{d}_{K},

k=1×[0,1]n|γk(t,x,G𝝅t,x(u),z)γk(t,x,G𝝅t,x(u),z)|pνk(dz)𝑑uCK,p|xx|p.\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du\leq C_{K,p}|x-x^{\prime}|^{p}. (51)

With Lemmas 1 to 3, we can now apply (Kunita 2004, Theorem 3.2) and (Situ 2006, Theorem 119) to obtain the well-posedness of (42) along with the moment estimate of its solution.

Proposition 1.

Under Assumptions 1 and 2, for any admissible policy 𝛑\boldsymbol{\pi}, there exists a unique strong solution {X~t𝛑,0tT}\{\tilde{X}_{t}^{\boldsymbol{\pi}},0\leq t\leq T\} to the exploratory Lévy SDE (42). Furthermore, for any p2p\geq 2, there exists a constant Cp>0C_{p}>0 such that

𝔼t,x¯[suptsT|X~s𝝅|p]Cp(1+|x|p).\mathbb{E}_{t,x}^{\bar{\mathbb{P}}}\left[\sup_{t\leq s\leq T}|\tilde{X}_{s}^{\boldsymbol{\pi}}|^{p}\right]\leq C_{p}(1+|x|^{p}). (52)

It should be noted that the conditions imposed in Assumptions 1 and 2 are sufficient but not necessary for obtaining the well-posedness and moment estimate of the exploratory Lévy SDE (42). For a specific problem, weaker conditions may suffice for these results if we exploit special structures of the problem.

From the previous discussion, we see that for a given admissible stochastic feedback policy 𝝅\boldsymbol{\pi}, the sample state process {Xt𝝅,t[0,T]}\{X^{\boldsymbol{\pi}}_{t},t\in[0,T]\} and the exploratory state process {X~t𝝅,t[0,T]}\{\tilde{X}^{\boldsymbol{\pi}}_{t},t\in[0,T]\} associated with 𝝅\boldsymbol{\pi} share the same infinitesimal generator and hence the same probability law. This is justified by (Ethier and Kurtz 1986, Chapter 4, Theorem 4.1) on the condition that the function space C01,2([0,T)×d)C_{0}^{1,2}([0,T)\times\mathbb{R}^{d}) is a core of the generator, which we assume to hold. It follows that

𝔼t,x¯[suptsT|Xs𝝅|p]=𝔼t,x¯[suptsT|X~s𝝅|p]Cp(1+|x|p)\mathbb{E}_{t,x}^{\bar{\mathbb{P}}}\left[\sup_{t\leq s\leq T}|X_{s}^{\boldsymbol{\pi}}|^{p}\right]=\mathbb{E}_{t,x}^{\bar{\mathbb{P}}}\left[\sup_{t\leq s\leq T}|\tilde{X}_{s}^{\boldsymbol{\pi}}|^{p}\right]\leq C_{p}(1+|x|^{p}) (53)

if (52) holds.

2.3 Exploratory HJB equation

With the exploratory dynamics (42), for any admissible stochastic policy 𝝅\boldsymbol{\pi} the value function J(t,x;𝝅)J(t,x;\boldsymbol{\pi}) given by (23) can be rewritten as

J(t,x;𝝅)\displaystyle J(t,x;\boldsymbol{\pi}) =𝔼t,x¯[tTeβ(st)𝒜(r(s,X~s𝝅,as𝝅)θlog𝝅(as𝝅|s,X~s𝝅))𝝅(as𝝅|s,X~s𝝅)dads\displaystyle=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-\beta(s-t)}\int_{\mathcal{A}}\left(r(s,\tilde{X}_{s}^{\boldsymbol{\pi}},a_{s}^{\boldsymbol{\pi}})-\theta\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,\tilde{X}_{s-}^{\boldsymbol{\pi}})\right)\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,\tilde{X}_{s-}^{\boldsymbol{\pi}})dads\right. (54)
+eβ(Tt)h(X~T𝝅)].\displaystyle\quad\quad\quad\left.+e^{-\beta(T-t)}h(\tilde{X}_{T}^{\boldsymbol{\pi}})\right]. (55)

Under Assumption 1 and using the admissibility of 𝝅\boldsymbol{\pi} and (52), it is easy to see that JJ has polynomial growth in xx. We provide the Feynman–Kac formula for this function in Lemma 4 by working with the representation (55). In the proof, we consider the finite and infinite jump activity cases separately because special care is needed in the latter. We revise Assumption 1 by adding one more condition for this case.

Assumption 1.

Conditions (i) to (iv) in Assumption 1 hold. We further assume condition (v): if νk(dz)=\int_{\mathbb{R}}\nu_{k}(dz)=\infty, |γk(t,x,a,z)||\gamma_{k}(t,x,a,z)| is bounded for any |z|1|z|\leq 1, t[0,T]t\in[0,T], xKdx\in\mathbb{R}_{K}^{d}, and a𝒜a\in\mathcal{A}.

For Lemma 4, Lemma 5 and Theorem 1, we impose Assumption 1 and assume that the exploratory SDE (42) is well-posed with the moment estimate (52). For simplicity, we do not explicitly mention these assumptions in the statement of the results.

Lemma 4.

Given an admissible stochastic policy 𝛑\boldsymbol{\pi}, suppose there exists a solution ϕC1,2([0,T)×d)C([0,T]×d)\phi\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d}) to the following partial integro-differential equation (PIDE):

ϕt(t,x)+𝒜[H(t,x,a,ϕx,ϕxx,ϕ)θlog𝝅(a|t,x)]𝝅(a|t,x)𝑑aβϕ(t,x)=0,(t,x)[0,T)×d,\displaystyle\frac{\partial\phi}{\partial t}(t,x)+\int_{\mathcal{A}}\left[H(t,x,a,\phi_{x},\phi_{xx},\phi)-\theta\log\boldsymbol{\pi}(a|t,x)\right]\boldsymbol{\pi}(a|t,x)da-\beta\phi(t,x)=0,\;(t,x)\in[0,T)\times\mathbb{R}^{d}, (56)

with terminal condition ϕ(T,x)=h(x)\phi(T,x)=h(x), xdx\in\mathbb{R}^{d}. Moreover, for some p2p\geq 2, ϕ\phi satisfies

|ϕ(t,x)|C(1+|x|p),(t,x)[0,T]×d.|\phi(t,x)|\leq C(1+|x|^{p}),\ \forall(t,x)\in[0,T]\times\mathbb{R}^{d}. (57)

Then ϕ\phi is the value function of the policy 𝛑\boldsymbol{\pi}, i.e. J(t,x;𝛑)=ϕ(t,x)J(t,x;\boldsymbol{\pi})=\phi(t,x).

For ease of presentation, we henceforth assume the value function J(,;𝝅)C1,2([0,T)×d)C([0,T]×d)J(\cdot,\cdot;\boldsymbol{\pi})\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d}) for any admissible stochastic policy 𝝅\boldsymbol{\pi}.

Remark 2.

The conclusion in Lemma 4 still holds if we assume ϕx(t,x)\phi_{x}(t,x) shows polynomial growth in xx instead of Assumption 1-(v).

Next, we consider the optimal value function defined by

J(t,x)=sup𝝅𝚷J(t,x;𝝅),\displaystyle J^{*}(t,x)=\sup_{\boldsymbol{\pi}\in\boldsymbol{\Pi}}J(t,x;\boldsymbol{\pi}), (58)

where 𝚷\boldsymbol{\Pi} is the class of admissible strategies. The following result characterizes JJ^{*} and the optimal stochastic policy through the so-called exploratory HJB equation.

Lemma 5.

Suppose there exists a solution ψC1,2([0,T)×d)C([0,T]×d)\psi\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d}) to the exploratory HJB equation:

ψt(t,x)+sup𝝅𝒫(𝒜)𝒜{H(t,x,a,ψx,ψxx,ψ)θlog𝝅(a|t,x)}𝝅(a|t,x)𝑑aβψ(t,x)=0,\displaystyle\frac{\partial\psi}{\partial t}(t,x)+\sup_{\boldsymbol{\pi}\in\mathcal{P}(\mathcal{A})}\int_{\mathcal{A}}\{H(t,x,a,\psi_{x},\psi_{xx},\psi)-\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da-\beta\psi(t,x)=0, (59)

with the terminal condition ψ(T,x)=h(x)\psi(T,x)=h(x), where HH is the Hamiltonian defined in (19). Moreover, for some p2p\geq 2, ψ\psi satisfies

|ψ(t,x)|C(1+|x|p),(t,x)[0,T]×d,|\psi(t,x)|\leq C(1+|x|^{p}),\ \forall(t,x)\in[0,T]\times\mathbb{R}^{d}, (60)

and it holds that

𝒜exp(1θH(t,x,a,ψx,ψxx,ψ))𝑑a<.\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}H(t,x,a,\psi_{x},\psi_{xx},\psi)\right)da<\infty. (61)

Then, the Gibbs measure or Boltzman distribution

𝝅(a|t,x)exp(1θH(t,x,a,ψx,ψxx,ψ))\displaystyle\boldsymbol{\pi}^{*}(a|t,x)\propto\exp\left(\frac{1}{\theta}H(t,x,a,\psi_{x},\psi_{xx},\psi)\right) (62)

is the optimal stochastic policy and J(t,x)=ψ(t,x)J^{*}(t,x)=\psi(t,x) provided that 𝛑\boldsymbol{\pi}^{*} is admissible.

Plugging the optimal stochastic policy (62) into Eq. (59) to remove the supremum operator, we obtain the following nonlinear PIDE for the optimal value function JJ^{*}:

Jt(t,x)+θlog[𝒜exp(1θH(t,x,a,Jx,Jxx,J))𝑑a]βJ(t,x)=0;J(T,x)=h(x).\displaystyle\frac{\partial J^{*}}{\partial t}(t,x)+\theta\log\left[\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right)da\right]-\beta J^{*}(t,x)=0;\;\;J^{*}(T,x)=h(x). (63)

3 qq-Learning Theory

3.1 qq-function and policy improvement

We present the qq-learning theory for jump-diffusions that includes both policy evaluation and policy improvement, now that the exploratory formulation has been set up. The theory can be developed similarly to Jia and Zhou (2023); so we will just highlight the main differences in the analysis, skipping the parts that are similar.

Definition 2.

The qq-function of the problem (3)–(5) associated with a given policy 𝛑𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi} is defined by

q(t,x,a;𝝅)\displaystyle q(t,x,a;\boldsymbol{\pi}) =J(t,x;𝝅)t+H(t,x,a,Jx,Jxx,J)βJ(t,x;𝝅),(t,x,a)[0,T]×d×𝒜,\displaystyle=\frac{\partial J(t,x;\boldsymbol{\pi})}{\partial t}+H(t,x,a,J_{x},J_{xx},J)-\beta J(t,x;\boldsymbol{\pi}),\;\;(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}, (64)

where JJ is given in (55) and the Hamiltonian function HH is defined in (19).

It is an immediate consequence of Lemma 4 that the qq-function satisfies

𝒜[q(t,x,a;𝝅)θlog𝝅(a|t,x)]𝝅(a|t,x)𝑑a=0,(t,x)[0,T]×d.\displaystyle\int_{\mathcal{A}}[q(t,x,a;\boldsymbol{\pi})-\theta\log\boldsymbol{\pi}(a|t,x)]\boldsymbol{\pi}(a|t,x)da=0,\;\;(t,x)\in[0,T]\times\mathbb{R}^{d}. (65)

The following policy improvement theorem can be proved similarly as in (Jia and Zhou 2023, Theorem 2) by using the arguments in the proof of Lemma 4.

Theorem 1 (Policy Improvement).

For any given 𝛑𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi}, define

𝝅(|t,x)exp(1γH(t,x,,Jx(t,x;𝝅),Jxx(t,x;𝝅),J(t,;𝝅)))exp(1γq(t,x,;𝝅)).\boldsymbol{\pi}^{\prime}(\cdot|t,x)\propto\exp\left(\frac{1}{\gamma}H(t,x,\cdot,J_{x}(t,x;\boldsymbol{\pi}),J_{xx}(t,x;\boldsymbol{\pi}),J(t,\cdot;\boldsymbol{\pi}))\right)\propto\exp\left(\frac{1}{\gamma}q(t,x,\cdot;\boldsymbol{\pi})\right).

If 𝛑𝚷,\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi}, then

J(t,x,𝝅)J(t,x,𝝅).\displaystyle J(t,x,\boldsymbol{\pi}^{\prime})\geq J(t,x,\boldsymbol{\pi}). (66)

Moreover, if the following map

(𝝅)\displaystyle\mathcal{I}(\boldsymbol{\pi}) =exp(1θH(t,x,,Jx(t,x;𝝅),Jxx(t,x;𝝅),J(t,;𝝅)))𝒜exp(1θH(t,x,a,Jx(t,x;𝝅),Jxx(t,x;𝝅),J(t,;𝝅)))𝑑a,𝝅𝚷\displaystyle=\frac{\exp\left(\frac{1}{\theta}H(t,x,\cdot,J_{x}(t,x;\boldsymbol{\pi}),J_{xx}(t,x;\boldsymbol{\pi}),J(t,\cdot;\boldsymbol{\pi}))\right)}{\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}H(t,x,a,J_{x}(t,x;\boldsymbol{\pi}),J_{xx}(t,x;\boldsymbol{\pi}),J(t,\cdot;\boldsymbol{\pi}))\right)da},\quad\boldsymbol{\pi}\in\boldsymbol{\Pi} (67)
=exp(1θq(t,x,;𝝅))𝒜exp(1θq(t,x,a;𝝅))𝑑a\displaystyle=\frac{\exp\left(\frac{1}{\theta}q(t,x,\cdot;\boldsymbol{\pi})\right)}{\int_{\mathcal{A}}\exp\left(\frac{1}{\theta}q(t,x,a;\boldsymbol{\pi})\right)da} (68)

has a fixed point 𝛑\boldsymbol{\pi}^{*}, then 𝛑\boldsymbol{\pi}^{*} is an optimal policy.

3.2 Martingale characterization of the qq-function

Next we derive the martingale characterization of the qq-function associated with a policy 𝝅𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi}, assuming that its value function has already been learned and known. We will highlight the major differences in the proof, provided in the appendix, compared with the pure diffusion setting. For Theorem 2, Theorem 3, and Theorem 4, we impose Assumption 1 and assume the moment estimate (53) for the sample state process holds without explicitly mentioning them in the theorem statements.

Theorem 2.

Let a policy 𝛑𝚷,\boldsymbol{\pi}\in\boldsymbol{\Pi}, its value function J(,;𝛑)C1,2([0,T)×d)C([0,T]×d)J(\cdot,\cdot;\boldsymbol{\pi})\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d}) and a continuous function q^:[0,T]×d×𝒜\hat{q}:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R} be given. Assume that J(t,x;𝛑)J(t,x;\boldsymbol{\pi}) and Jx(t,x;𝛑)J_{x}(t,x;\boldsymbol{\pi}) both have polynomial growth in xx. Then the following results hold.

  1. (i)

    q^(t,x,a)=q(t,x,a;𝝅)\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi}) for all (t,x,a)(t,x,a) if and only if for any (t,x),(t,x), the following process

    eβsJ(t,Xs𝝅;𝝅)+tseβτ[r(τ,Xτ𝝅,aτ𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\displaystyle e^{-\beta s}J(t,X_{s}^{\boldsymbol{\pi}};\boldsymbol{\pi})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau (69)

    is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale, where X𝝅={Xs𝝅:tsT}X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\} is the sample state process defined in (22) with Xt𝝅=xX^{\boldsymbol{\pi}}_{t}=x.

  2. (ii)

    If q^(t,x,a)=q(t,x,a;𝝅)\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi}) for all (t,x,a)(t,x,a), then for any 𝝅𝚷\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi} and any (t,x)(t,x), the following process

    eβsJ(t,Xs𝝅;𝝅)+tseβτ[r(τ,Xτ𝝅,aτ𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\displaystyle e^{-\beta s}J(t,X_{s}^{\boldsymbol{\pi}^{\prime}};\boldsymbol{\pi})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})]d\tau (70)

    is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale, where {Xs𝝅:tsT}\{X^{\boldsymbol{\pi}^{\prime}}_{s}:t\leq s\leq T\} is the solution to (22) under 𝝅\boldsymbol{\pi}^{\prime} with initial condition Xt𝝅=xX^{\boldsymbol{\pi}^{\prime}}_{t}=x.

  3. (iii)

    If there exists 𝝅𝚷\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi} such that for all (t,x)(t,x), the process (70) is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale where Xt𝝅=xX^{\boldsymbol{\pi}^{\prime}}_{t}=x, then we have q^(t,x,a)=q(t,x,a;𝝅)\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi}) for all (t,x,a)(t,x,a).

Moreover, in any of the three cases above, the qq-function satisfies

𝒜{q(t,x,a;𝝅)γlog𝝅(a|t,x)}𝝅(a|t,x)𝑑a=0,for all (t,x)[0,T]×d.\displaystyle\int_{\mathcal{A}}\{q(t,x,a;\boldsymbol{\pi})-\gamma\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da=0,\quad\text{for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$}. (71)
Remark 3.

Similar to Jia and Zhou (2023), Theorem 2-(i) facilitates on-policy learning, where learning the qq-function of the given target policy 𝛑\boldsymbol{\pi} is based on data {(s,Xs𝛑,as𝛑),tsT}\{(s,X_{s}^{\boldsymbol{\pi}},a_{s}^{\boldsymbol{\pi}}),t\leq s\leq T\} generated by 𝛑\boldsymbol{\pi}. On the other hand, Theorem 2-(ii) and -(iii) are for off-policy learning, where learning the qq-function of 𝛑\boldsymbol{\pi} is based on data generated by a different, called behavior, policy 𝛑\boldsymbol{\pi}^{\prime}.

Next, we extend Theorem 7 in Jia and Zhou (2023) and obtain a martingale characterization of the value function and the qq-function simultaneously. The proof is essentially the same and hence omitted.

Theorem 3.

Let a policy 𝛑𝚷,\boldsymbol{\pi}\in\boldsymbol{\Pi}, a function J^C1,2([0,T)×d)C([0,T]×d)\hat{J}\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d}) with polynomial growth and a continuous function q^:[0,T]×d×𝒜\hat{q}:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R} be given satisfying

J^(T,x)=h(x),𝒜{q^(t,x,a)θlog𝝅(a|t,x)}𝝅(a|t,x)𝑑a=0,for all (t,x)[0,T]×d.\displaystyle\hat{J}(T,x)=h(x),\quad\int_{\mathcal{A}}\{\hat{q}(t,x,a)-\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da=0,\quad\text{for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$}. (72)

Assume that J^\hat{J} and J^x\hat{J}_{x} both have polynomial growth. Then

  1. (i)

    J^\hat{J} and q^\hat{q} are respectively the value function and the q-function associated with 𝝅\boldsymbol{\pi} if and only if for all (t,x)[0,T]×d(t,x)\in[0,T]\times\mathbb{R}^{d}, the following process

    eβsJ^(s,Xs𝝅)+tseβτ[r(τ,Xτ𝝅,aτ𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\displaystyle e^{-\beta s}\hat{J}(s,X_{s}^{\boldsymbol{\pi}})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau (73)

    is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale, where X𝝅={Xs𝝅:tsT}X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\} satisfies (22) with Xt𝝅=xX^{\boldsymbol{\pi}}_{t}=x.

  2. (ii)

    If J^\hat{J} and q^\hat{q} are respectively the value function and the q-function associated with 𝝅\boldsymbol{\pi}, then for any 𝝅𝚷\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi} and for all (t,x)[0,T]×d(t,x)\in[0,T]\times\mathbb{R}^{d}, the following process

    eβsJ^(s,Xs𝝅)+tseβτ[r(τ,Xτ𝝅,aτ𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\displaystyle e^{-\beta s}\hat{J}(s,X_{s}^{\boldsymbol{\pi}^{\prime}})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})-\hat{q}(\tau,X_{\tau}^{\boldsymbol{\pi}^{\prime}},a_{\tau}^{\boldsymbol{\pi}^{\prime}})]d\tau (74)

    is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale, where {Xs𝝅:tsT}\{X^{\boldsymbol{\pi}^{\prime}}_{s}:t\leq s\leq T\} satisfies (22) with Xt𝝅=xX^{\boldsymbol{\pi}^{\prime}}_{t}=x.

  3. (iii)

    If there exists 𝝅𝚷\boldsymbol{\pi}^{\prime}\in\boldsymbol{\Pi} such that for all (t,x)(t,x), the process (74) is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale where Xt𝝅=xX^{\boldsymbol{\pi}^{\prime}}_{t}=x, then we have J^(t,x)=J(t,x;𝝅)\hat{J}(t,x)=J(t,x;\boldsymbol{\pi}) and q^(t,x,a)=q(t,x,a;𝝅)\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi}) for all (t,x,a)(t,x,a).

In any of the three cases above, if it holds that 𝛑(a|t,x)=exp(1θq^(t,x,a))𝒜exp(1θq^(t,x,a))𝑑a\boldsymbol{\pi}(a|t,x)=\frac{\exp(\frac{1}{\theta}\hat{q}(t,x,a))}{\int_{\mathcal{A}}{\exp(\frac{1}{\theta}\hat{q}(t,x,a))da}}, then 𝛑\boldsymbol{\pi} is the optimal policy and J^\hat{J} is the optimal value function.

3.3 Optimal qq-function

We consider in this section the optimal qq-function, i.e., the qq-function associated with the optimal policy 𝝅\boldsymbol{\pi}^{*} in (62). Based on Definition 2, we can define it by

q(t,x,a)\displaystyle q^{*}(t,x,a) =J(t,x)t+H(t,x,a,Jx,Jxx,J)βJ(t,x),\displaystyle=\frac{\partial J^{*}(t,x)}{\partial t}+H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})-\beta J^{*}(t,x), (75)

where JJ^{*} is the optimal value function that solves the exploratory HJB equation in (59).

The following is the martingale condition that characterize the optimal value function JJ^{*} and the optimal qq-function, that can be proved analogously to Theorem 9 in Jia and Zhou (2023).

Theorem 4.

Let a function J^C1,2([0,T)×d)C([0,T]×d)\hat{J}^{*}\in C^{1,2}([0,T)\times\mathbb{R}^{d})\cap C([0,T]\times\mathbb{R}^{d}) and a continuous function q^:[0,T]×d×𝒜\hat{q}^{*}:[0,T]\times\mathbb{R}^{d}\times\mathcal{A}\rightarrow\mathbb{R} be given satisfying

J^(T,x)=h(x),𝒜exp(1θq^(t,x,a))𝑑a=1,for all (t,x)[0,T]×d.\displaystyle\hat{J}^{*}(T,x)=h(x),\quad\int_{\mathcal{A}}{\exp\left(\frac{1}{\theta}\hat{q}^{*}(t,x,a)\right)da}=1,\quad\text{for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$}. (76)

Assume that J^(t,x)\hat{J}^{*}(t,x) and J^x(t,x)\hat{J}^{*}_{x}(t,x) both have polynomial growth in xx. Then

  1. (i)

    If J^\hat{J}^{*} and q^\hat{q}^{*} are respectively the optimal value function and the optimal qq-function, then for any 𝝅𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi} and for all (t,x)[0,T]×d(t,x)\in[0,T]\times\mathbb{R}^{d}, the following process

    eβsJ^(s,Xs𝝅)+tseβτ[r(τ,Xτ𝝅,aτ𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\displaystyle e^{-\beta s}\hat{J}^{*}(s,X_{s}^{\boldsymbol{\pi}})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}^{*}(\tau,X_{\tau}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau (77)

    is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale, where X𝝅={Xs𝝅:tsT}X^{\boldsymbol{\pi}}=\{X^{\boldsymbol{\pi}}_{s}:t\leq s\leq T\} satisfies (22) with Xt𝝅=xX^{\boldsymbol{\pi}}_{t}=x. Moreover, in this case, 𝝅^(a|t,x)=exp(1θq^(t,x,a))\boldsymbol{\hat{\pi}}^{*}(a|t,x)=\exp\left(\frac{1}{\theta}\hat{q}^{*}(t,x,a)\right) is the optimal stochastic policy.

  2. (ii)

    If there exists 𝝅𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi} such that for all (t,x)(t,x), the process (77) is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale where Xt𝝅=xX^{\boldsymbol{\pi}}_{t}=x, then J^\hat{J}^{*} and q^\hat{q}^{*} are respectively the optimal value function and the optimal qq-function.

4 qq-Learning Algorithms

In this section we present learning algorithms based on the martingale characterization of the qq-function discussed in the previous section. We need to distinguish two cases, depending on whether or not the density function of the Gibbs measure generated from the qq-function can be computed and integrated explicitly.

We first discuss the case when the normalizing constant in the Gibbs measure can be computed explicitly. We denote by JψJ^{\psi} and qϕ{q}^{\phi} the parameterized function approximators for the optimal value function and optimal qq-function, respectively. In view of Theorem 4, these approximators are chosen to satisfy

Jψ(T,x)=h(x),𝒜exp(1θqϕ(t,x,a))𝑑a=1.{J}^{\psi}(T,x)=h(x),\ \int_{\mathcal{A}}\exp\left(\frac{1}{\theta}q^{\phi}(t,x,a)\right)da=1. (78)

We can then update (ψ,ϕ)(\psi,\phi) by enforcing the martingale condition discussed in Theorem 4 and applying the techniques developed in Jia and Zhou (2022a). This procedure has been discussed in details in Section 4.1 of Jia and Zhou (2023), and hence we omit the details. For reader’s convenience, we present Algorithms 1 and 2, which summarize the offline and online qq-learning algorithms respectively. Such algorithms are based on the so-called martingale orthogonality condition in Jia and Zhou (2022a), where the typical choices of test functions in these algorithms are ξt=Jψψ(t,Xt𝝅ϕ)\xi_{t}=\frac{\partial J^{\psi}}{\partial\psi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}}), and ζt=qϕϕ(t,Xt𝝅ϕ,at𝝅ϕ)\zeta_{t}=\frac{\partial q^{\phi}}{\partial\phi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}}), where 𝝅ϕ\boldsymbol{\pi}^{\phi} is the policy generated by qϕq^{\phi}. Note that these two algorithms are identical to Algorithms 2 and 3 in Jia and Zhou (2023).

Algorithm 1 Offline–Episodic qq-Learning Algorithm

Inputs: initial state x0x_{0}, horizon TT, time step Δt\Delta t, number of episodes NN, number of mesh grids KK, initial learning rates αψ,αϕ\alpha_{\psi},\alpha_{\phi} and a learning rate schedule function l()l(\cdot) (a function of the number of episodes), functional forms of parameterized value function Jψ(,)J^{\psi}(\cdot,\cdot) and qq-function qϕ(,,)q^{\phi}(\cdot,\cdot,\cdot) satisfying (78), functional forms of test functions 𝝃(t,xt,at)\boldsymbol{\xi}(t,x_{\cdot\wedge t},a_{\cdot\wedge t}) and 𝜻(t,xt,at)\boldsymbol{\zeta}(t,x_{\cdot\wedge t},a_{\cdot\wedge t}), and temperature parameter θ\theta.

Required program (on-policy): environment simulator (x,r)=EnvironmentΔt(t,x,a)(x^{\prime},r)=\textit{Environment}_{\Delta t}(t,x,a) that takes current time–state pair (t,x)(t,x) and action aa as inputs and generates state xx^{\prime} at time t+Δtt+\Delta t and instantaneous reward rr at time tt as outputs. Policy 𝝅ϕ(a|t,x)=exp(1γqϕ(t,x,a))\boldsymbol{\pi}^{\phi}(a|t,x)=\exp\left(\frac{1}{\gamma}q^{\phi}(t,x,a)\right).

Required program (off-policy): observations {atk,rtk,xtk+1}k=0,,K1{xtK,h(xtK)}=Observation(Δt)\{a_{t_{k}},r_{t_{k}},x_{t_{k+1}}\}_{k=0,\cdots,K-1}\cup\{x_{t_{K}},h(x_{t_{K}})\}=\textit{Observation}(\Delta t) including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grid with step size Δt\Delta t.

Learning procedure:

  Initialize ψ,ϕ\psi,\phi.
  for episode j=1j=1 to NN do
     Initialize k=0k=0. Observe initial state x0x_{0} and store xtkx0x_{t_{k}}\leftarrow x_{0}. {On-policy case
     while k<Kk<K do
         Generate action atk𝝅ϕ(|tk,xtk)a_{t_{k}}\sim\boldsymbol{\pi}^{\phi}(\cdot|t_{k},x_{t_{k}}). Apply atka_{t_{k}} to environment simulator (x,r)=EnvironmentΔt(tk,xtk,atk)(x,r)=Environment_{\Delta t}(t_{k},x_{t_{k}},a_{t_{k}}), and observe new state xx and reward rr as outputs. Store xtk+1xx_{t_{k+1}}\leftarrow x and rtkrr_{t_{k}}\leftarrow r. Update kk+1k\leftarrow k+1.
     end while} {Off-policy case Obtain one observation {atk,rtk,xtk+1}k=0,,K1{xtK,h(xtK)}=Observation(Δt)\{a_{t_{k}},r_{t_{k}},x_{t_{k+1}}\}_{k=0,\cdots,K-1}\cup\{x_{t_{K}},h(x_{t_{K}})\}=\textit{Observation}(\Delta t). } For every k=0,1,,K1k=0,1,\cdots,K-1, compute and store test functions ξtk=𝝃(tk,xt0,,xtk,at0,,atk)\xi_{t_{k}}=\boldsymbol{\xi}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}}), ζtk=𝜻(tk,xt0,,xtk,at0,,atk)\zeta_{t_{k}}=\boldsymbol{\zeta}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}}). Compute
Δψ=i=0K1ξti[Jψ(ti+1,xti+1)Jψ(ti,xti)+rtiΔtqϕ(ti,xti,ati)ΔtβJψ(ti,xti)Δt],\Delta\psi=\sum_{i=0}^{K-1}\xi_{t_{i}}\big{[}J^{\psi}(t_{i+1},x_{t_{i+1}})-J^{\psi}(t_{i},x_{t_{i}})+r_{t_{i}}\Delta t-q^{\phi}(t_{i},x_{t_{i}},a_{t_{i}})\Delta t-\beta J^{\psi}(t_{i},x_{t_{i}})\Delta t\big{]},
Δϕ=i=0K1ζti[Jψ(ti+1,xti+1)Jψ(ti,xti)+rtiΔtqϕ(ti,xti,ati)ΔtβJψ(ti,xti)Δt].\Delta\phi=\sum_{i=0}^{K-1}\zeta_{t_{i}}\big{[}J^{\psi}(t_{i+1},x_{t_{i+1}})-J^{\psi}(t_{i},x_{t_{i}})+r_{t_{i}}\Delta t-q^{\phi}(t_{i},x_{t_{i}},a_{t_{i}})\Delta t-\beta J^{\psi}(t_{i},x_{t_{i}})\Delta t\big{]}.
Update ψ\psi and ϕ\phi by
ψψ+l(j)αψΔψ.\psi\leftarrow\psi+l(j)\alpha_{\psi}\Delta\psi.
ϕϕ+l(j)αϕΔϕ.\phi\leftarrow\phi+l(j)\alpha_{\phi}\Delta\phi.
  end for
Algorithm 2 Online-Incremental qq-Learning Algorithm

Inputs: initial state x0x_{0}, horizon TT, time step Δt\Delta t, number of mesh grids KK, initial learning rates αψ,αϕ\alpha_{\psi},\alpha_{\phi} and learning rate schedule function l()l(\cdot) (a function of the number of episodes), functional forms of parameterized value function Jψ(,)J^{\psi}(\cdot,\cdot) and qq-function qϕ(,,)q^{\phi}(\cdot,\cdot,\cdot) satisfying (78), functional forms of test functions 𝝃(t,xt,at)\boldsymbol{\xi}(t,x_{\cdot\wedge t},a_{\cdot\wedge t}) and 𝜻(t,xt,at)\boldsymbol{\zeta}(t,x_{\cdot\wedge t},a_{\cdot\wedge t}), and temperature parameter θ\theta.

Required program (on-policy): environment simulator (x,r)=EnvironmentΔt(t,x,a)(x^{\prime},r)=\textit{Environment}_{\Delta t}(t,x,a) that takes current time–state pair (t,x)(t,x) and action aa as inputs and generates state xx^{\prime} at time t+Δtt+\Delta t and instantaneous reward rr at time tt as outputs. Policy 𝝅ϕ(a|t,x)=exp(1θqϕ(t,x,a))\boldsymbol{\pi}^{\phi}(a|t,x)=\exp\left(\frac{1}{\theta}q^{\phi}(t,x,a)\right).

Required program (off-policy): observations {a,r,x}=Observation(t,x;Δt)\{a,r,x^{\prime}\}=\textit{Observation}(t,x;\Delta t) including the observed actions, rewards, and state when the current time-state pair is (t,x)(t,x) under the given behavior policy at the sampling time grid with step size Δt\Delta t.

Learning procedure:

  Initialize ψ,ϕ\psi,\phi.
  for episode j=1j=1 to \infty do
     Initialize k=0k=0. Observe initial state x0x_{0} and store xtkx0x_{t_{k}}\leftarrow x_{0}.
     while k<Kk<K do
         {On-policy case Generate action atk𝝅ϕ(|tk,xtk)a_{t_{k}}\sim\boldsymbol{\pi}^{\phi}(\cdot|t_{k},x_{t_{k}}). Apply atka_{t_{k}} to environment simulator (x,r)=EnvironmentΔt(tk,xtk,atk)(x,r)=Environment_{\Delta t}(t_{k},x_{t_{k}},a_{t_{k}}), and observe new state xx and reward rr as outputs. Store xtk+1xx_{t_{k+1}}\leftarrow x and rtkrr_{t_{k}}\leftarrow r. } {Off-policy case Obtain one observation atk,rtk,xtk+1=Observation(tk,xtk;Δt)a_{t_{k}},r_{t_{k}},x_{t_{k+1}}=\textit{Observation}(t_{k},x_{t_{k}};\Delta t). } Compute test functions ξtk=𝝃(tk,xt0,,xtk,at0,,atk)\xi_{t_{k}}=\boldsymbol{\xi}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}}), ζtk=𝜻(tk,xt0,,xtk,at0,,atk)\zeta_{t_{k}}=\boldsymbol{\zeta}(t_{k},x_{t_{0}},\cdots,x_{t_{k}},a_{t_{0}},\cdots,a_{t_{k}}). Compute
δ=Jψ(tk+1,xtk+1)Jψ(tk,xtk)+rtkΔtqϕ(tk,xtk,atk)ΔtβJψ(tk,xtk)Δt,\displaystyle\delta=J^{\psi}(t_{k+1},x_{t_{k+1}})-J^{\psi}(t_{k},x_{t_{k}})+r_{t_{k}}\Delta t-q^{\phi}(t_{k},x_{t_{k}},a_{t_{k}})\Delta t-\beta J^{\psi}(t_{k},x_{t_{k}})\Delta t,
Δψ=ξtkδ,\displaystyle\Delta\psi=\xi_{t_{k}}\delta,
Δϕ=ζtkδ.\displaystyle\Delta\phi=\zeta_{t_{k}}\delta.
Update ψ\psi and ϕ\phi by
ψψ+l(j)αψΔψ.\psi\leftarrow\psi+l(j)\alpha_{\psi}\Delta\psi.
ϕϕ+l(j)αϕΔϕ.\phi\leftarrow\phi+l(j)\alpha_{\phi}\Delta\phi.
Update kk+1k\leftarrow k+1
     end while
  end for

When the normalizing constant in the Gibbs measure is not available, we take the same approach as in Jia and Zhou (2023) to develop learning algorithms. Specifically, we consider {𝝅ϕ(|t,x)}ϕΦ\{\boldsymbol{\pi}^{\phi}(\cdot|t,x)\}_{\phi\in\Phi}, which is a family of density functions of some tractable distributions, e.g. multivariate normal distributions. Starting from a stochastic policy 𝝅ϕ\boldsymbol{\pi}^{\phi} in this family, we update the policy by considering the optimization problem

minϕΦKL(𝝅ϕ(|t,x)|exp(1θq(t,x,;𝝅ϕ))).\displaystyle\min_{\phi^{\prime}\in\Phi}\text{KL}\left(\boldsymbol{\pi}^{\phi^{\prime}}(\cdot|t,x)\Big{|}\exp\left(\frac{1}{\theta}q(t,x,\cdot;\boldsymbol{\pi}^{\phi})\right)\right).

Specifically, using gradient descent, we can update ϕ\phi as in Jia and Zhou (2023), by

ϕϕθαϕdt[log𝝅ϕ(at𝝅ϕ|t,Xt𝝅ϕ)1θq(t,Xt𝝅ϕ,at𝝅ϕ;𝝅ϕ)]ϕlog𝝅ϕ(at𝝅ϕ|t,Xt𝝅ϕ).\displaystyle\phi\leftarrow\phi-\theta\alpha_{\phi}dt\left[\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}})-\frac{1}{\theta}q(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})\right]\frac{\partial}{\partial\phi}\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}}). (79)

In the above updating rule, we need only the values of the qq-function along the trajectory – the “data” – {(t,Xt𝝅ϕ,at𝝅ϕ);0tT}\{(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}});0\leq t\leq T\}, instead of its full functional form. These values can be learned through the “temporal difference” of the value function along the data. To see this, applying Itô’s formula (15) to J(,;𝝅ϕ)J(\cdot,\cdot;\boldsymbol{\pi}^{\phi}), we have

q(t,Xt𝝅ϕ,at𝝅ϕ;𝝅ϕ)dt\displaystyle q(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})dt =dJ(t,Xt𝝅ϕ;𝝅ϕ)+[r(t,Xt𝝅ϕ,at𝝅ϕ)βJ(t,Xt𝝅ϕ;𝝅ϕ)]dt\displaystyle=dJ(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})+[r(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})-\beta J(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})]dt (80)
+Jx(t,Xt𝝅ϕ;𝝅ϕ)σ(t,Xt𝝅ϕ,at𝝅ϕ)dWt\displaystyle\quad+J_{x}(t,X_{t-}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi}){\sigma}(t,X_{t-}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})dW_{t}
+d(J(t,Xt𝝅ϕ+γ(t,Xt𝝅ϕ,at𝝅ϕ,z))J(t,Xt𝝅ϕ;𝝅ϕ))N~(dt,dz).\displaystyle\quad+\int_{\mathbb{R}^{d}}\left(J(t,X^{\boldsymbol{\pi}^{\phi}}_{t-}+{\gamma}(t,X^{\boldsymbol{\pi}^{\phi}}_{t-},a_{t}^{\boldsymbol{\pi}^{\phi}},z))-J(t,X^{\boldsymbol{\pi}^{\phi}}_{t-};\boldsymbol{\pi}^{\phi})\right)\widetilde{N}(dt,dz). (81)

We may ignore the dWtdW_{t} and N~(dt,dz)\widetilde{N}(dt,dz) terms which are martingale differences with mean zero, and then the updating rule in (79) becomes

ϕϕ+αϕ[θlog𝝅ϕ(at𝝅ϕ|t,Xt𝝅ϕ)dt+dJ(t,Xt𝝅ϕ;𝝅ϕ)+(r(t,Xt𝝅ϕ,at𝝅ϕ)βJ(t,Xt𝝅ϕ;𝝅ϕ))dt]\displaystyle\phi\leftarrow\phi+\alpha_{\phi}\left[-\theta\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}})dt+dJ(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})+\left(r(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})-\beta J(t,X_{t}^{\boldsymbol{\pi}^{\phi}};\boldsymbol{\pi}^{\phi})\right)dt\right] (82)
ϕlog𝝅ϕ(at𝝅ϕ|t,Xt𝝅ϕ).\displaystyle\qquad\qquad\qquad\cdot\frac{\partial}{\partial\phi}\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}}). (83)

Using Jψ(,)J^{\psi}(\cdot,\cdot) as the parameterized function approximator for J(,;𝝅ϕ)J(\cdot,\cdot;\boldsymbol{\pi}^{\phi}), we arrive at the updating rule for the policy parameter ϕ\phi:

ϕϕ+αϕ[θlog𝝅ϕ(at𝝅ϕ|t,Xt𝝅ϕ)dt+dJψ(t,Xt𝝅ϕ)+(r(t,Xt𝝅ϕ,at𝝅ϕ)βJψ(t,Xt𝝅ϕ))dt]\displaystyle\phi\leftarrow\phi+\alpha_{\phi}\left[-\theta\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}})dt+dJ^{\psi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}})+\left(r(t,X_{t}^{\boldsymbol{\pi}^{\phi}},a_{t}^{\boldsymbol{\pi}^{\phi}})-\beta J^{\psi}(t,X_{t}^{\boldsymbol{\pi}^{\phi}})\right)dt\right] (84)
ϕlog𝝅ϕ(at𝝅ϕ|t,Xt𝝅ϕ).\displaystyle\qquad\qquad\qquad\cdot\frac{\partial}{\partial\phi}\log\boldsymbol{\pi}^{\phi}(a_{t}^{\boldsymbol{\pi}^{\phi}}|t,X_{t-}^{\boldsymbol{\pi}^{\phi}}). (85)

Therefore, we can update ψ\psi using the PE methods in Jia and Zhou (2022a), and update ϕ\phi using the above rule, leading to actor–critic type of algorithms.

To conclude, we are able to use the same RL algorithms to learn the optimal policy and optimal value function, without having to know a priori whether the unknown environment entails a pure diffusion or a jump-diffusion. This important conclusion is based on the theoretical analysis carried out in the previous sections.

5 Application: Mean–Variance Portfolio Selection

We now present an applied example of the general theory and algorithms derived. Consider investing in a market where there are a risk-free asset and a risky asset (e.g. a stock or an index). The risk-free rate is given by a constant rfr_{f} and the risky asset price process follows

dSt=St[μdt+σdWt+(exp(z)1)N~(dt,dz)].dS_{t}=S_{t-}\left[\mu dt+\sigma dW_{t}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(dt,dz)\right]. (86)

Let XtX_{t} be the discounted wealth value at time tt, and ata_{t} is the discounted dollar value of the investment in the risky asset. The self-financing discounted wealth process follows

dXta=atσρdt+atσdWt+at(exp(z)1)N~(dt,dz),dX^{a}_{t}=a_{t}\sigma\rho dt+a_{t}\sigma dW_{t}+a_{t}\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(dt,dz), (87)

where ρ\rho is the Sharpe ratio of the risky asset and given by

ρ=μrfσ.\rho=\frac{\mu-r_{f}}{\sigma}. (88)

We assume

|z|>1exp(z)ν(dz)<,|z|>1exp(2z)ν(dz)<.\displaystyle\int_{|z|>1}\exp(z)\nu(dz)<\infty,\quad\int_{|z|>1}\exp(2z)\nu(dz)<\infty. (89)

Condition (89) implies that 𝔼[St]\mathbb{E}[S_{t}] and 𝔼[St2]\mathbb{E}[S_{t}^{2}] are finite for every t0t\geq 0; see (Cont and Tankov 2004, Proposition 3.14). We set

σJ2(exp(z)1)2ν(dz),\sigma_{J}^{2}\coloneqq\int_{\mathbb{R}}(\exp(z)-1)^{2}\nu(dz), (90)

which is finite by conditions (2) and (89).

Fix the investment horizon as [0,T][0,T]. The mean-variance (MV) portfolio selection problem considers

minaVar[XTa] subject to 𝔼[XTa]=z.\min_{a}\operatorname{Var}\left[X_{T}^{a}\right]\quad\text{ subject to }\mathbb{E}\left[X_{T}^{a}\right]=z. (91)

We seek the optimal pre-committed strategy for the MV problem as in Zhou and Li (2000). We can transform the above constrained problem into an unconstrained one by introducing a Lagrange multiplier, which yields

mina𝔼[(XTa)2]z22ω(𝔼[XTa]z)=mina𝔼[(XTaω)2](ωz)2.\min_{a}\mathbb{E}\left[\left(X_{T}^{a}\right)^{2}\right]-z^{2}-2\omega\left(\mathbb{E}\left[X_{T}^{a}\right]-z\right)=\min_{a}\mathbb{E}\left[\left(X_{T}^{a}-\omega\right)^{2}\right]-(\omega-z)^{2}. (92)

Note that the optimal solution to the unconstrained minimization problem depends on ω\omega, and we can obtain the optimal multiplier ω\omega^{*} by solving 𝔼[XTa(ω)]=z\mathbb{E}\left[X_{T}^{a^{*}}(\omega)\right]=z.

The exploratory formulation of the problem is

min𝝅𝔼t,x¯[(XT𝝅ω)2+θtTlog𝝅(as𝝅|s,Xs𝝅)𝑑s](ωz)2,\displaystyle\min_{\boldsymbol{\pi}}\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\left(X_{T}^{\boldsymbol{\pi}}-\omega\right)^{2}+\theta\int_{t}^{T}\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})ds\right]-(\omega-z)^{2}, (93)

where the discounted wealth under a stochastic policy 𝝅\boldsymbol{\pi} follows

dXs𝝅=as𝝅σρds+as𝝅σdWs+as𝝅(exp(z)1)N~(ds,dz),s[t,T],Xs𝝅=x.dX^{\boldsymbol{\pi}}_{s}=a_{s}^{\boldsymbol{\pi}}\sigma\rho ds+a_{s}^{\boldsymbol{\pi}}\sigma dW_{s}+a_{s}^{\boldsymbol{\pi}}\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(ds,dz),\quad s\in[t,T],\ X^{\boldsymbol{\pi}}_{s}=x. (94)

5.1 Solution of the exploratory control problem

We consider the HJB equation for problem (93):

Vt(t,x)+inf𝝅𝒫(){H(t,x,a,Vx,Vxx,V)+θlog𝝅(a|t,x)}𝝅(a|t,x)𝑑a=0,\displaystyle\frac{\partial V}{\partial t}(t,x)+\inf_{\boldsymbol{\pi}\in\mathcal{P}(\mathbb{R})}\int_{\mathbb{R}}\{H(t,x,a,V_{x},V_{xx},V)+\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da=0, (95)

with the terminal condition V(T,x)=(xω)2(ωz)2V(T,x)=(x-\omega)^{2}-(\omega-z)^{2}. Note that supremum becomes infimum and the sign before θlog𝝅(a|t,x)\theta\log\boldsymbol{\pi}(a|t,x) flips compared with (59) because we consider minimization here. The Hamiltonian of the problem is given by

H(t,x,a,Vx,Vxx,V)\displaystyle H(t,x,a,V_{x},V_{xx},V) =aσρVx(t,x)+12a2σ2Vxx(t,x)\displaystyle=a\sigma\rho V_{x}(t,x)+\frac{1}{2}a^{2}\sigma^{2}V_{xx}(t,x) (96)
+(V(t,x+γ(a,z))V(t,x)γ(a,z)Vx(t,x))ν(dz),\displaystyle+\int_{\mathbb{R}}\left(V(t,x+\gamma(a,z))-V(t,x)-\gamma(a,z)V_{x}(t,x)\right)\nu(dz), (97)

where γ(a,z)=a(ez1).\gamma(a,z)=a(e^{z}-1). We take the following ansatz for the solution of the HJB equation (95):

V(t,x)=(xω)2f(t)+g(t)(ωz)2.V(t,x)=(x-\omega)^{2}f(t)+g(t)-(\omega-z)^{2}. (98)

As V(t,x)V(t,x) is quadratic in xx, we can easily calculate the integral term in the Hamiltonian and obtain

H(t,x,a,Vx,Vxx,V)=aσρVx(t,x)+12a2(σ2+σJ2)Vxx(t,x).H(t,x,a,V_{x},V_{xx},V)=a\sigma\rho V_{x}(t,x)+\frac{1}{2}a^{2}(\sigma^{2}+\sigma_{J}^{2})V_{xx}(t,x). (99)

The probability density function that minimizes the integral in (95) is given by

𝝅c(|t,x)exp(1θH(t,x,a,Vx,Vxx,V)),\boldsymbol{\pi}_{c}(\cdot|t,x)\propto\exp\left(-\frac{1}{\theta}H(t,x,a,V_{x},V_{xx},V)\right), (100)

which is a candidate for the optimal stochastic policy. From (99), we obtain

𝝅c(|t,x)𝒩(σρVx(σ2+σJ2)Vxx,θ(σ2+σJ2)Vxx).\boldsymbol{\pi}_{c}(\cdot|t,x)\sim\mathcal{N}\left(\cdot\mid-\frac{\sigma\rho V_{x}}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}},\frac{\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}\right). (101)

Substituting it back to the HJB equation (95), we obtain a nonlinear PDE as

Vtρ2σ22(σ2+σJ2)(Vx)2Vxxθ2ln2πθ(σ2+σJ2)Vxx=0,(t,x)[0,T)×,\displaystyle V_{t}-\frac{\rho^{2}\sigma^{2}}{2(\sigma^{2}+\sigma_{J}^{2})}\frac{(V_{x})^{2}}{V_{xx}}-\frac{\theta}{2}\ln\frac{2\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}=0,\quad(t,x)\in[0,T)\times\mathbb{R}, (102)
V(T,x)=(xω)2(ωz)2.\displaystyle V(T,x)=(x-\omega)^{2}-(\omega-z)^{2}. (103)

We plug in the ansatz (98) to the above PDE and obtain that f(t)f(t) satisfies

f(t)ρ2σ2σ2+σJ2f(t)=0,f(T)=1,\displaystyle f^{\prime}(t)-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}f(t)=0,\ f(T)=1, (104)

and g(t)g(t) satisfies

g(t)θ2lnπθ(σ2+σJ2)f(t)=0,g(T)=0.\displaystyle g^{\prime}(t)-\frac{\theta}{2}\ln\frac{\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})f(t)}=0,\ g(T)=0. (105)

These two ordinary differential equations can be solved analytically, and we obtain

V(t,x)=\displaystyle V(t,x)= (xω)2exp(ρ2σ2σ2+σJ2(Tt))+θρ2σ24(σ2+σJ2)(T2t2)\displaystyle(x-\omega)^{2}\exp\left(-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right)+\frac{\theta\rho^{2}\sigma^{2}}{4(\sigma^{2}+\sigma_{J}^{2})}(T^{2}-t^{2}) (106)
θ2(ρ2σ2σ2+σJ2T+lnπθσ2+σJ2)(Tt)(ωz)2.\displaystyle-\frac{\theta}{2}\left(\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}T+\ln\frac{\pi\theta}{\sigma^{2}+\sigma_{J}^{2}}\right)(T-t)-(\omega-z)^{2}. (107)

It follows that

𝝅c(|t,x)𝒩(|σρσ2+σJ2(xω),θ2(σ2+σJ2)exp(ρ2σ2σ2+σJ2(Tt))).\boldsymbol{\pi}_{c}(\cdot|t,x)\sim\mathcal{N}\left(\cdot\ \Big{|}-\frac{\sigma\rho}{\sigma^{2}+\sigma_{J}^{2}}(x-\omega),\frac{\theta}{2(\sigma^{2}+\sigma_{J}^{2})}\exp\left(\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right)\right). (108)

It is straightforward to verify that 𝝅c\boldsymbol{\pi}_{c} is admissible by checking the four conditions in Definition 1. Furthermore, VV solves the HJB equation (95) and shows quadratic growth. Therefore, by Lemma 5, we have the following conclusion.

Proposition 2.

For the unconstrained MV problem (92), the optimal value function J(t,x)=V(t,x)J^{*}(t,x)=V(t,x) and the optimal stochastic policy 𝛑=𝛑c\boldsymbol{\pi}^{*}=\boldsymbol{\pi}_{c}.

When there is no jump, we have σJ2=0\sigma_{J}^{2}=0 and thus recover the expressions of the optimal value function and optimal policy derived in Wang and Zhou (2020) for the unconstrained MV problem in the pure diffusion setting.

5.2 Parametrizations for qq-learning

It is important to observe that the optimal value function, optimal policy and the Hamiltonian given by (99) take the same structural forms regardless of the presence of jumps, while the only differences are the constant coefficients in those functions. However, those coefficients are unknown anyway and will be parameterized in the implementation of our RL algorithms. Consequently, we can use the same parameterizations for the optimal value function and optimal qq-function for learning as in the diffusion setting of Jia and Zhou (2023). This important insight, concluded only after a rigorous theoretical analysis, shows that the continuous-time RL algorithms are robust to the presence of jumps and essentially model-free, at least for the MV portfolio selection problem.

Following Jia and Zhou (2023), we parametrize the value function as

Jψ(t,x;ω)=(xω)2eψ3(Tt)+ψ2(t2T2)+ψ1(tT)(ωz)2,J^{\psi}(t,x;\omega)=(x-\omega)^{2}e^{-\psi_{3}(T-t)}+\psi_{2}\left(t^{2}-T^{2}\right)+\psi_{1}(t-T)-(\omega-z)^{2}, (109)

and the qq-function as

qϕ(t,x,a;w)=eϕ2ϕ3(Tt)2(a+ϕ1(xw))2θ2[log2πθ+ϕ2+ϕ3(Tt)].q^{\phi}(t,x,a;w)=-\frac{e^{-\phi_{2}-\phi_{3}(T-t)}}{2}\left(a+\phi_{1}(x-w)\right)^{2}-\frac{\theta}{2}\left[\log 2\pi\theta+\phi_{2}+\phi_{3}(T-t)\right]. (110)

Let ψ=(ψ1,ψ2,ψ3)\psi=\left(\psi_{1},\psi_{2},\psi_{3}\right)^{\top} and ϕ=(ϕ1,ϕ2,ϕ3)\phi=\left(\phi_{1},\phi_{2},\phi_{3}\right)^{\top}. The policy associated with the parametric qq-function is 𝝅ϕ(t,x;w)=𝒩(ϕ1(xw),θeϕ2+ϕ3(Tt)){\boldsymbol{\pi}}^{\phi}(\cdot\mid t,x;w)=\mathcal{N}\left(-\phi_{1}(x-w),\theta e^{\phi_{2}+\phi_{3}(T-t)}\right). In addition to ψ\psi and ϕ\phi, we learn the Lagrange multiplier ω\omega in the same way as in Jia and Zhou (2023) by the stochastic approximation algorithm that updates ω\omega with a learning rate after a fixed number of iterations.

5.3 Simulation study

We assess the effect of jumps on the convergence behavior of our algorithms via a simulation study. We use the same basic setting as in Jia and Zhou (2023): x0=1x_{0}=1, z=1.4z=1.4, T=1T=1 year, Δt=1/252\Delta t=1/252 years (corresponding to one trading day), and a chosen temperature parameter θ=0.1\theta=0.1. We consider two market simulators: one is given by the Black–Scholes (BS) model and the other is Merton’s jump-diffusion (MJD) model in which the Lévy density is a scaled Gaussian density, i.e.,

ν(z)=λ12πδ2exp((zm)22δ2),λ>0,δ>0,m,\nu(z)=\lambda\frac{1}{\sqrt{2\pi\delta^{2}}}\exp\left(-\frac{(z-m)^{2}}{2\delta^{2}}\right),\ \lambda>0,\delta>0,m\in\mathbb{R}, (111)

where λ\lambda is the arrival rate of the Poisson jumps. The Gaussian assumption is a common choice in the finance literature for modeling the jump-size distribution (see e.g. Merton 1976, Bates 1991, Das 2002), partly due to its tractability for statistical estimation and partly because heavy-tailed distributions may not be easily identified from real data when the number of jumps is limited (see Heyde and Kou 2004).

Under the latter model, we have

σJ2=λ[exp(2m+2δ2)2exp(m+12δ2)+1].\sigma_{J}^{2}=\lambda\left[\exp\left(2m+2\delta^{2}\right)-2\exp\left(m+\frac{1}{2}\delta^{2}\right)+1\right]. (112)

To mimic the real market, we set the parameters of these two simulators by estimating them from the daily data of the S&P 500 index using maximum likelihood estimation (MLE). Our estimation data cover a long period from the beginning of 2000 to the end of 2023. In Table 1, we summarize the estimated parameter values (used for the simulators) and the corresponding value of ϕ1\phi_{1}^{*} in the optimal policy. Note that although we use a stochastic policy to interact with the environment during training to update the policy parameters, for actual execution of portfolio selection we apply a deterministic policy which is the mean part of the optimal stochastic policy after it has been learned. So it is off-policy learning here. The advantage of doing so among others is to reduce the variance of the final wealth; see Huang et al. (2022) for a discussion on this approach. As a result, here we only display the values for ϕ1\phi_{1}^{*} in these two environments and use them as benchmarks to check the convergence of our algorithm (see Figure 1).

Simulator Parameters Optimal
BS μ=0.0690,σ=0.1965\mu=0.0690,\sigma=0.1965 ϕ1=1.5940\phi_{1}^{*}=1.5940
MJD μ=0.0636,σ=0.1347,λ=28.4910,m=0.0039,δ=0.0275\mu=0.0636,\sigma=0.1347,\lambda=28.4910,m=-0.0039,\delta=0.0275 ϕ1=1.7869\phi_{1}^{*}=1.7869
Table 1: Parameters used in the two simulators. The column “Optimal” reports the values of ϕ1\phi_{1} in the optimal policies calculated using the respective simulators.

For offline learning, the Lagrange multiplier ω\omega is updated after every m=10m=10 iterations, and the parameter vectors ψ\psi and ϕ\phi are initialized as zero vectors. The learning rates are set to be αw=0.05\alpha_{w}=0.05, αψ=0.001\alpha_{\psi}=0.001, and αϕ=0.1\alpha_{\phi}=0.1 with decay rate l(j)=j0.51l(j)=j^{-0.51}, where jj is the iteration index. In each iteration, we generate 3232 independent TT-year trajectories to update the parameters. We train the model for N=20000N=20000 iterations.

We also consider online learning with Δt\Delta t equal to one trading day. We select a batch size of 128128 trading days and update the parameters after this number of observations coming in. We set m=1m=1 for updating ω\omega and initialize ψ\psi and ϕ\phi as zero vectors. The learning rates are set as αw=0.01\alpha_{w}=0.01, αψ=0.001\alpha_{\psi}=0.001, and αϕ=0.05\alpha_{\phi}=0.05 with decay rate l(j)=j0.51l(j)=j^{-0.51}. Some of the rates are notably smaller under online learning because now we update with fewer observations and thus must be more cautious. The model is again trained for N=20000N=20000 iterations.

Figure 1 plots the convergence behavior of offline and online learning under both simulators or market environments (one with jumps and one without). The algorithms have converged after a sufficient number of iterations, whether jumps are present or not. This demonstrates that convergence of the offline and online qq-learning algorithms proposed in Jia and Zhou (2023) under the diffusion setting is robust to the presence of jumps for mean–variance portfolio selection. However, jumps in the environment can introduce more variability in the convergence process as seen from the plots.

Refer to caption
Figure 1: Convergence of the offline and online qq-Learning algorithms under two market simulators for the policy parameter ϕ1\phi_{1} (iteration index on the xx-axis)

5.4 Effects of jumps

The theoretical analysis so far in this section shows that, for the mean–variance problem, one does not need to know in advance whether or not the stock prices have jumps in order to carry out the RL task, because the optimal stochastic policy is Gaussian and the corresponding value function and qq-function have the same structures for parameterization irrespective of the presence of jumps. However, we stress that this is rather an exception than a rule. Here we give a counterexample.

Consider a modification of the mean–variance problem where the controlled system dynamics is

dXta=atσρdt+atσdBt+γ(at,z)N~(dt,dz),\displaystyle dX_{t}^{a}=a_{t}\sigma\rho dt+a_{t}\sigma dB_{t}+\int_{\mathbb{R}}\gamma(a_{t},z)\widetilde{N}(dt,dz), (113)

with

γ(a,z)=a2,\displaystyle\gamma(a,z)=a^{2}, (114)

and the exploratory objective is

J(t,x;w)=min𝝅𝔼t,x¯[(XT𝝅w)2+θtTlog𝝅(as𝝅|s,Xs𝝅)𝑑s](ωz)2.\displaystyle J^{*}(t,x;w)=\min_{\boldsymbol{\pi}}\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[(X_{T}^{\boldsymbol{\pi}}-w)^{2}+\theta\int_{t}^{T}\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})ds\right]-(\omega-z)^{2}. (115)

Note that this is not a mean–variance portfolio selection problem because (113) does not correspond to a self-financed wealth equation with a reasonably modelled stock price process.

The Hamiltonian is given by

H(t,x,a,vx,vxx,v)\displaystyle H(t,x,a,v_{x},v_{xx},v) =aσρvx(t,x)+12a2σ2vxx(t,x)\displaystyle=a\sigma\rho v_{x}(t,x)+\frac{1}{2}a^{2}\sigma^{2}v_{xx}(t,x) (116)
+(v(t,x+a2)v(t,x)a2vx(t,x))ν(dz).\displaystyle\quad+\int_{\mathbb{R}}\left(v(t,x+a^{2})-v(t,x)-a^{2}\cdot v_{x}(t,x)\right)\nu(dz). (117)

If an optimal stochastic policy exists, then it must be

𝝅(a|t,x)exp(1θH(t,x,a,Jx,Jxx,J)).\displaystyle\boldsymbol{\pi}^{*}(a|t,x)\propto\exp\left(-\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right). (118)

We show by contradiction that the optimal stochastic policy can not be Gaussian in this case. Note that if there is no optimal stochastic policy, then it would already demonstrate that jumps matter because the optimal stochastic policy for the case of no jumps exists and is Gaussian.

Remark 4.

The existence of optimal stochastic policy in (118) is equivalent to the integrability of the quantity exp(1θH(t,x,a,Jx,Jxx,J))\exp\left(\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right) over a𝒜=(,)a\in\mathcal{A}=(-\infty,\infty). This integrability depends on the tail behavior of the Hamiltonian and, in particular, the behavior of J(t,x+a2)J^{*}(t,x+a^{2}) when a2a^{2} is large.

Suppose the optimal stochastic policy 𝝅(|t,x)\boldsymbol{\pi}^{*}(\cdot|t,x) is Gaussian for all (t,x)(t,x), implying that the Hamilitonian H(t,x,a,Jx,Jxx,J)H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*}) is a quadratic function of aa. It then follows from (116) that there exist functions h1(t,x)h_{1}(t,x) and h2(t,x)h_{2}(t,x) such that

J(t,x+a2)J(t,x)a2Jx(t,x)=a2h1(t,x)+ah2(t,x),for all (t,x,a).\displaystyle J^{*}(t,x+a^{2})-J^{*}(t,x)-a^{2}J^{*}_{x}(t,x)=a^{2}\cdot h_{1}(t,x)+a\cdot h_{2}(t,x),\quad\text{for all $(t,x,a)$}. (119)

We do not put a term independent of aa on the right-hand side because the left-hand side is zero when a=0a=0. Taking derivative with respect to aa, we obtain

Jx(t,x+a2)2a2aJx(t,x)=2ah1(t,x)+h2(t,x),for all (t,x,a).\displaystyle J^{*}_{x}(t,x+a^{2})\cdot 2a-2aJ^{*}_{x}(t,x)=2a\cdot h_{1}(t,x)+h_{2}(t,x),\quad\text{for all $(t,x,a)$}. (120)

Setting a=0a=0, we get h2=0.h_{2}=0. It follows that

a[Jx(t,x+a2)Jx(t,x)h1(t,x)]=0for all (t,x,a).\displaystyle a\cdot\left[J^{*}_{x}(t,x+a^{2})-J^{*}_{x}(t,x)-h_{1}(t,x)\right]=0\quad\text{for all $(t,x,a)$}. (121)

Hence we have h1(t,x)=Jx(t,x+a2)Jx(t,x)h_{1}(t,x)=J^{*}_{x}(t,x+a^{2})-J^{*}_{x}(t,x) for any a0.a\neq 0. Sending aa to zero yields h1(t,x)=0h_{1}(t,x)=0 for all (t,x)(t,x). Therefore, we obtain from (119) that

J(t,x+a2)J(t,x)a2Jx(t,x)=0,for all (t,x,a).\displaystyle J^{*}(t,x+a^{2})-J^{*}(t,x)-a^{2}J^{*}_{x}(t,x)=0,\quad\text{for all $(t,x,a)$}. (122)

Taking derivative in aa in the above we have

Jx(t,x+a2)Jx(t,x)=0,for all (t,x,a).\displaystyle J^{*}_{x}(t,x+a^{2})-J^{*}_{x}(t,x)=0,\quad\text{for all $(t,x,a)$}. (123)

Thus JxJ^{*}_{x} is constant in xx or JJ^{*} is affine in xx, leading to J(t,x)=g1(t)x+g2(t)J^{*}(t,x)=g_{1}(t)x+g_{2}(t) for some functions g1(t)g_{1}(t) and g2(t)g_{2}(t). The resulting Hamiltonian becomes

H(t,x,a,Jx,Jxx,J)\displaystyle H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*}) =aσρg1(t).\displaystyle=a\sigma\rho g_{1}(t). (124)

This is linear in a𝒜=(,)a\in\mathcal{A}=(-\infty,\infty) and hence the integral 𝒜exp(1θH(t,x,a,Jx,Jxx,J))𝑑a\int_{\mathcal{A}}\exp\left(-\frac{1}{\theta}H(t,x,a,J^{*}_{x},J^{*}_{xx},J^{*})\right)da does not exist. It follows that 𝝅(|t,x)\boldsymbol{\pi}^{*}(\cdot|t,x) does not exist, which is a contradiction. Therefore, we have shown that under (114), the optimal stochastic policy either does not exist or is not Gaussian when it exists.

Remark 5.

The argument above works for γ(a,z)=am\gamma(a,z)=a^{m} for any m>1m>1.

6 Application: Mean–Variance Hedging of Options

The MV portfolio selection problem considered in Section 5 is an LQ problem. In this section, we present another application that is non-LQ. Consider an option seller who needs to hedge a short position in a European-style option that expires at time TT. The option is written on a risky asset whose price process SS is described by the SDE (86) with condition (89) satisfied. At the terminal time TT, the seller pays G(ST)G(S_{T}) to the option holder. We assume that the seller’s hedging activity will not affect the risky asset price.

To hedge the random payoff, the seller constructs a portfolio consisting of the underlying risky asset and cash. We consider discounted quantities in the problem: the discounted risky asset price S^terftSt\hat{S}_{t}\coloneqq e^{-r_{f}t}S_{t} and discounted payoff G^(S^T):=erfTG(ST)\hat{G}(\hat{S}_{T}):=e^{-r_{f}T}G(S_{T}), where rfr_{f} is again the constant risk-free interest rate. As an example, for a put option with strike price 𝒦\mathcal{K}, G(ST)=(𝒦ST)+G(S_{T})=(\mathcal{K}-S_{T})^{+} (x+:=max(x,0)x^{+}:=\max(x,0)), and G^(S^T)=(𝒦^S^T)+\hat{G}(\hat{S}_{T})=(\hat{\mathcal{K}}-\hat{S}_{T})^{+}, where 𝒦^:=erfT𝒦\hat{\mathcal{K}}:=e^{-r_{f}T}\mathcal{K}. Here S^\hat{S} follows the SDE

dS^t=S^t[ρσdt+σdWt+(exp(z)1)N~(dt,dz)],d\hat{S}_{t}=\hat{S}_{t-}\left[\rho\sigma dt+\sigma dW_{t}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(dt,dz)\right], (125)

where ρ\rho is defined in (88).

We denote the discounted dollar value in the risky asset and the discounted value of the hedging portfolio at time tt by ata_{t} and XtX_{t}, respectively. As in the MV portfolio selection problem, ata_{t} is the control variable and XX follows the SDE (87). The seller seeks a hedging policy to minimize the deviation from the terminal payoff. A popular formulation is mean–variance hedging (also known as quadratic hedging), which considers the objective

mina𝔼[(XTaG^(S^T))2].\min_{a}\mathbb{E}\left[\left(X_{T}^{a}-\hat{G}(\hat{S}_{T})\right)^{2}\right]. (126)

Note that this expectation is taken under the real-world probability measure, rather than under any martingale measure for option pricing. An advantage of this formulation is that relevant data are observable in real world (but not necessarily in a risk-neutral world). Although objective (126) looks similar to that of the MV portfolio selection problem, the target the hedging portfolio tries to meet is now random instead of constant. Furthermore, the payoff is generally nonlinear in S^T\hat{S}_{T}. Thus, MV hedging is not an LQ problem. Also, both S^t\hat{S}_{t} and XtaX_{t}^{a} matter for the hedging decision at time tt, whereas S^t\hat{S}_{t} is irrelevant for decision in the MV portfolio selection problem. This increase in the state dimension makes the hedging problem much more difficult to solve.

We consider the following exploratory formulation to encourage exploration for learning:

min𝝅𝔼t,x¯[(XT𝝅G^(S^T))2+θtTlog𝝅(as𝝅|s,Xs𝝅)𝑑s],\displaystyle\min_{\boldsymbol{\pi}}\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\left(X_{T}^{\boldsymbol{\pi}}-\hat{G}(\hat{S}_{T})\right)^{2}+\theta\int_{t}^{T}\log\boldsymbol{\pi}(a_{s}^{\boldsymbol{\pi}}|s,X_{s-}^{\boldsymbol{\pi}})ds\right], (127)

where the discounted value of the hedging portfolio under a stochastic policy 𝝅\boldsymbol{\pi} follows (94).

6.1 Solution of the exploratory control problem

We consider the HJB equation for problem (127):

Vt+inf𝝅𝒫(){H(t,S,x,a,VS,VSS,Vx,Vxx,VSx,V)+θlog𝝅(a|t,S,x)}𝝅(a|t,S,x)𝑑a=0,\frac{\partial V}{\partial t}+\inf_{\boldsymbol{\pi}\in\mathcal{P}(\mathbb{R})}\int_{\mathbb{R}}\left\{H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V)+\theta\log\boldsymbol{\pi}(a|t,S,x)\right\}\boldsymbol{\pi}(a|t,S,x)da=0, (128)

with terminal condition v(T,S,x)=(xG^(S))2v(T,S,x)=(x-\hat{G}(S))^{2}. The Hamiltonian of the problem is given by

H(t,S,x,a,VS,VSS,Vx,Vxx,VSx,V)\displaystyle H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V) (129)
=\displaystyle= ρσSVS(t,S,x)+12σ2S2VSS(t,S,x)+aρσVx(t,S,x)+12a2σ2Vxx(t,S,x)+aσ2SVSx(t,S,x)\displaystyle\rho\sigma SV_{S}(t,S,x)+\frac{1}{2}\sigma^{2}S^{2}V_{SS}(t,S,x)+a\rho\sigma V_{x}(t,S,x)+\frac{1}{2}a^{2}\sigma^{2}V_{xx}(t,S,x)+a\sigma^{2}SV_{Sx}(t,S,x) (130)
+\displaystyle+ (V(t,S+γ~(S,z),x+γ(a,z))V(t,S,x)γ~(S,z)VS(t,S,x)γ(a,z)Vx(t,S,x))ν(dz),\displaystyle\int_{\mathbb{R}}\left(V(t,S+\tilde{\gamma}(S,z),x+\gamma(a,z))-V(t,S,x)-\tilde{\gamma}(S,z)V_{S}(t,S,x)-\gamma(a,z)V_{x}(t,S,x)\right)\nu(dz), (131)

where γ(a,z)=a(exp(z)1)\gamma(a,z)=a(\exp(z)-1) and γ~(S,z)=S(exp(z)1)\tilde{\gamma}(S,z)=S(\exp(z)-1).

We make the following ansatz for the solution of the HJB equation (128):

V(t,S,x)=(xh(t,S))2f(t)+g(t,S).V(t,S,x)=(x-h(t,S))^{2}f(t)+g(t,S). (132)

With this, we can simplify the integral term in the Hamiltonian and obtain

H(t,S,x,a,VS,VSS,Vx,Vxx,VSx,V)\displaystyle H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V) (133)
=\displaystyle= ρσSVS(t,S,x)+12σ2S2VSS(t,S,x)+12a2(σ2+σJ2)Vxx(t,S,x)\displaystyle\rho\sigma SV_{S}(t,S,x)+\frac{1}{2}\sigma^{2}S^{2}V_{SS}(t,S,x)+\frac{1}{2}a^{2}(\sigma^{2}+\sigma_{J}^{2})V_{xx}(t,S,x) (134)
+a(ρσVx(t,S,x)+σ2SVSx(exp(z)1)(h(t,Sexp(z))h(t,S))ν(dz))\displaystyle+a\left(\rho\sigma V_{x}(t,S,x)+\sigma^{2}SV_{Sx}-\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)\right) (135)
+(V(t,Sexp(z),x)V(t,S,x)S(exp(z)1)VS(t,S,x))ν(dz).\displaystyle+\int_{\mathbb{R}}\left(V(t,S\exp(z),x)-V(t,S,x)-S(\exp(z)-1)V_{S}(t,S,x)\right)\nu(dz). (136)

The probability density function that minimizes the integral in (128) is given by

𝝅c(|t,S,x)exp(1θH(t,S,x,a,VS,VSS,Vx,Vxx,VSx,V)),\boldsymbol{\pi}_{c}(\cdot|t,S,x)\propto\exp\left(-\frac{1}{\theta}H(t,S,x,a,V_{S},V_{SS},V_{x},V_{xx},V_{Sx},V)\right), (137)

which is a candidate for the optimal stochastic policy. From (136), we obtain that 𝝅c(|t,S,x)\boldsymbol{\pi}_{c}(\cdot|t,S,x) is given by

𝒩(|ρσVx+σ2SVSx(exp(z)1)(h(t,Sexp(z))h(t,S))ν(dz)Vxx(σ2+σJ2)Vxx,θ(σ2+σJ2)Vxx).\mathcal{N}\left(\cdot\ \Big{|}-\frac{\rho\sigma V_{x}+\sigma^{2}SV_{Sx}-\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)V_{xx}}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}},\frac{\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}\right). (138)

Substituting it back to the HJB equation (128), we obtain a nonlinear PIDE

Vt+ρσSVS+12σ2S2VSS+(V(t,Sexp(z),x)V(t,S,x)S(exp(z)1)VS(t,S,x))ν(dz)\displaystyle V_{t}+\rho\sigma SV_{S}+\frac{1}{2}\sigma^{2}S^{2}V_{SS}+\int_{\mathbb{R}}\left(V(t,S\exp(z),x)-V(t,S,x)-S(\exp(z)-1)V_{S}(t,S,x)\right)\nu(dz) (139)
(ρσVX+σ2SVSx(exp(z)1)(h(t,Sexp(z))h(t,S))ν(dz)Vxx)22(σ2+σJ2)Vxx\displaystyle-\frac{\left(\rho\sigma V_{X}+\sigma^{2}SV_{Sx}-\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)V_{xx}\right)^{2}}{2(\sigma^{2}+\sigma_{J}^{2})V_{xx}} (140)
θ2ln2πθ(σ2+σJ2)Vxx=0,(t,S,x)[0,T)×+×,V(T,x)=(xω)2(ωz)2.\displaystyle-\frac{\theta}{2}\ln\frac{2\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})V_{xx}}=0,\quad(t,S,x)\in[0,T)\times\mathbb{R}_{+}\times\mathbb{R},\ V(T,x)=(x-\omega)^{2}-(\omega-z)^{2}. (141)

We plug in the ansatz (132) to the above PIDE. After some lengthy calculations, we can collect similar terms and obtain the following equations satisfied by ff, hh and gg:

f(t)ρ2σ2σ2+σJ2f(t)=0,f(T)=1,\displaystyle f^{\prime}(t)-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}f(t)=0,\ f(T)=1, (142)
ht+(h(t,Sexp(z))h(t,S)S(exp(z)1)hS(t,S))[1ρσσ2+σJ2(exp(z)1)]ν(dz)\displaystyle h_{t}+\int_{\mathbb{R}}\left(h(t,S\exp(z))-h(t,S)-S(\exp(z)-1)h_{S}(t,S)\right)\left[1-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}(\exp(z)-1)\right]\nu(dz) (143)
+12σ2S2hSS=0,(t,S)[0,T)×+,h(T,S)=G^(S),\displaystyle+\frac{1}{2}\sigma^{2}S^{2}h_{SS}=0,\quad(t,S)\in[0,T)\times\mathbb{R}_{+},\ h(T,S)=\hat{G}(S), (144)

and

gt+ρσSgS+12σ2S2gSS+(g(t,Sexp(z))g(t,S)S(exp(z)1)gS(t,S))ν(dz)\displaystyle g_{t}+\rho\sigma Sg_{S}+\frac{1}{2}\sigma^{2}S^{2}g_{SS}+\int_{\mathbb{R}}\left(g(t,S\exp(z))-g(t,S)-S(\exp(z)-1)g_{S}(t,S)\right)\nu(dz) (145)
+σ2S2(hS)2f(t)+(h(t,Sexp(z))h(t,S))2f(t)ν(dz)\displaystyle+\sigma^{2}S^{2}(h_{S})^{2}f(t)+\int_{\mathbb{R}}\left(h(t,S\exp(z))-h(t,S)\right)^{2}f(t)\nu(dz) (146)
f(t)σ2+σJ2(σ2ShS+(exp(z)1)(h(t,Sexp(z))h(t,S))ν(dz))2\displaystyle-\frac{f(t)}{\sigma^{2}+\sigma_{J}^{2}}\left(\sigma^{2}Sh_{S}+\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)\right)^{2} (147)
θ2lnπθ(σ2+σJ2)f(t)=0,(t,S)[0,T)×+,g(T,S)=0.\displaystyle-\frac{\theta}{2}\ln\frac{\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})f(t)}=0,\quad(t,S)\in[0,T)\times\mathbb{R}_{+},\ g(T,S)=0. (148)

The function ff is given by

f(t)=exp(ρ2σ2σ2+σJ2(Tt)).f(t)=\exp\left(-\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right). (149)

However, the two PIDEs cannot be solved in closed form in general. Below we first present some properties of hh and gg.

Lemma 6.

Assume σ>0\sigma>0 and G^(S)\hat{G}(S) is Lipschitz continuous in SS. The PIDE (144) has a unique solution in C1,2([0,T)×+)C([0,T]×+)C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) that is Lipschitz continuous in SS. The PIDE (148) also has a unique solution in C1,2([0,T)×+)C([0,T]×+)C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}). Moreover, there exists a constant C>0C>0 such that

|h(t,S)|C(1+S),|g(t,S)|C(1+S2)for any(t,S)[0,T]×+.|h(t,S)|\leq C(1+S),\quad|g(t,S)|\leq C(1+S^{2})\ \ \text{for any}\ (t,S)\in[0,T]\times\mathbb{R}_{+}. (150)

Next, we provide stochastic representations for the solutions to PIDEs (144) and (148), which will be subsequently exploited for our RL study. Construct a new probability measure \mathbb{Q} from the given probability measure \mathbb{P} with the Radon–Nikodym density process Λs:=dd|s\Lambda_{s}:=\frac{d\mathbb{Q}}{d\mathbb{P}}|_{\mathcal{F}_{s}} defined by

dΛs=ρσσ2+σJ2Λs[σdWs+(exp(z)1)N~(ds,dz)],Λt=1.d\Lambda_{s}=-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}\Lambda_{s-}\left[\sigma dW_{s}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}(ds,dz)\right],\ \Lambda_{t}=1. (151)

Condition (89) guarantees that {Λs:tsT}\{\Lambda_{s}:t\leq s\leq T\} is a true martingale with unit expectation. The standard measure change results yield that, under \mathbb{Q},

Ws:=Ws+ρσ2σ2+σJ2(st),N~(ds,dz):=N(ds,dz)[1ρσσ2+σJ2(exp(z)1)]ν(dz)dsW^{\prime}_{s}:=W_{s}+\frac{\rho\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(s-t),\ \widetilde{N}^{\prime}(ds,dz):=N(ds,dz)-\left[1-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}(\exp(z)-1)\right]\nu(dz)ds (152)

are standard Brownian motion and compensated Poisson random measure, respectively. Using them we can rewrite the SDE for S^\hat{S} as

dS^s=S^s[σdWs+(exp(z)1)N~(ds,dz)].d\hat{S}_{s}=\hat{S}_{s-}\left[\sigma dW^{\prime}_{s}+\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}^{\prime}(ds,dz)\right]. (153)

Let Yt:=1+σWt+0t(exp(z)1)N~(ds,dz)Y_{t}:=1+\sigma W_{t}+\int_{0}^{t}\int_{\mathbb{R}}(\exp(z)-1)\widetilde{N}^{\prime}(ds,dz). Clearly, it is a Lévy process and also a martingale under condition (89). Process S^\hat{S} is the stochastic exponential of YY and it follows from (Cont and Tankov 2004, Proposition 8.23) that S^\hat{S} is also a martingale under \mathbb{Q}. The Feynman–Kac theorem gives the representation of the solution hh to PIDE (144) as

h(t,S)=𝔼[G^(S^T)|S^t=S],t[0,T].h(t,S)=\mathbb{E}^{\mathbb{Q}}\left[\hat{G}(\hat{S}_{T})\big{|}\hat{S}_{t}=S\right],\quad t\in[0,T]. (154)

We can view \mathbb{Q} as the martingale measure for option valuation and interpret h(t,S)h(t,S) as the option price at time tt and underlying price SS. For PIDE (148), again by the Feynman–Kac theorem we obtain that its solution gg is given by

g(t,S)\displaystyle g(t,S) =ge(t,S)tTθ2lnπθ(σ2+σJ2)f(s)ds\displaystyle=g_{e}(t,S)-\int_{t}^{T}\frac{\theta}{2}\ln\frac{\pi\theta}{(\sigma^{2}+\sigma_{J}^{2})f(s)}ds (155)
=ge(t,S)θ2[ln(πθσ2+σJ2)(Tt)+ρ2σ22(σ2+σJ2)(Tt)2],\displaystyle=g_{e}(t,S)-\frac{\theta}{2}\left[\ln\left(\frac{\pi\theta}{\sigma^{2}+\sigma_{J}^{2}}\right)(T-t)+\frac{\rho^{2}\sigma^{2}}{2(\sigma^{2}+\sigma_{J}^{2})}(T-t)^{2}\right], (156)

where

ge(t,S):=𝔼[tTf(s)(σ2S^s2(hS(s,S^s))2+(h(s,S^sexp(z))h(s,S^s))2ν(dz))ds\displaystyle g_{e}(t,S):=\mathbb{E}^{\mathbb{P}}\left[\int_{t}^{T}f(s)\left(\sigma^{2}\hat{S}_{s}^{2}(h_{S}(s,\hat{S}_{s}))^{2}+\int_{\mathbb{R}}\left(h(s,\hat{S}_{s}\exp(z))-h(s,\hat{S}_{s})\right)^{2}\nu(dz)\right)ds\right. (157)
tTf(s)σ2+σJ2(σ2S^shS(s,S^s)+(exp(z)1)(h(s,S^sexp(z))h(s,S^s))ν(dz))2ds|S^t=S].\displaystyle\left.-\int_{t}^{T}\frac{f(s)}{\sigma^{2}+\sigma_{J}^{2}}\left(\sigma^{2}\hat{S}_{s}h_{S}(s,\hat{S}_{s})+\int_{\mathbb{R}}(\exp(z)-1)\left(h(s,\hat{S}_{s}\exp(z))-h(s,\hat{S}_{s})\right)\nu(dz)\right)^{2}ds\big{|}\hat{S}_{t}=S\right]. (158)

Using the expression for V(t,S,x)V(t,S,x), we obtain 𝝅c(|t,S,x)\boldsymbol{\pi}_{c}(\cdot|t,S,x) as

𝝅c(|t,S,x)𝒩(|(t,S,x),θ2(σ2+σJ2)exp(ρ2σ2σ2+σJ2(Tt))),\boldsymbol{\pi}_{c}(\cdot|t,S,x)\sim\mathcal{N}\left(\cdot\ \Big{|}\ \mathcal{M}^{*}(t,S,x),\frac{\theta}{2(\sigma^{2}+\sigma_{J}^{2})}\exp\left(\frac{\rho^{2}\sigma^{2}}{\sigma^{2}+\sigma_{J}^{2}}(T-t)\right)\right), (159)

where

(t,S,x)=σ2ShS(t,S)+(exp(z)1)(h(t,Sexp(z))h(t,S))ν(dz)ρσ(xh(t,S))σ2+σJ2.\mathcal{M}^{*}(t,S,x)=\frac{\sigma^{2}Sh_{S}(t,S)+\int_{\mathbb{R}}(\exp(z)-1)\left(h(t,S\exp(z))-h(t,S)\right)\nu(dz)-\rho\sigma(x-h(t,S))}{\sigma^{2}+\sigma_{J}^{2}}. (160)

Using Lemma 6, we can directly verify that 𝝅c\boldsymbol{\pi}_{c} is admissible. It is also obvious to see that V(t,S,x)V(t,S,x) given by (132) has quadratic growth in SS and xx. Therefore, by Lemma 5, we have the following conclusion.

Proposition 3.

For the MV hedging problem (126), the optimal value function J(t,S,x)=V(t,S,x)J^{*}(t,S,x)=V(t,S,x) and the optimal stochastic policy 𝛑=𝛑c\boldsymbol{\pi}^{*}=\boldsymbol{\pi}_{c} under the assumptions that σ>0\sigma>0 and G^(S)\hat{G}(S) is Lipschitz continuous in SS.

The mean part of 𝝅\boldsymbol{\pi}^{*}, (t,S,x)\mathcal{M}^{*}(t,S,x), comprises three terms. The quantity hS(t,S)h_{S}(t,S) is the delta of the option and hence ShS(t,S)Sh_{S}(t,S) gives the dollar amount of the risky asset as required by delta hedging, which is used to hedge continuous price movements. The integral term shows the dollar amount of the risky asset that should be held to hedge discontinuous changes. The two hedges are combined using weights σ2/(σ2+σJ2)\sigma^{2}/(\sigma^{2}+\sigma_{J}^{2}) and 1/(σ2+σJ2)1/(\sigma^{2}+\sigma_{J}^{2}). The last term reflects the adjustment that needs to be made due to the discrepancy between the hedging portfolio value and option price. It is easy to show that (t,S,x)\mathcal{M}^{*}(t,S,x) is the optimal deterministic policy of problem (126).

The variance of 𝝅\boldsymbol{\pi}^{*} is the same as that in the MV portfolio selection problem (c.f. (108)), which decreases as tt approaches the terminal time TT. This implies that one gradually reduces exploration over time.

6.2 Parametrizations and actor–critic learning

We use the previously derived solution of the exploratory problem as knowledge for RL. As we will see, the exploratory solution lends itself to natural parametrizations for the policy (“actor”) and value function (“critic”), but less so for parametrizing the qq-function.

To simplify the integral in (t,S,x)\mathcal{M}^{*}(t,S,x) defined in (160), we first parametrize the Lévy density ν(z)\nu(z) as in (111) where we have three parameters: λ\lambda is the jump arrival rate, and mm and δ\delta are the mean and standard deviation of the normal distribution for the jump size. We then use the Fourier-cosine method developed in Fang and Oosterlee (2009) to obtain an expression for h(t,S)h(t,S). Let Yt:=log(S^t/S^0)Y_{t}:=\log(\hat{S}_{t}/\hat{S}_{0}). Applying Itô’s formula, under \mathbb{P} we obtain

dlogS^t=[ρσ12σ2λ(exp(m+12δ2)1)]dt+σdWt+zN(dt,dz).d\log\hat{S}_{t}=\left[\rho\sigma-\frac{1}{2}\sigma^{2}-\lambda\left(\exp\left(m+\frac{1}{2}\delta^{2}\right)-1\right)\right]dt+\sigma dW_{t}+\int_{\mathbb{R}}zN(dt,dz). (161)

Noting (152), under \mathbb{Q} we have

dlogS^t=[ρσσJ2σ2+σJ212σ2λ(exp(m+12δ2)1)]dt+σdWt+zN(dt,dz),d\log\hat{S}_{t}=\left[\rho\sigma\frac{\sigma_{J}^{2}}{\sigma^{2}+\sigma_{J}^{2}}-\frac{1}{2}\sigma^{2}-\lambda\left(\exp\left(m+\frac{1}{2}\delta^{2}\right)-1\right)\right]dt+\sigma dW^{\prime}_{t}+\int_{\mathbb{R}}zN(dt,dz), (162)

where the Lévy measure of N(dt,dz)N(dt,dz) becomes

ν(dz)=[1ρσσ2+σJ2(exp(z)1)]λ2πδ2exp((zm)22δ2)dz.\nu^{\prime}(dz)=\left[1-\frac{\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}(\exp(z)-1)\right]\frac{\lambda}{\sqrt{2\pi\delta^{2}}}\exp\left(-\frac{(z-m)^{2}}{2\delta^{2}}\right)dz. (163)

The characteristic function of YtY_{t} under \mathbb{Q} is given by

ΦYt(u)=exp[iut(ρσσJ2σ2+σJ212σ2λ(exp(m+12δ2)1))12σ2u2t+tA(u)],\Phi_{Y_{t}}^{\mathbb{Q}}(u)=\exp\left[iut\left(\rho\sigma\frac{\sigma_{J}^{2}}{\sigma^{2}+\sigma_{J}^{2}}-\frac{1}{2}\sigma^{2}-\lambda\left(\exp\left(m+\frac{1}{2}\delta^{2}\right)-1\right)\right)-\frac{1}{2}\sigma^{2}u^{2}t+tA(u)\right], (164)

where

A(u)=\displaystyle A(u)= (exp(iuz)1)ν(dz)\displaystyle\int_{\mathbb{R}}\left(\exp(iuz)-1\right)\nu^{\prime}(dz) (165)
=\displaystyle= λ(exp(ium12δ2u2)1)λρσσ2+σJ2×\displaystyle\lambda\left(\exp\left(ium-\frac{1}{2}\delta^{2}u^{2}\right)-1\right)-\frac{\lambda\rho\sigma}{\sigma^{2}+\sigma_{J}^{2}}\times (166)
(exp(m+12δ2+iu(m+δ2)12δ2u2)exp(ium12δ2u2)exp(m+12δ2)+1).\displaystyle\left(\exp\left(m+\frac{1}{2}\delta^{2}+iu(m+\delta^{2})-\frac{1}{2}\delta^{2}u^{2}\right)-\exp\left(ium-\frac{1}{2}\delta^{2}u^{2}\right)-\exp\left(m+\frac{1}{2}\delta^{2}\right)+1\right). (167)

The Fourier-cosine method then calculates h(t,S)h(t,S) as

h^(t,S):=k=0N1(ΦYt(kπba)exp(ikπln(S/𝒦^)aba))Vk.\hat{h}(t,S):=\sideset{}{{}^{\prime}}{\sum}_{k=0}^{N-1}\Re\left(\Phi_{Y_{t}}^{\mathbb{Q}}\left(\frac{k\pi}{b-a}\right)\exp\left(ik\pi\frac{\ln(S/\hat{\mathcal{K}})-a}{b-a}\right)\right)V_{k}. (168)

Here, (x)\Re(x) denotes the real part of complex number xx, Σ\Sigma^{\prime} indicates the first term in the sum is halved, 𝒦^\hat{\mathcal{K}} is the discounted strike price of the call or put option under consideration, and [a,b][a,b] is the truncation region for YTY_{T} (chosen wide enough so that the probability of YTY_{T} going outside under \mathbb{Q} can be neglected). The expression of VkV_{k} is given by Eqs. (24) and (25) in Fang and Oosterlee (2009) for call and put options, respectively, and it only depends on a,b,𝒦^a,b,\hat{\mathcal{K}}. We can also calculate hS(t,S)h_{S}(t,S) by differentiating h^(t,S)\hat{h}(t,S) w.r.t. SS, which yields

h^S(t,S):=k=0N1(ΦYt(kπba)ikπ(ba)Sexp(ikπln(S/𝒦^)aba))Vk.\hat{h}_{S}(t,S):=\sideset{}{{}^{\prime}}{\sum}_{k=0}^{N-1}\Re\left(\Phi_{Y_{t}}^{\mathbb{Q}}\left(\frac{k\pi}{b-a}\right)\frac{ik\pi}{(b-a)S}\exp\left(ik\pi\frac{\ln(S/\hat{\mathcal{K}})-a}{b-a}\right)\right)V_{k}. (169)

Actor parametrization. We follow the form of the optimal stochastic policy given in (159) to parametrize our policy. To obtain a parsimonious parametrization, we introduce ϕ1,ϕ2,ϕ3,ϕ4,ϕ5\phi_{1},\phi_{2},\phi_{3},\phi_{4},\phi_{5}, which respectively represent the agent’s understanding – relative to her learning – of drift μ\mu, volatility σ\sigma, jump arrival rate λ\lambda, mean mm and standard deviation δ\delta of the jump size of the risky asset. We emphasize that in our learning algorithm these parameters are not estimated by any statistical method (and hence they do not necessarily correspond to the true dynamics of the risky asset); rather they will be updated based on the hedging experiences (via exploration and exploitation) from the real environment. To implement (159), we also need two dependent parameters ϕ6\phi_{6} and ϕ7\phi_{7} that are calculated by

ϕ6=ϕ1rfϕ2,ϕ7=ϕ3[exp(2ϕ4+2ϕ52)2exp(ϕ4+12ϕ52)+1].\phi_{6}=\frac{\phi_{1}-r_{f}}{\phi_{2}},\ \phi_{7}=\sqrt{\phi_{3}\left[\exp\left(2\phi_{4}+2\phi_{5}^{2}\right)-2\exp\left(\phi_{4}+\frac{1}{2}\phi_{5}^{2}\right)+1\right]}. (170)

These two parameters show the agent’s understanding of ρ\rho and σJ\sigma_{J}, and they will be updated in the learning process by the above equations with the update of the independent parameters ϕ1\phi_{1} to ϕ5\phi_{5}. Denote ϕ=(ϕi)i=17\phi=(\phi_{i})_{i=1}^{7}.

We use ϕ1,ϕ2,ϕ3,ϕ4,ϕ5,ϕ6,ϕ7\phi_{1},\phi_{2},\phi_{3},\phi_{4},\phi_{5},\phi_{6},\phi_{7} to replace their counterparts μ,σ,λ,m,δ,ρ,σJ\mu,\sigma,\lambda,m,\delta,\rho,\sigma_{J} in (159) to obtain a parametrized policy. For the mean part, we calculate the option price and its delta by plugging ϕ\phi into (168) and (169), with the results denoted by h^ϕ(t,S)\hat{h}^{\phi}(t,S) and h^Sϕ(t,S)\hat{h}_{S}^{\phi}(t,S) respectively, and approximate the integral by Gauss-Hermite (G-H) quadrature, which yields

ϕ(t,S,x)=ϕ22Sh^Sϕ(t,S)+ϕ32πq=1Qwq(exp(zq)1)(h^ϕ(t,Sexp(zq))h^ϕ(t,S))ϕ2ϕ6(xh^ϕ(t,S))ϕ22+ϕ72,\mathcal{M}^{\phi}(t,S,x)=\frac{\phi_{2}^{2}S\hat{h}_{S}^{\phi}(t,S)+\frac{\phi_{3}}{\sqrt{2\pi}}\sum\limits_{q=1}^{Q}w_{q}(\exp(z_{q})-1)\left(\hat{h}^{\phi}(t,S\exp(z_{q}))-\hat{h}^{\phi}(t,S)\right)-\phi_{2}\phi_{6}(x-\hat{h}^{\phi}(t,S))}{\phi_{2}^{2}+\phi_{7}^{2}}, (171)

where zq=ϕ4+ϕ5uqz_{q}=\phi_{4}+\phi_{5}u_{q} and {(wq,uq),q=1,,Q}\{(w_{q},u_{q}),q=1,\cdots,Q\} are the weights and abscissa of the QQ-point G-H rule that are predetermined and independent of ϕ\phi. For the variance part, we use ϕ\phi in (149) to obtain fϕ(t)f^{\phi}(t). Thus, our parametrized policy is given by

𝝅ϕ(|t,S,x)𝒩(|ϕ(t,S,x),θ2(ϕ22+ϕ72)fϕ(t)).\boldsymbol{\pi}^{\phi}(\cdot|t,S,x)\sim\mathcal{N}\left(\cdot\ \Big{|}\ \mathcal{M}^{\phi}(t,S,x),\frac{\theta}{2(\phi_{2}^{2}+\phi_{7}^{2})f^{\phi}(t)}\right). (172)

Critic parametrization. For the value function of 𝝅ϕ\boldsymbol{\pi}^{\phi}, following (132) we parametrize it as

Jψ,ϕ(t,S,x)=(xh^ϕ(t,S))2fϕ(t)+geψ(t,S)θ2[ln(πθϕ22+ϕ72)(Tt)+ϕ22ϕ622(ϕ22+ϕ72)(Tt)2].J^{\psi,\phi}(t,S,x)=(x-\hat{h}^{\phi}(t,S))^{2}f^{\phi}(t)+g_{e}^{\psi}(t,S)-\frac{\theta}{2}\left[\ln\left(\frac{\pi\theta}{\phi_{2}^{2}+\phi_{7}^{2}}\right)(T-t)+\frac{\phi_{2}^{2}\phi_{6}^{2}}{2(\phi_{2}^{2}+\phi_{7}^{2})}(T-t)^{2}\right]. (173)

We do not specify any parametric form for geψ(t,S)g_{e}^{\psi}(t,S), but will learn it using Gaussian process regression (Williams and Rasmussen 2006), where ψ\psi denotes its parameters. We choose Gaussian process for our problem because it is able to generate flexible shapes and can often achieve a reasonably good fit with a moderate amount of data.

Actor–critic learning. Consider a time grid {tk}k=0K\{t_{k}\}_{k=0}^{K} with t0=0t_{0}=0, tK=Tt_{K}=T, and time step Δt\Delta t. In an iteration, we start with a given policy 𝝅ϕ\boldsymbol{\pi}^{\phi} and use it to collect MM episodes from the environment, where the mm-th episode is given by {(tk,S^tk(m),Xtk𝝅ϕ,(m),atk𝝅ϕ,(m))}k=0K\{(t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)},a_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})\}_{k=0}^{K}. We then perform two steps.

In the critic step, the main task is to learn geψ(tk,S)g_{e}^{\psi}(t_{k},S) as a function of SS for each tkt_{k}. The stochastic representation (158) shows that geg_{e} can be viewed as the value function of a stream of rewards that depend on time and the risky asset price; so the task is a policy evaluation problem. We first obtain the running reward path in every episode, where the integral involving the Lévy measure is again calculated by the QQ-point G-H rule. Specifically, the reward at time tkt_{k} for k=0,,K1k=0,\cdots,K-1 is given by

Rtkg:=fϕ(tk)[ϕ22S^tk2(h^Sϕ(tk,S^tk))2+ϕ32πq=1Qwq(h^ϕ(tk,S^tkexp(zq))h^ϕ(tk,S^tk))2]Δt\displaystyle R_{t_{k}}^{g}:=f^{\phi}(t_{k})\left[\phi_{2}^{2}\hat{S}_{t_{k}}^{2}(\hat{h}^{\phi}_{S}(t_{k},\hat{S}_{t_{k}}))^{2}+\frac{\phi_{3}}{\sqrt{2\pi}}\sum_{q=1}^{Q}w_{q}\left(\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}}\exp(z_{q}))-\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}})\right)^{2}\right]\Delta t (174)
fϕ(tk)ϕ22+ϕ72[ϕ22S^tkh^Sϕ(tk,S^tk)+ϕ32πq=1Qwq(exp(zq)1)(h^ϕ(tk,S^tkexp(zq))h^ϕ(tk,S^tk))]2Δt,\displaystyle-\frac{f^{\phi}(t_{k})}{\phi_{2}^{2}+\phi_{7}^{2}}\left[\phi_{2}^{2}\hat{S}_{t_{k}}\hat{h}^{\phi}_{S}(t_{k},\hat{S}_{t_{k}})+\frac{\phi_{3}}{\sqrt{2\pi}}\sum_{q=1}^{Q}w_{q}(\exp(z_{q})-1)\left(\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}}\exp(z_{q}))-\hat{h}^{\phi}(t_{k},\hat{S}_{t_{k}})\right)\right]^{2}\Delta t, (175)

where the risky asset price path {(S^tk)}k=0K1\{(\hat{S}_{t_{k}})\}_{k=0}^{K-1} is collected from the real environment. We stress again that the above calculation uses the agent’s parameters ϕ\phi and does not involve any parameters of the true dynamics. We then fit a Gaussian process to the sample of the cumulative rewards {=kK1Rtg,(m)}m=1M\{\sum_{\ell=k}^{K-1}R_{t_{\ell}}^{g,(m)}\}_{m=1}^{M} to estimate geψ(tk,S)g_{e}^{\psi}(t_{k},S). Figure 2 illustrates the fit at three time points with the radial basis kernel. The Gaussian process is able to provide a good fit to the shape of ge(t,S)g_{e}(t,S) as a function of SS for each given tt.

In the actor step, we update the policy parameter ϕi\phi_{i} for i=1,,5i=1,\cdots,5 by

ϕi\displaystyle\phi_{i}\leftarrow ϕiαϕiMm=1Mk=0K1[θlog𝝅ϕ(atk𝝅ϕ,(m)|tk,S^tk(m),Xtk𝝅ϕ,(m))Δt\displaystyle~{}\phi_{i}-\frac{\alpha_{\phi_{i}}}{M}\sum_{m=1}^{M}\sum_{k=0}^{K-1}\left.\Big{[}\theta\log\boldsymbol{\pi}^{\phi}(a_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)}|~{}t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})\Delta t\right. (176)
+Jψ,ϕ(tk+1,S^tk+1(m),Xtk+1𝝅ϕ,(m))Jψ,ϕ(tk,S^tk(m),Xtk𝝅ϕ,(m))βJψ,ϕ(tk,Xtk𝝅ϕ,(m))Δt]\displaystyle\left.+J^{\psi,\phi}(t_{k+1},\hat{S}_{t_{k+1}}^{(m)},X_{t_{k+1}}^{\boldsymbol{\pi}^{\phi},(m)})-J^{\psi,\phi}(t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})-\beta J^{\psi,\phi}(t_{k},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)})\Delta t\Big{]}\right. (177)
ϕilog𝝅ϕ(atk𝝅ϕ,(m)|tk,S^tk(m),Xtk𝝅ϕ,(m)).\displaystyle\ \cdot\frac{\partial}{\partial\phi_{i}}\log\boldsymbol{\pi}^{\phi}(a_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)}|~{}t_{k},\hat{S}_{t_{k}}^{(m)},X_{t_{k}}^{\boldsymbol{\pi}^{\phi},(m)}). (178)

This update essentially follows from (85); the difference is that we do batch processing and update the parameters only after observing all the episodes in a batch. Furthermore, we flip the sign before αϕi\alpha_{\phi_{i}} and θ\theta as we consider minimization here while the problem in the general discussion is for maximization.

Refer to caption
Figure 2: Fit of Gaussian process to the cumulative reward data for geg_{e} at tkt_{k} (k=0,10,20k=0,10,20) for a one-month put option with strike 𝒦=100\mathcal{K}=100. The variable SS is on the xx-axis. We use the MLE estimates of the MJD model shown in Table 2 to calculate the rewards for geg_{e} from the sampled index price paths.

6.3 Simulation study

We check the convergence behavior of our actor–critic algorithm in a simulation study. We assume the environment is described by an MJD model with its true parameters shown in the upper row of Table 2.222These values are chosen close to the estimates obtained from the market data of S&P 500 in Table 1 to make the MJD model better resemble the reality in the simulation study. We consider hedging a 4-month put option with strike price 𝒦=100\mathcal{K}=100 because short-term options with maturity of a few months are typically traded much more actively than longer-term options. We assume the risk-free rate is zero.

We use 20 quadrature points for the G-H rule in our implementation, which suffices for high accuracy. In each iteration, we sample M=32M=32 episodes from the environment simulated by the MJD model. The learning rates are chosen to be αϕ1=5e3\alpha_{\phi_{1}}=5\mathrm{e}{-3}, αϕ2=2e5\alpha_{\phi_{2}}=2\mathrm{e}{-5}, αϕ3=15\alpha_{\phi_{3}}=15, αϕ4=1e6\alpha_{\phi_{4}}=1\mathrm{e}{-6}, αϕ5=1e6\alpha_{\phi_{5}}=1\mathrm{e}{-6}, while the temperature parameter θ=0.1\theta=0.1. We keep the learning rates constant in the first 30 iterations and then they decay at rate l(j)=j0.5l(j)=j^{-0.5}, where jj is the iteration index. Before our actor–critic learning, we sample a 15-year path from the MJD model as data for MLE and obtain estimates of the parameters shown in Table 2. MLE estimates all the parameters quite well except the drift μ\mu, where the estimate is more than twice the true value.333This is expected due to the well-known mean-blur problem, as 15 years of data are significantly insufficient to obtain a reasonably accurate estimate of the mean using statistical methods. We then use these estimates as initial values of ϕ1\phi_{1} to ϕ5\phi_{5} for learning. Figure 3 clearly shows that our RL algorithm is able to converge near the true values. In particular, the estimate of μ\mu eventually moves closely around the true level despite being distant to it initially.

Parameters
True μ=0.06,σ=0.13,λ=28,m=0.004,δ=0.03\mu=0.06,\sigma=0.13,\lambda=28,m=-0.004,\delta=0.03
MLE μ=0.1289,σ=0.1331,λ=26.1091,m=0.0033,δ=0.0317\mu=0.1289,\sigma=0.1331,\lambda=26.1091,m=-0.0033,\delta=0.0317
Table 2: Parameters of the true MJD model and MLE estimates from 15 years of data
Refer to caption
Figure 3: Training paths of five parameters (iteration index on the xx-axis)

6.4 Empirical study

In the empirical study we consider hedging a European put option written on the S&P 500 index. This type of option is popular in the financial market as a tool for investors to protect their high beta investment. We obtain daily observations of the index from the beginning of 2000 to the end of 2023, and split them into two periods: the training period (the first 20 years) and the test period (the last 4 years). The test period includes the U.S. stock market meltdown at the onset of the pandemic, the 2022 bear period, and various bull periods.

We use the MLE estimates from the first 20 years of observations to initialize ϕ1\phi_{1} to ϕ5\phi_{5} for learning. To sample an episode from the environment for training our RL algorithm, we bootstrap index returns from the training period. Similarly, we bootstrap index returns from the test period to generate index price paths to test the learned policy. We use the 20-point G-H rule with M=32M=32 episodes in each training iteration. The learning rates are set as αϕ1=5e3\alpha_{\phi_{1}}=5\mathrm{e}{-3}, αϕ2=5e4\alpha_{\phi_{2}}=5\mathrm{e}{-4}, αϕ3=50\alpha_{\phi_{3}}=50, αϕ4=5e6\alpha_{\phi_{4}}=5\mathrm{e}{-6}, αϕ5=1e5\alpha_{\phi_{5}}=1\mathrm{e}{-5} with the same learning rate decay schedule as in the simulation study. We set the temperature parameter θ=0.1\theta=0.1.

We consider four maturities ranging from 1 to 4 months for the put option. For a standard S&P500 index option traded on CBOE, in practice its strike is quoted in points commensurate with the scale of the index level and converted to dollar amount by a multiplier of 100. That is, the dollar strike 𝒦=100𝒦p\mathcal{K}=100\mathcal{K}_{p}, where 𝒦p\mathcal{K}_{p} is the strike in points.444For example, for an at-the-money option contract where the current index is 5000 points, we have 𝒦p=5000\mathcal{K}_{p}=5000. The contract’s notional size is 500000 dollars. Let StS_{t} be the index level at time tt converted to dollars by the multiplier of 100. We sample S0S_{0} uniformly from the range [0.7𝒦,1.3𝒦][0.7\mathcal{K},1.3\mathcal{K}] to generate training and test episodes. This allows us to include scenarios where the option is deep out of money or in the money. We train our algorithm for 100 iterations and then test it on 5000 index price episodes. Finally, although we use a stochastic policy to interact with the environment for learning/updating the parameters during training, in the test stage we apply a deterministic policy which is the mean of the learned stochastic policy. This can reduce the variance of the final hedging error.

To measure the test performance, we consider the mean squared hedging error (denominated in squared dollars) divided by 𝒦p2\mathcal{K}_{p}^{2}. This effectively normalizes the contract’s scale; so we can just set 𝒦p=1\mathcal{K}_{p}=1 in our implementation. The test result can be interpreted as the average mean squared hedging error in squared dollars for every point of strike price.

We compare the performances of the policy learned by our RL algorithm and the one obtained by plugging the MLE estimates from the training period data to (160). Both policies are tested on the same index price episodes. We check the statistical significance of the difference in the squared hedging errors of the two policies by applying the t-test. Table 3 reports the mean squared hedging error for each policy as well as the p-value of the t-test. Not surprisingly, the hedging error of each policy increases as the maturity becomes longer due to greater uncertainty in the problem. In every case, the RL policy achieves notable reduction in the hedging error compared with the MLE-based policy. The p-value further shows the outperformance of the former is statistically significant.

Maturity RL policy MLE policy p-value
1 month 0.3513 (0.034) 0.4006 (0.043) 3.2e-7
2 months 0.5668 (0.029) 0.6796 (0.050) 1.2e-5
3 months 0.8283 (0.031) 1.0080 (0.060) 1.6e-5
4 months 0.8649 (0.047) 1.1544 (0.068) 4.2e-13
Table 3: Mean squared hedging errors of the two policies in the empirical test with their standard errors shown in parenthesis

7 Conclusions

Fluctuations in data or time series coming from nonlinear, complex dynamic systems are characterized by two types: slow changes and sudden jumps, the latter occurring much rarely than the former. Hence jump-diffusions capture the key structural characteristics of many data generating processes in areas such as physics, astrophysics, earth science, engineering, finance, and medicine. As a result, RL for jump-diffusions is important both theoretically and practically. This paper endeavors to lay a theoretical foundation for the study. A key insight from this research is that temporal–difference algorithms designed for diffusions can work seamlessly for jump-diffusions. However, unless using general neural networks, policy parameterization does need to respond to the presence of jumps if one is to take advantage of any special structure of an underlying problem.

There are plenty of open questions starting from here, including the convergence of the algorithms, regret bounds, decaying rates of the temperature parameters, and learning rates of the gradient decent and/or stochastic approximation procedures involved.

Acknowledgements

Gao acknowledges support from the Hong Kong Research Grants Council [GRF 14201424, 14212522, 14200123]. Li is supported by the Hong Kong Research Grant Council [GRF 14216322]. Zhou is supported by the Nie Center for Intelligent Asset Management at Columbia University.

References

  • Aït-Sahalia et al. (2012) Aït-Sahalia, Y., J. Jacod, and J. Li (2012). Testing for jumps in noisy high frequency data. Journal of Econometrics 168(2), 207–222.
  • Andersen et al. (2002) Andersen, T. G., L. Benzoni, and J. Lund (2002). An empirical investigation of continuous-time equity return models. The Journal of Finance 57(3), 1239–1284.
  • Applebaum (2009) Applebaum, D. (2009). Lévy Processes and Stochastic Calculus. Cambridge University Press.
  • Bates (1991) Bates, D. S. (1991). The crash of 87: was it expected? The evidence from options markets. The Journal of Finance 46(3), 1009–1044.
  • Bates (1996) Bates, D. S. (1996). Jumps and stochastic volatility: Exchange rate processes implicit in Deutsche Mark options. The Review of Financial Studies 9(1), 69–107.
  • Bender and Thuan (2023) Bender, C. and N. T. Thuan (2023). Entropy-regularized mean-variance portfolio optimization with jumps. arXiv preprint arXiv:2312.13409.
  • Bo et al. (2024) Bo, L., Y. Huang, X. Yu, and T. Zhang (2024). Continuous-time q-learning for jump-diffusion models under Tsallis entropy. arXiv preprint arXiv:2407.03888.
  • Cai and Kou (2011) Cai, N. and S. G. Kou (2011). Option pricing under a mixed-exponential jump diffusion model. Management Science 57(11), 2067–2081.
  • Cont and Tankov (2004) Cont, R. and P. Tankov (2004). Financial Modelling with Jump Processes. Chapman and Hall/CRC.
  • Dai et al. (2023) Dai, M., Y. Dong, and Y. Jia (2023). Learning equilibrium mean–variance strategy. Mathematical Finance 33(4), 1166–1212.
  • Das (2002) Das, S. R. (2002). The surprise element: Jumps in interest rates. Journal of Econometrics 106(1), 27–65.
  • Denkert et al. (2024) Denkert, R., H. Pham, and X. Warin (2024). Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching. arXiv preprint arXiv:2404.17939.
  • Ethier and Kurtz (1986) Ethier, S. N. and T. G. Kurtz (1986). Markov Processes: Characterization and Convergence. John Wiley & Sons.
  • Fang and Oosterlee (2009) Fang, F. and C. W. Oosterlee (2009). A novel pricing method for European options based on Fourier-cosine series expansions. SIAM Journal on Scientific Computing 31(2), 826–848.
  • Gammaitoni et al. (1998) Gammaitoni, L., P. Hänggi, P. Jung, and F. Marchesoni (1998). Stochastic resonance. Reviews of Modern Physics 70(1), 223–287.
  • Gao et al. (2022) Gao, X., Z. Q. Xu, and X. Y. Zhou (2022). State-dependent temperature control for langevin diffusions. SIAM Journal on Control and Optimization 60(3), 1250–1268.
  • Giraudo and Sacerdote (1997) Giraudo, M. T. and L. Sacerdote (1997). Jump-diffusion processes as models for neuronal activity. Biosystems 40(1-2), 75–82.
  • Goswami et al. (2018) Goswami, B., N. Boers, A. Rheinwalt, N. Marwan, J. Heitzig, S. Breitenbach, and J. Kurths (2018). Abrupt transitions in time series with uncertainties. Nature Communications 9(1), 48–57.
  • Guo et al. (2023) Guo, X., A. Hu, and Y. Zhang (2023). Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. SIAM Journal on Control and Optimization 61(2), 755–787.
  • Guo et al. (2022) Guo, X., R. Xu, and T. Zariphopoulou (2022). Entropy regularization for mean field games with learning. Mathematics of Operations research 47(4), 3239–3260.
  • Heyde and Kou (2004) Heyde, C. C. and S. G. Kou (2004). On the controversy over tailweight of distributions. Operations Research Letters 32(5), 399–408.
  • Huang et al. (2022) Huang, Y., Y. Jia, and X. Zhou (2022). Achieving mean–variance efficiency by continuous-time reinforcement learning. In Proceedings of the Third ACM International Conference on AI in Finance, pp.  377–385.
  • Jacod and Shiryaev (2013) Jacod, J. and A. Shiryaev (2013). Limit Theorems for Stochastic Processes. Springer.
  • Jia and Zhou (2022a) Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research 23(154), 1–55.
  • Jia and Zhou (2022b) Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research 23(154), 1–55.
  • Jia and Zhou (2023) Jia, Y. and X. Y. Zhou (2023). q-learning in continuous time. Journal of Machine Learning Research 24(161), 1–61.
  • Kou (2002) Kou, S. G. (2002). A jump-diffusion model for option pricing. Management Science 48(8), 1086–1101.
  • Kunita (2004) Kunita, H. (2004). Stochastic differential equations based on Lévy processes and stochastic flows of diffeomorphisms. In M. M. Rao (Ed.), Real and Stochastic Analysis: New Perspectives, pp.  305–373. Birkhäuser.
  • Kushner (2000) Kushner, H. J. (2000). Jump-diffusions with controlled jumps: Existence and numerical methods. Journal of Mathematical Analysis and Applications 249(1), 179–198.
  • Li and Linetsky (2014) Li, L. and V. Linetsky (2014). Time-changed Ornstein–Uhlenbeck processes and their applications in commodity derivative models. Mathematical Finance 24(2), 289–330.
  • Lim (2005) Lim, A. E. (2005). Mean-variance hedging when there are jumps. SIAM Journal on Control and Optimization 44(5), 1893–1922.
  • Merton (1976) Merton, R. C. (1976). Option pricing when underlying stock returns are discontinuous. Journal of Financial Economics 3(1-2), 125–144.
  • Munos (2006) Munos, R. (2006). Policy gradient in continuous time. Journal of Machine Learning Research 7, 771–791.
  • Øksendal and Sulem (2007) Øksendal, B. K. and A. Sulem (2007). Applied Stochastic Control of Jump Diffusions, Volume 498. Springer.
  • Park et al. (2021) Park, S., J. Kim, and G. Kim (2021). Time discretization-invariant safe action repetition for policy gradient methods. Advances in Neural Information Processing Systems 34, 267–279.
  • Reisinger and Zhang (2021) Reisinger, C. and Y. Zhang (2021). Regularity and stability of feedback relaxed controls. SIAM Journal on Control and Optimization 59(5), 3118–3151.
  • Sato (1999) Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press.
  • Schweizer (1996) Schweizer, M. (1996). Approximation pricing and the variance-optimal martingale measure. The Annals of Probability 24(1), 206–236.
  • Schweizer (2001) Schweizer, M. (2001). A guided tour through quadratic hedging approaches. In E. Jouini, J. Cvitanic, and M. Musiela (Eds.), Handbooks in Mathematical Finance: Option Pricing, Interest Rates and Risk Management, pp.  538–74. Cambridge University Press.
  • Situ (2006) Situ, R. (2006). Theory of Stochastic Differential Equations with Jumps and Applications. Springer.
  • Tallec et al. (2019) Tallec, C., L. Blier, and Y. Ollivier (2019). Making deep q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732.
  • Wang et al. (2023) Wang, B., X. Gao, and L. Li (2023). Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950.
  • Wang and Zheng (2022) Wang, B. and X. Zheng (2022). Testing for the presence of jump components in jump diffusion models. Journal of Econometrics 230(2), 483–509.
  • Wang et al. (2020) Wang, H., T. Zariphopoulou, and X. Y. Zhou (2020). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research 21(198), 1–34.
  • Wang and Zhou (2020) Wang, H. and X. Y. Zhou (2020). Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance 30(4), 1273–1308.
  • Williams and Rasmussen (2006) Williams, C. K. and C. E. Rasmussen (2006). Gaussian Processes for Machine Learning. MIT Press.
  • Wu and Li (2024) Wu, B. and L. Li (2024). Reinforcement learning for continuous-time mean–variance portfolio selection in a regime-switching market. Journal of Economic Dynamics and Control 158, 104787.
  • Zhou and Li (2000) Zhou, X. Y. and D. Li (2000). Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization 42(1), 19–33.
  • Zhu et al. (2015) Zhu, C., G. Yin, and N. A. Baran (2015). Feynman–Kac formulas for regime-switching jump diffusions and their applications. Stochastics 87(6), 1000–1032.

Appendix A Proofs

Proof of Lemma 2.

We observe that for each k=1,,k=1,\cdots,\ell,

×[0,1]n|γk(t,x,G𝝅t,x(u),z)|pνk(dz)𝑑u=𝒜|γk(t,x,a,z)|pνk(dz)𝝅(a|t,x)𝑑a.\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)\right|^{p}\nu_{k}(dz)du=\int_{\mathcal{A}}\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|^{p}\nu_{k}(dz)\boldsymbol{\pi}(a|t,x)da. (179)

From Assumption 1-(iii), we have k=1|γk(t,x,a,z)|pνk(dz)Cp(1+|x|p)\sum_{k=1}^{\ell}\int_{\mathbb{R}}|\gamma_{k}(t,x,a,z)|^{p}\nu_{k}(dz)\leq C_{p}(1+|x|^{p}) for any (t,x,a)[0,T]×d×𝒜(t,x,a)\in[0,T]\times\mathbb{R}^{d}\times\mathcal{A}, where CpC_{p} does not depend on aa. Thus, integrating over aa with the measure 𝝅(a|t,x)da\boldsymbol{\pi}(a|t,x)da preserves the linear growth property. ∎

Proof of Lemma 3.

We consider

×[0,1]n|γk(t,x,G𝝅t,x(u),z)γk(t,x,G𝝅t,x(u),z)|pνk(dz)𝑑u,\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du, (180)

which is bounded by

Cp\displaystyle C_{p} (×[0,1]n|γk(t,x,G𝝅t,x(u),z)γk(t,x,G𝝅t,x(u),z)|pνk(dz)du\displaystyle\left(\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du\right. (181)
+×[0,1]n|γk(t,x,G𝝅t,x(u),z)γk(t,x,G𝝅t,x(u),z)|pνk(dz)du)\displaystyle\left.+\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du\right) (182)

for some constant Cp>0C_{p}>0.

For the first integral, using Assumption 2 we obtain

×[0,1]n|γk(t,x,G𝝅t,x(u),z)γk(t,x,G𝝅t,x(u),z)|pνk(dz)𝑑u\displaystyle\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x}}(u),z\right)-\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du (183)
\displaystyle\leq CK,p[0,1]n|G𝝅(t,x,u)G𝝅(t,x,u)|p𝑑uCK,p|xx|p.\displaystyle C_{K,p}\int_{[0,1]^{n}}\left|G^{\boldsymbol{\pi}}(t,x,u)-G^{\boldsymbol{\pi}}(t,x^{\prime},u)\right|^{p}du\leq C^{\prime}_{K,p}|x-x^{\prime}|^{p}. (184)

For the second integral, we have

×[0,1]n|γk(t,x,G𝝅t,x(u),z)γk(t,x,G𝝅t,x(u),z)|pνk(dz)𝑑u\displaystyle\int_{\mathbb{R}\times[0,1]^{n}}\left|\gamma_{k}\left(t,x,G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)-\gamma_{k}\left(t,x^{\prime},G^{\boldsymbol{\pi}_{t,x^{\prime}}}(u),z\right)\right|^{p}\nu_{k}(dz)du (185)
=\displaystyle= 𝒜|γk(t,x,a,z)γk(t,x,a,z)|pνk(dz)𝝅(a|t,x)𝑑a\displaystyle\int_{\mathcal{A}}\int_{\mathbb{R}}\left|\gamma_{k}\left(t,x,a,z\right)-\gamma_{k}\left(t,x^{\prime},a,z\right)\right|^{p}\nu_{k}(dz)\boldsymbol{\pi}(a|t,x^{\prime})da (186)
\displaystyle\leq 𝒜CK,p′′|xx|p𝝅(a|t,x)𝑑a\displaystyle\int_{\mathcal{A}}C^{\prime\prime}_{K,p}|x-x^{\prime}|^{p}\boldsymbol{\pi}(a|t,x^{\prime})da (187)
=\displaystyle= CK,p′′|xx|p,\displaystyle C^{\prime\prime}_{K,p}|x-x^{\prime}|^{p}, (188)

where we used Assumption 1-(ii) and CK,p′′C^{\prime\prime}_{K,p} is the constant there. The desired claim is obtained by combining these results. ∎

Proof of Lemma 4.

Fix t[0,T)t\in[0,T) and suppose X~tπ=x\tilde{X}^{\pi}_{t}=x. Define a sequence of stopping times τn=inf{st:|X~sπ|n}\tau_{n}=\inf\{s\geq t:|\tilde{X}^{\pi}_{s}|\geq n\} for nn\in\mathbb{N}. Applying Itô’s formula (15) to the process eβsϕ(s,X~sπ)e^{-\beta s}\phi(s,\tilde{X}^{\pi}_{s}), where X~sπ\tilde{X}^{\pi}_{s} follows the exploratory SDE (42), we obtain t[t,T]\forall t^{\prime}\in[t,T],

eβ(tτn)ϕ(tτn,X~rτnπ)eβtϕ(t,x)\displaystyle e^{-\beta(t^{\prime}\wedge\tau_{n})}\phi(t^{\prime}\wedge\tau_{n},\tilde{X}^{\pi}_{r\wedge\tau_{n}})-e^{-\beta t}\phi(t,x) (189)
=\displaystyle= ttτneβs(𝝅ϕ(s,X~sπ)βϕ(s,X~sπ))𝑑s\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\left(\mathcal{L}^{\boldsymbol{\pi}}\phi(s,\tilde{X}^{\pi}_{s-})-\beta\phi(s,\tilde{X}^{\pi}_{s-})\right)ds (190)
+\displaystyle+ ttτneβsϕx(s,X~sπ)σ~(s,X~sπ,π(|s,X~sπ))dWs\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\phi_{x}(s,\tilde{X}^{\pi}_{s-})\circ\tilde{\sigma}(s,\tilde{X}^{\pi}_{s-},\pi(\cdot|s,\tilde{X}^{\pi}_{s-}))dW_{s} (191)
+\displaystyle+ ttτneβsk=1×[0,1]n(ϕ(s,X~sπ+γk(s,X~sπ,G𝝅(s,X~sπ,u),z))ϕ(s,X~sπ))N~k(ds,dz,du),\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\sum_{k=1}^{\ell}\int_{\mathbb{R}\times[0,1]^{n}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}^{\prime}_{k}(ds,dz,du), (192)

where 𝝅\mathcal{L}^{\boldsymbol{\pi}} is the generator of the exploratory process (X~tπ)(\tilde{X}^{\pi}_{t}) given in (40). We next show that the expectations of (191) and (LABEL:eq:ito1N) are zero.

Note that for s[t,tτn]s\in[t,t^{\prime}\wedge\tau_{n}], |X~sπ|n|\tilde{X}^{\pi}_{s-}|\leq n. Thus, ϕx(s,X~sπ)\phi_{x}(s,\tilde{X}^{\pi}_{s-}) is also bounded. Using the linear growth of σ~(s,X~sπ,π(|s,X~sπ))\tilde{\sigma}(s,\tilde{X}^{\pi}_{s-},\pi(\cdot|s,\tilde{X}^{\pi}_{s-})) in Lemma 1 and the moment estimate (52), we can see that (191) is a square-integrable martingale and hence its expectation is zero.

Next, we analyze the following stochastic integral:

ttτneβs×[0,1]n(ϕ(s,X~sπ+γk(s,X~sπ,G𝝅(s,X~sπ,u),z))ϕ(s,X~sπ))N~k(ds,dz,du).\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}_{k}(ds,dz,du). (194)

We consider the finite and infinite jump activity cases separately.

Case 1: νk(dz)<\int_{\mathbb{R}}\nu_{k}(dz)<\infty. In this case, both of the processes

ttτneβs×[0,1]nϕ(s,X~sπ+γk(s,X~sπ,G𝝅(s,X~sπ,u),z))N~k(ds,dz,du),\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\widetilde{N}_{k}(ds,dz,du), (195)
ttτneβs×[0,1]nϕ(s,X~sπ)N~k(ds,dz,du),\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\phi(s,\tilde{X}^{\pi}_{s-})\widetilde{N}_{k}(ds,dz,du), (196)

are square-integrable martingales and hence have zero expectations. We prove this claim for (195) and analyzing (196) is entirely analogous.

Using the polynomial growth of ϕ\phi, Lemma 2, and |X~sπ|n|\tilde{X}^{\pi}_{s-}|\leq n for s[t,tτn]s\in[t,t^{\prime}\wedge\tau_{n}], we obtain

𝔼t,x¯[ttτne2βs×[0,1]n|ϕ(s,X~sπ+γk(s,X~sπ,G𝝅(s,X~sπ,u),z))|2νk(dz)𝑑u𝑑s]\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-2\beta s}\int_{\mathbb{R}\times[0,1]^{n}}\left|\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\right|^{2}\nu_{k}(dz)duds\right] (197)
\displaystyle\leq Cp𝔼t,x¯[ttτn×[0,1]n(1+|X~sπ|p+|γk(s,X~sπ,G𝝅(s,X~sπ,u),z))|p)νk(dz)duds]\displaystyle C_{p}\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}\int_{\mathbb{R}\times[0,1]^{n}}\left(1+|\tilde{X}^{\pi}_{s-}|^{p}+|\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))|^{p}\right)\nu_{k}(dz)duds\right] (198)
\displaystyle\leq Cp𝔼t,x¯[ttτn(1+|X~sπ|p)𝑑s]<.\displaystyle C_{p}^{\prime}\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}(1+|\tilde{X}^{\pi}_{s-}|^{p})ds\right]<\infty. (199)

This implies the process (195) is a square-integrable martingale; see e.g. (Situ, 2006, Section 1.9) for square-integrability of stochastic integrals with respect to compensated Poisson random measures.

Case 2: νk(dz)=\int_{\mathbb{R}}\nu_{k}(dz)=\infty. Let B1={|z|1}×[0,1]nB_{1}=\{|z|\leq 1\}\times[0,1]^{n} and B1c={|z|>1}×[0,1]nB_{1}^{c}=\{|z|>1\}\times[0,1]^{n}. The stochastic integral (194) can be written as the sum of two integrals:

ttτneβsB1(ϕ(s,X~sπ+γk(s,X~sπ,G𝝅(s,X~sπ,u),z))ϕ(s,X~sπ))N~k(ds,dz,du)\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}_{k}(ds,dz,du) (200)
+\displaystyle+ ttτneβsB1c(ϕ(s,X~sπ+γk(s,X~sπ,G𝝅(s,X~sπ,u),z))ϕ(s,X~sπ))N~k(ds,dz,du).\displaystyle\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}^{c}}\left(\phi(s,\tilde{X}^{\pi}_{s-}+\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))-\phi(s,\tilde{X}^{\pi}_{s-})\right)\widetilde{N}_{k}(ds,dz,du). (201)

Using the mean-value theorem, the stochastic integral (200) is equal to

ttτneβsB1ϕx(s,X~sπ+αsγk(s,X~sπ,G𝝅(s,X~sπ,u),z))γk(s,X~sπ,G𝝅(s,X~sπ,u),z)N~k(ds,dz,du)\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}}\phi_{x}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\circ\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\widetilde{N}_{k}(ds,dz,du) (202)

for some αs[0,1]\alpha_{s}\in[0,1]. For s[t,tτn]s\in[t,t^{\prime}\wedge\tau_{n}], |X~sπ|n|\tilde{X}^{\pi}_{s-}|\leq n. By Assumption 1-(v), |γk(s,X~sπ,G𝝅(s,X~sπ,u),z)||\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)| is bounded for any |z|1|z|\leq 1, s[t,tτn]s\in[t,t^{\prime}\wedge\tau_{n}] and u[0,1]nu\in[0,1]^{n}, which further implies the boundedness of |ϕx(s,X~sπ+αsγk(s,X~sπ,G𝝅(s,X~sπ,u),z))||\phi_{x}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))|. Now, for each j=1,,dj=1,\cdots,d,

ttτneβsB1ϕxj(s,X~sπ+αsγk(s,X~sπ,G𝝅(s,X~sπ,u),z))γjk(s,X~sπ,G𝝅(s,X~sπ,u),z)N~k(ds,dz,du)\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-\beta s}\int_{B_{1}}\phi_{x_{j}}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\gamma_{jk}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\widetilde{N}_{k}(ds,dz,du) (203)

is a square-integrable martingale because

𝔼t,x¯[ttτne2βsB1ϕxj2(s,X~sπ+αsγk(s,X~sπ,G𝝅(s,X~sπ,u),z))γjk2(s,X~sπ,G𝝅(s,X~sπ,u),z)νk(dz)𝑑u𝑑s]\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}e^{-2\beta s}\int_{B_{1}}\phi_{x_{j}}^{2}(s,\tilde{X}^{\pi}_{s-}+\alpha_{s}\gamma_{k}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z))\gamma_{jk}^{2}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\nu_{k}(dz)duds\right] (204)
\displaystyle\leq C𝔼t,x¯[ttτnB1γjk2(s,X~sπ,G𝝅(s,X~sπ,u),z)νk(dz)𝑑u𝑑s]\displaystyle C\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}\int_{B_{1}}\gamma_{jk}^{2}(s,\tilde{X}^{\pi}_{s-},G_{\boldsymbol{\pi}}(s,\tilde{X}^{\pi}_{s-},u),z)\nu_{k}(dz)duds\right] (205)
\displaystyle\leq C𝔼t,x¯[ttτn(1+|X~sπ|2)𝑑s]<,\displaystyle C^{\prime}\cdot\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{t^{\prime}\wedge\tau_{n}}(1+|\tilde{X}^{\pi}_{s-}|^{2})ds\right]<\infty, (206)

where we used Lemma 2 and boundedness of |X~sπ||\tilde{X}^{\pi}_{s-}| in the above. Thus, (200) is a square-integrable martingale with mean zero.

For (201), we can use the same arguments as in the finite activity case by observing |z|>1νk(dz)<\int_{|z|>1}\nu_{k}(dz)<\infty to show that each of the two processes in (201) is a square-integrable martingale with mean zero.

Combining the results above, setting t=Tt^{\prime}=T, and taking expectation, we obtain

𝔼t,x¯[eβ(Tτn)ϕ(X~Tτn𝝅)]eβtϕ(t,x)\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[e^{-\beta(T\wedge\tau_{n})}\phi(\tilde{X}^{\boldsymbol{\pi}}_{T\wedge\tau_{n}})\right]-e^{-\beta t}\phi(t,x) =𝔼t,x¯[tTτneβs(𝝅ϕ(s,X~sπ)βϕ(s,X~sπ))𝑑s].\displaystyle=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T\wedge\tau_{n}}e^{-\beta s}\left(\mathcal{L}^{\boldsymbol{\pi}}\phi(s,\tilde{X}^{\pi}_{s-})-\beta\phi(s,\tilde{X}^{\pi}_{s-})\right)ds\right].

As ϕ(s,x)\phi(s,x) satisfies Eq. (56), it follows from (19) that

ϕ(t,x)\displaystyle\phi(t,x) =𝔼t,x¯[tTτneβ(st)𝒜(r(s,X~s𝝅,a)θlog𝝅(a|s,X~sπ))𝝅(a|s,X~s𝝅)dads\displaystyle=\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T\wedge\tau_{n}}e^{-\beta(s-t)}\int_{\mathcal{A}}\left(r(s,\tilde{X}^{\boldsymbol{\pi}}_{s-},a)-\theta\log{\boldsymbol{\pi}}(a|s,\tilde{X}^{\pi}_{s-})\right)\boldsymbol{\pi}(a|s,\tilde{X}^{\boldsymbol{\pi}}_{s-})dads\right. (207)
+eβ(Tτnt)ϕ(X~Tτn𝝅)].\displaystyle\quad\quad\quad\left.+e^{-\beta(T\wedge\tau_{n}-t)}\phi(\tilde{X}^{\boldsymbol{\pi}}_{T\wedge\tau_{n}})\right]. (208)

It follows from Assumption 1-(iv), Definition 1-(iii), and Eq. (57) that the term on the right-hand side of (208) is dominated by C(1+suptsT|X~s𝝅|p)C(1+\sup_{t\leq s\leq T}|\tilde{X}_{s}^{\boldsymbol{\pi}}|^{p}) for some p2p\geq 2, which has finite expectation from the moment estimate (52). Therefore, sending nn to infinity in (208) and applying the dominated convergence theorem, we obtain ϕ(t,x)=J(t,x,𝝅)\phi(t,x)=J(t,x,\boldsymbol{\pi}).∎

Proof of Lemma 5.

We first maximize the integral in (59). Applying (Jia and Zhou, 2023, Lemma 13), we deduce that 𝝅\boldsymbol{\pi}^{*} given by (62) is the unique maximizer. Next, we show that ψ(t,x)\psi(t,x) is the optimal value function.

On one hand, given any admissible stochastic policy 𝝅𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi}, from (59) we have

ψ(t,x)t+𝒜{H(t,x,a,ψx,ψxx,ψ)θlog𝝅(a|t,x)}𝝅(a|t,x)𝑑aβψ(t,x)0.\displaystyle\frac{\partial\psi(t,x)}{\partial t}+\int_{\mathcal{A}}\{H(t,x,a,\psi_{x},\psi_{xx},\psi)-\theta\log\boldsymbol{\pi}(a|t,x)\}\boldsymbol{\pi}(a|t,x)da-\beta\psi(t,x)\leq 0. (209)

Using similar arguments as in the proof of Lemma 4, we obtain J(t,x,𝝅)ψ(t,x)J(t,x,\boldsymbol{\pi})\leq\psi(t,x) for any 𝝅𝚷\boldsymbol{\pi}\in\boldsymbol{\Pi}. Thus, J(t,x)ψ(t,x)J^{*}(t,x)\leq\psi(t,x).

On the other hand, Eq. (59) becomes

ψ(t,x)t+𝒜{H(t,x,a,ψx,ψxx,ψ)θlog𝝅(a)}𝝅(a)𝑑aβψ(t,x)\displaystyle\frac{\partial\psi(t,x)}{\partial t}+\int_{\mathcal{A}}\{H(t,x,a,\psi_{x},\psi_{xx},\psi)-\theta\log\boldsymbol{\pi}^{*}(a)\}\boldsymbol{\pi}^{*}(a)da-\beta\psi(t,x) =0,\displaystyle=0, (210)

with ψ(T,x)=h(x).\psi(T,x)=h(x). By Lemma 4, we obtain that J(t,x,𝝅)=ψ(t,x)J(t,x,\boldsymbol{\pi}^{*})=\psi(t,x). It follows that J(t,x)ψ(t,x)J^{*}(t,x)\geq\psi(t,x).

Combining these results, we conclude that J(t,x)=ψ(t,x)J^{*}(t,x)=\psi(t,x) and 𝝅\boldsymbol{\pi}^{*} is the optimal stochastic policy. ∎

Proof of Theorem 2.

Consider part (i). Applying Itô’s formula to the value function of policy 𝝅\boldsymbol{\pi} over the sample state process defined by (22) and using the definition of qq-function, we obtain that for 0t<sT0\leq t<s\leq T,

eβsJ(s,Xs𝝅;𝝅)eβtJ(t,x;𝝅)+tseβτ[r(τ,Xτ𝝅,aτ𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\displaystyle e^{-\beta s}J(s,X_{s}^{\boldsymbol{\pi}};\boldsymbol{\pi})-e^{-\beta t}J(t,x;\boldsymbol{\pi})+\int_{t}^{s}e^{-\beta\tau}[r(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})-\hat{q}(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau
=tseβτ[q(τ,Xτ𝝅,aτ𝝅;𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ+tseβτJx(τ,Xτ𝝅;𝝅)σ(τ,Xτπ,aτπ)𝑑Wτ\displaystyle=\int_{t}^{s}e^{-\beta\tau}[q(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}};\boldsymbol{\pi})-\hat{q}(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau+\int_{t}^{s}e^{-\beta\tau}J_{x}(\tau,X_{\tau-}^{\boldsymbol{\pi}};\boldsymbol{\pi})\circ{\sigma}(\tau,X^{\pi}_{\tau-},a^{\pi}_{\tau})dW_{\tau}
+k=1tseβτ(J(τ,Xτ𝝅+γk(τ,Xτ𝝅,aτ𝝅,z))J(τ,Xτ𝝅;𝝅))N~k(dτ,dz).\displaystyle\quad+\sum_{k=1}^{\ell}\int_{t}^{s}e^{-\beta\tau}\int_{\mathbb{R}}\left(J(\tau,X^{\boldsymbol{\pi}}_{\tau-}+{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z))-J(\tau,X^{\boldsymbol{\pi}}_{\tau-};\boldsymbol{\pi})\right)\widetilde{N}_{k}(d\tau,dz). (211)

Suppose q^(t,x,a)=q(t,x,a;𝝅)\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi}) for all (t,x,a)(t,x,a). Hence the first term on the right-hand side of (211) is zero. We verify the following two conditions:

𝔼t,x¯[tTe2βτ|Jx(τ,Xτ𝝅;𝝅)σ(τ,Xτπ,aτπ)|2𝑑τ]<,\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-2\beta\tau}|J_{x}(\tau,X_{\tau-}^{\boldsymbol{\pi}};\boldsymbol{\pi})\circ{\sigma}(\tau,X^{\pi}_{\tau-},a^{\pi}_{\tau})|^{2}d\tau\right]<\infty, (212)
𝔼t,x¯[tTe2βτ|J(τ,Xτ𝝅+γk(τ,Xτ𝝅,aτ𝝅,z);𝝅)J(τ,Xτ𝝅;𝝅)|2νk(dz)𝑑τ]<.\displaystyle\mathbb{E}^{\bar{\mathbb{P}}}_{t,x}\left[\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left|J(\tau,X^{\boldsymbol{\pi}}_{\tau-}+{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z);\boldsymbol{\pi})-J(\tau,X^{\boldsymbol{\pi}}_{\tau-};\boldsymbol{\pi})\right|^{2}\nu_{k}(dz)d\tau\right]<\infty. (213)

Eq. (212) follows from Assumption 1-(iii), the polynomial growth of JxJ_{x} in xx, and the moment estimate (53). For (213), we apply the mean-value theorem and the integral becomes

tTe2βτ|Jx(τ,Xτ𝝅+ατγk(τ,Xτ𝝅,aτ𝝅,z);𝝅)γk(τ,Xτ𝝅,aτ𝝅,z)|2νk(dz)𝑑τ\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left|J_{x}(\tau,X^{\boldsymbol{\pi}}_{\tau-}+\alpha_{\tau}{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z);\boldsymbol{\pi})\circ{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2}\nu_{k}(dz)d\tau (214)

for some ατ[0,1]\alpha_{\tau}\in[0,1]. Using the polynomial growth of JxJ_{x} in xx, we can bound the integral by

tTe2βτ|Jx(τ,Xτ𝝅+ατγk(τ,Xτ𝝅,aτ𝝅,z);𝝅)|2|γk(τ,Xτ𝝅,aτ𝝅,z)|2νk(dz)𝑑τ\displaystyle\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left|J_{x}(\tau,X^{\boldsymbol{\pi}}_{\tau-}+\alpha_{\tau}{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z);\boldsymbol{\pi})\right|^{2}\cdot\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2}\nu_{k}(dz)d\tau (215)
\displaystyle\leq CtTe2βτ(1+|Xτ𝝅+ατγk(τ,Xτ𝝅,aτ𝝅,z)|p)2|γk(τ,Xτ𝝅,aτ𝝅,z)|2νk(dz)𝑑τ\displaystyle C\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left(1+|X^{\boldsymbol{\pi}}_{\tau-}+\alpha_{\tau}{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)|^{p}\right)^{2}\cdot\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2}\nu_{k}(dz)d\tau (216)
\displaystyle\leq CtTe2βτ(1+|Xτ𝝅|p+|γk(τ,Xτ𝝅,aτ𝝅,z)|p)2|γk(τ,Xτ𝝅,aτ𝝅,z)|2νk(dz)𝑑τ\displaystyle C^{\prime}\int_{t}^{T}e^{-2\beta\tau}\int_{\mathbb{R}}\left(1+\left|X^{\boldsymbol{\pi}}_{\tau-}\right|^{p}+\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{p}\right)^{2}\cdot\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2}\nu_{k}(dz)d\tau (217)
\displaystyle\leq CtT((1+|Xτ𝝅|p)2|γk(τ,Xτ𝝅,aτ𝝅,z)|2νk(dz)\displaystyle C^{\prime}\int_{t}^{T}\left((1+\left|X^{\boldsymbol{\pi}}_{\tau-}\right|^{p})^{2}\int_{\mathbb{R}}\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2}\nu_{k}(dz)\right. (218)
+2(1+|Xτ𝝅|p)|γk(τ,Xτ𝝅,aτ𝝅,z)|p+2νk(dz)+|γk(τ,Xτ𝝅,aτ𝝅,z)|2p+2νk(dz))dτ\displaystyle\left.+2(1+\left|X^{\boldsymbol{\pi}}_{\tau-}\right|^{p})\int_{\mathbb{R}}\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{p+2}\nu_{k}(dz)+\int_{\mathbb{R}}\left|{\gamma}_{k}(\tau,X^{\boldsymbol{\pi}}_{\tau-},a_{\tau}^{\boldsymbol{\pi}},z)\right|^{2p+2}\nu_{k}(dz)\right)d\tau (219)

Using Assumption 1-(iii) and the moment estimate (53), we obtain (213). It follows that the second and third processes on the right-hand side of (211) are ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingales and thus we have the martingale property of the process given by (69).

Conversely, if (69) is a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale, we see from (211) that the process

tseβτ[q(τ,Xτ𝝅,aτ𝝅;𝝅)q^(τ,Xτ𝝅,aτ𝝅)]𝑑τ\int_{t}^{s}e^{-\beta\tau}[q(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}};\boldsymbol{\pi})-\hat{q}(\tau,X_{\tau-}^{\boldsymbol{\pi}},a_{\tau}^{\boldsymbol{\pi}})]d\tau (220)

is also a ({𝒢s}s0,¯)(\{\mathcal{G}_{s}\}_{s\geq 0},\bar{\mathbb{P}})-martingale. Furthermore, it has continuous sample paths and finite variation and thus is equal to zero ¯\bar{\mathbb{P}}-almost surely. We can then follow the argument in the proof of Theorem 6 in Jia and Zhou (2023) to show that q^(t,x,a)=q(t,x,a;𝝅)\hat{q}(t,x,a)=q(t,x,a;\boldsymbol{\pi}) for all (t,x,a)(t,x,a). There is only one step in their proof that we need to modify due to the presence of jumps.

Specifically, consider the sample state process X𝝅X^{\boldsymbol{\pi}} starting from some time-state-action (t,x,a)(t^{*},x^{*},a^{*}). Fix δ>0\delta>0 and define

Tδ=inf{tt:|Xt𝝅x|>δ}(t+δ).\displaystyle T_{\delta}=\inf\{t^{\prime}\geq t^{*}:|X_{t^{\prime}}^{\boldsymbol{\pi}}-x^{*}|>\delta\}\wedge(t^{*}+\delta). (221)

In the pure diffusion case, Jia and Zhou (2023) uses the continuity of the sample paths of X𝝅X^{\boldsymbol{\pi}} to argue that Tδ>tT_{\delta}>t^{*}, ¯\bar{\mathbb{P}}-almost surely. This result still holds with presence of jumps, because the Lévy processes that drive our controlled state X𝝅X^{\boldsymbol{\pi}} are stochastic continuous, i.e., the probability of having a jump at the fixed time tt^{*} is zero.

To prove parts (ii) and (iii), we can apply the arguments used in proving part (i) together with those arguments from the proof of Theorem 6 in Jia and Zhou (2023). The details are omitted. ∎

Proof of Lemma 6.

(1) Consider the function h(t,S)h^{\prime}(t,S) defined by the RHS of (154). Proposition 12.1 in Cont and Tankov (2004) shows that hC1,2([0,T)×+)C([0,T]×+)h^{\prime}\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) and it satisfies the PIDE (144). Furthermore, h(t,S)h^{\prime}(t,S) is Lipschitz continuous in SS. The Lipschitz continuity of function G^\hat{G} also implies |G^(S^T)|C(1+S^T)|\hat{G}(\hat{S}_{T})|\leq C(1+\hat{S}_{T}) for some constant C>0C>0. It follows that

|h(t,S)|C(1+𝔼[S^T|S^t=S])C(1+S),|h^{\prime}(t,S)|\leq C\left(1+\mathbb{E}^{\mathbb{Q}}[\hat{S}_{T}|\hat{S}_{t}=S]\right)\leq C(1+S), (222)

where we used the martingale property of S^\hat{S} under \mathbb{Q}. The Feynman-Kac Theorem (see Zhu et al. (2015)) implies uniqueness of classical solutions satisfying the linear growth condition.

(2) To study the PIDE (148), we consider the function g(t,S)g^{\prime}(t,S) defined by the RHS of (156). Under the assumption of Lemma 6, the arguments of Proposition 12.1 in Cont and Tankov (2004) can be used to show that gC1,2([0,T)×+)C([0,T]×+)g^{\prime}\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) and it satisfies the PIDE (148). The Lipschits continuity of h(t,S)h(t,S) in SS implies that hS(t,S)h_{S}(t,S) is bounded and |h(t,Sexp(z))h(t,S)|C(exp(z)1)|h(t,S\exp(z))-h(t,S)|\leq C(\exp(z)-1). We also have 𝔼[suptsTS^s2|S^t=S]C(1+S2)\mathbb{E}^{\mathbb{P}}[\sup_{t\leq s\leq T}\hat{S}_{s}^{2}|\hat{S}_{t}=S]\leq C(1+S^{2}) (Kunita (2004), Theorem 3.2). Combining these results, we see that g(t,S)g^{\prime}(t,S) shows quadratic growth in SS. Again the Feynman-Kac Theorem implies uniqueness of classical solutions satisfying the quadratic growth condition. ∎