Reinforcement Learning for Jump-Diffusions, with Financial Applications
Abstract
We study continuous-time reinforcement learning (RL) for stochastic control in which system dynamics are governed by jump-diffusion processes. We formulate an entropy-regularized exploratory control problem with stochastic policies to capture the exploration–exploitation balance essential for RL. Unlike the pure diffusion case initially studied by Wang et al. (2020), the derivation of the exploratory dynamics under jump-diffusions calls for a careful formulation of the jump part. Through a theoretical analysis, we find that one can simply use the same policy evaluation and -learning algorithms in Jia and Zhou (2022a), Jia and Zhou (2023), originally developed for controlled diffusions, without needing to check a priori whether the underlying data come from a pure diffusion or a jump-diffusion. However, we show that the presence of jumps ought to affect parameterizations of actors and critics in general. We investigate as an application the mean–variance portfolio selection problem with stock price modelled as a jump-diffusion, and show that both RL algorithms and parameterizations are invariant with respect to jumps. Finally, we present a detailed study on applying the general theory to option hedging.
Keywords. Reinforcement learning, continuous time, jump-diffusions, exploratory formulation, well-posedness, Hamiltonian, martingale, -learning.
1 Introduction
Recently there is an upsurge of interest in continuous-time reinforcement learning (RL) with continuous state spaces and possibly continuous action spaces. Continuous RL problems are important because: 1) many if not most practical problems are naturally continuous in time (and in space), such as autonomous driving, robot navigation, video game play and high frequency trading; 2) while one can discretize time upfront and turn a continuous-time problem into a discrete-time MDP, it has been known, indeed shown experimentally e.g., Munos (2006), Tallec et al. (2019) and Park et al. (2021), that this approach is very sensitive to time discretization and performs poorly with small time steps; 3) there are more analytical tools available for the continuous setting that enable a rigorous and thorough analysis leading to interpretable (instead of black-box) and general (instead of ad hoc) RL algorithms.
Compared with the vast literature of RL for MDPs, continuous-time RL research is still in its infancy with the latest study focusing on establishing a rigorous mathematical theory and devising resulting RL algorithms. This strand of research starts with Wang et al. (2020) that introduces a mathematical formulation to capture the essence of RL – the exploration–exploitation tradeoff – in the continuous setting, followed by a “trilogy” (Jia and Zhou 2022a, b, Jia and Zhou 2023) that develops intertwining theories on policy evaluation, policy gradient and -learning respectively. The common underpinning of the entire theory is the martingale property of certain stochastic processes, the enforcement of which naturally leads to various temporal difference algorithms to train and learn -functions, value functions and optimal (stochastic) policies. The research is characterized by carrying out all the analysis in the continuous setting, and discretizing time only at the final, implementation stage for approximating the integrated rewards and the temporal difference. The theory has been adapted and extended in different directions; see e.g. Reisinger and Zhang (2021), Guo et al. (2022), Dai et al. (2023), as well as employed for applications; see e.g. Wang and Zhou (2020), Huang et al. (2022), Gao et al. (2022), Wang et al. (2023), and Wu and Li (2024).
The study so far has been predominantly on pure diffusion processes, namely the state processes are governed by controlled stochastic differential equations (SDEs) with a drift part and a diffusion one. While it is reasonable to model the underlying data generating processes as diffusions within a short period of time, sudden and drastic changes can and do happen over time. An example is a stock price process: while it is approximately a diffusion over a sufficiently short period, it may respond dramatically to a surprisingly good or bad earning report. Other examples include neuron dynamics (Giraudo and Sacerdote 1997), stochastic resonance (Gammaitoni et al. 1998) and climate data (Goswami et al. 2018). It is therefore natural and necessary to extend the continuous RL theory and algorithms to the case when jumps are present. This is particularly important for decision makings in financial markets, where it has been well recognized that using jumps to capture large sudden movements provides a more realistic way to model market dynamics; see the discussions in Chapter 1 of Cont and Tankov (2004). The financial modeling literature with jumps dates back to the seminal work of Merton (1976), who extends the classical Black–Scholes model by introducing a compound Poisson process with normally distributed jumps in the log returns. Since then alternative jump size distributions have been proposed in e.g. Kou (2002) and Cai and Kou (2011). Empirical success of jump-diffusion models have been documented for many asset classes; see Bates (1991), Andersen et al. (2002), and Aït-Sahalia et al. (2012) for stocks and stock indices, Bates (1996) for exchange rates, Das (2002) for interest rates, and Li and Linetsky (2014) for commodities, among many others.
This paper makes two major contributions. The first is mathematical in terms of setting up the suitable exploratory formulation and proving the well-posedness of the resulting exploratory SDE, which form the foundation of the RL theory for jump-diffusions. Wang et al. (2020) apply the classical stochastic relaxed control to model the exploration or randomization prevalent in RL, and derive an exploratory state equation that dictates the dynamics of the “average” of infinitely many state processes generated by repeatedly sampling from the same exploratory, stochastic policy. The drift and variance coefficients of the exploratory SDE are the means of those coefficients against the given stochastic policy (which is a probability distribution) respectively. The derivation therein is based on a law of large number argument to the first two moments of the diffusion process. That argument fails for jump-diffusions which are not uniquely determined by the first two moments. We overcome this difficulty by analyzing instead the infinitesimal generator of the sample state process, from which we identify the dynamics of the exploratory state process. Inspired by Kushner (2000) who studies relaxed control for jump-diffusions, we formulate the exploratory SDE by extending the original Poisson random measures for jumps to capture the effect of random exploration. It should be noted that, like almost all the earlier works on relaxed control, Kushner (2000) is motivated by answering the theoretical question of whether an optimal control exists, as randomization convexifies the universe of control strategies. In comparison, our formulation is guided by the practical motivation of exploration for learning. There is also another subtle but important difference. We consider stochastic feedback policies while Kushner (2000) does not. This in turn creates technical issues in studying the well-posedness of the exploratory SDE in our framework.
The second main contribution is several implications regarding the impact of jumps on RL algorithm design. Thanks to the established exploratory formulation, we can define the Hamiltonian that, compared with the pure diffusion counterpart, has to include an additional term corresponding to the jumps. The resulting HJB equation – called the exploratory HJB – is now a partial integro-differential equation (PIDE) instead of a PDE due to that additional term. However, when expressed in terms of the Hamiltonian, the exploratory HJB equation has exactly the same form as that in the diffusion case. This leads to several completely identical statements of important results, including the optimality of the Gibbs exploration, definition of a -function, and martingale characterizations of value functions and -functions. Here by “identical” we mean in terms of the Hamiltonian; in other words, these statements differ between diffusions and jump-diffusions entirely because the Hamiltonian is defined differently (which also causes some differences in the proofs of the results concerned). Most important of all, in the resulting RL algorithms, the Hamiltonian (or equivalently the -function) can be computed using temporal difference of the value function by virtue of the Itô lemma; as a result the algorithms are completely identical no matter whether or not there are jumps. This has a significant practical implication: we can just use the same RL algorithms without the need of checking in advance whether the underlying data come from a pure diffusion or a jump-diffusion. It is significant for the following reason. In practice, data are always observed or sampled at discrete times, no matter how frequent they arrive. Thus we encounter successive discontinuities along the sample trajectory even when the data actually come from a diffusion process. There are some criteria that can be used to check whether the underlying process is a diffusion or a jump-diffusion, e.g. Aït-Sahalia et al. (2012), Wang and Zheng (2022). But these methods typically require data with very high frequency to be effective, which may not always be available. In addition, noises must be properly handled for them to work.
Even though we can apply the same RL algorithms irrespective of the presence of jumps, the parametrization of the policy and value function may still depend on it, if we try to exploit certain special structure of the problem instead of using general neural networks for parameterization. Indeed, we give an example in which the optimal exploratory policy is Gaussian when there are no jumps, whereas an optimal policy either does not exist or becomes non-Gaussian when there are jumps. However, in the mean–variance portfolio selection we present as a concrete application, the optimal Gibbs exploration measure again reduces to Gaussian and the value function is quadratic as in Wang and Zhou (2020), both owing to the inherent linear–quadratic (LQ) structure of the problem. Hence in this particular case jumps do not even affect the parametrization of the policy and value function/-function for learning.
We also consider mean–variance hedging of options as another application. This is a non-LQ problem and hence more difficult to solve than mean–variance portfolio selection. The mean–variance hedging problem has been studied in various early works, such as Schweizer (1996, 2001) and Lim (2005). Here, we introduce the entropy-regularized mean–variance hedging objective for an asset following a jump-diffusion and derive analytical representations for the optimal stochastic policy, which is again Gaussian, as well as the optimal value function. We use these representations to devise an actor–critic algorithm to learn the optimal hedging policy from data.
We compare our work with four recent related papers. (1) Bender and Thuan (2023) consider the continuous-time mean–variance portfolio selection problem with exploration under a jump-diffusion setting. Our paper differs from theirs in several aspects. First, they consider a specific application problem while we study RL for general controlled jump-diffusions. Second, they obtain the SDE for the exploratory state by taking limit of the discrete-time exploratory dynamics, whereas our approach first derives the form of the infinitesimal generator of the sample state process and then infers the exploratory SDE from it. It is unclear how their approach works when dealing with general control problems. Finally, they do not consider how to develop RL algorithms based on their solution of the exploratory mean–variance portfolio selection, which we do in this paper. (2) Guo et al. (2023) consider continuous-time RL for linear–convex models with jumps. The scope and motivation are different from ours: They focus on the Lipschitz stability of feedback controls for this special class of control problems where the diffusion and jump terms are not controlled, and propose a least-square model-based algorithm and obtain sublinear regret guarantees in the episodic setting. By contrast, we consider RL for general jump-diffusions and develop model-free algorithms without considering regret bounds. (3) Denkert et al. (2024) aim to unify certain types of stochastic control problems by considering the so-called randomized control formulation which leads to the same optimal value functions as those of the original problems. They develop a policy gradient representation and actor-critic algorithms for RL. The randomized control formulation is fundamentally different from the framework we are considering: therein the control is applied at a set of random time points generated by a random point process instead of at every time point as in our framework. (4) Bo et al. (2024) develop -learning for jump-diffusions by using Tsallis’ entropy for regularization instead of Shannon’s entropy considered in our paper and in Jia and Zhou (2022b), Jia and Zhou (2023).111The paper Bo et al. (2024) came to our attention after a previous version of our paper was completed and posted. While this entropy presents an interesting alternative for developing RL algorithms, it may make the exploratory control problem less tractable to solve and lead to policy distributions that are inefficient to sample for exploration in certain applications.
The remainder of the paper is organized as follows. In Section 2, we discuss the problem formulation. In Section 3, we present the theory of -learning for jump-diffusions, followed by the discussion of -learning algorithms in Section 4. In Section 5, we apply the general theory and algorithms to a mean–variance portfolio selection problem, and discuss the impact of jumps. Section 6 presents the application to a mean–variance option hedging problem. Finally, Section 7 concludes. All the proofs are given in an appendix.
2 Problem Formulation and Preliminaries
For readers’ convenience, we first recall some basic concepts for one-dimensional (1D) Lévy processes, which can be found in standard references such as Sato (1999) and Applebaum (2009). A 1D process is a Lévy process if it is continuous in probability, has stationary and independent increments, and almost surely. Denote the jump of at time by , and let be the collection of Borel sets of whose closure does not contain . The Poisson random measure (or jump measure) of is defined as
(1) |
which gives the number of jumps up to time with jump size in a Borel set away from . The Lévy measure of is defined by for , which shows the expected number of jumps in in unit time, and is finite. For any , is a Poisson process with intensity given by . The differential forms of these two measures are written as and , respectively. If is absolutely continuous, we write by using the same letter for the measure and its density function. The Lévy measure must satisfy the integrability condition
(2) |
However, it is not necessarily a finite measure on but always a -finite measure. The Lévy process is said to have jumps of finite (infinite activity) if (). The number of jumps on any finite time interval is finite in the former case but infinite in the latter. For any Borel set with its closure including , is finite in the finite activity case but infinite otherwise. Finally, the compensated Poisson random measure is defined as . For any , the process is a martingale.
Throughout the paper, we use the following notations. By convention, all vectors are column vectors unless otherwise specified. We use and to denote the space of -dimensional vectors and matrices, respectively. For matrix , we use for its transpose, for its Euclidean/Frobenius norm, and write . Given two matrices and of the same size, we denote by the inner product between and , which is given by . For a positive semidefinite matrix , we write , where is its singular value decomposition with as two orthogonal matrices and as a diagonal matrix, and is the diagonal matrix whose entries are the square root of those of . We use to denote the function , and to denote the function value of at . We use both and for the firs and second (partial) derivatives of a function with respect to . We write the minimum of two values and as . The notation denotes the uniform distribution over set while refers to the Gaussian distribution with mean vector and covariance matrix .
2.1 Classical stochastic control of jump-diffusions
Consider a filtered probability space satisfying the usual hypothesis. Assume that this space is rich enough to support , a standard Brownian motion in , and independent one-dimensional (1D) Lévy processes , which are also independent of . Let be the vector of their Poisson random measures, and similarly define and . The controlled system dynamics are governed by the following Lévy SDE (Øksendal and Sulem 2007, Chapter 3):
(3) |
where
(4) |
is the control or action at time , is the control space, and is the control process assumed to be predictable with respect to . We denote the -th column of the matrix by . The goal of stochastic control is, for each initial time-state pair of (3), to find the optimal control process that maximizes the expected total reward:
(5) |
where is a discount factor that measures the time value of the payoff.
The stochastic control problem (3)–(5) is very general; in particular, control processes affect the drift, diffusion and jump coefficients. We now make the following assumption to ensure well-posedness of the problem. Define .
Assumption 1.
Suppose the following conditions are satisfied by the state dynamics and reward functions:
-
(i)
are all continuous functions in their respective arguments;
-
(ii)
(local Lipschitz continuity) for any and any , there exist positive constants and such that ,
(6) (7) -
(iii)
(linear growth in ) for any , there exist positive constants and such that ,
(8) (9) -
(iv)
there exists a constant such that
(10) for some and some ; moreover, is finite.
Conditions (i)-(iii) guarantee the existence of a unique strong solution to the Lévy SDE (3) with initial condition . Furthermore, for any , there exists a constant such that
(11) |
see (Kunita 2004, Theorem 3.2) and (Situ 2006, Theorem 119). With the moment estimate (11), it follows that condition (iv) implies that the expected value in (5) is finite.
Let be the infinitesimal generator associated with the Lévy SDE (3). Under condition (iii), we have for . Thus, we can write in the following form:
(12) |
where is the gradient and is the Hessian matrix.
We recall Itô’s formula, which will be frequently used in our analysis; see e.g. (Øksendal and Sulem 2007, Theorem 1.16). Let be the unique strong solution to (3). For any , we have
(13) | ||||
(14) | ||||
(15) |
It is known that the Hamilton–Jacobi–Bellman (HJB) equation for the control problem (3)–(5) is given by
(16) | ||||
where is given in (2.1). Under proper conditions, the solution to the above equation is the optimal value function for control problem (5). Moreover, the following function, which maps a time-state pair to an action:
(17) |
is the optimal feedback control policy of the problem.
Given a smooth function we define the Hamiltonian by
(18) | ||||
(19) |
Then, the HJB equation (16) can be recast as
(20) | ||||
2.2 Randomized control and exploratory formulation
A key idea of RL is to explore the unknown environment by randomizing the actions. Let be a given stochastic feedback policy, where is the set of probability density functions defined on . Let be sampled from (i.e. is a copy of ), which is a deterministic feedback policy. Applying this policy to (3), we get for ,
(21) |
Assuming the solution to the above SDE uniquely exists, we say the action process to be generated from . Note that depends on the specific sample , which we omit to write out for notational simplicity. In the following, we will also write as .
We need to enlarge the original filtered probability space to include the additional randomness from sampling actions. Following Jia and Zhou (2022b), Jia and Zhou (2023), we assume that the probability space is rich enough to support independent copies of an -dimensional random vector uniformly distributed over , where is the dimension of the control space. These copies are also independent of and . Let be the new sigma-algebra generated by and the copies of the uniform random vector up to time . The new filtered probability space is , where is the product extension from and they coincide when restricted to .
Fix a stochastic feedback policy and an initial time-state pair . An action process generated from is an -progressively measurable process that is also predictable. Consider the sample state process that follows the SDE
(22) |
Once again, bear in mind that the above equation depends on a specific sample ; so there are in fact infinitely many similar equations, each corresponding to a sample of .
To encourage exploration, we add an entropy regularizer to the running reward, leading to
(23) |
where is the expectation conditioned on , taken with respect to the randomness in the Brownian motion, the Poisson random measures, and the action randomization. Here is the temperature parameter that controls the level of exploration. The function is called the value function of the policy . The goal of RL is to find the policy that maximizes the value function among admissible policies that are to be specified in Definition 1 below.
For theoretical analysis, we consider the exploratory dynamics of , which represent the key averaged characteristics of the sample state process over infinitely many randomized actions. In the case of diffusions, Wang et al. (2020) derive such exploratory dynamics by applying a law of large number argument to the first two moments of the diffusion process. Their approach, however, cannot be applied to jump-diffusions. Here, we get around by studying the infinitesimal generator of the sample state process, from which we will identify the dynamics of the exploratory state process.
To this end, let , which is continuously differentiable in and twice continuously differentiable in with compact support, and we need to analyze . Fixing , consider the SDE (22) starting from with independent copies of . Let be very small and assume the corresponding actions are fixed from to . Denote by the value of the state process corresponding to at . Then
(24) | |||
(25) | |||
(26) | |||
(27) | |||
(28) |
Using the law of large number, we obtain
(29) |
where
(30) |
These “exploratory” drift and diffusion coefficients are consistent with those in Wang et al. (2020). It is tempting to think the exploratory jump coefficient is similarly the average of with respect to ; but unfortunately it is generally not true. This in turn is one of the main distinctive features in studying RL for jump-diffusions.
We approach the problem by analyzing the integrals in (28). Using the second-order Taylor expansion, the boundedness of for and condition (iii) of Assumption 1, we obtain that for fixed and each ,
(31) | |||
(32) |
for some constant , which is independent of . It follows that
(33) |
Combining (29) and (33), the infinitesimal generator of the sample state process is given by the probability weighted average of the generator of the classical controlled jump-diffusion, i.e.,
(34) |
Next, we reformulate the integrals in (33) to convert them to the same form as (2.1), from which we can infer the SDE for the exploratory state process.
Recall that the Poisson random measure with intensity measure () is defined over the product space . We can also interpret as a counting measure associated with a random configuration of points (Cont and Tankov 2004, Section 2.6.3), i.e.,
(35) |
where is the Dirac measure with mass one at point , is the arrival time of the th jump of the Lévy process , and is the size of this jump. We can interpret as the mark of the th event.
At , the size of the jump in the controlled state under policy is given by , where is the state right before the jump occurs and is the action generated from the feedback policy . When the policy is deterministic, the generated action is determined by and thus the size of the jump in is a function of . By contrast, when becomes stochastic, an additional random noise is introduced at that determines the generated action together with . Consequently, the size of the jump in is a function of plus the random noise for exploration at .
This motivates us to construct new Poisson random measures on an extended space to capture the effect of random noise on jumps for stochastic policies. Specifically, for each , we construct a new Poisson random measure, denoted by , on the product space , with its intensity measure given by . Here, is the realized value of the -dimensional random vector that follows , which is the random noise introduced in the probability space for exploration. The new Poisson random measure is also a counting measure associated with a random configuration of points :
(36) |
where and are the same as above, and is the random vector that generates random exploration at . Hence, the th event is marked by both and under . We let .
In general, for any -dimensional random vector that follows distribution , we can find a measurable function such that , where . As an example, consider . We can represent it as , where is the cumulative distribution function of the univariate standard normal distribution and is a vector obtained by applying to each component of .
For the stochastic feedback policy , using we obtain
(37) | ||||
(38) |
It follows that the infinitesimal generator of the sample state process can be written as
(39) | |||
(40) |
Comparing (40) with (2.1) in terms of the integral part that characterizes the behavior of jumps, we observe that the new measure replaces the Lévy measure and integration is done over an extended space to capture the effect of random exploration on jumps. The jump coefficient function that generates the jump size in the controlled state process given the Lévy jump with size and control variable is still the same. However, in (40) the control is generated from as , where is the realized value of the random noise introduced for exploration. In the following, we will also write as whenever using the latter simplifies notations.
Based on (40), we see that the exploratory state should be the solution to the following Lévy SDE:
(41) | ||||
(42) |
which we call the exploratory Lévy SDE. The solution process, if exists, is denoted by and called the exploratory (state) process. As we explain below, this process informs us the behavior of the key characteristics of the sample state process after averaging over infinitely many actions sampled from the stochastic policy .
In general, the sample state process defined by (22) is a semimartingale, as it is the sum of three processes: the drift process that has finite variation (the first term in (22)), the continuous (local) martingale driven by the Brownian motion (the second term in (22)), and the discontinuous (local) martingale driven by the compensated Poisson random measure (the third term in (22)). Any semimartingale is fully determined by three characteristics: the drift, the quadratic variation of the continuous local martingale, and the compensator of the random measure associated with the process’s jumps (the compensator gives the jump intensity); see Jacod and Shiryaev (2013) for detailed discussions of semimartingales and their characteristics.
For the sample state process, given that and the action sampled from is , the characteristics over an infinitesimally small time interval are given by the triplet .
Now consider the exploratory state process , which is also a semimartingale by (42). Its characteristics over an infinitesimally small time interval with are given by the triplet , where the third characteristic is obtained by calculating for Lévy jumps with size from . Using (30), we have
(43) | |||
(44) |
Thus, the semimartingale characteristics of the exploratory state process are the averages of those of the sample state process over action randomization.
Remark 1.
In general, there may be other ways to formulate the exploratory SDE in the jump-diffusion case as we may be able to obtain alternative representations for the infinitesimal generator based on (34). However, the law of the exploratory state would not change because its generator stays the same.
A technical yet foundational question is the well-posedness (i.e. existence and uniqueness of solution) of the exploratory SDE (42), which we address below. For that we first specify the class of admissible strategies, which is the same as that considered in Jia and Zhou (2023) for pure diffusions.
Definition 1.
A policy is called admissible, if
-
(i)
, for every and is measurable;
-
(ii)
is continuous in , i.e., as . Furthermore, for any , there is a constant independent of such that
(45) -
(iii)
, for some and is a positive constant; for any , for some and is a positive constant that can depend on .
Next, we establish the well-posedness of (42) under any admissible policy. The result of the next lemma regarding and is provided in the proof of Lemma 2 in Jia and Zhou (2022b), which uses property (ii) of admissibility.
Lemma 1.
Under Assumption 1, for any admissible policy , the functions and have the following properties:
-
(i)
(local Lipschitz continuity) for , there exists a constant such that ,
(46) -
(ii)
(linear growth in ) there exists a constant such that ,
(47)
We now establish similar properties for in the following lemmas, whose proofs are relegated to the appendix.
Lemma 2 (linear growth in ).
Under Assumption 1, for any admissible and any , there exists a constant that can depend on such that ,
(48) |
For the local Lipschitz continuity of , we make an additional assumption.
Assumption 2.
For , the following conditions hold.
-
(i)
For any and any , there exists a constant that can depend on and such that
(49) -
(ii)
For any and any , there exists a constant that can depend on and such that
(50)
For a stochastic feedback policy , we have . Clearly, Assumption 2-(ii) holds provided that and are locally Lipschitz continuous in .
Lemma 3 (local Lipschitz continuity).
With Lemmas 1 to 3, we can now apply (Kunita 2004, Theorem 3.2) and (Situ 2006, Theorem 119) to obtain the well-posedness of (42) along with the moment estimate of its solution.
Proposition 1.
It should be noted that the conditions imposed in Assumptions 1 and 2 are sufficient but not necessary for obtaining the well-posedness and moment estimate of the exploratory Lévy SDE (42). For a specific problem, weaker conditions may suffice for these results if we exploit special structures of the problem.
From the previous discussion, we see that for a given admissible stochastic feedback policy , the sample state process and the exploratory state process associated with share the same infinitesimal generator and hence the same probability law. This is justified by (Ethier and Kurtz 1986, Chapter 4, Theorem 4.1) on the condition that the function space is a core of the generator, which we assume to hold. It follows that
(53) |
if (52) holds.
2.3 Exploratory HJB equation
With the exploratory dynamics (42), for any admissible stochastic policy the value function given by (23) can be rewritten as
(54) | ||||
(55) |
Under Assumption 1 and using the admissibility of and (52), it is easy to see that has polynomial growth in . We provide the Feynman–Kac formula for this function in Lemma 4 by working with the representation (55). In the proof, we consider the finite and infinite jump activity cases separately because special care is needed in the latter. We revise Assumption 1 by adding one more condition for this case.
Assumption 1′.
Conditions (i) to (iv) in Assumption 1 hold. We further assume condition (v): if , is bounded for any , , , and .
For Lemma 4, Lemma 5 and Theorem 1, we impose Assumption 1′ and assume that the exploratory SDE (42) is well-posed with the moment estimate (52). For simplicity, we do not explicitly mention these assumptions in the statement of the results.
Lemma 4.
Given an admissible stochastic policy , suppose there exists a solution to the following partial integro-differential equation (PIDE):
(56) |
with terminal condition , . Moreover, for some , satisfies
(57) |
Then is the value function of the policy , i.e. .
For ease of presentation, we henceforth assume the value function for any admissible stochastic policy .
Remark 2.
Next, we consider the optimal value function defined by
(58) |
where is the class of admissible strategies. The following result characterizes and the optimal stochastic policy through the so-called exploratory HJB equation.
Lemma 5.
Suppose there exists a solution to the exploratory HJB equation:
(59) |
with the terminal condition , where is the Hamiltonian defined in (19). Moreover, for some , satisfies
(60) |
and it holds that
(61) |
Then, the Gibbs measure or Boltzman distribution
(62) |
is the optimal stochastic policy and provided that is admissible.
3 -Learning Theory
3.1 -function and policy improvement
We present the -learning theory for jump-diffusions that includes both policy evaluation and policy improvement, now that the exploratory formulation has been set up. The theory can be developed similarly to Jia and Zhou (2023); so we will just highlight the main differences in the analysis, skipping the parts that are similar.
Definition 2.
It is an immediate consequence of Lemma 4 that the -function satisfies
(65) |
The following policy improvement theorem can be proved similarly as in (Jia and Zhou 2023, Theorem 2) by using the arguments in the proof of Lemma 4.
Theorem 1 (Policy Improvement).
For any given , define
If then
(66) |
Moreover, if the following map
(67) | ||||
(68) |
has a fixed point , then is an optimal policy.
3.2 Martingale characterization of the -function
Next we derive the martingale characterization of the -function associated with a policy , assuming that its value function has already been learned and known. We will highlight the major differences in the proof, provided in the appendix, compared with the pure diffusion setting. For Theorem 2, Theorem 3, and Theorem 4, we impose Assumption 1 and assume the moment estimate (53) for the sample state process holds without explicitly mentioning them in the theorem statements.
Theorem 2.
Let a policy its value function and a continuous function be given. Assume that and both have polynomial growth in . Then the following results hold.
-
(i)
for all if and only if for any the following process
(69) is a -martingale, where is the sample state process defined in (22) with .
-
(ii)
If for all , then for any and any , the following process
(70) is a -martingale, where is the solution to (22) under with initial condition .
-
(iii)
If there exists such that for all , the process (70) is a -martingale where , then we have for all .
Moreover, in any of the three cases above, the -function satisfies
(71) |
Remark 3.
Similar to Jia and Zhou (2023), Theorem 2-(i) facilitates on-policy learning, where learning the -function of the given target policy is based on data generated by . On the other hand, Theorem 2-(ii) and -(iii) are for off-policy learning, where learning the -function of is based on data generated by a different, called behavior, policy .
Next, we extend Theorem 7 in Jia and Zhou (2023) and obtain a martingale characterization of the value function and the -function simultaneously. The proof is essentially the same and hence omitted.
Theorem 3.
Let a policy a function with polynomial growth and a continuous function be given satisfying
(72) |
Assume that and both have polynomial growth. Then
-
(i)
and are respectively the value function and the q-function associated with if and only if for all , the following process
(73) is a -martingale, where satisfies (22) with .
-
(ii)
If and are respectively the value function and the q-function associated with , then for any and for all , the following process
(74) is a -martingale, where satisfies (22) with .
-
(iii)
If there exists such that for all , the process (74) is a -martingale where , then we have and for all .
In any of the three cases above, if it holds that , then is the optimal policy and is the optimal value function.
3.3 Optimal -function
We consider in this section the optimal -function, i.e., the -function associated with the optimal policy in (62). Based on Definition 2, we can define it by
(75) |
where is the optimal value function that solves the exploratory HJB equation in (59).
The following is the martingale condition that characterize the optimal value function and the optimal -function, that can be proved analogously to Theorem 9 in Jia and Zhou (2023).
Theorem 4.
Let a function and a continuous function be given satisfying
(76) |
Assume that and both have polynomial growth in . Then
-
(i)
If and are respectively the optimal value function and the optimal -function, then for any and for all , the following process
(77) is a -martingale, where satisfies (22) with . Moreover, in this case, is the optimal stochastic policy.
-
(ii)
If there exists such that for all , the process (77) is a -martingale where , then and are respectively the optimal value function and the optimal -function.
4 -Learning Algorithms
In this section we present learning algorithms based on the martingale characterization of the -function discussed in the previous section. We need to distinguish two cases, depending on whether or not the density function of the Gibbs measure generated from the -function can be computed and integrated explicitly.
We first discuss the case when the normalizing constant in the Gibbs measure can be computed explicitly. We denote by and the parameterized function approximators for the optimal value function and optimal -function, respectively. In view of Theorem 4, these approximators are chosen to satisfy
(78) |
We can then update by enforcing the martingale condition discussed in Theorem 4 and applying the techniques developed in Jia and Zhou (2022a). This procedure has been discussed in details in Section 4.1 of Jia and Zhou (2023), and hence we omit the details. For reader’s convenience, we present Algorithms 1 and 2, which summarize the offline and online -learning algorithms respectively. Such algorithms are based on the so-called martingale orthogonality condition in Jia and Zhou (2022a), where the typical choices of test functions in these algorithms are , and , where is the policy generated by . Note that these two algorithms are identical to Algorithms 2 and 3 in Jia and Zhou (2023).
Inputs: initial state , horizon , time step , number of episodes , number of mesh grids , initial learning rates and a learning rate schedule function (a function of the number of episodes), functional forms of parameterized value function and -function satisfying (78), functional forms of test functions and , and temperature parameter .
Required program (on-policy): environment simulator that takes current time–state pair and action as inputs and generates state at time and instantaneous reward at time as outputs. Policy .
Required program (off-policy): observations including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grid with step size .
Learning procedure:
Inputs: initial state , horizon , time step , number of mesh grids , initial learning rates and learning rate schedule function (a function of the number of episodes), functional forms of parameterized value function and -function satisfying (78), functional forms of test functions and , and temperature parameter .
Required program (on-policy): environment simulator that takes current time–state pair and action as inputs and generates state at time and instantaneous reward at time as outputs. Policy .
Required program (off-policy): observations including the observed actions, rewards, and state when the current time-state pair is under the given behavior policy at the sampling time grid with step size .
Learning procedure:
When the normalizing constant in the Gibbs measure is not available, we take the same approach as in Jia and Zhou (2023) to develop learning algorithms. Specifically, we consider , which is a family of density functions of some tractable distributions, e.g. multivariate normal distributions. Starting from a stochastic policy in this family, we update the policy by considering the optimization problem
Specifically, using gradient descent, we can update as in Jia and Zhou (2023), by
(79) |
In the above updating rule, we need only the values of the -function along the trajectory – the “data” – , instead of its full functional form. These values can be learned through the “temporal difference” of the value function along the data. To see this, applying Itô’s formula (15) to , we have
(80) | ||||
(81) |
We may ignore the and terms which are martingale differences with mean zero, and then the updating rule in (79) becomes
(82) | |||
(83) |
Using as the parameterized function approximator for , we arrive at the updating rule for the policy parameter :
(84) | |||
(85) |
Therefore, we can update using the PE methods in Jia and Zhou (2022a), and update using the above rule, leading to actor–critic type of algorithms.
To conclude, we are able to use the same RL algorithms to learn the optimal policy and optimal value function, without having to know a priori whether the unknown environment entails a pure diffusion or a jump-diffusion. This important conclusion is based on the theoretical analysis carried out in the previous sections.
5 Application: Mean–Variance Portfolio Selection
We now present an applied example of the general theory and algorithms derived. Consider investing in a market where there are a risk-free asset and a risky asset (e.g. a stock or an index). The risk-free rate is given by a constant and the risky asset price process follows
(86) |
Let be the discounted wealth value at time , and is the discounted dollar value of the investment in the risky asset. The self-financing discounted wealth process follows
(87) |
where is the Sharpe ratio of the risky asset and given by
(88) |
We assume
(89) |
Condition (89) implies that and are finite for every ; see (Cont and Tankov 2004, Proposition 3.14). We set
(90) |
Fix the investment horizon as . The mean-variance (MV) portfolio selection problem considers
(91) |
We seek the optimal pre-committed strategy for the MV problem as in Zhou and Li (2000). We can transform the above constrained problem into an unconstrained one by introducing a Lagrange multiplier, which yields
(92) |
Note that the optimal solution to the unconstrained minimization problem depends on , and we can obtain the optimal multiplier by solving .
The exploratory formulation of the problem is
(93) |
where the discounted wealth under a stochastic policy follows
(94) |
5.1 Solution of the exploratory control problem
We consider the HJB equation for problem (93):
(95) |
with the terminal condition . Note that supremum becomes infimum and the sign before flips compared with (59) because we consider minimization here. The Hamiltonian of the problem is given by
(96) | ||||
(97) |
where We take the following ansatz for the solution of the HJB equation (95):
(98) |
As is quadratic in , we can easily calculate the integral term in the Hamiltonian and obtain
(99) |
The probability density function that minimizes the integral in (95) is given by
(100) |
which is a candidate for the optimal stochastic policy. From (99), we obtain
(101) |
Substituting it back to the HJB equation (95), we obtain a nonlinear PDE as
(102) | |||
(103) |
We plug in the ansatz (98) to the above PDE and obtain that satisfies
(104) |
and satisfies
(105) |
These two ordinary differential equations can be solved analytically, and we obtain
(106) | ||||
(107) |
It follows that
(108) |
It is straightforward to verify that is admissible by checking the four conditions in Definition 1. Furthermore, solves the HJB equation (95) and shows quadratic growth. Therefore, by Lemma 5, we have the following conclusion.
Proposition 2.
For the unconstrained MV problem (92), the optimal value function and the optimal stochastic policy .
When there is no jump, we have and thus recover the expressions of the optimal value function and optimal policy derived in Wang and Zhou (2020) for the unconstrained MV problem in the pure diffusion setting.
5.2 Parametrizations for -learning
It is important to observe that the optimal value function, optimal policy and the Hamiltonian given by (99) take the same structural forms regardless of the presence of jumps, while the only differences are the constant coefficients in those functions. However, those coefficients are unknown anyway and will be parameterized in the implementation of our RL algorithms. Consequently, we can use the same parameterizations for the optimal value function and optimal -function for learning as in the diffusion setting of Jia and Zhou (2023). This important insight, concluded only after a rigorous theoretical analysis, shows that the continuous-time RL algorithms are robust to the presence of jumps and essentially model-free, at least for the MV portfolio selection problem.
Following Jia and Zhou (2023), we parametrize the value function as
(109) |
and the -function as
(110) |
Let and . The policy associated with the parametric -function is . In addition to and , we learn the Lagrange multiplier in the same way as in Jia and Zhou (2023) by the stochastic approximation algorithm that updates with a learning rate after a fixed number of iterations.
5.3 Simulation study
We assess the effect of jumps on the convergence behavior of our algorithms via a simulation study. We use the same basic setting as in Jia and Zhou (2023): , , year, years (corresponding to one trading day), and a chosen temperature parameter . We consider two market simulators: one is given by the Black–Scholes (BS) model and the other is Merton’s jump-diffusion (MJD) model in which the Lévy density is a scaled Gaussian density, i.e.,
(111) |
where is the arrival rate of the Poisson jumps. The Gaussian assumption is a common choice in the finance literature for modeling the jump-size distribution (see e.g. Merton 1976, Bates 1991, Das 2002), partly due to its tractability for statistical estimation and partly because heavy-tailed distributions may not be easily identified from real data when the number of jumps is limited (see Heyde and Kou 2004).
Under the latter model, we have
(112) |
To mimic the real market, we set the parameters of these two simulators by estimating them from the daily data of the S&P 500 index using maximum likelihood estimation (MLE). Our estimation data cover a long period from the beginning of 2000 to the end of 2023. In Table 1, we summarize the estimated parameter values (used for the simulators) and the corresponding value of in the optimal policy. Note that although we use a stochastic policy to interact with the environment during training to update the policy parameters, for actual execution of portfolio selection we apply a deterministic policy which is the mean part of the optimal stochastic policy after it has been learned. So it is off-policy learning here. The advantage of doing so among others is to reduce the variance of the final wealth; see Huang et al. (2022) for a discussion on this approach. As a result, here we only display the values for in these two environments and use them as benchmarks to check the convergence of our algorithm (see Figure 1).
Simulator | Parameters | Optimal |
---|---|---|
BS | ||
MJD |
For offline learning, the Lagrange multiplier is updated after every iterations, and the parameter vectors and are initialized as zero vectors. The learning rates are set to be , , and with decay rate , where is the iteration index. In each iteration, we generate independent -year trajectories to update the parameters. We train the model for iterations.
We also consider online learning with equal to one trading day. We select a batch size of trading days and update the parameters after this number of observations coming in. We set for updating and initialize and as zero vectors. The learning rates are set as , , and with decay rate . Some of the rates are notably smaller under online learning because now we update with fewer observations and thus must be more cautious. The model is again trained for iterations.
Figure 1 plots the convergence behavior of offline and online learning under both simulators or market environments (one with jumps and one without). The algorithms have converged after a sufficient number of iterations, whether jumps are present or not. This demonstrates that convergence of the offline and online -learning algorithms proposed in Jia and Zhou (2023) under the diffusion setting is robust to the presence of jumps for mean–variance portfolio selection. However, jumps in the environment can introduce more variability in the convergence process as seen from the plots.

5.4 Effects of jumps
The theoretical analysis so far in this section shows that, for the mean–variance problem, one does not need to know in advance whether or not the stock prices have jumps in order to carry out the RL task, because the optimal stochastic policy is Gaussian and the corresponding value function and -function have the same structures for parameterization irrespective of the presence of jumps. However, we stress that this is rather an exception than a rule. Here we give a counterexample.
Consider a modification of the mean–variance problem where the controlled system dynamics is
(113) |
with
(114) |
and the exploratory objective is
(115) |
Note that this is not a mean–variance portfolio selection problem because (113) does not correspond to a self-financed wealth equation with a reasonably modelled stock price process.
The Hamiltonian is given by
(116) | ||||
(117) |
If an optimal stochastic policy exists, then it must be
(118) |
We show by contradiction that the optimal stochastic policy can not be Gaussian in this case. Note that if there is no optimal stochastic policy, then it would already demonstrate that jumps matter because the optimal stochastic policy for the case of no jumps exists and is Gaussian.
Remark 4.
The existence of optimal stochastic policy in (118) is equivalent to the integrability of the quantity over . This integrability depends on the tail behavior of the Hamiltonian and, in particular, the behavior of when is large.
Suppose the optimal stochastic policy is Gaussian for all , implying that the Hamilitonian is a quadratic function of . It then follows from (116) that there exist functions and such that
(119) |
We do not put a term independent of on the right-hand side because the left-hand side is zero when . Taking derivative with respect to , we obtain
(120) |
Setting , we get It follows that
(121) |
Hence we have for any Sending to zero yields for all . Therefore, we obtain from (119) that
(122) |
Taking derivative in in the above we have
(123) |
Thus is constant in or is affine in , leading to for some functions and . The resulting Hamiltonian becomes
(124) |
This is linear in and hence the integral does not exist. It follows that does not exist, which is a contradiction. Therefore, we have shown that under (114), the optimal stochastic policy either does not exist or is not Gaussian when it exists.
Remark 5.
The argument above works for for any .
6 Application: Mean–Variance Hedging of Options
The MV portfolio selection problem considered in Section 5 is an LQ problem. In this section, we present another application that is non-LQ. Consider an option seller who needs to hedge a short position in a European-style option that expires at time . The option is written on a risky asset whose price process is described by the SDE (86) with condition (89) satisfied. At the terminal time , the seller pays to the option holder. We assume that the seller’s hedging activity will not affect the risky asset price.
To hedge the random payoff, the seller constructs a portfolio consisting of the underlying risky asset and cash. We consider discounted quantities in the problem: the discounted risky asset price and discounted payoff , where is again the constant risk-free interest rate. As an example, for a put option with strike price , (), and , where . Here follows the SDE
(125) |
where is defined in (88).
We denote the discounted dollar value in the risky asset and the discounted value of the hedging portfolio at time by and , respectively. As in the MV portfolio selection problem, is the control variable and follows the SDE (87). The seller seeks a hedging policy to minimize the deviation from the terminal payoff. A popular formulation is mean–variance hedging (also known as quadratic hedging), which considers the objective
(126) |
Note that this expectation is taken under the real-world probability measure, rather than under any martingale measure for option pricing. An advantage of this formulation is that relevant data are observable in real world (but not necessarily in a risk-neutral world). Although objective (126) looks similar to that of the MV portfolio selection problem, the target the hedging portfolio tries to meet is now random instead of constant. Furthermore, the payoff is generally nonlinear in . Thus, MV hedging is not an LQ problem. Also, both and matter for the hedging decision at time , whereas is irrelevant for decision in the MV portfolio selection problem. This increase in the state dimension makes the hedging problem much more difficult to solve.
We consider the following exploratory formulation to encourage exploration for learning:
(127) |
where the discounted value of the hedging portfolio under a stochastic policy follows (94).
6.1 Solution of the exploratory control problem
We consider the HJB equation for problem (127):
(128) |
with terminal condition . The Hamiltonian of the problem is given by
(129) | ||||
(130) | ||||
(131) |
where and .
We make the following ansatz for the solution of the HJB equation (128):
(132) |
With this, we can simplify the integral term in the Hamiltonian and obtain
(133) | ||||
(134) | ||||
(135) | ||||
(136) |
The probability density function that minimizes the integral in (128) is given by
(137) |
which is a candidate for the optimal stochastic policy. From (136), we obtain that is given by
(138) |
Substituting it back to the HJB equation (128), we obtain a nonlinear PIDE
(139) | |||
(140) | |||
(141) |
We plug in the ansatz (132) to the above PIDE. After some lengthy calculations, we can collect similar terms and obtain the following equations satisfied by , and :
(142) |
(143) | |||
(144) |
and
(145) | |||
(146) | |||
(147) | |||
(148) |
The function is given by
(149) |
However, the two PIDEs cannot be solved in closed form in general. Below we first present some properties of and .
Lemma 6.
Next, we provide stochastic representations for the solutions to PIDEs (144) and (148), which will be subsequently exploited for our RL study. Construct a new probability measure from the given probability measure with the Radon–Nikodym density process defined by
(151) |
Condition (89) guarantees that is a true martingale with unit expectation. The standard measure change results yield that, under ,
(152) |
are standard Brownian motion and compensated Poisson random measure, respectively. Using them we can rewrite the SDE for as
(153) |
Let . Clearly, it is a Lévy process and also a martingale under condition (89). Process is the stochastic exponential of and it follows from (Cont and Tankov 2004, Proposition 8.23) that is also a martingale under . The Feynman–Kac theorem gives the representation of the solution to PIDE (144) as
(154) |
We can view as the martingale measure for option valuation and interpret as the option price at time and underlying price . For PIDE (148), again by the Feynman–Kac theorem we obtain that its solution is given by
(155) | ||||
(156) |
where
(157) | |||
(158) |
Using the expression for , we obtain as
(159) |
where
(160) |
Using Lemma 6, we can directly verify that is admissible. It is also obvious to see that given by (132) has quadratic growth in and . Therefore, by Lemma 5, we have the following conclusion.
Proposition 3.
For the MV hedging problem (126), the optimal value function and the optimal stochastic policy under the assumptions that and is Lipschitz continuous in .
The mean part of , , comprises three terms. The quantity is the delta of the option and hence gives the dollar amount of the risky asset as required by delta hedging, which is used to hedge continuous price movements. The integral term shows the dollar amount of the risky asset that should be held to hedge discontinuous changes. The two hedges are combined using weights and . The last term reflects the adjustment that needs to be made due to the discrepancy between the hedging portfolio value and option price. It is easy to show that is the optimal deterministic policy of problem (126).
The variance of is the same as that in the MV portfolio selection problem (c.f. (108)), which decreases as approaches the terminal time . This implies that one gradually reduces exploration over time.
6.2 Parametrizations and actor–critic learning
We use the previously derived solution of the exploratory problem as knowledge for RL. As we will see, the exploratory solution lends itself to natural parametrizations for the policy (“actor”) and value function (“critic”), but less so for parametrizing the -function.
To simplify the integral in defined in (160), we first parametrize the Lévy density as in (111) where we have three parameters: is the jump arrival rate, and and are the mean and standard deviation of the normal distribution for the jump size. We then use the Fourier-cosine method developed in Fang and Oosterlee (2009) to obtain an expression for . Let . Applying Itô’s formula, under we obtain
(161) |
Noting (152), under we have
(162) |
where the Lévy measure of becomes
(163) |
The characteristic function of under is given by
(164) |
where
(165) | ||||
(166) | ||||
(167) |
The Fourier-cosine method then calculates as
(168) |
Here, denotes the real part of complex number , indicates the first term in the sum is halved, is the discounted strike price of the call or put option under consideration, and is the truncation region for (chosen wide enough so that the probability of going outside under can be neglected). The expression of is given by Eqs. (24) and (25) in Fang and Oosterlee (2009) for call and put options, respectively, and it only depends on . We can also calculate by differentiating w.r.t. , which yields
(169) |
Actor parametrization. We follow the form of the optimal stochastic policy given in (159) to parametrize our policy. To obtain a parsimonious parametrization, we introduce , which respectively represent the agent’s understanding – relative to her learning – of drift , volatility , jump arrival rate , mean and standard deviation of the jump size of the risky asset. We emphasize that in our learning algorithm these parameters are not estimated by any statistical method (and hence they do not necessarily correspond to the true dynamics of the risky asset); rather they will be updated based on the hedging experiences (via exploration and exploitation) from the real environment. To implement (159), we also need two dependent parameters and that are calculated by
(170) |
These two parameters show the agent’s understanding of and , and they will be updated in the learning process by the above equations with the update of the independent parameters to . Denote .
We use to replace their counterparts in (159) to obtain a parametrized policy. For the mean part, we calculate the option price and its delta by plugging into (168) and (169), with the results denoted by and respectively, and approximate the integral by Gauss-Hermite (G-H) quadrature, which yields
(171) |
where and are the weights and abscissa of the -point G-H rule that are predetermined and independent of . For the variance part, we use in (149) to obtain . Thus, our parametrized policy is given by
(172) |
Critic parametrization. For the value function of , following (132) we parametrize it as
(173) |
We do not specify any parametric form for , but will learn it using Gaussian process regression (Williams and Rasmussen 2006), where denotes its parameters. We choose Gaussian process for our problem because it is able to generate flexible shapes and can often achieve a reasonably good fit with a moderate amount of data.
Actor–critic learning. Consider a time grid with , , and time step . In an iteration, we start with a given policy and use it to collect episodes from the environment, where the -th episode is given by . We then perform two steps.
In the critic step, the main task is to learn as a function of for each . The stochastic representation (158) shows that can be viewed as the value function of a stream of rewards that depend on time and the risky asset price; so the task is a policy evaluation problem. We first obtain the running reward path in every episode, where the integral involving the Lévy measure is again calculated by the -point G-H rule. Specifically, the reward at time for is given by
(174) | |||
(175) |
where the risky asset price path is collected from the real environment. We stress again that the above calculation uses the agent’s parameters and does not involve any parameters of the true dynamics. We then fit a Gaussian process to the sample of the cumulative rewards to estimate . Figure 2 illustrates the fit at three time points with the radial basis kernel. The Gaussian process is able to provide a good fit to the shape of as a function of for each given .
In the actor step, we update the policy parameter for by
(176) | ||||
(177) | ||||
(178) |
This update essentially follows from (85); the difference is that we do batch processing and update the parameters only after observing all the episodes in a batch. Furthermore, we flip the sign before and as we consider minimization here while the problem in the general discussion is for maximization.

6.3 Simulation study
We check the convergence behavior of our actor–critic algorithm in a simulation study. We assume the environment is described by an MJD model with its true parameters shown in the upper row of Table 2.222These values are chosen close to the estimates obtained from the market data of S&P 500 in Table 1 to make the MJD model better resemble the reality in the simulation study. We consider hedging a 4-month put option with strike price because short-term options with maturity of a few months are typically traded much more actively than longer-term options. We assume the risk-free rate is zero.
We use 20 quadrature points for the G-H rule in our implementation, which suffices for high accuracy. In each iteration, we sample episodes from the environment simulated by the MJD model. The learning rates are chosen to be , , , , , while the temperature parameter . We keep the learning rates constant in the first 30 iterations and then they decay at rate , where is the iteration index. Before our actor–critic learning, we sample a 15-year path from the MJD model as data for MLE and obtain estimates of the parameters shown in Table 2. MLE estimates all the parameters quite well except the drift , where the estimate is more than twice the true value.333This is expected due to the well-known mean-blur problem, as 15 years of data are significantly insufficient to obtain a reasonably accurate estimate of the mean using statistical methods. We then use these estimates as initial values of to for learning. Figure 3 clearly shows that our RL algorithm is able to converge near the true values. In particular, the estimate of eventually moves closely around the true level despite being distant to it initially.
Parameters | |
---|---|
True | |
MLE |

6.4 Empirical study
In the empirical study we consider hedging a European put option written on the S&P 500 index. This type of option is popular in the financial market as a tool for investors to protect their high beta investment. We obtain daily observations of the index from the beginning of 2000 to the end of 2023, and split them into two periods: the training period (the first 20 years) and the test period (the last 4 years). The test period includes the U.S. stock market meltdown at the onset of the pandemic, the 2022 bear period, and various bull periods.
We use the MLE estimates from the first 20 years of observations to initialize to for learning. To sample an episode from the environment for training our RL algorithm, we bootstrap index returns from the training period. Similarly, we bootstrap index returns from the test period to generate index price paths to test the learned policy. We use the 20-point G-H rule with episodes in each training iteration. The learning rates are set as , , , , with the same learning rate decay schedule as in the simulation study. We set the temperature parameter .
We consider four maturities ranging from 1 to 4 months for the put option. For a standard S&P500 index option traded on CBOE, in practice its strike is quoted in points commensurate with the scale of the index level and converted to dollar amount by a multiplier of 100. That is, the dollar strike , where is the strike in points.444For example, for an at-the-money option contract where the current index is 5000 points, we have . The contract’s notional size is 500000 dollars. Let be the index level at time converted to dollars by the multiplier of 100. We sample uniformly from the range to generate training and test episodes. This allows us to include scenarios where the option is deep out of money or in the money. We train our algorithm for 100 iterations and then test it on 5000 index price episodes. Finally, although we use a stochastic policy to interact with the environment for learning/updating the parameters during training, in the test stage we apply a deterministic policy which is the mean of the learned stochastic policy. This can reduce the variance of the final hedging error.
To measure the test performance, we consider the mean squared hedging error (denominated in squared dollars) divided by . This effectively normalizes the contract’s scale; so we can just set in our implementation. The test result can be interpreted as the average mean squared hedging error in squared dollars for every point of strike price.
We compare the performances of the policy learned by our RL algorithm and the one obtained by plugging the MLE estimates from the training period data to (160). Both policies are tested on the same index price episodes. We check the statistical significance of the difference in the squared hedging errors of the two policies by applying the t-test. Table 3 reports the mean squared hedging error for each policy as well as the p-value of the t-test. Not surprisingly, the hedging error of each policy increases as the maturity becomes longer due to greater uncertainty in the problem. In every case, the RL policy achieves notable reduction in the hedging error compared with the MLE-based policy. The p-value further shows the outperformance of the former is statistically significant.
Maturity | RL policy | MLE policy | p-value |
---|---|---|---|
1 month | 0.3513 (0.034) | 0.4006 (0.043) | 3.2e-7 |
2 months | 0.5668 (0.029) | 0.6796 (0.050) | 1.2e-5 |
3 months | 0.8283 (0.031) | 1.0080 (0.060) | 1.6e-5 |
4 months | 0.8649 (0.047) | 1.1544 (0.068) | 4.2e-13 |
7 Conclusions
Fluctuations in data or time series coming from nonlinear, complex dynamic systems are characterized by two types: slow changes and sudden jumps, the latter occurring much rarely than the former. Hence jump-diffusions capture the key structural characteristics of many data generating processes in areas such as physics, astrophysics, earth science, engineering, finance, and medicine. As a result, RL for jump-diffusions is important both theoretically and practically. This paper endeavors to lay a theoretical foundation for the study. A key insight from this research is that temporal–difference algorithms designed for diffusions can work seamlessly for jump-diffusions. However, unless using general neural networks, policy parameterization does need to respond to the presence of jumps if one is to take advantage of any special structure of an underlying problem.
There are plenty of open questions starting from here, including the convergence of the algorithms, regret bounds, decaying rates of the temperature parameters, and learning rates of the gradient decent and/or stochastic approximation procedures involved.
Acknowledgements
Gao acknowledges support from the Hong Kong Research Grants Council [GRF 14201424, 14212522, 14200123]. Li is supported by the Hong Kong Research Grant Council [GRF 14216322]. Zhou is supported by the Nie Center for Intelligent Asset Management at Columbia University.
References
- Aït-Sahalia et al. (2012) Aït-Sahalia, Y., J. Jacod, and J. Li (2012). Testing for jumps in noisy high frequency data. Journal of Econometrics 168(2), 207–222.
- Andersen et al. (2002) Andersen, T. G., L. Benzoni, and J. Lund (2002). An empirical investigation of continuous-time equity return models. The Journal of Finance 57(3), 1239–1284.
- Applebaum (2009) Applebaum, D. (2009). Lévy Processes and Stochastic Calculus. Cambridge University Press.
- Bates (1991) Bates, D. S. (1991). The crash of 87: was it expected? The evidence from options markets. The Journal of Finance 46(3), 1009–1044.
- Bates (1996) Bates, D. S. (1996). Jumps and stochastic volatility: Exchange rate processes implicit in Deutsche Mark options. The Review of Financial Studies 9(1), 69–107.
- Bender and Thuan (2023) Bender, C. and N. T. Thuan (2023). Entropy-regularized mean-variance portfolio optimization with jumps. arXiv preprint arXiv:2312.13409.
- Bo et al. (2024) Bo, L., Y. Huang, X. Yu, and T. Zhang (2024). Continuous-time q-learning for jump-diffusion models under Tsallis entropy. arXiv preprint arXiv:2407.03888.
- Cai and Kou (2011) Cai, N. and S. G. Kou (2011). Option pricing under a mixed-exponential jump diffusion model. Management Science 57(11), 2067–2081.
- Cont and Tankov (2004) Cont, R. and P. Tankov (2004). Financial Modelling with Jump Processes. Chapman and Hall/CRC.
- Dai et al. (2023) Dai, M., Y. Dong, and Y. Jia (2023). Learning equilibrium mean–variance strategy. Mathematical Finance 33(4), 1166–1212.
- Das (2002) Das, S. R. (2002). The surprise element: Jumps in interest rates. Journal of Econometrics 106(1), 27–65.
- Denkert et al. (2024) Denkert, R., H. Pham, and X. Warin (2024). Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching. arXiv preprint arXiv:2404.17939.
- Ethier and Kurtz (1986) Ethier, S. N. and T. G. Kurtz (1986). Markov Processes: Characterization and Convergence. John Wiley & Sons.
- Fang and Oosterlee (2009) Fang, F. and C. W. Oosterlee (2009). A novel pricing method for European options based on Fourier-cosine series expansions. SIAM Journal on Scientific Computing 31(2), 826–848.
- Gammaitoni et al. (1998) Gammaitoni, L., P. Hänggi, P. Jung, and F. Marchesoni (1998). Stochastic resonance. Reviews of Modern Physics 70(1), 223–287.
- Gao et al. (2022) Gao, X., Z. Q. Xu, and X. Y. Zhou (2022). State-dependent temperature control for langevin diffusions. SIAM Journal on Control and Optimization 60(3), 1250–1268.
- Giraudo and Sacerdote (1997) Giraudo, M. T. and L. Sacerdote (1997). Jump-diffusion processes as models for neuronal activity. Biosystems 40(1-2), 75–82.
- Goswami et al. (2018) Goswami, B., N. Boers, A. Rheinwalt, N. Marwan, J. Heitzig, S. Breitenbach, and J. Kurths (2018). Abrupt transitions in time series with uncertainties. Nature Communications 9(1), 48–57.
- Guo et al. (2023) Guo, X., A. Hu, and Y. Zhang (2023). Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. SIAM Journal on Control and Optimization 61(2), 755–787.
- Guo et al. (2022) Guo, X., R. Xu, and T. Zariphopoulou (2022). Entropy regularization for mean field games with learning. Mathematics of Operations research 47(4), 3239–3260.
- Heyde and Kou (2004) Heyde, C. C. and S. G. Kou (2004). On the controversy over tailweight of distributions. Operations Research Letters 32(5), 399–408.
- Huang et al. (2022) Huang, Y., Y. Jia, and X. Zhou (2022). Achieving mean–variance efficiency by continuous-time reinforcement learning. In Proceedings of the Third ACM International Conference on AI in Finance, pp. 377–385.
- Jacod and Shiryaev (2013) Jacod, J. and A. Shiryaev (2013). Limit Theorems for Stochastic Processes. Springer.
- Jia and Zhou (2022a) Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research 23(154), 1–55.
- Jia and Zhou (2022b) Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research 23(154), 1–55.
- Jia and Zhou (2023) Jia, Y. and X. Y. Zhou (2023). q-learning in continuous time. Journal of Machine Learning Research 24(161), 1–61.
- Kou (2002) Kou, S. G. (2002). A jump-diffusion model for option pricing. Management Science 48(8), 1086–1101.
- Kunita (2004) Kunita, H. (2004). Stochastic differential equations based on Lévy processes and stochastic flows of diffeomorphisms. In M. M. Rao (Ed.), Real and Stochastic Analysis: New Perspectives, pp. 305–373. Birkhäuser.
- Kushner (2000) Kushner, H. J. (2000). Jump-diffusions with controlled jumps: Existence and numerical methods. Journal of Mathematical Analysis and Applications 249(1), 179–198.
- Li and Linetsky (2014) Li, L. and V. Linetsky (2014). Time-changed Ornstein–Uhlenbeck processes and their applications in commodity derivative models. Mathematical Finance 24(2), 289–330.
- Lim (2005) Lim, A. E. (2005). Mean-variance hedging when there are jumps. SIAM Journal on Control and Optimization 44(5), 1893–1922.
- Merton (1976) Merton, R. C. (1976). Option pricing when underlying stock returns are discontinuous. Journal of Financial Economics 3(1-2), 125–144.
- Munos (2006) Munos, R. (2006). Policy gradient in continuous time. Journal of Machine Learning Research 7, 771–791.
- Øksendal and Sulem (2007) Øksendal, B. K. and A. Sulem (2007). Applied Stochastic Control of Jump Diffusions, Volume 498. Springer.
- Park et al. (2021) Park, S., J. Kim, and G. Kim (2021). Time discretization-invariant safe action repetition for policy gradient methods. Advances in Neural Information Processing Systems 34, 267–279.
- Reisinger and Zhang (2021) Reisinger, C. and Y. Zhang (2021). Regularity and stability of feedback relaxed controls. SIAM Journal on Control and Optimization 59(5), 3118–3151.
- Sato (1999) Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press.
- Schweizer (1996) Schweizer, M. (1996). Approximation pricing and the variance-optimal martingale measure. The Annals of Probability 24(1), 206–236.
- Schweizer (2001) Schweizer, M. (2001). A guided tour through quadratic hedging approaches. In E. Jouini, J. Cvitanic, and M. Musiela (Eds.), Handbooks in Mathematical Finance: Option Pricing, Interest Rates and Risk Management, pp. 538–74. Cambridge University Press.
- Situ (2006) Situ, R. (2006). Theory of Stochastic Differential Equations with Jumps and Applications. Springer.
- Tallec et al. (2019) Tallec, C., L. Blier, and Y. Ollivier (2019). Making deep q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732.
- Wang et al. (2023) Wang, B., X. Gao, and L. Li (2023). Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950.
- Wang and Zheng (2022) Wang, B. and X. Zheng (2022). Testing for the presence of jump components in jump diffusion models. Journal of Econometrics 230(2), 483–509.
- Wang et al. (2020) Wang, H., T. Zariphopoulou, and X. Y. Zhou (2020). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research 21(198), 1–34.
- Wang and Zhou (2020) Wang, H. and X. Y. Zhou (2020). Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance 30(4), 1273–1308.
- Williams and Rasmussen (2006) Williams, C. K. and C. E. Rasmussen (2006). Gaussian Processes for Machine Learning. MIT Press.
- Wu and Li (2024) Wu, B. and L. Li (2024). Reinforcement learning for continuous-time mean–variance portfolio selection in a regime-switching market. Journal of Economic Dynamics and Control 158, 104787.
- Zhou and Li (2000) Zhou, X. Y. and D. Li (2000). Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization 42(1), 19–33.
- Zhu et al. (2015) Zhu, C., G. Yin, and N. A. Baran (2015). Feynman–Kac formulas for regime-switching jump diffusions and their applications. Stochastics 87(6), 1000–1032.
Appendix A Proofs
Proof of Lemma 2.
We observe that for each ,
(179) |
From Assumption 1-(iii), we have for any , where does not depend on . Thus, integrating over with the measure preserves the linear growth property. ∎
Proof of Lemma 3.
We consider
(180) |
which is bounded by
(181) | ||||
(182) |
for some constant .
For the first integral, using Assumption 2 we obtain
(183) | ||||
(184) |
For the second integral, we have
(185) | ||||
(186) | ||||
(187) | ||||
(188) |
where we used Assumption 1-(ii) and is the constant there. The desired claim is obtained by combining these results. ∎
Proof of Lemma 4.
Fix and suppose . Define a sequence of stopping times for . Applying Itô’s formula (15) to the process , where follows the exploratory SDE (42), we obtain ,
(189) | ||||
(190) | ||||
(191) | ||||
(192) |
where is the generator of the exploratory process given in (40). We next show that the expectations of (191) and (LABEL:eq:ito1N) are zero.
Note that for , . Thus, is also bounded. Using the linear growth of in Lemma 1 and the moment estimate (52), we can see that (191) is a square-integrable martingale and hence its expectation is zero.
Next, we analyze the following stochastic integral:
(194) |
We consider the finite and infinite jump activity cases separately.
Case 1: . In this case, both of the processes
(195) | |||
(196) |
are square-integrable martingales and hence have zero expectations. We prove this claim for (195) and analyzing (196) is entirely analogous.
Using the polynomial growth of , Lemma 2, and for , we obtain
(197) | ||||
(198) | ||||
(199) |
This implies the process (195) is a square-integrable martingale; see e.g. (Situ, 2006, Section 1.9) for square-integrability of stochastic integrals with respect to compensated Poisson random measures.
Case 2: . Let and . The stochastic integral (194) can be written as the sum of two integrals:
(200) | ||||
(201) |
Using the mean-value theorem, the stochastic integral (200) is equal to
(202) |
for some . For , . By Assumption 1′-(v), is bounded for any , and , which further implies the boundedness of . Now, for each ,
(203) |
is a square-integrable martingale because
(204) | ||||
(205) | ||||
(206) |
where we used Lemma 2 and boundedness of in the above. Thus, (200) is a square-integrable martingale with mean zero.
For (201), we can use the same arguments as in the finite activity case by observing to show that each of the two processes in (201) is a square-integrable martingale with mean zero.
Combining the results above, setting , and taking expectation, we obtain
As satisfies Eq. (56), it follows from (19) that
(207) | ||||
(208) |
It follows from Assumption 1-(iv), Definition 1-(iii), and Eq. (57) that the term on the right-hand side of (208) is dominated by for some , which has finite expectation from the moment estimate (52). Therefore, sending to infinity in (208) and applying the dominated convergence theorem, we obtain .∎
Proof of Lemma 5.
We first maximize the integral in (59). Applying (Jia and Zhou, 2023, Lemma 13), we deduce that given by (62) is the unique maximizer. Next, we show that is the optimal value function.
On one hand, given any admissible stochastic policy , from (59) we have
(209) |
Using similar arguments as in the proof of Lemma 4, we obtain for any . Thus, .
Combining these results, we conclude that and is the optimal stochastic policy. ∎
Proof of Theorem 2.
Consider part (i). Applying Itô’s formula to the value function of policy over the sample state process defined by (22) and using the definition of -function, we obtain that for ,
(211) |
Suppose for all . Hence the first term on the right-hand side of (211) is zero. We verify the following two conditions:
(212) | |||
(213) |
Eq. (212) follows from Assumption 1-(iii), the polynomial growth of in , and the moment estimate (53). For (213), we apply the mean-value theorem and the integral becomes
(214) |
for some . Using the polynomial growth of in , we can bound the integral by
(215) | ||||
(216) | ||||
(217) | ||||
(218) | ||||
(219) |
Using Assumption 1-(iii) and the moment estimate (53), we obtain (213). It follows that the second and third processes on the right-hand side of (211) are -martingales and thus we have the martingale property of the process given by (69).
Conversely, if (69) is a -martingale, we see from (211) that the process
(220) |
is also a -martingale. Furthermore, it has continuous sample paths and finite variation and thus is equal to zero -almost surely. We can then follow the argument in the proof of Theorem 6 in Jia and Zhou (2023) to show that for all . There is only one step in their proof that we need to modify due to the presence of jumps.
Specifically, consider the sample state process starting from some time-state-action . Fix and define
(221) |
In the pure diffusion case, Jia and Zhou (2023) uses the continuity of the sample paths of to argue that , -almost surely. This result still holds with presence of jumps, because the Lévy processes that drive our controlled state are stochastic continuous, i.e., the probability of having a jump at the fixed time is zero.
To prove parts (ii) and (iii), we can apply the arguments used in proving part (i) together with those arguments from the proof of Theorem 6 in Jia and Zhou (2023). The details are omitted. ∎
Proof of Lemma 6.
(1) Consider the function defined by the RHS of (154). Proposition 12.1 in Cont and Tankov (2004) shows that and it satisfies the PIDE (144). Furthermore, is Lipschitz continuous in . The Lipschitz continuity of function also implies for some constant . It follows that
(222) |
where we used the martingale property of under . The Feynman-Kac Theorem (see Zhu et al. (2015)) implies uniqueness of classical solutions satisfying the linear growth condition.
(2) To study the PIDE (148), we consider the function defined by the RHS of (156). Under the assumption of Lemma 6, the arguments of Proposition 12.1 in Cont and Tankov (2004) can be used to show that and it satisfies the PIDE (148). The Lipschits continuity of in implies that is bounded and . We also have (Kunita (2004), Theorem 3.2). Combining these results, we see that shows quadratic growth in . Again the Feynman-Kac Theorem implies uniqueness of classical solutions satisfying the quadratic growth condition. ∎