Continuous-time optimal investment with portfolio constraints: a reinforcement learning approach
Abstract.
In a reinforcement learning (RL) framework, we study the exploratory version of the continuous time expected utility (EU) maximization problem with a portfolio constraint that includes widely-used financial regulations such as short-selling constraints and borrowing prohibition. The optimal feedback policy of the exploratory unconstrained classical EU problem is shown to be Gaussian. In the case where the portfolio weight is constrained to a given interval, the corresponding constrained optimal exploratory policy follows a truncated Gaussian distribution. We verify that the closed form optimal solution obtained for logarithmic utility and quadratic utility for both unconstrained and constrained situations converge to the non-exploratory expected utility counterpart when the exploration weight goes to zero. Finally, we establish a policy improvement theorem and devise an implementable reinforcement learning algorithm by casting the optimal problem in a martingale framework. Our numerical examples show that exploration leads to an optimal wealth process that is more dispersedly distributed with heavier tail compared to that of the case without exploration. This effect becomes less significant as the exploration parameter is smaller. Moreover, the numerical implementation also confirms the intuitive understanding that a broader domain of investment opportunities necessitates a higher exploration cost. Notably, when subjected to both short-selling and money borrowing constraints, the exploration cost becomes negligible compared to the unconstrained case.
Key words and phrases:
Optimal investment, entropy regularized, reinforcement learning, exploration, stochastic optimal control, portfolio constraint2010 Mathematics Subject Classification:
34D20, 60H10, 92D25, 93D05, 93D20.1. Introduction
Reinforcement learning (RL) is a dynamic and rapidly advancing subset within the field of machine learning. In recent times, RL has been applied across diverse disciplines to address substantial real-world challenges, extending its reach into domains such as game theory (Silver et al., 2016, 2017), control
theory (Bertsekas, 2019),
information theory (Williams et al., 2017), and statistics (Kaelbling
et al., 1996).
RL has been also applied to study many important issues of operations research (OR) such as manufacturing planning, control systems (Schneckenreither and
Haeussler, 2019) and inventory management (Bertsimas and Thiele, 2006). In quantitative finance, RL has been used to study important
problems such
as algorithmic and high frequency trading,
portfolio management.
The availability of big data sets
has too facilitated
the rapid development
of using RL techniques in financial
engineering. As
a typical example, electronic markets
can provide a sufficient amount of microstructure data for training and adaptive learning,
much beyond what human traders and portfolio managers could handle
in old days. Numerous studies
have been done in this direction
including optimal order execution (Nevmyvaka
et al., 2006; Schnaubelt, 2022),
optimal trading (Hendricks and
Wilcox, 2014), portfolio allocation (Moody et al., 1998),
mean-variance portfolio allocation (Wang and Zhou, 2020; Wang, 2019; Dai et al., 2020; Jia and Zhou, 2022b).
Notably,
the authors in Wang and Zhou (2020)
considered the continuous time mean-variance
portfolio selection using a continuous RL framework.
Their algorithm outperforms
both traditional and deep neural network based
algorithms by a large margin. For recent reviews on applications of RL in OR, finance and economics, the
interested reader is invited to delve into the review papers e.g. Gosavi (2009); Hambly et al. (2021); Jaimungal (2022); Charpentier
et al. (2021).
Differing from conventional econometric or supervised learning approaches prevalent in quantitative
finance research, which often necessitate the assumption of parametric models, an RL agent refrains from pre-specifying a structural model. Instead, she progressively learns optimal strategies through trial and error, engaging in interactions with a black-box environment (e.g. the market).
By repeatedly trying
a policy for actions,
receiving and evaluating
reward signals, and improving
the policy, the agent gradually improves her
performance.
The three key components
that essentially capture
the heart of RL are: i) exploration, ii) policy evaluation, and iii)
policy improvement.
The agent first explores
the surrounding
environment by trying
a policy. She
then evaluates
her reward for the given policy.
Lastly, using the information
receiving from both exploration
and current reward, she
devises a new policy
with larger reward.
The whole process is then repeated.
Despite the fast development and vast applications, very few existing RL studies are done for continuous settings
e.g. Doya (2000); Frémaux
et al. (2013); Lee and Sutton (2021), whereas a large body of RL investigations is limited to discrete learning frameworks
such as discrete-time Markov decision processes (MPD) or deterministic settings, see e.g. Sutton and Barto (2018); Hambly et al. (2021); Liu et al. (2020); Lee and Lee (2021). Since discrete time dynamics
just provides an approximation for
real work systems,
it is important, from
the practical point of view, to consider
continuous time systems with both
continuous states and action spaces.
As mentioned previously,
under RL settings,
an agent simultaneously interacts with
the surrounding environment (exploration)
and improves her overall performance (exploitation).
Since exploration is inherently costly in term of resources,
it is important to design
an active learning approach
which balances both exploration and exploitation. Therefore, there is a critical necessity to extend the RL techniques
to continuous settings where agent can find
best learning strategies that balance exploration and exploitation.
The studies of RL
under a continuous time framework
in both time and space
have been initiated in a series of recent papers (Wang et al., 2020; Wang and Zhou, 2020; Wang, 2019; Dai et al., 2020; Jia and Zhou, 2022b).
In Wang et al. (2020)
the authors
propose a theoretical framework, called exploratory formulation, for studying
RL problems in continuous systems in both time and space
that captures repetitive learning
under exploration in the continuous time limit.
Wang and Zhou (2020) adopts the RL setting of Wang et al. (2020)
to study the continuous-time mean variance
portfolio selection with a finite time horizon.
They show that in a learning framework incorporating both exploration and exploitation,
the optimal feedback policy is Gaussian, with time-decaying variance.
As showed in Jia and Zhou (2022a), the framework in Wang et al. (2020) minimizes
the mean-square temporal difference (TD) error (Barnard, 1993; Baird, 1995; Sutton and Barto, 2018)
when learning the value function, which later
turns out to be inconsistent for stochastic settings.
Dai et al. (2020) further considers the
equilibrium mean–variance strategies addressing the time-inconsistent issue of the problem. Guo et al. (2022); Mou et al. (2021)
extend the formulation and results of Wang and Zhou (2020) to mean-field games
and a mean-variance problem with drift
uncertainty, respectively. In Jia and Zhou (2022b), using a martingale approach,
the authors
are able to present the gradient of the value
function with respect to
a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated
using samples and the current value function. This representation effectively turns
policy gradient into a policy evaluation problem. Studies of convergence under the exploratory formulation can be found in Tang et al. (2022); Huang et al. (2022).
In this paper, we adopt the exploratory stochastic
control learning framework of Wang et al. (2020)
to study the continuous-time optimal investment problem without and with a portfolio constraint which covers widely-used financial regulations such as short-selling constraints and borrowing prohibition (Cuoco, 1997). In both constrained and unconstrained settings with exploration, we manage to find the closed form solution of the exploratory problem for logarithmic utility and quadratic function. We show that the optimal feedback policy of the exploratory unconstrained expected utility (EU) problem is Gaussian, which is aligned with the result obtained in Wang and Zhou (2020) for the mean-variance setting. However, when the risky investment ratio is restricted on a given interval, the constrained optimal exploration policy now follows a truncated Gaussian distribution.
The explicit form of the solution to HJB equations enables us to obtain a policy improvement theorem which confirms that exploration can be performed within the class of (resp. truncated) Gaussian policies for the (resp. constrained) unconstrained EU problem. Moreover, by casting the optimal problem in a martingale framework as in Jia and Zhou (2022b), we devise an implementable reinforcement learning
algorithm. We observe that compared to the classical case (without exploration), exploration procedure in the presence of a portfolio constraint leads to an optimal portfolio process that exhibits a more dispersed distribution with heavier tail. This effect becomes less significant as the exploration parameter is smaller. Moreover, a decrease (resp. increase) in the lower (resp. upper) bound of portfolio strategy leads to a corresponding increase in the exploration cost. This aligns with the intuitive understanding that a larger investment opportunity domain requires a higher exploration cost. Notably, when facing both short-selling and money borrowing constraints, the exploration cost becomes negligible compared to the unconstrained case. Our paper is the first attempt to explicitly analyze the classical continuous-time EU problem with possible portfolio constraints in an entropy-regularized RL framework. Although this paper addresses specifically an optimal portfolio problem, it is highly relevant to OR challenges from multiple perspectives. The techniques and methodologies discussed here can be applied directly to various areas of OR that are related to decision making under model parameter uncertainty and the reliable data is used to learn the true model parameters. It is worth highlighting that our findings in Section 6 establish a direct connection between quadratic utility and the mean-variance (MV) framework, a cornerstone in OR for evaluating risk-return profiles under portfolio constraints. With reinforcement learning (RL) methods increasingly adopted in OR for solving dynamic stochastic problems, our work offers a versatile framework with clear applications in portfolio optimization, inventory management, supply chain systems, and other OR domains involving constrained stochastic decision-making.
While revising this paper, we acknowledge a recent work Dai et al. (2023) where the authors consider the Merton optimal portfolio for a power utility function in a stochastic volatility setting with a recursive weighting scheme on exploration that endogenously discounts the current exploration reward by the past accumulative amount of exploration. By adopting a two-step optimization of the corresponding exploratory HBJ for an exploratory learning that incorporates the correlation between the risk asset and the volatility dynamics suggested in Dai et al. (2021), the authors in Dai et al. (2023) are able to characterize the Gaussian optimal policy with biased structure due to incompleteness. Compared to Dai et al. (2021, 2023), our paper studies the Merton problem for logarithmic and quadratic utility functions in complete market settings aiming at obtaining explicit solution, while accounting for possible portfolio constraints. We remark that the approach in Dai et al. (2023) seems to be no longer applicable for the classical expected utility maximization when removing the endogenously recursive weighting scheme on exploration. In addition, it is not clear how the two-step optimization procedure in Dai et al. (2023) is applied to our constrained setting.
The rest of the paper is organized as follows: Section 2
formulates the exploratory optimal
investment consumption
problem
incorporating both exploration and exploitation and provides a general form of the optimal distributional control to the exploratory HJB equation.
Section 3 studies the unconstrained optimal investment problem for logarithmic utility function. We elaborate the constrained EU problem with exploration and discuss the exploration cost and its impact in Section 4. Section 5
discuses an implementable reinforcement learning
algorithm in a martingale framework and provides some numerical examples. Section 6 studies the case with a quadratic utility function. An extension to random coefficient markets is presented in Section 7. Section 8 concludes
the paper with future research perspectives. Technical proofs and additional details can be found in the Appendix.
Notation. In this paper, we use or to denote a stochastic process ; also refers to the values of the stochastic process when it is clear from the context. We use to denote the normal distribution with mean and standard deviation , denotes the mathematical constant pi. Additionally, denote the partial first/second derivative with respect to the corresponding arguments. , are the probability density and the probability cumulative density functions of the standard normal distribution, respectively.
2. Formulation
Consider a filtered probability space (). Furthermore, let be a standard Brownian motion (restricted to ) under the real world measure and is the standard filtration of . For our analysis, we assume a Black-Scholes economy. To be more precise, there are two assets traded in the market, one risk-free asset earning a constant interest rate and one risky asset following a geometric Brownian motion with instantaneous rate of return and volatility . Their dynamics are given by,
(2.1) | ||||
(2.2) |
Consider an investor endowed with an initial wealth , which will be used for investments. We assume that the investor splits her initial wealth in the two assets given above. We use to denote the fraction of wealth that the investor invests in the risky asset. The remaining money is invested in the risk-free asset. We assume that the processes is -adapted. The investor chooses an investment and consumption strategy from the following admissible set
Here, is the investor’s wealth process which satisfies the following stochastic differential equation:
(2.3) |
2.1. The classic optimal investment problem
The classical asset allocation problem can be written as
(2.4) |
where is a concave utility function. The optimization problem (2.4) is solely an exploitation problem, which is a typical set up of classical stochastic control optimization problems. In a financial market with complete knowledge of the model parameters, ones are readily to employ the classical model-based stochastic control theory (see e.g. Fleming and Soner (2006); Pham (2009)), duality approach (see e.g. Chen and Vellekoop (2017); Kamma and Pelsser (2022)) or to use the so-called martingale approach (see e.g. Karatzas and Shreve (1998)) to find the solution of Problem (2.4). When implementing the optimal strategy, one needs to estimate the market parameters from historical time series of asset prices. However, in practice, estimating these parameters is notoriously difficult with an acceptable accuracy, especially the mean return , also known as the mean-blur problem, see e.g. Luenberger (1998).
2.2. Optimal investment with exploration
In the RL setting where the underlying model is not known, dynamic learning is needed and the agent employs exploration to interact with and learn from the unknown environment through trial and error. The key idea is to model exploration via a distribution over the space of controls from which the trials are sampled. Here we assume that the action space is continuous and randomization is restricted to those distributions that have density functions. In particular, at time , with the corresponding current wealth , the agent considers a classical control sampled from a policy distribution . Mathematically, such a policy can be generated e.g. by introducing an additional random variable that is uniformly distributed in and is independent of the filtration of the Brownian motion , see e.g. Jia and Zhou (2022a). In this sense, the probability space is now expanded to (), where and is now an extended probability measure on whose projection on is . This sampling is executed for rounds over the same time horizon. Intuitively, the utility of such a (feedback) policy becomes accurate enough when N is large by using the law of large numbers. This procedure, known as policy evaluation, is considered as a fundamental element of most RL algorithms in practice. For our continuous-time setting, we follow the exploratory setting suggested in Wang et al. (2020) and refer to e.g., Wang et al. (2020); Wang and Zhou (2020); Jia and Zhou (2022a, b) for motivations and additional details. In particular, we consider the exploratory version of the wealth dynamics (2.3) given by
(2.5) |
where the exploration drift and exploration volatility are defined by
(2.6) |
where
From (2.6) we observe that
(2.7) |
As a result, the agent may want to maximize over all admissible policies the exploratory expected investment utility Intuitively, the agent must account for exploration effect in her objective. As suggested in Wang et al. (2020), for a given exploration distribution , the following Shannon’s differential entropy penalty at time
can be used to measure the exploration impact. Note that when the model is fully known, there would be no requirement for exploration and learning. In such a scenario, control distributions would collapse to Dirac measures, placing us in the domain of classical stochastic control Fleming and Soner (2006); Pham (2009).
Motivated by the above discussion, below we consider the following Shannon’s entropy-regularized exploratory optimization problem
(2.8) |
where is the set of admissible (feedback) exploration distributions (or more precisely density). Mathematically, (2.8) is a relaxed stochastic control problem which has been widely studied in the control literature, see e.g. Fleming (1976); Fleming and Nisio (1984); Nicole el et al. (1987). Note that relaxed controls in general have a natural interpretation that at each time , the agent chooses not a single action (strategy) but rather a probability measure over the action space from which a specific action is randomly sampled. In more general settings of relaxed control e.g. Fleming (1976); Fleming and Nisio (1984); Nicole el et al. (1987), the admissible set may contain open-loop distributions that are measure-valued stochastic processes. This is also in line with the notion of a mixed strategy in game theory, in which players choose probability measures over the action set, rather than single actions (pure strategies).
We note that the minus sign in front of the entropy term accounts for the fact that exploration is inherently costly in term of resources. Also, we can see that the optimization problem (2.8) incorporates both exploration (due to the entropy factor) and exploitation. The rate of exploration is determined by the exogenous temporal parameter in the sense that a larger value of indicates that more explorations are needed and vice versus. Hence, the agent can personalize her exploration rate by selecting an appropriate exogenous temporal parameter .
Next, we specify the set of admissible controls as follows.
Assumption 2.1.
The set of admissible control has the following properties:
-
(1)
For each , is a density function on .
-
(2)
The mapping is measurable.
-
(3)
For each , the exploratory SDE (2.5) admits a unique strong solution denoted by which is positive and
2.3. HJB equation and optimal distribution control
For each , we denote by the optimal value function of Problem (2.8), i.e.
By standard arguments of Bellman’s principle of optimality, the optimal value function satisfies the following HJB equation
(2.9) |
with terminal condition . We observe that the formula in the bracket of (2.9) can be expressed as
The optimal distribution can be obtained by Donsker and Varadhan’s variational formula (see e.g. Donsker and Varadhan (2006))
(2.10) |
which is a feedback form and can be seen as a Boltzmann distribution (see e.g., Tang et al. (2022) for a similar result). Here the notation is introduced to indicate that depends on the exploration parameter . Note that we can write as
which is Gaussian with mean and variance (assuming that ) defined by
(2.11) |
Substituting (3.2) back to the HJB (2.9) we obtain the following non-linear PDE
(2.12) |
with terminal condition
3. Unconstrained optimal investment problem with exploration
3.1. Unconstrained optimal policy
Below we show that the optimal solution to (2.12) can be given in explicit form for the case of logarithmic utility. Logarithmic utility is widely recognized as the preference function for rational investors, as highlighted in Rubinstein (1977). The study in Pulley (1983) supports this assertion by showcasing that the logarithmic utility serves as an excellent approximation in the context of the Markowitz mean-variance setting and remarkably, the optimal portfolio under logarithmic utility is almost identical to that derived from the mean-variance strategy. It is crucial to emphasize that while our optimal strategy with logarithmic utility maintains time consistency, the mean-variance strategy is only precommitted, meaning it is only optimal at time 0. This stands out as one of the most critical drawbacks within the mean-variance setting. Logarithmic utility’s versatility extends to diverse applications, including long-term investor preferences (Gerrard et al., 2023), optimal hedging theory and within semimartingale market models (Merton, 1975), categorizing it under the hyperbolic absolute risk aversion (HARA) class. This class has been explored in portfolio optimization under finite-horizon economies (Cuoco, 1997), infinite-horizon consumption-portfolio problems (El Karoui and Jeanblanc-Picqué, 1998), and scenarios involving allowed terminal debt (Chen and Vellekoop, 2017).
We summarize the results for the unconstrained case i.e. in the following theorem .
Theorem 3.1.
The optimal value function of the entropy-regularized exploratory optimal investment problem with logarithmic utility is given by 111Recall that is the regular constant pi.
(3.1) |
for . Moreover, the optimal feedback distribution control (which is independent of ) is a Gaussian with mean and variance , i.e.
(3.2) |
and the associated optimal wealth under is given by the following SDE
(3.3) |
From Theorem 3.1 we can see that the best control distribution to balance exploration and exploitation is Gaussian. This demonstrates the popularity of Gaussian distribution in RL studies. The mean of the optimal distribution control is given by the Merton strategy
(3.4) |
whereas the variance of the optimal Gaussian policy is controlled by the degree of exploration . We also observe that at any , the exploration variance decreases as increases, which means that exploration is less necessary in more random market environment.
Remark 3.1.
The optimal value function of the regularized exploration problem can be expressed as
(3.5) |
where
Intuitively, measures the exploration effect. It is easy to see that as , hence, which is the optimal value function in the absence of exploration.
The following shows solvability equivalence between the classical and exploratory EU problem, meaning that the solution to one directly provides the solution to the other, without requiring a separate resolution.
Theorem 3.2.
The following statements are equivalent
- (a)
-
(b)
The function
is the optimal value function of the classic (without exploration) EU problem where the optimal investment strategy is given by the Merton fraction .
Proof.
It is easy to see that the and solve the HJB equation (2.12) with and without exploration respectively. The admissibility of these two optimal strategies are straightforward to see. ∎
3.2. Exploration cost and exploration effect
The exploration cost for a general RL problem is defined by the difference between the expected utility following the corresponding optimal control under the classical objective and the exploratory objective, net of the value of the entropy. Remark that the exploration cost is well defined only if both the classical and the exploratory problems are solvable. Let be the value function of the classical EU problem. Below, we often write for shorthand when there is no confusion. The exploration cost at time is defined by
(3.6) |
The first term is the classical value function at time . The second term is the expected utility (before being regularized) of the exploration problem at time .
Proposition 3.1.
Assume that one the statements in Theorem 3.2 holds. Then, the exploration cost is given by .
Proof.
Observe that the exploration cost converges to zero as . Since the exploration weight as been taken as an exogenous parameter which reflects the level of exploration desired by the learning agent, it is intuitive to expect that the smaller is, the more emphasis is placed on exploitation. Moreover, when is sufficiently close to zero, the exploratory formulation is getting close to the problem without exploration. The following theorem confirms a desirable result that the entropy-regularized EU problem converges to its classical EU counterpart when the exploration weight goes to zero.
Theorem 3.3.
Assume that one the statements in Theorem 3.2 holds. Then, for each , it holds that
(3.9) |
where is the Dirac measure at the Merton strategy. Furthermore, for each ,
3.3. Policy improvement
The following policy improvement theorem is crucial for interpretable RL algorithms as it ensures that the iterated value functions is non-decreasing and converges to the optimal value function. Below, for some given policy (not necessarily Gaussian), we denote the corresponding value function by
(3.10) |
Theorem 3.4.
For some given admissible policy (not necessarily Gaussian), we assume that the corresponding value function and satisfies for any Suppose furthermore that the feedback policy defined by
(3.11) |
is admissible. Let be the value function corresponding to this new (Gaussian) policy . Then,
(3.12) |
Note that Theorem 3.4 holds true for general utility function . One important implication of Theorem 3.4 is that for any given (not necessarily Gaussian) policy , there are always policies in the Gaussian family that improve the value function of (i.e. providing higher expected utility values). Therefore, it is intuitive to focus on the Gaussian policies when choosing an initial exploration distribution. Note that the optimal Gaussian policy given by (3.2) also suggests that a candidate of the initial feedback policy may take the form for some real number . As shown below, such a choice leads to a fast convergence of both value functions and policies in a finite number of iterations.
Theorem 3.5.
Let with some real number and consider the logarithmic utility function . Define the sequence of feedback policies updated by the policy improvement scheme (3.11), i.e.,
(3.13) |
where is the value function corresponding to the policy defined by
(3.14) |
Then,
(3.15) |
and
(3.16) |
which is the optimal value function given by (3.1).
As shown in Appendix A.3, starting with a Gaussian policy the optimal solution of the exploratory EU with logarithmic utility is attained rapidly (after one update).
Remark 3.1.
It is valuable to extend Theorem 3.5 to contexts with general utility functions. However, obtaining closed-form solutions for highly nonlinear average PDEs (see Appendix A.3) is not feasible. Note that even the solution existence of the next-step average PDE is not obvious and would require a thorough analysis, including asymptotic expansions.
3.4. Policy evaluation and algorithm
Following Jia and Zhou (2022a, b) we consider in this section the policy evaluation that takes the martingale property into account. Recall with an abusive notation the value function
(3.17) |
where is a given learning (feedback) policy. By Feymann-Kac’s theorem, it can be seen that solves (e.g. in viscosity sense) the following average PDE
(3.18) |
where is the infinitesimal operator associated to , i.e.
(3.19) |
As in Jia and Zhou (2022a, b), the value function of a policy can be characterized by the following martingale property.
Theorem 3.6.
A function is the value function associated with the policy if and only if it satisfies the terminal condition , and
(3.20) |
is a martingale on .
It follows from Itô’s Lemma and Theorem 3.6 that for all adapted square-integrable processes satisfying . Equivalently,
(3.21) |
Such a process is called test function. Let be a parameterized family that is used to approximate , where . Here satisfies Assumption 3.1 below and our goal is to find the best parameter such that the martingale property in Theorem 3.6 holds. In this sense, the process
should be a martingale in with terminal value
As discussed in (Jia and Zhou, 2022a, page 18), the martingale condition of is further equivalent to a requirement that the process at any given time is the expectation of the terminal value conditional on all the information available at that time. A fundamental property of the conditional expectation then yields that minimizes the -error between and any -measurable random variables. Therefore, our objective is to minimize the martingale cost function defined by . In other words, for the policy evaluation (PE) procedure, we look at the following (offline) minimization problem
(3.22) |
and check the martingale orthogonal property
(3.23) |
The orthogonality condition (3.23) can be assured by solving the following minimization problem (which is a quadratic form of (3.23)) 222Our procedure is inspired from (Jia and Zhou, 2022a, page 22). In particular, we are optimizing with respect to the parameter which is a vector in general. In fact, we have to make sure that the parameterization satisfies the martingale orthogonality condition (3.23)) by choosing a suitable test function . For numerical approximation methods involving a parametric family , in principle, we need at least equations as our martingale orthogonality conditions in order to fully determine . Consequently, may be vector-valued, making (3.23) a vector-valued equation or, equivalently, a system of equations.
(3.24) |
where is a positively definite matrix with suitable size (e.g. or ) for any test process . A common choice of test process is to take . The following assumption333There is a rich theory on solution existence and regularity of general parabolic PDEs, see e.g. Friedman (2008). However, the average PDE (3.18) appears to be a new type of PDEs for which further studies on well-posedness (existence and uniqueness) and regularity of solution are needed. See Tang et al. (2022) for related discussions on a similar exploratory elliptic PDEs. is needed in such an approximation procedure, see e.g. Jia and Zhou (2022b).
Assumption 3.1.
For all , is a with polynomial growth condition in . Moreover, is smooth in and its two first derivatives are and satisfy the polynomial growth condition in .
Suppose that a PE step can be proceeded to obtain an estimate of the value function for an given admissible policy . If the policy can be parameterized by , it is possible to learn the corresponding value function by following a policy gradient (PG) step. Recall first that the value function satisfies the average PDE
(3.25) |
By differentiating (3.25) w.r.t. we obtain the following average PDE for the policy gradient
(3.26) |
with terminal condition , where
(3.27) |
By Feynman-Kac’s formula, admits the following representation
(3.28) |
Since cannot be observed nor computed without knowing the environment, it is important to replace by a suitable approximation. To this end, using Itô’s formula, the term can be replaced by . Taking into account the fact that the term is a martingale, we obtain the following.
Theorem 3.7.
Let be an admissible policy. The policy gradient admits the following representation
(3.29) |
Proof.
The proof is similar to that of Theorem 5 in Jia and Zhou (2022b), and hence is omitted. ∎
Note that for any arbitrary policy, the gradient of the corresponding value function given by the above expectation (3.29) is not zero in general. As mentioned in Jia and Zhou (2022a, b), all the terms inside the expectation above are all only computable when the action trajectories and the corresponding state trajectories on are available. The sample is also needed to estimate the value function obtained in the previous PE step.
4. Constrained optimal investment problem with exploration
In this section, we extend the results obtained in Section 3 to settings with a convex portfolio constraint see e.g. Cuoco (1997). In particular, we assume that the agent does not consume over her investment period and in addition, due to regulatory reasons, her risky investment ratio is set in a given interval , where are real numbers. Observe that the well-known short-selling constraint can be included by taking , . If, in addition , then both short-selling and money borrowing are prohibited. The case of buying constraint can be also covered by setting . Clearly, we are back to the unconstrained setting when sending and . In such a constrained framework, the set of admissible investment strategies is now defined by
where is the non-exploration wealth process starting with for strategy . The objective is to maximize the terminal expected utility
(4.1) |
where is a utility function.
As in the unconstrained case, we look at the exploratory version of the wealth dynamics given by (2.5) with exploration drift and exploration volatility are defined by (2.6). Intuitively, the corresponding exploration setting is slightly adjusted to taking the constraint into account. In particular, let be the set of admissible exploration distributions in that satisfy the following properties:
-
(1)
For each , is a density function on .
-
(2)
The mapping is measurable.
-
(3)
For each , the exploration SDE (2.5) admits a unique strong solution denoted by which is positive and
The exploration optimization is now stated by
(4.2) |
As before, the optimal value function satisfies the following HJB equation
(4.3) |
with terminal condition . Using again the standard argument of DPP we observe that under the portfolio constraint , the optimal feedback policy now follows a truncated Gaussian distribution.
Lemma 4.1.
In the exploratory constrained EU setting, if then the optimal distribution is a Gaussian distribution with mean and variance , truncated on interval , where
(4.4) |
The density of the optimal policy is given by
(4.5) |
where and are the PDF and CDF functions of the standard normal distribution, respectively.
Note that the first two moments and entropy of a truncated Gaussian distribution can be computed explicitly, see e.g. Kotz et al. (2004). Substituting (4.5) back to the HJB (4.3) we obtain the following highly non-linear PDE
(4.6) |
with terminal condition , where by abusing notations,
(4.7) |
with
(4.8) |
4.1. Optimal policy with portfolio constraints
Below we show that the optimal solution to (4.6) can be given in explicit form for the case of logarithmic utility. We summarize the results in the following theorem.
Theorem 4.1 (Exploratory optimal investment under portfolio constraint).
For logarithmic utility , the optimal value function of the entropy-regularized exploratory constrained optimal investment problem is given by
(4.9) |
for , where
(4.10) |
Moreover, the optimal feedback distribution control is a Gaussian distribution with parameter and , truncated on the interval i.e.
(4.11) |
and the associated optimal wealth under is given by the following SDE
(4.12) |
where
(4.13) |
with, by abusing notations, , and
(4.14) |
Remark 4.1.
At any , the exploration variance decreases as increases, which means that exploration is less necessary in more random market environment. However, in contrast to the unconstrained section, both the mean and variance of the optimal distribution are controlled by the degree of exploration .
From Theorem 4.1 we can see that in such a setting with portfolio constraints, the best control distribution to balance exploration and exploitation is truncated Gaussian, demonstrating once again the popularity of Gaussian distribution in RL studies, even in constrained optimization settings.
Lemma 4.2.
The optimal value function of the regularized exploration constrained problem can be expressed as
(4.15) |
where
(4.16) |
which is the constrained optimal value function without exploration using the corresponding constrained optimal investment strategy
(4.17) |
where
and
Note that . The following lemma confirms the intuition that , which measures the exploration effect, converges to 0 as .
Lemma 4.3.
Consider defined by (4.10) for each . We have
(4.18) |
Proof.
It can be seen directly using e.g. the classical L’Hôpital’s rule. ∎
Similar to Theorem 3.2 for the unconstrained case, it is straightforward to obtain the solvability equivalence between classical and exploratory constrained EU problem.
4.2. Exploration cost and exploration effect under portfolio constraints
We now turn our attention to the exploration cost. As before, the exploration cost at time is defined by
(4.19) |
where and are the optimal value function at time of the constrained EU problem without and with exploration respectively (defined by (4.15) and (4.16)). The integral term is the expected utility (before being regularized) of the exploration problem at time .
Proposition 4.1.
In the constrained problem with exploration, the exploration cost is given by
where, by abusing notations,
Moreover, .
Proof.
Recall first that the optimal distribution is a Gaussian distribution with mean and variance truncated on the interval . It is known e.g. Kotz et al. (2004) that for any Gaussian distribution truncated on , its entropy is given by
(4.20) |
where . The explicit form of the exploration cost now follows directly by Lemma 4.2. Finally, it is straightforward to verify that . ∎
As shown above, the exploration cost converges to zero as . Since the exploration weight as been taken as an exogenous parameter which reflects the level of exploration desired by the learning agent, it is intuitive to expect that the smaller is, the more emphasis is placed on exploitation; and when is sufficiently close to zero, the exploratory formulation is getting close to the problem without exploration. The following lemma confirms the intuition that exploration cost under a portfolio constraint is smaller than that without constraint when the investment strategy is restricted in an interval.
Lemma 4.4.
If the investment strategy is restricted in , where , then , for all . The same conclusion also holds for the case where or .
The following theorem confirms a desirable result that the entropy-regularized constrained EU problem converges to its non-exploratory constrained EU counterpart when the exploration weight goes to zero.
Theorem 4.2.
For each ,
(4.21) |
where the (without exploration) constrained Merton strategy is defined by (4.17). Furthermore, for each ,
4.3. Policy improvement
Below, for some given admissible (feedback) policy (not necessarily truncated Gaussian) on , we denote the corresponding value function by
(4.22) |
Theorem 4.3.
For some given policy (not necessarily Gaussian) on the interval , we assume that the corresponding value function and satisfies for any Suppose furthermore that the feedback policy defined by
(4.23) |
(Gaussian truncated on ) is admissible. Let be the value function corresponding to this new truncated (Gaussian) policy . Then,
(4.24) |
Proof.
The proof is similar to that of the unconstrained case hence, omitted. ∎
Theorem 4.3, also suggests that a candidate of the initial feedback policy may take the form for some parameters . Similar to the unconstrained case, such a choice of policy leads to the convergence of both the value functions and the policies in a finite number of iterations.
Theorem 4.4.
Let with and assume that . Define the sequence of feedback policies updated by the policy improvement scheme (4.23), i.e.,
(4.25) |
where is the value function corresponding to the policy defined by
Proof.
The proof can be done using Feymann-Kac representation and the update result of Theorem 4.3, similarly to the unconstrained problem. ∎
The above improvement theorem allows to establish the policy evaluation and algorithm taking the martingale property into account. Since the steps are similar to the unconstrained problem in Section 3.4, we skip the detail.
5. Learning implementation and numerical example
We are ready now in the position to discuss a RL algorithm for this problem. Note that a common practice in RL literature is that one can represent the value function and the policy using (deep) neural networks. In this section, we don’t follow this path. Inspired by the (offline) actor-critic approach in Jia and Zhou (2022b), we instead take advantages of the explicit parametric form of the optimal value function and the improvement policy which are given in Theorem 3.2. This will in turn facilitate the learning process and will lead to faster learning and convergence.
Below, to implement a workable algorithm, one can approximate a generic expectation in the following manner: Let be a partition for the finite interval . We then collect a sample as follows: For , the initial sample point is . Next, at each , is sampled to obtain an allocation for the risky asset. The wealth at the next time instant is computed by equation (2.5). As a result, a generic expectation can be approximated by the following finite sum
(5.1) |
5.1. Implementation for the unconstrained problem
Recall first that the exploratory optimal value function is given by
(5.2) |
We learn it from the following parametric form
(5.3) |
which clearly satisfies Assumption 3.1. It follows from (5.2) that
(5.4) |
Recall that for a Gaussian distribution its entropy is given by . Since we do not know the model’s parameters and , we will learn them using Theorem 3.2 with the parametric policy It follows that
(5.5) |
Also,
Therefore
Moreover, it can be seen from (5.5) that and Finally, in such a learning framework, we choose the test function .
For learning step, we adopt the (offline) actor-critic approach in Jia and Zhou (2022b) which is summarized in Algorithm 1 below.
5.2. Implementation for the constrained problem
First, recall from (4.9) that the exploratory optimal value function is given by
for , where is given by (4.10). We learn it from the following parametric form
(5.6) |
which clearly satisfies Assumption 3.1. It follows from (5.6) that
(5.7) |
Again, we parametrize the policy from Theorem 4.1 as the following
Also recall that for , a Gaussian distribution truncated on , its entropy is given by
Also,
Therefore
The partial derivatives of can be obtained explicitly.
Similarly to the unconstrained case, the learning step can be done using the Actor-Critic Algorithm adopted for the constrained problem in Algorithm 2 below.
5.3. Numerical demonstration
We provide an an example to demonstrate our results under a portfolio constraint in a setting with and . The annualized interest rate is . We choose to simulate sample paths of the diffusion wealth process (6.1) based on the the update policy after each episode. First, for we plot the corresponding exploration cost function for both constrained and unconstrained cases. Figure 1 clearly confirms the theoretical results obtained in Proposition 3.1. Interestingly, the exploration cost of unconstrained case () is much larger than that of constrained case (). Consistent to Lemma 4.4, this could be explained that for the constrained case, one just needs to search for the optimal strategy in a finite domain as opposed to an infinite domain for the unconstrained case.


The left (resp. right) panel of Figure 1 demonstrates the impact of the lower (resp. upper) portfolio bound on the exploration cost. Clearly, as the lower (resp. upper) bound of the portfolio decreases (resp. increases), the exploration cost increases. This finding aligns with the intuitive understanding that a broader domain of investment opportunities necessitates higher exploration expenses. Notably, when subjected to both short-selling (i.e., ) and money borrowing constraints (i.e., ), the exploration cost becomes negligible compared to the unconstrained case.
Recall from Theorem 3.1 that the difference between the constrained value function and the exploratory constrained value function is determined by the temporal exploration rate . In Figure 2, we plot the difference for (left panel) and (right panel). It is clear from Figure 2 the difference is substantial for a large exploration rate but decreases greatly for a small exploration rate.


Next, we choose the learning rates which are used to update and to be , , respectively with the decay rate . From the closed-from formula, the true value of is given by . For each learning episode, we start with an initial guess
where are independent and generated from a (Gaussian noise) normal distribution. For each iteration, we randomly sample 1-year trajectories of the wealth process . The gradients of the value function are calculated by the above approximate discrete sum (5.1). The model is trained for iterations. The results reported in this section is the average of training episodes. The portfolio strategy is limited by the lower bound (i.e. short-selling is not possible) and the upper bound (i.e. money borrowing is also prohibited). Finally, we choose the test function .



Figure 3 reports the learned policy in comparison with the true optimal policy which is a Gaussian distribution truncated on . It can be observed that the smaller the value of the exploration rate is, the more closely matched the learned optimal strategy is to that of the ground true optimal policy. As shown in Figure 4, when is small, the (truncated Gaussian) optimal policy gets closer to the Dirac distribution at the “constrained” Merton strategy defined in (4.17), which is consistent to the result obtained in Theorem 4.4.


To have a more concrete comparison between the true optimal optimal policy and the learned optimal policy, in Table 1, we compare the true mean, the learned mean, and the standard deviation of the learned mean. In addition, we report the (empirical) Kullback–Leibler divergence (see e.g. Csiszár (1975)) from the learned policy to the true policy for different , see column.
True mean | Learned mean | std | ||
0.01 | 0.555556 | 0.555927 | 0.000005 | 0.000000 |
0.1 | 0.555556 | 0.554679 | 0.000007 | 0.000000 |
1 | 0.555556 | 0.555750 | 0.000189 | 0.000052 |
Next we consider the dependence of on the number of iterations used. To easy the comparison, we consider the difference as a function of iterations. More specifically, we run Algorithm 1 to learn the coefficients from data. We then calculate the difference . In Figure 5, we report the difference by averaging 20 runs.

It is clear from Figure 5 the error between the optimal value function and the learned value function is decreasing as the number of iterations increases.
To have a more concrete comparison, we report the true parameters , versus the corresponding learned , in Table 2. As shown in Table 2, the quality of learning is improving as the number of iterations increases.
Iteration | True | Learned | True | Learned |
---|---|---|---|---|
500 | (0.055929,-0.001497) | (0.052141,-0.001261) | (0.555556,2.407946) | (0.552537,2.407276) |
1000 | (0.055929,-0.001497) | (0.057109 ,-0.001836) | (0.555556,2.407946) | (0.554388,2.413809 ) |
2000 | (0.055929,-0.001497) | (0.054673,-0.003369 ) | (0.555556,2.407946) | (0.555207,2.410082) |
3000 | (0.055929,-0.001497) | (0.057058,-0.004642 ) | (0.555556,2.407946) | (0.555872,2.408604) |
4000 | (0.055929,-0.001497) | (0.054119 ,0.001334) | (0.555556,2.407946) | (0.554001,2.409800) |
5000 | (0.055929,-0.001497) | (0.055721,-0.000253) | (0.555556,2.407946) | (0.557341,2.404103) |
10000 | (0.055929,-0.001497) | (0.056456,-0.002821) | (0.555556,2.407946) | (0.554809,2.405365) |
Lastly in this section, we would like to compare the effect of the exploration parameter on the distribution of the optimal wealth process. In Figure 6 we plot the distribution of the optimal wealth processes at time using the dynamics in (4.12) and a sample size of . From Figure 6, it can be seen that the exploration parameter plays a central role on the distribution of the constrained optimal wealth process. We can clearly distinguish the difference between the case with exploration (i.e. ) and the case without exploration (i.e. ). In particular, exploration leads to a more dispersed distribution with heavier tail. Again, as confirmed from the theory in the previous section, this effect becomes less significant as the exploration parameter is smaller.



6. Optimal portfolio with portfolio constraints and exploration for quadratic utility
We consider in this section the case with a quadratic utility function. It is worth noting that in portfolio theory the use of quadratic utility functions is very close to the classic Markowitz mean-variance portfolio and the reader is referred to e.g., Duffie and Richardson (1991); Bodnar et al. (2013) for discussions on connections between three quadratic optimization problems: the Markowitz mean–variance problem, the mean–variance utility function and the quadratic utility. Note that unlike mean-variance (MV) solution, the optimal solution under a quadratic utility function is not only time-consistent but, as shown below, also lies on the MV efficient frontier under mild conditions. Moreover, while it becomes extremely complex even without exploration e.g. Li et al. (2002); Bielecki et al. (2005); Li and Xu (2016) when a portfolio constraint is added in a MV problem, we still able to obtain closed form solutions when portfolio constraints are included in our exploratory learning setting with a quadratic utility. Note that for a quadratic utility function, the wealth process can be negative. Therefore, below we consider the amount of wealth invested in the risky asset at time . To include portfolio constraints we assume that given the current wealth , the risk investment amount at time is bounded by , where (resp. ) is a deterministic continuous function defining the lower (resp. upper) portfolio bound. The wealth dynamics follows the following SDE
(6.1) |
The set of admissible investment strategies is now defined by
Our objective is to maximize the terminal expected utility
(6.2) |
where , with parameter . The constant reflects the agent’s risk aversion and can be regarded as the risk aversion parameter in the mean–variance analysis. We remark that the quadratic utility function has its global maximum at the so-called bliss point . Obviously, the quadratic utility function is symmetric with respect to the bliss point : it is increasing for and decreasing for and the utility values at some and at are equivalent. We consider the exploratory version of the wealth dynamics given by
(6.3) |
where
(6.4) |
and is a probability density on the interval . The exploratory optimization is now stated by
(6.5) |
where is the set of admissible feedback policies that satisfy the following properties:
-
(1)
For each , is a density function on .
-
(2)
The mapping is measurable.
-
(3)
The exploration SDE (6.3) admits a unique strong solution denoted by and
As before, the optimal value function satisfies the following HJB equation
(6.6) |
with terminal condition . Using again the standard argument of DPP we observe that under the portfolio constraint , the optimal feedback policy now follows a truncated Gaussian distribution.
Lemma 6.1.
In the exploratory constrained EU setting with quadratic utility function and , the optimal feedback policy is a Gaussian distribution with mean and variance truncated on interval , where
(6.7) |
The density of the optimal policy is given by
(6.8) |
where and are the PDF and CDF functions of the standard normal distribution, respectively.
Substituting (6.8) back to the HJB (6.6) we obtain the following non-linear PDE
(6.9) |
with terminal condition , where by abusing notations,
(6.10) |
with
(6.11) | |||
(6.12) |
In the rest of this section we assume that the portfolio bounds are given by
where, as before , and are two time-varying continuous bounded functions with .
Theorem 6.1 (Quadratic utility-exploratory optimal investment under portfolio constraint).
The optimal value function of the entropy-regularized exploratory constrained optimal investment problem with quadratic utility is given by
(6.13) | ||||
(6.14) |
for , where is the Sharpe ratio, ,
(6.15) |
where
Moreover, the optimal feedback distribution control is a Gaussian distribution with parameter and , truncated on the interval i.e.
(6.16) |
Corollary 6.1 (Unconstrained exploratory quadratic portfolio).
The optimal value function of the entropy-regularized exploratory unconstrained optimal investment problem for the quadratic utility is given by
(6.17) | ||||
(6.18) |
for . Moreover, the unconstrained optimal feedback distribution control is a Gaussian distribution with parameter and .
Proof.
This is a direct consequence of Theorem 6.1 by letting and . ∎
Remark 6.1.
Compared to the mean-variance results obtained in Wang and Zhou (2020), the risk reward parameter plays the role of the Lagrangian multiplier (under their notations) in the mean-variance setting. It should be mentioned that instead of considering the true portfolio, the authors in Wang and Zhou (2020) study the mean-variance problem for the discounted portfolio, which can be obtained by assuming in our setting. Note that the variance of our optimal policy is different from the one obtained in Wang and Zhou (2020) by a multiplicative factor since our quadratic utility is nothing else but the mean-variance utility in Wang and Zhou (2020) scaled up by .
Letting in Corollary 6.1 we can obtain the unconstrained optimal portfolio without exploration for quadratic utility function.
Corollary 6.2 (Unconstrained quadratic portfolio).
For a quadratic utility , the optimal value function of unconstrained optimal investment problem without exploration is given by
(6.19) |
and the unconstrained optimal control strategy is given by .
Following similar arguments used in Proposition 4.1 and Lemma 4.4 we can confirm that the exploration cost for the constrained problem is smaller than that of the unconstrained problem for a quadratic utility function.
Proposition 6.1.
In the constrained problem with exploration and quadratic utility function, the exploration cost is given by
Moreover,
Similarly to Theorems 4.3-4.4 for logarithmic utility, a policy improvement theorem can be shown for quadratic utility functions. Below we show that the optimal solution of the exploratory quadratic utility problem is mean-variance efficient (see e.g. Luenberger (1998); Kato et al. (2020); Duffie and Richardson (1991) for similar discussions in the case without exploration).
Proposition 6.2.
Assume that the agent’s initial wealth is smaller than the discounted reward level , i.e. . Then, the exploratory unconstrained optimal portfolio belongs to the mean-variance frontier.
Remark 6.2.
Note that since the exploration parameter only effects the diffusion term, we can see that Proposition 6.2 also holds for the unconstrained case without exploration. Moreover, it can be shown that Proposition 6.2 still holds true for the constrained case with exploration. Indeed, recall the optimal constrained exploratory solution whose dynamics is given by
where , are given by (C.5). Note that the second term of in (C.5) is negative because
Using the comparison principle of ordinary differential equations we can conclude that , which implies that the optimal solution of the expected quadratic utility problem with both portfolio constraints and exploration lies on the (constrained) mean-variance frontier.
6.1. Implementation
We consider the case without constraint. Recall the value function
(6.20) | ||||
(6.21) |
for . We choose to parametrize the value function as
That is,
From here the test functions are chosen as
(6.22) |
Also, recall the unconstrained optimal feedback distribution control is a Gaussian distribution with parameter and . For the policy distribution, we parametrize it as follows
From here, it can be seen that the entropy is given by
The log-likelihood, is given by
Also, the derivatives of the log-likelihood are given by
6.2. Numerical example
We illustrate the results for quadratic utility function. Specifically, we consider the scenario without constraints with and . We remark that the condition is fulfilled. First, we run the algorithm using iterations, the learning rates which are used to update and are chosen to be , , respectively with the decay rate . Moreover, we choose the exploration rate . Figure 8 plots the evolution of coefficients of . In this case, the true coefficients are .



In this case, it is evident from Figure 8 that converges to its true counterpart. Next, recall from Corollary 6.1 the optimal feedback distribution control is a Gaussian distribution with parameter and . To have a concrete comparison, we choose ; we then run the algorithm to obtain estimates for . Using the estimates for , in Figure 8 we plot the density of the true optimal control versus the density of the learned optimal control for various exploration rates .



It is clear from Figure 8 that the learned policies are matched extremely well to the true policies for different exploration rates. This once again confirms the theoretical findings presented in this section. We remark that compared to the existing literature on exploratory MV problem studied in e.g. Wang and Zhou (2020); Dai et al. (2020); Jia and Zhou (2022a, b) which has to deal with an additional Lagrange multiplier, our numerical implementation for the quadratic utility is much simpler thanks to our closed form solutions. As mentioned above, as the classical MV problem is time-inconsistent problem, the optimal value and strategy might be only significant when computing as the initial time . In our case, the optimal investment strategy and value function can be learned at any time in ]. Finally, we remark that the above learning procedure can be extended to the constrained problem by slightly adopting the above learning implementation as it is done for logarithm utility. Since the paper is rather lengthy we decided not to present this implementation.
7. An extension to random coefficient models
This section shows that our model can be extended beyond the classical Black-Scholes model. In particular, we assume that the drift and the volatility of the risky asset are driven by a state process as follows:
(7.1) |
and
(7.2) |
where are two independent Brownian motions. We assume that , are globally Lipschitz and linearly bounder functions so that there exists a unique strong solution to the factor SDE (7.2). It is worth mentioning that due to the additional randomness , the market is incomplete. Such a factor model has been well studied in the literature. Now, for each , the exploratory optimal value function is defined by
where as before, is the exploratory wealth process and is the set of admissible distributions. Hence, the optimal value function satisfies the following HJB equation
(7.3) |
with terminal condition . Following the same steps in Section 3, the optimal distribution is given by
(7.4) |
which is Gaussian with mean and variance (assuming that ) defined by
(7.5) |
As before, using the basic properties of Gaussian laws we obtain the following PDE
(7.6) |
with terminal condition Compared to the previous sections, we have to deal with a two-dimension PDE. To solve (7.6), we seek for an ansatz of the form , where is a smooth function that depends only on and . For this ansatz we obtain the following PDE for
(7.7) |
where .
Theorem 7.1.
Assume that and are bounded differentiable functions with bounded derivatives and
(7.8) |
Then, the , where solves the PDE (7.7), is the optimal value function of the exploratory problem. Moreover, the optimal distribution control is Gaussian with mean and variance and the exploratory wealth process is given by
(7.9) |
Proof.
Changing variable we can rewrite the PDE (7.7) as
(7.10) |
By assumption, is continuous bounded and the Cauchy problem (7.10) satisfies all conditions in Theorems 12 and 17 (Chapter 1) in Friedman (2008). Therefore, there exists unique solution in . It is now straightforward to see that with satisfies (7.6). Note that by Feynman-Kac’s theorem, the function can be represented as
(7.11) |
where
(7.12) |
and is a probability measure and is a Brownian motion under . By following the same steps as in the proof of Theorem 3.1 we can show that (7.4) is admissible and hence is an optimal policy. The corresponding optimal exploratory wealth process is given by
(7.13) |
∎
Remark 7.1.
-
•
It is obvious that when . In this case, we are back to the classical Black-Scholes model in Section 3.
- •
- •
8. Conclusion
We study an exploration version of the continuous-time expected utility maximization problem with reinforcement learning. We show that the optimal feedback policy of the exploratory problem with exploration is Gaussian. However, when the risky investment ratio is restricted on a given portfolio interval, the constrained optimal exploration policy now follows a truncated Gaussian distribution. For logarithm and quadratic utility functions, the solution to the exploratory problem can be obtained in closed form which converges to the classical expected utility counterpart when the exploration weight goes to zero. For interpretable RL algorithms, a policy improvement theorem is provided. Finally, we devise an implementable reinforcement learning algorithm by casting the optimal problem in a martingale framework. Our work can be extended to various directions. For example, it would be interesting to consider both consumption and investment with a general utility functions in more general market settings. We foresee that the -learning framework developed in Jia and Zhou (2023)
might be useful in such a setting.
It would be also interesting to extend
the work to a higher dimension case.
We leave these exciting research problems for future studies.
Acknowledgements
Thai Nguyen acknowledges the support of the Natural Sciences and Engineering Research Council of Canada [RGPIN-2021-02594].
References
- Baird (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
- Barnard (1993) Barnard, E. (1993). Temporal-difference methods and markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357–365.
- Bertsekas (2019) Bertsekas, D. (2019). Reinforcement learning and optimal control. Athena Scientific.
- Bertsimas and Thiele (2006) Bertsimas, D. and Thiele, A. (2006). A robust optimization approach to inventory theory. Operations research, 54(1):150–168.
- Bielecki et al. (2005) Bielecki, T. R., Jin, H., Pliska, S. R., and Zhou, X. Y. (2005). Continuous-time mean-variance portfolio selection with bankruptcy prohibition. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 15(2):213–244.
- Bodnar et al. (2013) Bodnar, T., Parolya, N., and Schmid, W. (2013). On the equivalence of quadratic optimization problems commonly used in portfolio theory. European Journal of Operational Research, 229(3):637–644.
- Charpentier et al. (2021) Charpentier, A., Elie, R., and Remlinger, C. (2021). Reinforcement learning in economics and finance. Computational Economics, pages 1–38.
- Chen and Vellekoop (2017) Chen, A. and Vellekoop, M. (2017). Optimal investment and consumption when allowing terminal debt. European Journal of Operational Research, 258(1):385–397.
- Csiszár (1975) Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, pages 146–158.
- Cuoco (1997) Cuoco, D. (1997). Optimal consumption and equilibrium prices with portfolio constraints and stochastic income. Journal of Economic Theory, 72(1):33–73.
- Dai et al. (2020) Dai, M., Dong, Y., and Jia, Y. (2020). Learning equilibrium mean-variance strategy. Available at SSRN 3770818.
- Dai et al. (2023) Dai, M., Dong, Y., and Jia, Y. (2023). Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4):1166–1212.
- Dai et al. (2021) Dai, M., Jin, H., Kou, S., and Xu, Y. (2021). A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108.
- Donsker and Varadhan (2006) Donsker, M. and Varadhan, S. (2006). Large deviations for markov processes and the asymptotic evaluation of certain markov process expectations for large times. In Probabilistic Methods in Differential Equations: Proceedings of the Conference Held at the University of Victoria, August 19–20, 1974, pages 82–88. Springer.
- Doya (2000) Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation, 12(1):219–245.
- Duffie and Richardson (1991) Duffie, D. and Richardson, H. R. (1991). Mean-variance hedging in continuous time. The Annals of Applied Probability, pages 1–15.
- El Karoui and Jeanblanc-Picqué (1998) El Karoui, N. and Jeanblanc-Picqué, M. (1998). Optimization of consumption with labor income. Finance and Stochastics, 2:409–440.
- Fleming (1976) Fleming, W. H. (1976). Generalized solutions in optimal stochastic control. Brown Univ.
- Fleming and Nisio (1984) Fleming, W. H. and Nisio, M. (1984). On stochastic relaxed control for partially observed diffusions. Nagoya Mathematical Journal, 93:71–108.
- Fleming and Soner (2006) Fleming, W. H. and Soner, H. M. (2006). Controlled Markov processes and viscosity solutions, volume 25. Springer Science & Business Media.
- Frémaux et al. (2013) Frémaux, N., Sprekeler, H., and Gerstner, W. (2013). Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS computational biology, 9(4):e1003024.
- Friedman (2008) Friedman, A. (2008). Partial differential equations of parabolic type. Courier Dover Publications.
- Gerrard et al. (2023) Gerrard, R., Kyriakou, I., Nielsen, J. P., and Vodička, P. (2023). On optimal constrained investment strategies for long-term savers in stochastic environments and probability hedging. European Journal of Operational Research, 307(2):948–962.
- Gosavi (2009) Gosavi, A. (2009). Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178–192.
- Guo et al. (2022) Guo, X., Hu, A., Xu, R., and Zhang, J. (2022). A general framework for learning mean-field games. Mathematics of Operations Research.
- Hambly et al. (2021) Hambly, B., Xu, R., and Yang, H. (2021). Recent advances in reinforcement learning in finance. arXiv preprint arXiv:2112.04553.
- Hendricks and Wilcox (2014) Hendricks, D. and Wilcox, D. (2014). A reinforcement learning extension to the almgren-chriss framework for optimal trade execution. In 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), pages 457–464. IEEE.
- Huang et al. (2022) Huang, Y.-J., Wang, Z., and Zhou, Z. (2022). Convergence of policy improvement for entropy-regularized stochastic control problems. arXiv preprint arXiv:2209.07059.
- Jaimungal (2022) Jaimungal, S. (2022). Reinforcement learning and stochastic optimisation. Finance and Stochastics, 26(1):103–129.
- Jia and Zhou (2022a) Jia, Y. and Zhou, X. Y. (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1–55.
- Jia and Zhou (2022b) Jia, Y. and Zhou, X. Y. (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(154):1–55.
- Jia and Zhou (2023) Jia, Y. and Zhou, X. Y. (2023). Q-learning in continuous time. Journal of Machine Learning Research.
- Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285.
- Kamma and Pelsser (2022) Kamma, T. and Pelsser, A. (2022). Near-optimal asset allocation in financial markets with trading constraints. European Journal of Operational Research, 297(2):766–781.
- Karatzas and Shreve (1998) Karatzas, I. and Shreve, S. E. (1998). Methods of mathematical finance, volume 39. Springer.
- Kato et al. (2020) Kato, M., Nakagawa, K., Abe, K., and Morimura, T. (2020). Mean-variance efficient reinforcement learning by expected quadratic utility maximization. arXiv preprint arXiv:2010.01404.
- Kotz et al. (2004) Kotz, S., Balakrishnan, N., and Johnson, N. L. (2004). Continuous multivariate distributions, Volume 1: Models and applications, volume 1. John Wiley & Sons.
- Lee and Lee (2021) Lee, H.-R. and Lee, T. (2021). Multi-agent reinforcement learning algorithm to solve a partially-observable multi-agent problem in disaster response. European Journal of Operational Research, 291(1):296–308.
- Lee and Sutton (2021) Lee, J. and Sutton, R. S. (2021). Policy iterations for reinforcement learning problems in continuous time and space—fundamental theory and methods. Automatica, 126:109421.
- Li and Xu (2016) Li, X. and Xu, Z. Q. (2016). Continuous-time markowitz’s model with constraints on wealth and portfolio. Operations research letters, 44(6):729–736.
- Li et al. (2002) Li, X., Zhou, X. Y., and Lim, A. E. (2002). Dynamic mean-variance portfolio selection with no-shorting constraints. SIAM Journal on Control and Optimization, 40(5):1540–1555.
- Liu et al. (2020) Liu, Y., Chen, Y., and Jiang, T. (2020). Dynamic selective maintenance optimization for multi-state systems over a finite horizon: A deep reinforcement learning approach. European Journal of Operational Research, 283(1):166–181.
- Luenberger (1998) Luenberger, D. G. (1998). Investment science. Oxford university press.
- Merton (1975) Merton, R. C. (1975). Optimum consumption and portfolio rules in a continuous-time model. In Stochastic optimization models in finance, pages 621–661. Elsevier.
- Moody et al. (1998) Moody, J., Wu, L., Liao, Y., and Saffell, M. (1998). Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470.
- Mou et al. (2021) Mou, C., Zhang, W., and Zhou, C. (2021). Robust exploratory mean-variance problem with drift uncertainty. arXiv preprint arXiv:2108.04100.
- Nevmyvaka et al. (2006) Nevmyvaka, Y., Feng, Y., and Kearns, M. (2006). Reinforcement learning for optimized trade execution. In Proceedings of the 23rd international conference on Machine learning, pages 673–680.
- Nicole el et al. (1987) Nicole el, K., Du Huu, N., and Monique, J.-P. (1987). Compactification methods in the control of degenerate diffusions: existence of an optimal control. Stochastics: an international journal of probability and stochastic processes, 20(3):169–219.
- Pham (2009) Pham, H. (2009). Continuous-time stochastic control and optimization with financial applications, volume 61. Springer Science & Business Media.
- Pulley (1983) Pulley, L. B. (1983). Mean-variance approximations to expected logarithmic utility. Operations Research, 31(4):685–696.
- Rubinstein (1977) Rubinstein, M. (1977). The strong case for the generalized logarithmic utility model as the premier model of financial markets. In Financial Dec Making Under Uncertainty, pages 11–62. Elsevier.
- Schnaubelt (2022) Schnaubelt, M. (2022). Deep reinforcement learning for the optimal placement of cryptocurrency limit orders. European Journal of Operational Research, 296(3):993–1006.
- Schneckenreither and Haeussler (2019) Schneckenreither, M. and Haeussler, S. (2019). Reinforcement learning methods for operations research applications: The order release problem. In Machine Learning, Optimization, and Data Science: 4th International Conference, LOD 2018, Volterra, Italy, September 13-16, 2018, Revised Selected Papers 4, pages 545–559. Springer.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
- Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. nature, 550(7676):354–359.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Tang et al. (2022) Tang, W., Zhang, Y. P., and Zhou, X. Y. (2022). Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216.
- Wang (2019) Wang, H. (2019). Large scale continuous-time mean-variance portfolio allocation via reinforcement learning. arXiv preprint arXiv:1907.11718.
- Wang et al. (2020) Wang, H., Zariphopoulou, T., and Zhou, X. Y. (2020). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34.
- Wang and Zhou (2020) Wang, H. and Zhou, X. Y. (2020). Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308.
- Williams et al. (2017) Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. (2017). Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE.
Appendix A Technical proofs of Section 3
A.1. Proof of Theorem 3.1
Proof.
Recall that . We find the solution of the HJB (2.12) under the form , where are smooth functions satisfying the terminal condition , . Direct calculation shows that (2.12) boils down to
(A.1) |
where and is the Archimedes’ constant. It follows that and
Hence, and
where the constant is chosen such that the terminal condition is fulfilled. Direction calculation leads to
and the solution of PDE (2.12) is given by (3.1). Let us first show that in (3.1) is the optimal value function. Indeed, it follows that the optimal distribution is now given by a Gaussian distribution with mean (independent of the exploration parameter ) and variance The law of optimal feedback Gaussian control allows to determine the exploration wealth drift and volatility from (2.6) as
Hence, the exploration wealth dynamics is given by (3.3). Now, it is clear that the SDE (3.3) admits solution given by
(A.2) |
Moreover, it can be seen directly from (A.2) that
which means that is admissible. ∎
A.2. Proof of Theorem 3.4
Proof.
Recall first that for any admissible (not necessarily Gaussian), the corresponding value function defined by (3.10) solves the following average PDE
with terminal condition . It follows that
Note that the above supremum is attained at the updated policy defined by (3.11). In other words,
which implies that
(A.3) |
Now, for the policy , the corresponding value function is given by
(A.4) |
Let . By Itô’s formula we obtain
Let be a localization sequence of the above stochastic integral, i.e. . Clearly, . By (A.3) we obtain for ,
By taking , sending and using (A.4) we obtain that for
∎
A.3. Proof of Theorem 3.5
Proof.
Observe first that solves the following average PDE
Solving the above PDE with terminal condition and with , we obtain , where is a continuous function depending only on and . Therefore, the value function satisfies hypothesis in Theorem 3.4 and its conclusions apply. In particular, the next Gaussian policy
(A.5) |
is admissible and
(A.6) |
where be the value function corresponding to this new policy . Again, solves the following average PDE
Direct computations related to the distribution of lead to the following PDE
It is hence straightforward to see that given by (3.1) solves the above PDE and the proof is complete. ∎
A.4. Proof of Theorem 3.6
Appendix B Technical Proofs of Section 4
B.1. Proof of Theorem 4.1
Proof.
Similar to the unconstrained problem, we find the solution of the HJB (4.6) under the form , where are smooth functions satisfying the terminal condition , . Direct calculation shows that
which are independent of . Hence, is also independent of . Now, (4.6) boils down to
(B.1) |
for , where . It follows that and
where the constant is chosen such that the terminal condition is fulfilled. Direction calculation leads to
and the solution of PDE (4.6) is given by (4.9). It is straightforward to verify that defined by (4.9) solves (4.6). Indeed, it follows that the optimal distribution is now given by a Gaussian distribution with mean (independent of exploration) and variance truncated on the interval . The density of is given by (4.14) which is independent of the wealth level. This truncated Gaussian control allows to determine the exploration wealth drift and volatility as
(B.2) |
where and are given by (4.14). This leads to the exploration wealth dynamics given by (4.12). Now, it is clear that the above SDE (4.12) admits a unique solution and which means that is admissible.∎
B.2. Proof of Lemma 4.4
Proof.
Appendix C Technical proofs for Section 6
C.1. Proof of Theorem 6.1
Proof.
We find the solution of the HJB (6.8) under the form , where are smooth functions satisfying the terminal condition , and . Differenting this ansatz, plugging into the PDE (6.8) and studying the coefficients of and we obtain
Similarly, we obtain the following ODE when considering the coefficient of :
(C.1) |
Direct calculation shows that
(C.2) | |||
(C.3) |
which are independent of . Hence, given in (6.15) which is also independent of . Therefore,
Now, it is straightforward to verify that defined by (6.14) solves (6.8). Next, it follows that the optimal feedback distribution control is a Gaussian distribution with parameter and , truncated on the interval i.e.
(C.4) |
This explicit truncated Gaussian policy allows to determine the exploration wealth drift and volatility as
where
(C.5) |
and
(C.6) |
which guarantees that the SDE (6.3) admits strong solution. Finally, it is straightforward to verify the integrability conditions to conclude that is admissible.∎
C.2. Proof of Proposition 6.2
Proof.
First, from the quadratic utility form it is easy to see that for any admissible terminal portfolio ,
(C.7) |
Observe that when is fixed, the right hand side of (C.7) is increasingly monotone with respect to as long as . Therefore, among policies with a fixed variance and mean , the policy that maximizes the quadratic expected utility will provide the highest mean of the right hand-side of (C.7). In other words, to show that the unconstrained exploratory optimal terminal wealth of the quadratic expected utility maximization problem lies in the mean-variance efficient frontier, it suffices to show that . Indeed, from (C.5), taking and , the unconstrained optimal wealth dynamics with exploration is given by
which implies that . Put we obtain the following ODE
Solving this ODE with initial condition we obtain
(C.8) |
and . It follows directly from the condition that . ∎