Continuous-time optimal investment with portfolio constraints: a reinforcement learning approach

H. Chau Department of Mathematics
University of Manchester
Manchester M13 9PL, United Kingdom [email protected] , D. Nguyen Department of Mathematics
Marist College
3399 North Road
Poughkeepsie NY 12601
United States [email protected] and T. Nguyen École d’Actuariat
Université Laval
2325 Rue de l’Université, Québec, QC G1V 0A6, Canada [email protected]

Abstract.

In a reinforcement learning (RL) framework, we study the exploratory version of the continuous time expected utility (EU) maximization problem with a portfolio constraint that includes widely-used financial regulations such as short-selling constraints and borrowing prohibition. The optimal feedback policy of the exploratory unconstrained classical EU problem is shown to be Gaussian. In the case where the portfolio weight is constrained to a given interval, the corresponding constrained optimal exploratory policy follows a truncated Gaussian distribution. We verify that the closed form optimal solution obtained for logarithmic utility and quadratic utility for both unconstrained and constrained situations converge to the non-exploratory expected utility counterpart when the exploration weight goes to zero. Finally, we establish a policy improvement theorem and devise an implementable reinforcement learning algorithm by casting the optimal problem in a martingale framework. Our numerical examples show that exploration leads to an optimal wealth process that is more dispersedly distributed with heavier tail compared to that of the case without exploration. This effect becomes less significant as the exploration parameter is smaller. Moreover, the numerical implementation also confirms the intuitive understanding that a broader domain of investment opportunities necessitates a higher exploration cost. Notably, when subjected to both short-selling and money borrowing constraints, the exploration cost becomes negligible compared to the unconstrained case.

Key words and phrases:

Optimal investment, entropy regularized, reinforcement learning, exploration, stochastic optimal control, portfolio constraint

2010 Mathematics Subject Classification:

34D20, 60H10, 92D25, 93D05, 93D20.

1. Introduction

Reinforcement learning (RL) is a dynamic and rapidly advancing subset within the field of machine learning. In recent times, RL has been applied across diverse disciplines to address substantial real-world challenges, extending its reach into domains such as game theory (Silver et al., 2016, 2017), control theory (Bertsekas, 2019), information theory (Williams et al., 2017), and statistics (Kaelbling et al., 1996). RL has been also applied to study many important issues of operations research (OR) such as manufacturing planning, control systems (Schneckenreither and Haeussler, 2019) and inventory management (Bertsimas and Thiele, 2006). In quantitative finance, RL has been used to study important problems such as algorithmic and high frequency trading, portfolio management. The availability of big data sets has too facilitated the rapid development of using RL techniques in financial engineering. As a typical example, electronic markets can provide a sufficient amount of microstructure data for training and adaptive learning, much beyond what human traders and portfolio managers could handle in old days. Numerous studies have been done in this direction including optimal order execution (Nevmyvaka et al., 2006; Schnaubelt, 2022), optimal trading (Hendricks and Wilcox, 2014), portfolio allocation (Moody et al., 1998), mean-variance portfolio allocation (Wang and Zhou, 2020; Wang, 2019; Dai et al., 2020; Jia and Zhou, 2022b). Notably, the authors in Wang and Zhou (2020) considered the continuous time mean-variance portfolio selection using a continuous RL framework. Their algorithm outperforms both traditional and deep neural network based algorithms by a large margin. For recent reviews on applications of RL in OR, finance and economics, the interested reader is invited to delve into the review papers e.g. Gosavi (2009); Hambly et al. (2021); Jaimungal (2022); Charpentier et al. (2021).

Differing from conventional econometric or supervised learning approaches prevalent in quantitative finance research, which often necessitate the assumption of parametric models, an RL agent refrains from pre-specifying a structural model. Instead, she progressively learns optimal strategies through trial and error, engaging in interactions with a black-box environment (e.g. the market). By repeatedly trying a policy for actions, receiving and evaluating reward signals, and improving the policy, the agent gradually improves her performance. The three key components that essentially capture the heart of RL are: i) exploration, ii) policy evaluation, and iii) policy improvement. The agent first explores the surrounding environment by trying a policy. She then evaluates her reward for the given policy. Lastly, using the information receiving from both exploration and current reward, she devises a new policy with larger reward. The whole process is then repeated.

Despite the fast development and vast applications, very few existing RL studies are done for continuous settings e.g. Doya (2000); Frémaux et al. (2013); Lee and Sutton (2021), whereas a large body of RL investigations is limited to discrete learning frameworks such as discrete-time Markov decision processes (MPD) or deterministic settings, see e.g. Sutton and Barto (2018); Hambly et al. (2021); Liu et al. (2020); Lee and Lee (2021). Since discrete time dynamics just provides an approximation for real work systems, it is important, from the practical point of view, to consider continuous time systems with both continuous states and action spaces. As mentioned previously, under RL settings, an agent simultaneously interacts with the surrounding environment (exploration) and improves her overall performance (exploitation). Since exploration is inherently costly in term of resources, it is important to design an active learning approach which balances both exploration and exploitation. Therefore, there is a critical necessity to extend the RL techniques to continuous settings where agent can find best learning strategies that balance exploration and exploitation.

The studies of RL under a continuous time framework in both time and space have been initiated in a series of recent papers (Wang et al., 2020; Wang and Zhou, 2020; Wang, 2019; Dai et al., 2020; Jia and Zhou, 2022b). In Wang et al. (2020) the authors propose a theoretical framework, called exploratory formulation, for studying RL problems in continuous systems in both time and space that captures repetitive learning under exploration in the continuous time limit. Wang and Zhou (2020) adopts the RL setting of Wang et al. (2020) to study the continuous-time mean variance portfolio selection with a finite time horizon. They show that in a learning framework incorporating both exploration and exploitation, the optimal feedback policy is Gaussian, with time-decaying variance. As showed in Jia and Zhou (2022a), the framework in Wang et al. (2020) minimizes the mean-square temporal difference (TD) error (Barnard, 1993; Baird, 1995; Sutton and Barto, 2018) when learning the value function, which later turns out to be inconsistent for stochastic settings. Dai et al. (2020) further considers the equilibrium mean–variance strategies addressing the time-inconsistent issue of the problem. Guo et al. (2022); Mou et al. (2021) extend the formulation and results of Wang and Zhou (2020) to mean-field games and a mean-variance problem with drift uncertainty, respectively. In Jia and Zhou (2022b), using a martingale approach, the authors are able to present the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns policy gradient into a policy evaluation problem. Studies of convergence under the exploratory formulation can be found in Tang et al. (2022); Huang et al. (2022).

In this paper, we adopt the exploratory stochastic control learning framework of Wang et al. (2020) to study the continuous-time optimal investment problem without and with a portfolio constraint which covers widely-used financial regulations such as short-selling constraints and borrowing prohibition (Cuoco, 1997). In both constrained and unconstrained settings with exploration, we manage to find the closed form solution of the exploratory problem for logarithmic utility and quadratic function. We show that the optimal feedback policy of the exploratory unconstrained expected utility (EU) problem is Gaussian, which is aligned with the result obtained in Wang and Zhou (2020) for the mean-variance setting. However, when the risky investment ratio is restricted on a given interval, the constrained optimal exploration policy now follows a truncated Gaussian distribution.

The explicit form of the solution to HJB equations enables us to obtain a policy improvement theorem which confirms that exploration can be performed within the class of (resp. truncated) Gaussian policies for the (resp. constrained) unconstrained EU problem. Moreover, by casting the optimal problem in a martingale framework as in Jia and Zhou (2022b), we devise an implementable reinforcement learning algorithm. We observe that compared to the classical case (without exploration), exploration procedure in the presence of a portfolio constraint leads to an optimal portfolio process that exhibits a more dispersed distribution with heavier tail. This effect becomes less significant as the exploration parameter is smaller. Moreover, a decrease (resp. increase) in the lower (resp. upper) bound of portfolio strategy leads to a corresponding increase in the exploration cost. This aligns with the intuitive understanding that a larger investment opportunity domain requires a higher exploration cost. Notably, when facing both short-selling and money borrowing constraints, the exploration cost becomes negligible compared to the unconstrained case. Our paper is the first attempt to explicitly analyze the classical continuous-time EU problem with possible portfolio constraints in an entropy-regularized RL framework. Although this paper addresses specifically an optimal portfolio problem, it is highly relevant to OR challenges from multiple perspectives. The techniques and methodologies discussed here can be applied directly to various areas of OR that are related to decision making under model parameter uncertainty and the reliable data is used to learn the true model parameters. It is worth highlighting that our findings in Section 6 establish a direct connection between quadratic utility and the mean-variance (MV) framework, a cornerstone in OR for evaluating risk-return profiles under portfolio constraints. With reinforcement learning (RL) methods increasingly adopted in OR for solving dynamic stochastic problems, our work offers a versatile framework with clear applications in portfolio optimization, inventory management, supply chain systems, and other OR domains involving constrained stochastic decision-making.

While revising this paper, we acknowledge a recent work Dai et al. (2023) where the authors consider the Merton optimal portfolio for a power utility function in a stochastic volatility setting with a recursive weighting scheme on exploration that endogenously discounts the current exploration reward by the past accumulative amount of exploration. By adopting a two-step optimization of the corresponding exploratory HBJ for an exploratory learning that incorporates the correlation between the risk asset and the volatility dynamics suggested in Dai et al. (2021), the authors in Dai et al. (2023) are able to characterize the Gaussian optimal policy with biased structure due to incompleteness. Compared to Dai et al. (2021, 2023), our paper studies the Merton problem for logarithmic and quadratic utility functions in complete market settings aiming at obtaining explicit solution, while accounting for possible portfolio constraints. We remark that the approach in Dai et al. (2023) seems to be no longer applicable for the classical expected utility maximization when removing the endogenously recursive weighting scheme on exploration. In addition, it is not clear how the two-step optimization procedure in Dai et al. (2023) is applied to our constrained setting.

The rest of the paper is organized as follows: Section 2 formulates the exploratory optimal investment consumption problem incorporating both exploration and exploitation and provides a general form of the optimal distributional control to the exploratory HJB equation. Section 3 studies the unconstrained optimal investment problem for logarithmic utility function. We elaborate the constrained EU problem with exploration and discuss the exploration cost and its impact in Section 4. Section 5 discuses an implementable reinforcement learning algorithm in a martingale framework and provides some numerical examples. Section 6 studies the case with a quadratic utility function. An extension to random coefficient markets is presented in Section 7. Section 8 concludes the paper with future research perspectives. Technical proofs and additional details can be found in the Appendix.

Notation. In this paper, we use $Y$ or $Y_{t}$ to denote a stochastic process $Y:=\{Y_{t}\}_{t\in[0,T]}$ ; $Y_{t}$ also refers to the values of the stochastic process when it is clear from the context. We use $\mathcal{N}(x|a,b^{2})$ to denote the normal distribution with mean $a$ and standard deviation $b>0$ , $\pi_{e}\approx 3.14$ denotes the mathematical constant pi. Additionally, $f_{t},f_{x},f_{xx}$ denote the partial first/second derivative with respect to the corresponding arguments. $\varphi(y)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}y^{2}}$ , $\Phi(y)=\int_{-\infty}^{y}\varphi(x)dx$ are the probability density and the probability cumulative density functions of the standard normal distribution, respectively.

2. Formulation

Consider a filtered probability space ( $\Omega,\mathcal{F}^{{W}},\{\mathcal{F}^{{W}}_{t}\}_{t\in[0,T]},\mathbb{P}^{{W}}$ ). Furthermore, let $\{W_{t}\}_{t\in[0,T]}$ be a standard Brownian motion (restricted to $[0,T]$ ) under the real world measure $\mathbb{P}^{{W}}$ and $\{\mathcal{F}^{{W}}_{t}\}_{t\in[0,T]}$ is the standard filtration of $\{W_{t}\}_{t\in[0,T]}$ . For our analysis, we assume a Black-Scholes economy. To be more precise, there are two assets traded in the market, one risk-free asset $B_{t}$ earning a constant interest rate $r$ and one risky asset $S_{t}$ following a geometric Brownian motion with instantaneous rate of return $\mu>r$ and volatility $\sigma>0$ . Their dynamics are given by,

(2.1)		$\displaystyle dB_{t}$	$\displaystyle=rB_{t}dt,\enspace B_{0}=1,$
(2.2)		$\displaystyle dS_{t}$	$\displaystyle=\mu S_{t}dt+\sigma S_{t}dW_{t},\quad t\in[0,T],\enspace S_{0}=s.$

Consider an investor endowed with an initial wealth $x$ , which will be used for investments. We assume that the investor splits her initial wealth in the two assets given above. We use $\pi=\{\pi_{t}\}_{t\in[0,T]}$ to denote the fraction of wealth that the investor invests in the risky asset. The remaining money is invested in the risk-free asset. We assume that the processes $\{\pi_{t}\}_{t\in[0,T]}$ is $\mathcal{F}^{{W}}$ -adapted. The investor chooses an investment and consumption strategy from the following admissible set

\displaystyle\mathcal{A}(x_{0}):=

\displaystyle\Bigg{\{}\pi\;\text{is progressively measurable},\,X_{t}^{\pi}\geq 0\;\text{for all}\;t\in[0,T],\;\int_{0}^{T}\pi_{t}^{2}dt<\infty\Bigg{\}}.

Here, $X_{t}^{\pi}$ is the investor’s wealth process which satisfies the following stochastic differential equation:

(2.3)

\displaystyle dX_{t}^{\pi}

\displaystyle=\left(r+\pi_{t}(\mu-r)\right)X_{t}^{\pi}dt+\sigma\pi_{t}X_{t}^{\pi}dW_{t}\,,\quad X_{0}=x_{0}>0.

2.1. The classic optimal investment problem

The classical asset allocation problem can be written as

(2.4)

\displaystyle\max_{(\pi)\in\mathcal{A}(x_{0})}\mathbb{E}^{{\mathbb{P}^{W}}}\Big{[}U(X_{T}^{\pi})\Big{]},

where $U$ is a concave utility function. The optimization problem (2.4) is solely an exploitation problem, which is a typical set up of classical stochastic control optimization problems. In a financial market with complete knowledge of the model parameters, ones are readily to employ the classical model-based stochastic control theory (see e.g. Fleming and Soner (2006); Pham (2009)), duality approach (see e.g. Chen and Vellekoop (2017); Kamma and Pelsser (2022)) or to use the so-called martingale approach (see e.g. Karatzas and Shreve (1998)) to find the solution of Problem (2.4). When implementing the optimal strategy, one needs to estimate the market parameters from historical time series of asset prices. However, in practice, estimating these parameters is notoriously difficult with an acceptable accuracy, especially the mean return $\mu$ , also known as the mean-blur problem, see e.g. Luenberger (1998).

2.2. Optimal investment with exploration

In the RL setting where the underlying model is not known, dynamic learning is needed and the agent employs exploration to interact with and learn from the unknown environment through trial and error. The key idea is to model exploration via a distribution over the space of controls $\mathcal{C}\subseteq\mathbb{R}$ from which the trials are sampled. Here we assume that the action space $\mathcal{C}$ is continuous and randomization is restricted to those distributions that have density functions. In particular, at time $t\in[0,T]$ , with the corresponding current wealth $X_{t}$ , the agent considers a classical control $(\pi_{t})$ sampled from a policy distribution $\lambda_{t}(\pi):=\lambda(\pi|t,X_{t})$ . Mathematically, such a policy $\lambda$ can be generated e.g. by introducing an additional random variable $Z$ that is uniformly distributed in $[0,1]$ and is independent of the filtration of the Brownian motion $W$ , see e.g. Jia and Zhou (2022a). In this sense, the probability space is now expanded to ( $\Omega,\mathcal{F},\{\mathcal{F}_{t}\}_{t\in[0,T]},\mathbb{P}$ ), where $\mathcal{F}:=\mathcal{F}^{W}\vee\sigma(Z)$ and $\mathbb{P}$ is now an extended probability measure on $\mathcal{F}$ whose projection on $\mathcal{F}^{W}$ is $\mathbb{P}^{W}$ . This sampling is executed for $N$ rounds over the same time horizon. Intuitively, the utility of such a (feedback) policy becomes accurate enough when N is large by using the law of large numbers. This procedure, known as policy evaluation, is considered as a fundamental element of most RL algorithms in practice. For our continuous-time setting, we follow the exploratory setting suggested in Wang et al. (2020) and refer to e.g., Wang et al. (2020); Wang and Zhou (2020); Jia and Zhou (2022a, b) for motivations and additional details. In particular, we consider the exploratory version of the wealth dynamics (2.3) given by

(2.5)

\displaystyle dX_{t}^{\lambda}={\hat{A}(t,X_{t}^{\lambda};\lambda)}dt+{\hat{B}(t,X_{t}^{\lambda};\lambda)}dW_{t}\,,\quad X_{0}=x_{0},

where the exploration drift and exploration volatility are defined by

(2.6)

\hat{A}(t,x;\lambda):=\int_{\mathcal{C}}A(\pi,x)\lambda(\pi|t,x)d\pi;\quad\hat{B}(t,\pi,x):=\sqrt{\int_{\mathcal{C}}B^{2}(\pi,x)\lambda(\pi|t,x)d\pi}.

where

A(\pi,x):=(r+\pi(\mu-r))x;\quad B(\pi,x):=\sigma\pi x.

From (2.6) we observe that

(2.7)

\hat{A}{(t,x;\lambda)}=x\bigg{(}r+(\mu-r)\int_{\mathcal{C}}\pi\lambda(\pi|t,x)d\pi\bigg{)};\quad\hat{B}^{2}{(t,x;\lambda)}=\sigma^{2}x^{2}\int_{{\mathcal{C}}}\pi^{2}\lambda(\pi|t,x)d\pi.

As a result, the agent may want to maximize over all admissible policies $\lambda$ the exploratory expected investment utility $\mathbb{E}[U(X_{T}^{\lambda})].$ Intuitively, the agent must account for exploration effect in her objective. As suggested in Wang et al. (2020), for a given exploration distribution $\lambda$ , the following Shannon’s differential entropy penalty at time $t\in[0,T]$

\mathcal{K}^{\lambda}(t,x):=-\int_{\mathcal{C}}\lambda(\pi|t,x)\ln\lambda(\pi|t,x)d\pi,

can be used to measure the exploration impact. Note that when the model is fully known, there would be no requirement for exploration and learning. In such a scenario, control distributions would collapse to Dirac measures, placing us in the domain of classical stochastic control Fleming and Soner (2006); Pham (2009).

Motivated by the above discussion, below we consider the following Shannon’s entropy-regularized exploratory optimization problem

(2.8)

\displaystyle\max_{\lambda\in\mathcal{H}}\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{0}^{T}\int_{\mathcal{C}}{\lambda(\pi|t,X_{t}^{\lambda})\ln\lambda(\pi|t,X_{t}^{\lambda})}d\pi dt\Bigg{]},

where $\mathcal{H}$ is the set of admissible (feedback) exploration distributions (or more precisely density). Mathematically, (2.8) is a relaxed stochastic control problem which has been widely studied in the control literature, see e.g. Fleming (1976); Fleming and Nisio (1984); Nicole el et al. (1987). Note that relaxed controls in general have a natural interpretation that at each time $t$ , the agent chooses not a single action (strategy) but rather a probability measure over the action space $\mathcal{C}$ from which a specific action is randomly sampled. In more general settings of relaxed control e.g. Fleming (1976); Fleming and Nisio (1984); Nicole el et al. (1987), the admissible set may contain open-loop distributions that are measure-valued stochastic processes. This is also in line with the notion of a mixed strategy in game theory, in which players choose probability measures over the action set, rather than single actions (pure strategies).

We note that the minus sign in front of the entropy term accounts for the fact that exploration is inherently costly in term of resources. Also, we can see that the optimization problem (2.8) incorporates both exploration (due to the entropy factor) and exploitation. The rate of exploration is determined by the exogenous temporal parameter $m>0$ in the sense that a larger value of $m$ indicates that more explorations are needed and vice versus. Hence, the agent can personalize her exploration rate by selecting an appropriate exogenous temporal parameter $m$ .

Next, we specify the set of admissible controls as follows.

Assumption 2.1.

The set of admissible control $\lambda\in\mathcal{H}$ has the following properties:

(1)

For each $(t,x)\in[0,T]\times\mathbb{R}$ , $\lambda(\cdot|t,x)$ is a density function on $\mathcal{C}$ .
(2)

The mapping $[0,T]\times\mathbb{R}\times\mathcal{C}\ni(t,x,\pi)\mapsto\lambda(\pi|t,x)$ is measurable.

(3)

For each $\lambda\in\mathcal{H}$ , the exploratory SDE (2.5) admits a unique strong solution denoted by $X^{\lambda}$ which is positive and

\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{0}^{T}\int_{\mathcal{C}}\lambda(\pi|t,X_{t}^{\lambda})\ln\lambda(\pi|t,X_{t}^{\lambda})d\pi dt\Bigg{]}<\infty.

2.3. HJB equation and optimal distribution control

For each $m\in\mathbb{R}$ , we denote by $v(t,x;m)$ the optimal value function of Problem (2.8), i.e.

\displaystyle v(t,x;m):=\max_{\lambda\in\mathcal{H}}J(t,x,m,\lambda):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathcal{C}}\lambda(\pi|s,X_{s}^{\lambda})\ln\lambda(\pi|s,X_{s}^{\lambda})d\pi ds|X^{\lambda}_{t}=x\Bigg{]}.

By standard arguments of Bellman’s principle of optimality, the optimal value function $v$ satisfies the following HJB equation

(2.9)

v_{t}(t,x;m)+\sup_{\lambda}\bigg{\{}\hat{A}(t,x;\lambda)v_{x}(t,x;m)+\frac{1}{2}\hat{B}^{2}(t,x;\lambda)v_{xx}(t,x;m)-m\int_{\mathcal{C}}\ln\lambda(\pi|t,x)\lambda(\pi|t,x)d\pi\bigg{\}}=0,

with terminal condition $v(T,x;m)=U(x)$ . We observe that the formula in the bracket of (2.9) can be expressed as

\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r)))xv_{x}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}(t,x;m)-m\ln\lambda(\pi|t,x)\bigg{)}\lambda(\pi|t,x)d\pi.

The optimal distribution $\lambda^{*}$ can be obtained by Donsker and Varadhan’s variational formula (see e.g. Donsker and Varadhan (2006))

(2.10)

\lambda^{*}(\pi|t,x;m){\propto\exp\bigg{\{}\frac{1}{m}\bigg{(}(r+\pi(\mu-r)))xv_{x}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}(t,x;m)\bigg{)}\bigg{\}}},

which is a feedback form and can be seen as a Boltzmann distribution (see e.g., Tang et al. (2022) for a similar result). Here the notation $\lambda^{*}(\pi|t,x;m)$ is introduced to indicate that $\lambda^{*}$ depends on the exploration parameter $m$ . Note that we can write $\lambda^{*}(\pi|t,x;m)$ as

\lambda^{*}(\pi|t,x;m)\propto\exp\left(\frac{1}{2m}\sigma^{2}x^{2}v_{xx}\big{(}\pi-\frac{(r-\mu)xv_{x}}{\sigma^{2}x^{2}v_{xx}}\big{)}^{2}\right).

which is Gaussian with mean $\alpha$ and variance $\beta^{2}$ (assuming that $v_{xx}<0$ ) defined by

(2.11)

\alpha=-\frac{(\mu-r)xv_{x}}{\sigma^{2}x^{2}v_{xx}};\quad\beta^{2}=-\frac{m}{\sigma^{2}x^{2}v_{xx}}.

Substituting (3.2) back to the HJB (2.9) we obtain the following non-linear PDE

(2.12)

\displaystyle v_{t}(t,x;m)+rxv_{x}(t,x;m)-\frac{1}{2}\frac{(\mu-r)^{2}v_{x}^{2}(t,x;m)}{\sigma^{2}v_{xx}(t,x;m)}+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}x^{2}v_{xx}(t,x;m)}\bigg{)}=0,

with terminal condition $\quad v(T,x;m)=U(x).$

3. Unconstrained optimal investment problem with exploration

3.1. Unconstrained optimal policy

Below we show that the optimal solution to (2.12) can be given in explicit form for the case of logarithmic utility. Logarithmic utility is widely recognized as the preference function for rational investors, as highlighted in Rubinstein (1977). The study in Pulley (1983) supports this assertion by showcasing that the logarithmic utility serves as an excellent approximation in the context of the Markowitz mean-variance setting and remarkably, the optimal portfolio under logarithmic utility is almost identical to that derived from the mean-variance strategy. It is crucial to emphasize that while our optimal strategy with logarithmic utility maintains time consistency, the mean-variance strategy is only precommitted, meaning it is only optimal at time 0. This stands out as one of the most critical drawbacks within the mean-variance setting. Logarithmic utility’s versatility extends to diverse applications, including long-term investor preferences (Gerrard et al., 2023), optimal hedging theory and within semimartingale market models (Merton, 1975), categorizing it under the hyperbolic absolute risk aversion (HARA) class. This class has been explored in portfolio optimization under finite-horizon economies (Cuoco, 1997), infinite-horizon consumption-portfolio problems (El Karoui and Jeanblanc-Picqué, 1998), and scenarios involving allowed terminal debt (Chen and Vellekoop, 2017).

We summarize the results for the unconstrained case i.e. $\mathcal{C}=\mathbb{R}$ in the following theorem .

Theorem 3.1.

The optimal value function of the entropy-regularized exploratory optimal investment problem with logarithmic utility $U(x)=\ln x$ is given by ¹¹1Recall that $\pi_{e}=3.14\ldots$ is the regular constant pi.

(3.1)

\displaystyle v(t,x;m)=

\displaystyle\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)

for $(t,x)\in[0,T]\times\mathbb{R}_{+}$ . Moreover, the optimal feedback distribution control $\lambda^{*}$ (which is independent of $(t,x)$ ) is a Gaussian with mean $\frac{(\mu-r)}{\sigma^{2}}$ and variance $\frac{m}{\sigma^{2}}$ , i.e.

(3.2)

\lambda^{*}(\pi)=\mathcal{N}\left(\pi\bigg{|}\frac{(\mu-r)}{\sigma^{2}},\frac{m}{\sigma^{2}}\right)

and the associated optimal wealth under $\lambda^{*}$ is given by the following SDE

(3.3)

\displaystyle dX_{t}^{\lambda*}=X_{t}^{\lambda^{*}}\bigg{(}r+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}dt+\sqrt{\bigg{(}m+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}}X_{t}^{\lambda^{*}}dW_{t}\,,\quad X_{0}=x_{0}\,.

From Theorem 3.1 we can see that the best control distribution to balance exploration and exploitation is Gaussian. This demonstrates the popularity of Gaussian distribution in RL studies. The mean of the optimal distribution control is given by the Merton strategy

(3.4)

\pi^{Merton}:=\frac{\mu-r}{\sigma^{2}},

whereas the variance of the optimal Gaussian policy is controlled by the degree of exploration $m$ . We also observe that at any $t\in[0,T]$ , the exploration variance decreases as $\sigma$ increases, which means that exploration is less necessary in more random market environment.

Remark 3.1.

The optimal value function of the regularized exploration problem can be expressed as

(3.5)

\displaystyle v(t,x;m)=

\displaystyle v(t,x;0)+\psi(t,x;m),

where

\psi(t,x;m):=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)

Intuitively, $\psi(t,x;m)$ measures the exploration effect. It is easy to see that $\psi(t,x;m)\to 0$ as $m\to 0$ , hence, $v(t,x;m)\to v^{Merton}(t,x)=v(t,x;0)=\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)$ which is the optimal value function in the absence of exploration.

The following shows solvability equivalence between the classical and exploratory EU problem, meaning that the solution to one directly provides the solution to the other, without requiring a separate resolution.

Theorem 3.2.

The following statements are equivalent

(a)

The function

\displaystyle v(t,x;m)=

\displaystyle\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)

is the optimal value function of the exploratory problem. The optimal distribution control $\lambda^{*}(\pi|t,x;m)$ is given by (3.2) and the exploratory wealth process is given by (3.3).

(b)

The function

$v^{Merton}({t},x):=v(t,x;0)=\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)$

is the optimal value function of the classic (without exploration) EU problem where the optimal investment strategy is given by the Merton fraction $\pi_{t}^{Merton}=\frac{(\mu-r)}{\sigma^{2}}$ .

Proof.

It is easy to see that the $v(t,x;m)$ and $v(t,x;0)$ solve the HJB equation (2.12) with and without exploration respectively. The admissibility of these two optimal strategies are straightforward to see. ∎

3.2. Exploration cost and exploration effect

The exploration cost for a general RL problem is defined by the difference between the expected utility following the corresponding optimal control under the classical objective and the exploratory objective, net of the value of the entropy. Remark that the exploration cost is well defined only if both the classical and the exploratory problems are solvable. Let $v(t,x;0)$ be the value function of the classical EU problem. Below, we often write for shorthand $\lambda_{s}(\pi)=\lambda(\pi|s,X^{\lambda}_{s})$ when there is no confusion. The exploration cost at time $t=0$ is defined by

(3.6)

v(0,x;0)-\bigg{(}v(0,x;m)+m\mathbb{E}\bigg{[}\int_{0}^{T}\int_{\mathcal{C}}\lambda_{s}^{*}(\pi)\ln\lambda_{s}^{*}(\pi)d\pi ds|X^{\lambda^{*}}_{0}=x\bigg{]}\bigg{)}.

The first term is the classical value function at time $t=0$ . The second term is the expected utility (before being regularized) of the exploration problem at time $t=0$ .

Proposition 3.1.

Assume that one the statements in Theorem 3.2 holds. Then, the exploration cost is given by $\frac{mT}{2}$ .

Proof.

By Remark 3.1, the optimal value function of the regularized exploration problem can be expressed as

(3.7)

\displaystyle v(0,x;m)=

\displaystyle v^{Merton}(0,x)+\frac{m}{2}T\ln(\sigma^{-2}{2\pi_{e}m}).

On the other hand, the entropy term can be computed using the Gaussian form of $\lambda^{*}$ as follows

	$\displaystyle\int_{0}^{T}\mathbb{E}\bigg{[}\int_{-\infty}^{+\infty}\lambda_{s}^{}(\pi)\ln\lambda_{s}^{}(\pi)d\pi\|X^{\lambda^{*}}_{0}=$	$\displaystyle x\bigg{]}ds=T\int_{-\infty}^{+\infty}\big{(}-\frac{1}{2\beta^{2}}(\pi-\alpha)^{2}-\ln(\sqrt{2\pi_{e}\beta^{2}})\big{)}\mathcal{N}(\pi\|\alpha,\beta^{2})d\pi$
(3.8)			$\displaystyle=-T(\ln\sqrt{\frac{2m\pi_{e}}{\sigma^{2}}}+\frac{1}{2}).$

Substituting this into (3.6) leads to the desired conclusion.∎

Observe that the exploration cost converges to zero as $m\to 0$ . Since the exploration weight as been taken as an exogenous parameter $m$ which reflects the level of exploration desired by the learning agent, it is intuitive to expect that the smaller $m$ is, the more emphasis is placed on exploitation. Moreover, when $m$ is sufficiently close to zero, the exploratory formulation is getting close to the problem without exploration. The following theorem confirms a desirable result that the entropy-regularized EU problem converges to its classical EU counterpart when the exploration weight $m$ goes to zero.

Theorem 3.3.

Assume that one the statements in Theorem 3.2 holds. Then, for each $x\in\mathbb{R}_{+}$ , it holds that

(3.9)

\lambda^{*}(\cdot|t,x;m)\to\delta_{\pi^{Merton}}(\cdot)\quad\mbox{when}\quad m\to 0.

where $\delta_{\pi^{Merton}}$ is the Dirac measure at the Merton strategy. Furthermore, for each $(t,x)\in[0,T]\times\mathbb{R}_{+}$ ,

\lim_{m\to 0}|v(t,x;m)-v^{Merton}(t,x)|=0.

Proof.

The convergence of the optimal distribution control $\lambda^{*}$ to the Dirac measure $\delta_{\pi^{Merton}}$ is straightforward because it is Gaussian density with mean identical to $\pi^{Merton}=\frac{(\mu-r)}{\sigma^{2}}$ and variance $\frac{m}{\sigma^{2}}\to 0$ as $m\to 0$ . The convergence of the value function follows directly from the relation (3.5) in Remark 3.1. ∎

3.3. Policy improvement

The following policy improvement theorem is crucial for interpretable RL algorithms as it ensures that the iterated value functions is non-decreasing and converges to the optimal value function. Below, for some given policy $\lambda$ (not necessarily Gaussian), we denote the corresponding value function by

(3.10)

\displaystyle v^{\lambda}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda_{s}(\pi)\ln\lambda_{s}(\pi)d\pi|X^{\lambda}_{t}=x\Bigg{]}.

Theorem 3.4.

For some given admissible policy $\lambda$ (not necessarily Gaussian), we assume that the corresponding value function $v^{\lambda}(\cdot,\cdot;m)\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ and satisfies $v^{\lambda}_{xx}(t,x,;m)<0$ for any $(t,x)\in[0,T)\times\mathbb{R}_{+}.$ Suppose furthermore that the feedback policy $\widetilde{\lambda}$ defined by

(3.11)

\widetilde{\lambda}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{\lambda}}{\sigma^{2}x^{2}v_{xx}^{\lambda}},-\frac{m}{\sigma^{2}x^{2}v_{xx}^{\lambda}}\bigg{)}

is admissible. Let $v^{\widetilde{\lambda}}(t,x;m)$ be the value function corresponding to this new (Gaussian) policy $\widetilde{\lambda}$ . Then,

(3.12)

v^{\widetilde{\lambda}}(t,x;m)\geq v^{\lambda}(t,x;m),\quad(t,x)\in[0,T)\times\mathbb{R}_{+}.

Note that Theorem 3.4 holds true for general utility function $U$ . One important implication of Theorem 3.4 is that for any given (not necessarily Gaussian) policy $\lambda$ , there are always policies in the Gaussian family that improve the value function of $\lambda$ (i.e. providing higher expected utility values). Therefore, it is intuitive to focus on the Gaussian policies when choosing an initial exploration distribution. Note that the optimal Gaussian policy given by (3.2) also suggests that a candidate of the initial feedback policy may take the form ${\lambda}^{0}(\pi|t,x,m)=\mathcal{N}\big{(}\pi\big{|}a,b^{2}\big{)}$ for some real number $a,b$ . As shown below, such a choice leads to a fast convergence of both value functions and policies in a finite number of iterations.

Theorem 3.5.

Let ${\lambda}^{0}(\pi|t,x,m)=\mathcal{N}\big{(}\pi\big{|}a,b^{2}\big{)}$ with some real number $a,b>0$ and consider the logarithmic utility function $U(x)=\ln x$ . Define the sequence of feedback policies $\{\lambda^{n}(\cdot|t,x;m)\}_{n\geq 1}$ updated by the policy improvement scheme (3.11), i.e.,

(3.13)

\lambda^{n}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{\lambda^{n-1}}(t,x;m)}{\sigma^{2}x^{2}v^{\lambda^{n-1}}_{xx}(t,x;m)},-\frac{m}{\sigma^{2}x^{2}v^{\lambda^{n-1}}_{xx}(t,x;m)}\bigg{)},\quad n=1,2,\ldots,

where $v^{\lambda^{n-1}}$ is the value function corresponding to the policy $\lambda^{n-1}$ defined by

(3.14)

\displaystyle v^{\lambda^{n-1}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda^{n-1}})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda^{n-1}(\pi|s,X^{\lambda^{n-1}}_{s})\ln\lambda^{n-1}(\pi|s,X^{\lambda^{n-1}}_{s})d\pi|X^{\lambda^{n-1}}_{t}=x\Bigg{]}.

Then,

(3.15)

\lim_{n\to\infty}\lambda^{n}(\cdot|t,x;m)=\lambda^{*}(\cdot|t,x;m),\quad\mbox{weakly}

and

(3.16)

\lim_{n\to\infty}v^{\lambda^{n}}(t,x;m)=v^{\lambda^{*}}(t,x;m)=v(t,x;m),

which is the optimal value function given by (3.1).

As shown in Appendix A.3, starting with a Gaussian policy the optimal solution of the exploratory EU with logarithmic utility is attained rapidly (after one update).

Remark 3.1.

It is valuable to extend Theorem 3.5 to contexts with general utility functions. However, obtaining closed-form solutions for highly nonlinear average PDEs (see Appendix A.3) is not feasible. Note that even the solution existence of the next-step average PDE is not obvious and would require a thorough analysis, including asymptotic expansions.

3.4. Policy evaluation and algorithm

Following Jia and Zhou (2022a, b) we consider in this section the policy evaluation that takes the martingale property into account. Recall with an abusive notation the value function

(3.17)

J(t,x;{\lambda}):=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda(\pi|s,X^{\lambda}_{s})\ln\lambda(\pi|s,X^{\lambda}_{s})d\pi ds\bigg{|}X_{t}^{\lambda}=x\bigg{]},

where $\lambda_{t}$ is a given learning (feedback) policy. By Feymann-Kac’s theorem, it can be seen that $J(t,x;{\lambda})$ solves (e.g. in viscosity sense) the following average PDE

(3.18)

\int_{-\infty}^{+\infty}\bigg{(}\mathcal{L}J(t,x;\lambda)-m\ln\lambda(\pi|t,x)\bigg{)}\lambda(\pi|x,t)d\pi=0,

where $\mathcal{L}$ is the infinitesimal operator associated to $X_{t}^{\lambda}$ , i.e.

(3.19)

\mathcal{L}J(t,x;\lambda):=J_{t}(t,x;\lambda)+\hat{A}({t,x;}\lambda)J_{x}(t,x;\lambda)+\frac{1}{2}\hat{B}^{2}({t,x;}\lambda)J_{xx}(t,x;\lambda).

As in Jia and Zhou (2022a, b), the value function of a policy $\lambda$ can be characterized by the following martingale property.

Theorem 3.6.

A function $J(\cdot,\cdot;{\lambda})$ is the value function associated with the policy $\lambda$ if and only if it satisfies the terminal condition $J(T,x;{\lambda})=U(x)$ , and

(3.20)

\displaystyle Y_{s}:=J(t,X^{\lambda}_{s};{\lambda})-m\int_{t}^{s}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du,\quad{s\in[t,T]}

is a martingale on $[t,T]$ .

It follows from Itô’s Lemma and Theorem 3.6 that $\mathbb{E}[\int_{0}^{T}h_{s}dY_{s}]=0$ for all adapted square-integrable processes $h$ satisfying $\mathbb{E}[\int_{0}^{T}h^{2}_{s}d\langle Y\rangle_{s}]<\infty$ . Equivalently,

(3.21)

\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t,X^{\lambda}_{t})\ln\lambda(\pi|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]}=0.

Such a process $h$ is called test function. Let $J^{\theta}(t,X^{\lambda}_{t};{\lambda})$ be a parameterized family that is used to approximate $J$ , where $\theta\in\Theta\subset\mathbb{R}^{n},n\geq 1$ . Here $J^{\theta}$ satisfies Assumption 3.1 below and our goal is to find the best parameter $\theta^{*}$ such that the martingale property in Theorem 3.6 holds. In this sense, the process

Y_{t}^{\theta^{*}}=J^{\theta^{*}}(t,X^{\lambda}_{t};{\lambda})-m\int_{0}^{t}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du

should be a martingale in $[0,T]$ with terminal value

Y_{T}^{{\theta^{*}}}=U(X^{\lambda}_{T})-m\int_{0}^{T}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du.

As discussed in (Jia and Zhou, 2022a, page 18), the martingale condition of $Y^{\theta^{*}}$ is further equivalent to a requirement that the process at any given time $t<T$ is the expectation of the terminal value conditional on all the information available at that time. A fundamental property of the conditional expectation then yields that $Y^{\theta_{t}^{*}}$ minimizes the $L^{2}$ -error between $Y^{\theta}_{T}$ and any ${\mathcal{F}}_{t}$ -measurable random variables. Therefore, our objective is to minimize the martingale cost function defined by $\mathbb{E}[\int_{0}^{T}(Y_{T}^{\theta}-Y_{t}^{\theta})^{2}dt]$ . In other words, for the policy evaluation (PE) procedure, we look at the following (offline) minimization problem

(3.22)

\min_{\theta\in\Theta}\mathbb{E}\bigg{[}\int_{0}^{T}\bigg{(}U(X^{\lambda}_{T})-J^{\theta}(t,X^{\lambda}_{u};{\lambda})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du\bigg{)}^{2}dt\bigg{]}=0

and check the martingale orthogonal property

(3.23)

\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ^{\theta}(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t,X^{\lambda}_{t})\ln\lambda(\pi|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]}=0.

The orthogonality condition (3.23) can be assured by solving the following minimization problem (which is a quadratic form of (3.23)) ²²2Our procedure is inspired from (Jia and Zhou, 2022a, page 22). In particular, we are optimizing with respect to the parameter $\theta$ which is a vector in general. In fact, we have to make sure that the parameterization satisfies the martingale orthogonality condition (3.23)) by choosing a suitable test function $h$ . For numerical approximation methods involving a parametric family $\{J^{\theta},\theta\in\mathbb{R}^{n}\}$ , in principle, we need at least $n$ equations as our martingale orthogonality conditions in order to fully determine $\theta$ . Consequently, $h$ may be vector-valued, making (3.23) a vector-valued equation or, equivalently, a system of equations.

	$\displaystyle\min_{\theta\in\Theta}$	$\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ^{\theta}(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi\|X^{\lambda}_{t},t)\ln\lambda(\pi\|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]}^{\intercal}\times{\Sigma}\times$
(3.24)			$\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ^{\theta}(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi\|t,X^{\lambda}_{t})\ln\lambda(\pi\|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]},$

where $\Sigma$ is a positively definite matrix with suitable size (e.g. $\Sigma=I$ or $\Sigma=\mathbb{E}[\int_{0}^{T}h_{t}^{\intercal}h_{t}dt]$ ) for any test process $h$ . A common choice of test process is to take $h_{t}=\frac{\partial}{\partial\theta}J^{\theta}(t,X^{\lambda}_{t};\lambda)$ . The following assumption³³3There is a rich theory on solution existence and regularity of general parabolic PDEs, see e.g. Friedman (2008). However, the average PDE (3.18) appears to be a new type of PDEs for which further studies on well-posedness (existence and uniqueness) and regularity of solution are needed. See Tang et al. (2022) for related discussions on a similar exploratory elliptic PDEs. is needed in such an approximation procedure, see e.g. Jia and Zhou (2022b).

Assumption 3.1.

For all $\theta\in\Theta\subset\mathbb{R}^{n}$ , $J^{\theta}$ is a $C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ with polynomial growth condition in $x$ . Moreover, $J^{\theta}$ is smooth in $\theta$ and its two first derivatives are $C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ and satisfy the polynomial growth condition in $x$ .

Suppose that a PE step can be proceeded to obtain an estimate of the value function for an given admissible policy $\lambda$ . If the policy $\lambda^{\phi}$ can be parameterized by $\phi\in\Phi\subset\mathbb{R}^{d},{d\geq 1}$ , it is possible to learn the corresponding value function by following a policy gradient (PG) step. Recall first that the value function $J(t,x,\lambda^{\phi})$ satisfies the average PDE

(3.25)

\int_{-\infty}^{+\infty}\bigg{(}\mathcal{L}J(t,x;\lambda^{\phi})-m\ln\lambda^{\phi}(\pi|t,x)\bigg{)}\lambda^{\phi}(\pi|t,x)d\pi=0.

By differentiating (3.25) w.r.t. $\phi$ we obtain the following average PDE for the policy gradient $g(t,x;\phi):=\frac{\partial}{\partial\phi}J(t,x,\lambda^{\phi})$

(3.26)

\int_{-\infty}^{+\infty}\bigg{(}\psi(t,x,\pi;\phi)+\mathcal{L}g(t,x;\phi)\bigg{)}\lambda^{\phi}(\pi|t,x)d\pi=0,

with terminal condition $g(T,x;\phi)=\frac{\partial}{\partial\phi}J(T,x;\lambda^{\phi})=0$ , where

(3.27)

\psi(t,x,\pi;\phi):=\bigg{(}\mathcal{L}J(t,x;\lambda^{\phi})-m\ln\lambda^{\phi}(\pi|t,x)\bigg{)}\frac{\frac{\partial}{\partial\phi}\lambda^{\phi}(\pi|t,x)}{\lambda^{\phi}(\pi|t,x)}-m\frac{\partial}{\partial\phi}\ln\lambda^{\phi}(\pi|t,x).

By Feynman-Kac’s formula, $g(t,x;\phi)$ admits the following representation

(3.28)

g(t,x;\phi)=\mathbb{E}\Big{[}\int_{t}^{T}\int_{-\infty}^{+\infty}\psi(s,X^{\lambda^{\phi}}_{s},\pi;\phi)\lambda^{\phi}(\pi|s,X^{\lambda^{\phi}}_{s})d\pi ds|X^{\lambda^{\phi}}_{t}=x\Big{]}.

Since $\psi$ cannot be observed nor computed without knowing the environment, it is important to replace $\psi(s,X^{\lambda^{\phi}}_{s},\pi;\phi)ds$ by a suitable approximation. To this end, using Itô’s formula, the term $\mathcal{L}J(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})ds$ can be replaced by $dJ(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})-\frac{\partial}{\partial{x}}J(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})\hat{B}(\pi,\lambda_{s}^{\phi})dW_{s}$ . Taking into account the fact that the term $\frac{\partial}{\partial{x}}J(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})\hat{B}(\pi,\lambda_{s}^{\phi})dW_{s}$ is a martingale, we obtain the following.

Theorem 3.7.

Let $\lambda^{\phi}$ be an admissible policy. The policy gradient $g(t,x;\phi)=\frac{\partial}{\partial\phi}J(t,x,\lambda^{\phi})$ admits the following representation

	$\displaystyle g(t,x;\phi)$	$\displaystyle=\mathbb{E}\Bigg{[}\int_{t}^{T}\int_{-\infty}^{+\infty}\bigg{(}\frac{\partial}{\partial\phi}\ln\lambda^{\phi}(\pi\|s,X^{\lambda^{\phi}}_{s})\big{(}dJ(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})-m\ln\lambda^{\phi}(\pi\|s,X^{\lambda^{\phi}}_{s})\big{)}$
(3.29)			$\displaystyle-m\frac{\partial}{\partial\phi}\ln\lambda^{\phi}(\pi\|s,X^{\lambda^{\phi}}_{s})\bigg{)}\lambda^{\phi}(\pi\|s,X^{\lambda^{\phi}}_{s})d\pi ds\bigg{\|}X^{\lambda^{\phi}}_{t}=x\Bigg{]}.$

Proof.

The proof is similar to that of Theorem 5 in Jia and Zhou (2022b), and hence is omitted. ∎

Note that for any arbitrary policy, the gradient of the corresponding value function given by the above expectation (3.29) is not zero in general. As mentioned in Jia and Zhou (2022a, b), all the terms inside the expectation above are all only computable when the action trajectories and the corresponding state trajectories on $[t,T]$ are available. The sample is also needed to estimate the value function $J$ obtained in the previous PE step.

4. Constrained optimal investment problem with exploration

In this section, we extend the results obtained in Section 3 to settings with a convex portfolio constraint see e.g. Cuoco (1997). In particular, we assume that the agent does not consume over her investment period and in addition, due to regulatory reasons, her risky investment ratio $\pi$ is set in a given interval $[a,b]$ , where $a<b$ are real numbers. Observe that the well-known short-selling constraint can be included by taking $a=0$ , $b>0$ . If, in addition $b=1$ , then both short-selling and money borrowing are prohibited. The case of buying constraint can be also covered by setting $b=0$ . Clearly, we are back to the unconstrained setting when sending $a\to-\infty$ and $b\to+\infty$ . In such a constrained framework, the set of admissible investment strategies is now defined by

\displaystyle\mathcal{A}_{[a,b]}(x_{0}):=

\displaystyle\Bigg{\{}(\pi)\,\,\bigg{|}\,\;a\leq\pi\leq b\;\text{is progressively measurable},\,X_{t}^{\pi}\geq 0\;\text{for all}\;t\geq 0,\;\int_{0}^{T}\pi_{t}^{2}dt<\infty\Bigg{\}},

where $X_{t}^{\pi}$ is the non-exploration wealth process starting with $x$ for strategy $\pi$ . The objective is to maximize the terminal expected utility

(4.1)

\displaystyle\max_{\pi\in\mathcal{A}_{[a,b]}(x_{0})}\mathbb{E}\Big{[}U(X_{T}^{\pi})\Big{]},

where $U$ is a utility function.

As in the unconstrained case, we look at the exploratory version of the wealth dynamics given by (2.5) with exploration drift $\hat{A}(t,x;\lambda)$ and exploration volatility $\hat{B}(t,x;\lambda)$ are defined by (2.6). Intuitively, the corresponding exploration setting is slightly adjusted to taking the constraint $\pi_{t}\in[a,b]$ into account. In particular, let $\mathcal{H}_{[a,b]}$ be the set of admissible exploration distributions $\lambda$ in $\mathcal{C}=[a,b]$ that satisfy the following properties:

(1)

For each $(t,x)\in[0,T]\times\mathbb{R}$ , $\lambda(\cdot|t,x)$ is a density function on $[a,b]$ .
(2)

The mapping $[0,T]\times\mathbb{R}\times[a,b]\ni(t,x,\pi)\mapsto\lambda(\pi|t,x)$ is measurable.
(3)

For each $\lambda\in\mathcal{H}_{[a,b]}$ , the exploration SDE (2.5) admits a unique strong solution denoted by $X^{\lambda^{{,[a,b]}}}$ which is positive and $\mathbb{E}\bigg{[}U(X_{T}^{\lambda^{{,[a,b]}}})-m\int_{0}^{T}\int_{a}^{b}\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})\ln\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})d\pi dt\bigg{]}<\infty.$

The exploration optimization is now stated by

(4.2)

\displaystyle\max_{\lambda\in\mathcal{H}_{[a,b]}}\mathbb{E}\bigg{[}U(X_{T}^{\lambda^{{,[a,b]}}})-m\int_{0}^{T}{\int_{a}^{b}}\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})\ln\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})d\pi dt\bigg{]}.

As before, the optimal value function satisfies the following HJB equation

	$\displaystyle v^{[a,b]}_{t}(t,x;m)+\sup_{\lambda\in\mathcal{H}_{[a,b]}}\bigg{\{}{\hat{A}(t,x;\lambda)}v^{[a,b]}_{x}(t,x;m)$	$\displaystyle+\frac{1}{2}{\hat{B}^{2}(t,x;\lambda)}v^{[a,b]}_{xx}(t,x;m)$
(4.3)			$\displaystyle-m{\int_{a}^{b}}\lambda(\pi\|t,x)\ln\lambda(\pi\|t,x)d\pi\bigg{\}}=0,$

with terminal condition $v(T,x;m)=U(x)$ . Using again the standard argument of DPP we observe that under the portfolio constraint $\pi\in[a,b]$ , the optimal feedback policy now follows a truncated Gaussian distribution.

Lemma 4.1.

In the exploratory constrained EU setting, if $v^{[a,b]}_{xx}<0$ then the optimal distribution $\lambda^{*,[a,b]}$ is a Gaussian distribution with mean $\alpha$ and variance $\beta^{2}$ , truncated on interval $[a,b]$ , where

(4.4)

\alpha=-\frac{(\mu-r)xv_{x}}{\sigma^{2}x^{2}v^{[a,b]}_{xx}};\quad\beta^{2}=-\frac{m}{\sigma^{2}x^{2}v^{[a,b]}_{xx}}.

The density of the optimal policy $\lambda^{*,[a,b]}$ is given by

(4.5)

\displaystyle\lambda^{*,[a,b]}(\pi|t,x;m)

\displaystyle:=\frac{1}{\beta}\frac{\varphi(\frac{\pi-\alpha}{\beta})}{\Phi(\frac{b-\alpha}{\beta})-\Phi(\frac{a-\alpha}{\beta})},

where $\varphi$ and $\Phi$ are the PDF and CDF functions of the standard normal distribution, respectively.

Note that the first two moments and entropy of a truncated Gaussian distribution can be computed explicitly, see e.g. Kotz et al. (2004). Substituting (4.5) back to the HJB (4.3) we obtain the following highly non-linear PDE

	$\displaystyle v^{[a,b]}_{t}(t,x;m)+rxv^{[a,b]}_{x}(t,x;m)$	$\displaystyle-\frac{1}{2}\frac{(\mu-r)^{2}(v^{[a,b]}_{x})^{2}(t,x;m)}{\sigma^{2}v^{[a,b]}_{xx}(t,x;m)}$
(4.6)			$\displaystyle+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}x^{2}v^{[a,b]}_{xx}(t,x;m)}\bigg{)}+{m\ln Z(m)}=0,$

with terminal condition $v(T,x;m)=U(x)$ , where by abusing notations,

(4.7)

Z(m):={\Phi(B(m))-\Phi(A(m))},

with

(4.8)

A(m):=\bigg{(}a+\frac{(\mu-r)}{\sigma^{2}}\frac{v^{[a,b]}_{x}}{xv^{[a,b]}_{xx}}\bigg{)}\sqrt{\frac{-\sigma^{2}x^{2}v^{[a,b]}_{xx}}{m}};\quad B(m):=\bigg{(}b+\frac{(\mu-r)}{\sigma^{2}}\frac{v^{[a,b]}_{x}}{xv^{[a,b]}_{xx}}\bigg{)}\sqrt{\frac{-\sigma^{2}x^{2}v^{[a,b]}_{xx}}{m}}.

4.1. Optimal policy with portfolio constraints

Below we show that the optimal solution to (4.6) can be given in explicit form for the case of logarithmic utility. We summarize the results in the following theorem.

Theorem 4.1 (Exploratory optimal investment under portfolio constraint).

For logarithmic utility $U(x)=\ln x$ , the optimal value function of the entropy-regularized exploratory constrained optimal investment problem is given by

(4.9)

\displaystyle v^{[a,b]}(t,x;m)=

\displaystyle\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t){+m\ln Z_{a,b}(m)(T-t)}

for $(t,x)\in[0,T]\times\mathbb{R}_{+}$ , where

(4.10)

Z_{a,b}(m):=\Phi\bigg{(}(b-\pi^{Merton})\sigma m^{-1/2}\bigg{)}-\Phi\bigg{(}(a-\pi^{Merton})\sigma m^{-1/2}\bigg{)}.

Moreover, the optimal feedback distribution control $\lambda^{*,[a,b]}$ is a Gaussian distribution with parameter $\pi^{Merton}=\frac{(\mu-r)}{\sigma^{2}}$ and $\frac{m}{\sigma^{2}}$ , truncated on the interval $[a,b]$ i.e.

(4.11)

\lambda^{*,[a,b]}(\pi)=\mathcal{N}\left(\pi\bigg{|}\frac{(\mu-r)}{\sigma^{2}},\frac{m}{\sigma^{2}}\right)\bigg{|}_{[a,b]}

and the associated optimal wealth under $\lambda^{*,[a,b]}$ is given by the following SDE

(4.12)

\displaystyle dX_{t}^{\lambda^{*,[a,b]}}=X_{t}^{\lambda^{*,[a,b]}}\bigg{(}r+\pi_{t}^{*,[a,b]}(\mu-r)\bigg{)}dt+\sigma{q_{t}^{*,[a,b]}}X_{t}^{\lambda^{*,[a,b]}}dW_{t}\,,\quad X_{0}=x_{0}\,,

where

(4.13)

\displaystyle(q_{t}^{*,[a,b]})^{2}=(\pi_{t}^{*,[a,b]})^{2}+\frac{m}{\sigma^{2}}\bigg{(}1+\frac{A(m)\varphi(A(m))-B(m)\varphi(B(m))}{Z_{a,b}(m)}-\bigg{(}\frac{\varphi(A(m))-\varphi(B(m))}{Z_{a,b}(m)}\bigg{)}^{2}\bigg{)}

with, by abusing notations, $A(m):=(a-\pi^{Merton})\sigma m^{-1/2}$ , $B(m):=(b-\pi^{Merton})\sigma m^{-1/2}$ and

(4.14)

\pi^{*,[a,b]}_{t}=\pi^{Merton}+\frac{\varphi\bigg{(}(a-\pi^{Merton})\sigma m^{-1/2}\bigg{)}-\varphi\bigg{(}(b-\pi^{Merton})\sigma m^{-1/2}\bigg{)}}{\sigma m^{-1/2}Z_{a,b}(m)}.

Remark 4.1.

At any $t\in[0,T]$ , the exploration variance decreases as $\sigma$ increases, which means that exploration is less necessary in more random market environment. However, in contrast to the unconstrained section, both the mean and variance of the optimal distribution $\lambda^{*,[a,b]}$ are controlled by the degree of exploration $m$ .

From Theorem 4.1 we can see that in such a setting with portfolio constraints, the best control distribution to balance exploration and exploitation is truncated Gaussian, demonstrating once again the popularity of Gaussian distribution in RL studies, even in constrained optimization settings.

Lemma 4.2.

The optimal value function of the regularized exploration constrained problem can be expressed as

(4.15)

\displaystyle v^{[a,b]}(t,x;m)=

\displaystyle v^{[a,b]}(t,x;0)+\psi^{[a,b]}(t,x;m),

where

(4.16)

\displaystyle v^{[a,b]}(t,x;0):=

\displaystyle\ln x+\bigg{(}r+(\mu-r)\pi^{0,[a,b]}-\frac{1}{2}(\pi^{0,[a,b]})^{2}\sigma^{2}\bigg{)}(T-t),

which is the constrained optimal value function without exploration using the corresponding constrained optimal investment strategy

(4.17)

\pi^{0,[a,b]}:=\pi^{Merton}{\bf 1}_{\pi^{Merton}\in[a,b]}+a{\bf 1}_{\pi^{Merton}<a}+b{\bf 1}_{\pi^{Merton}>b},

where

\psi^{[a,b]}(t,x;m)=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t){+m\ln Z_{a,b}(m)}(T-t)+(f(a){\bf 1}_{\pi^{Merton}<a}+f(b){\bf 1}_{\pi^{Merton}>b})(T-t)

and

f(y):=\frac{1}{2}y^{2}\sigma^{2}-y(\mu-r)+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}=\frac{1}{2}\sigma^{2}(y-\pi^{Merton})^{2}.

Note that $f(\pi^{Merton})=0$ . The following lemma confirms the intuition that $\psi^{[a,b]}(t,x;m)$ , which measures the exploration effect, converges to 0 as $m\to 0$ .

Lemma 4.3.

Consider $Z_{a,b}(m)$ defined by (4.10) for each $m>0$ . We have

(4.18)

\lim_{m\to 0}{m\ln Z_{a,b}(m)}=\begin{cases}0&{\pi^{Merton}\in[a,b]}\\ -f(a)&{\pi^{Merton}<a}\\ -f(b)&{\pi^{Merton}>b}.\end{cases}

Proof.

It can be seen directly using e.g. the classical L’Hôpital’s rule. ∎

Similar to Theorem 3.2 for the unconstrained case, it is straightforward to obtain the solvability equivalence between classical and exploratory constrained EU problem.

4.2. Exploration cost and exploration effect under portfolio constraints

We now turn our attention to the exploration cost. As before, the exploration cost at time $t=0$ is defined by

(4.19)

\begin{split}L^{[a,b]}(T,x;m)&:=v^{[a,b]}(0,x;0)-v^{[a,b]}(0,x;m)\\ &-m\mathbb{E}\bigg{[}\int_{0}^{T}\int_{a}^{b}\lambda_{s}^{*,[a,b]}(\pi)\ln\lambda_{s}^{*,[a,b]}(\pi)d\pi ds|X^{\lambda^{*,[a,b]}}_{0}=x\bigg{]},\end{split}

where $v^{[a,b]}(0,x;0)$ and $v^{[a,b]}(0,x;m)$ are the optimal value function at time $t=0$ of the constrained EU problem without and with exploration respectively (defined by (4.15) and (4.16)). The integral term is the expected utility (before being regularized) of the exploration problem at time $t=0$ .

Proposition 4.1.

In the constrained problem with exploration, the exploration cost is given by

L^{[a,b]}(T,x;m)=\frac{mT}{2}+mT\frac{A(m)\varphi(A(m))-B(m)\varphi(B(m))}{2Z_{a,b}(m)},

where, by abusing notations,

A(m):=(a-\pi^{Merton})\sigma m^{-1/2};\quad B(m):=(b-\pi^{Merton})\sigma m^{-1/2}.

Moreover, $\lim_{m\to 0}L^{[a,b]}(T,x;m)=0$ .

Proof.

Recall first that the optimal distribution $\lambda_{t}^{*,[a,b]}$ is a Gaussian distribution with mean $\alpha=\pi^{Merton}$ and variance $\beta^{2}=\frac{m}{\sigma^{2}}$ truncated on the interval $[a,b]$ . It is known e.g. Kotz et al. (2004) that for any Gaussian distribution $\mathcal{N}(\alpha,\beta^{2})$ truncated on $[a,b]$ , its entropy is given by

(4.20)

\ln\sqrt{{2\pi_{e}\beta^{2}}}+\frac{1}{2}+\ln\bigg{(}\Phi(l(b))-\Phi(l(a))\bigg{)}+\frac{l(a)\varphi(l(a))-l(b)\varphi(l(b))}{2(\Phi(l(b))-)\Phi(l(a)))},

where $l(y):=\frac{y-\alpha}{\beta}$ . The explicit form of the exploration cost now follows directly by Lemma 4.2. Finally, it is straightforward to verify that $\lim_{m\to 0}L^{[a,b]}(T,x;m)=0$ . ∎

As shown above, the exploration cost converges to zero as $m\to 0$ . Since the exploration weight as been taken as an exogenous parameter $m$ which reflects the level of exploration desired by the learning agent, it is intuitive to expect that the smaller $m$ is, the more emphasis is placed on exploitation; and when $m$ is sufficiently close to zero, the exploratory formulation is getting close to the problem without exploration. The following lemma confirms the intuition that exploration cost under a portfolio constraint is smaller than that without constraint when the investment strategy is restricted in an interval.

Lemma 4.4.

If the investment strategy is restricted in $[a,b]$ , where $a<\pi^{Merton}<b$ , then $L^{[a,b]}(T,x;m)\leq L(T,x;m)$ , for all $x>0$ . The same conclusion also holds for the case where $\pi^{Merton}\leq a<b\leq\pi^{Merton}+\sqrt{m}\sigma^{-1}$ or $\pi^{Merton}-\sqrt{m}\sigma^{-1}\leq a<b\leq\pi^{Merton}$ .

The following theorem confirms a desirable result that the entropy-regularized constrained EU problem converges to its non-exploratory constrained EU counterpart when the exploration weight $m$ goes to zero.

Theorem 4.2.

For each $x\in\mathbb{R}+$ ,

(4.21)

\lambda^{*,[a,b]}(\cdot|t,x,m)\to\delta_{\pi^{0,[a,b]}}{(\cdot)}\quad\mbox{when}\quad m\to 0,

where the (without exploration) constrained Merton strategy $\pi^{0,[a,b]}$ is defined by (4.17). Furthermore, for each $(t,x)\in[0,T]\times\mathbb{R}+$ ,

\lim_{m\to 0}|v^{*,[a,b]}(t,x;m)-v^{*,[a,b]}(t,x;0)|=0.

Proof.

The convergence of the optimal distribution control $\lambda^{*,[a,b]}$ to the Dirac measure $\delta_{{\pi^{0,[a,b]}}}$ is straightforward. The convergence of the value function follows directly from fact that $\psi^{[a,b]}(t,x;m)$ given by (4.14) converges to 0 as $m\to 0$ by Lemma 4.2. ∎

4.3. Policy improvement

Below, for some given admissible (feedback) policy $\lambda^{[a,b]}$ (not necessarily truncated Gaussian) on $[a,b]$ , we denote the corresponding value function by

(4.22)

\displaystyle v^{\lambda^{[a,b]}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda^{[a,b]}})-m\int_{t}^{T}{\int_{a}^{b}}\lambda_{s}^{[a,b]}(\pi)\ln\lambda_{s}^{[a,b]}(\pi)d\pi|X^{\lambda^{[a,b]}}_{t}=x\Bigg{]}.

Theorem 4.3.

For some given policy $\lambda^{[a,b]}$ (not necessarily Gaussian) on the interval $[a,b]$ , we assume that the corresponding value function $v^{\lambda^{[a,b]}}(\cdot,\cdot;m)\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+})$ and satisfies $v^{{\lambda^{[a,b]}}}_{xx}(t,x,;m)<0$ for any $(t,x)\in[0,T)\times\mathbb{R}_{+}.$ Suppose furthermore that the feedback policy $\widetilde{\lambda}^{[a,b]}$ defined by

(4.23)

\widetilde{\lambda}^{[a,b]}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv^{{\lambda^{[a,b]}}}_{x}}{\sigma^{2}x^{2}v^{{\lambda^{[a,b]}}}_{xx}},-\frac{m}{\sigma^{2}x^{2}v^{{\lambda^{[a,b]}}}_{xx}}\bigg{)}\bigg{|}_{[a,b]}

(Gaussian truncated on $[a,b]$ ) is admissible. Let $v^{\widetilde{\lambda}^{[a,b]}}(t,x;m)$ be the value function corresponding to this new truncated (Gaussian) policy $\widetilde{\lambda}^{[a,b]}$ . Then,

(4.24)

v^{\widetilde{\lambda}^{[a,b]}}(t,x;m)\geq v^{\lambda^{[a,b]}}(t,x;m),\quad(t,x)\in[0,T)\times\mathbb{R}_{+}.

Proof.

The proof is similar to that of the unconstrained case hence, omitted. ∎

Theorem 4.3, also suggests that a candidate of the initial feedback policy may take the form ${\lambda}^{0,[a,b]}(\pi|t,x;m)=\mathcal{N}\big{(}\pi\big{|}\alpha,\beta^{2}\big{)}|_{[a,b]}$ for some parameters $\alpha,\beta^{2}$ . Similar to the unconstrained case, such a choice of policy leads to the convergence of both the value functions and the policies in a finite number of iterations.

Theorem 4.4.

Let ${\lambda}^{0,[a,b]}(\pi;t,x,m)=\mathcal{N}\big{(}\pi\big{|}\alpha,\beta^{2}\big{)}|_{[a,b]}$ with $\alpha,\beta>0$ and assume that $U(x)=\ln x$ . Define the sequence of feedback policies $(\lambda^{n,{[a,b]}}(\pi;t,x,m))$ updated by the policy improvement scheme (4.23), i.e.,

(4.25)

\lambda^{n,{[a,b]}}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{\lambda^{n-1,{[a,b]}}}(t,x;m)}{\sigma^{2}x^{2}v^{\lambda^{n-1,{[a,b]}}}_{xx}(t,x;m)},-\frac{m}{\sigma^{2}x^{2}v^{\lambda^{n-1,{[a,b]}}}_{xx}(t,x;m)}\bigg{)}\bigg{|}_{[a,b]},\quad n=1,2,\cdots,

where $v^{\lambda^{n-1,{[a,b]}}}$ is the value function corresponding to the policy $\lambda^{n-1,{[a,b]}}$ defined by

(4.26)

\displaystyle v^{\lambda^{n-1,{[a,b]}}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda^{n-1,{[a,b]}}})-m\int_{t}^{T}\int_{a}^{b}\lambda_{s}^{n-1,{[a,b]}}(\pi)\ln\lambda_{s}^{n-1,{[a,b]}}(\pi)d\pi|X^{\lambda^{n-1,{[a,b]}}}_{t}=x\Bigg{]}.

Then,

(4.27)

\lim_{n\to\infty}\lambda^{n,{[a,b]}}(\cdot|t,x;m)=\lambda^{*,{[a,b]}}(\cdot|t,x;m),\quad\mbox{weakly}

and

(4.28)

\lim_{n\to\infty}v^{\lambda^{n,{[a,b]}}}(t,x;m)=v^{\lambda^{*,{[a,b]}}}(t,x;m),

which is the optimal value function given in Lemma 4.2.

Proof.

The proof can be done using Feymann-Kac representation and the update result of Theorem 4.3, similarly to the unconstrained problem. ∎

The above improvement theorem allows to establish the policy evaluation and algorithm taking the martingale property into account. Since the steps are similar to the unconstrained problem in Section 3.4, we skip the detail.

5. Learning implementation and numerical example

We are ready now in the position to discuss a RL algorithm for this problem. Note that a common practice in RL literature is that one can represent the value function $J^{\theta}$ and the policy $\lambda^{\phi}$ using (deep) neural networks. In this section, we don’t follow this path. Inspired by the (offline) actor-critic approach in Jia and Zhou (2022b), we instead take advantages of the explicit parametric form of the optimal value function $v$ and the improvement policy which are given in Theorem 3.2. This will in turn facilitate the learning process and will lead to faster learning and convergence.

Below, to implement a workable algorithm, one can approximate a generic expectation in the following manner: Let $0=t_{0}<t_{1}<\ldots<t_{l}<t_{l+1}=T$ be a partition for the finite interval $[0,T]$ . We then collect a sample $\mathcal{D}=\{(t_{i},x_{i}):i=0,1,\ldots,l+1\}$ as follows: For $i=0$ , the initial sample point is $(0,x_{0})$ . Next, at each $t_{i},i=0,1,\ldots,l$ , $\lambda_{t_{i}}(\pi)$ is sampled to obtain an allocation $\pi$ for the risky asset. The wealth $x_{t_{i}+1}$ at the next time instant $t_{i+1}$ is computed by equation (2.5). As a result, a generic expectation can be approximated by the following finite sum

(5.1)

E(\theta)\approx\sum_{(t_{i},x_{i})\in\mathcal{D}}h_{t_{i}}\bigg{(}J^{\theta}(t_{i+1},X^{\lambda}_{t_{i+1}};{\lambda})-J^{\theta}(t_{i},X^{\lambda}_{t_{i}};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t_{i},X^{\lambda}_{t_{i}})\ln\lambda(\pi|t_{i},X^{\lambda}_{t_{i}})d\pi\bigg{)}\Delta t.

5.1. Implementation for the unconstrained problem

Recall first that the exploratory optimal value function is given by

(5.2)

\displaystyle v(t,x;m)=

\displaystyle\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}(T-t)+\frac{m}{2}\ln({2\pi_{e}m})(T-t)-m\ln(\sigma)(T-t).

We learn it from the following parametric form

(5.3)

\displaystyle J^{\theta}(t,x;m)=\ln x+\theta(T-t)+\frac{m}{2}\ln({2\pi_{e}m})(T-t).

which clearly satisfies Assumption 3.1. It follows from (5.2) that

(5.4)

\displaystyle\theta=r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}-m\ln\sigma.

Recall that for a Gaussian distribution $\mathcal{N}(.|a,b^{2})$ its entropy is given by $\frac{1}{2}+\ln\left(b\sqrt{2\pi_{e}}\right)$ . Since we do not know the model’s parameters $\mu$ and $\sigma$ , we will learn them using Theorem 3.2 with the parametric policy $\lambda^{\phi}(\pi)=\mathcal{N}(\pi|\phi_{1},e^{\phi_{2}}m).$ It follows that

(5.5)

\displaystyle\hat{p}(\phi_{1},\phi_{2}):=-\int_{-\infty}^{+\infty}\lambda^{\phi}(\pi)\ln\lambda^{\phi}(\pi)d\pi=\frac{1}{2}+\frac{1}{2}(\phi_{2}+\ln(2\pi_{e}m)).

Also,

\ln\lambda^{\phi}(\pi)=-\frac{1}{2}\phi_{2}-\frac{1}{2m}(\pi-\phi_{1})^{2}e^{-\phi_{2}}-\frac{1}{2}\ln(2\pi_{e}m).

Therefore

\frac{\partial}{\phi_{1}}\ln\lambda^{\phi}(\pi)=(\pi-\phi_{1})\frac{e^{-\phi_{2}}}{m},\quad\frac{\partial}{\phi_{2}}\ln\lambda^{\phi}(\pi)=-\frac{1}{2}+\frac{1}{2}(\pi-\phi_{1})^{2}\frac{e^{-\phi_{2}}}{m}.

Moreover, it can be seen from (5.5) that $\frac{\partial\hat{p}}{\partial\phi_{1}}=0$ and $\frac{\partial\hat{p}}{\partial\phi_{2}}=\frac{1}{2}.$ Finally, in such a learning framework, we choose the test function $h_{t}=\frac{\partial J^{\theta}}{\partial\theta}=(T-t)$ .

For learning step, we adopt the (offline) actor-critic approach in Jia and Zhou (2022b) which is summarized in Algorithm 1 below.

0: Market simulator Market

0: Learning rates:

\eta_{\theta},\eta_{\phi}

, exploration rate

m

, number of iterations

M

1: for

k=1,2,\ldots,M

2: for

i=1,2,\ldots,\frac{T}{\Delta t}

3: Sample

(t_{i}^{k},x_{i}^{k})

from Market under

\lambda^{\phi}

4: Obtain collected samples

\mathcal{D}=\{(t_{i}^{k},x_{i}^{k}):1\leq i\leq\frac{T}{\Delta t}\}

5: Compute

\Delta_{\theta}J

using (3.4)

6: Compute

\Delta_{\phi}J

using (3.29)

7: Update

\theta\leftarrow\theta+l(i)\eta_{\theta}\Delta_{\theta}J

8: Update

\phi\leftarrow\phi-l(i)\eta_{\phi}\Delta_{\phi}J

9: end for

10: Update the policy

\lambda^{\phi}\leftarrow\mathcal{N}\left(\phi_{1},e^{\phi_{2}}m\right)

11: end for

Algorithm 1 PE/PG: Actor-Critic Algorithm: unconstrained problem

5.2. Implementation for the constrained problem

First, recall from (4.9) that the exploratory optimal value function is given by

\displaystyle v^{[a,b]}(t,x;m)=

\displaystyle\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t){+}m\ln Z_{a,b}(m)(T-t)

for $(t,x)\in[0,T]\times\mathbb{R}_{+}$ , where $Z_{a,b}(m)$ is given by (4.10). We learn it from the following parametric form

(5.6)

\displaystyle J^{\theta}(t,x;m)=\ln x+\theta_{1}(T-t)+\theta_{2}(T-t)+\frac{m}{2}\ln({2\pi_{e}m})(T-t),

which clearly satisfies Assumption 3.1. It follows from (5.6) that

(5.7)

\displaystyle\theta_{1}=r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}-m\ln(\sigma),\quad\theta_{2}=m\ln Z_{a,b}(m).

Again, we parametrize the policy from Theorem 4.1 as the following

\lambda^{\phi}(\pi)=\mathcal{N}(\pi|\phi_{1},e^{\phi_{2}}m)|_{[a,b]}=\frac{1}{\sqrt{e^{\phi_{2}}m}}\frac{\varphi(\frac{\pi-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})}{\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}.

Also recall that for $\lambda^{\phi}$ , a Gaussian distribution truncated on $[a,b]$ , its entropy is given by

\displaystyle\hat{p}(\phi_{1},\phi_{2})=

\displaystyle\ln\sqrt{{2\pi m}}+\frac{1}{2}\phi_{2}+\frac{1}{2}+\ln\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)+\frac{\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}}\varphi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}}\varphi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})}{2\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}.

Also,

\ln\lambda^{\phi}(\pi)=-\frac{1}{2}\phi_{2}-\frac{1}{2m}(\pi-\phi_{1})^{2}e^{-\phi_{2}}-\frac{1}{2}\ln(2\pi_{e}m)-\ln\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right).

Therefore

\frac{\partial}{\phi_{1}}\ln\lambda^{\phi}(\pi)=(\pi-\phi_{1})\frac{e^{-\phi_{2}}}{m}+\frac{\left({\varphi}(\frac{b-\phi_{1}}{\sqrt{e^{\phi}_{2}m}})-{\varphi}(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}{\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}\frac{1}{\sqrt{e^{\phi_{2}}m}},

\frac{\partial}{\phi_{2}}\ln\lambda^{\phi}(\pi)=-\frac{1}{2}+\frac{1}{2}(\pi-\phi_{1})^{2}\frac{e^{-\phi_{2}}}{m}+\frac{\left({\varphi}(\frac{b-\phi_{1}}{\sqrt{e^{\phi}_{2}m}})\frac{b-\phi_{1}}{\sqrt{m}}-{\varphi}(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\frac{a-\phi_{1}}{\sqrt{m}}\right)}{\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}\frac{1}{2}e^{-\phi_{2}/2}.

The partial derivatives of $\hat{p}(\phi_{1},\phi_{2})$ can be obtained explicitly.

Similarly to the unconstrained case, the learning step can be done using the Actor-Critic Algorithm adopted for the constrained problem in Algorithm 2 below.

0: Market simulator Market

0: Learning rates:

\eta_{\theta},\eta_{\phi}

, exploration rate

m

, number of iterations

M

, portfolio bounds

a,b

1: for

k=1,2,\ldots,M

2: for

i=1,2,\ldots,\frac{T}{\Delta t}

3: Sample

(t_{i}^{k},x_{i}^{k})

from Market under

\lambda^{\phi}

4: Obtain collected samples

\mathcal{D}=\{(t_{i}^{k},x_{i}^{k}):1\leq i\leq\frac{T}{\Delta t}\}

5: Compute

\Delta_{\theta}J

using (3.4)

6: Compute

\Delta_{\phi}J

using (3.29)

7: Update

\theta\leftarrow\theta+l(i)\eta_{\theta}\Delta_{\theta}J

8: Update

\phi\leftarrow\phi-l(i)\eta_{\phi}\Delta_{\phi}J

9: end for

10: Update the policy

\lambda^{\phi}\leftarrow\mathcal{N}\left(\phi_{1},e^{\phi_{2}}m\right)|_{[a,b]}

11: end for

Algorithm 2 PE/PG Actor-Critic Algorithm: constrained problem

5.3. Numerical demonstration

We provide an an example to demonstrate our results under a portfolio constraint in a setting with $T=1$ and $\Delta t=1/250$ . The annualized interest rate is $r=3\%$ . We choose $\mu=0.08,\sigma=0.3$ to simulate sample paths of the diffusion wealth process (6.1) based on the the update policy after each episode. First, for $m\in[0.001,2]$ we plot the corresponding exploration cost function for both constrained and unconstrained cases. Figure 1 clearly confirms the theoretical results obtained in Proposition 3.1. Interestingly, the exploration cost of unconstrained case ( $L(T,x;m)$ ) is much larger than that of constrained case ( $L^{[a,b]}(T,x;m)$ ). Consistent to Lemma 4.4, this could be explained that for the constrained case, one just needs to search for the optimal strategy in a finite domain as opposed to an infinite domain for the unconstrained case.

Refer to caption — (a) Impact of lower bound $a$

The left (resp. right) panel of Figure 1 demonstrates the impact of the lower (resp. upper) portfolio bound on the exploration cost. Clearly, as the lower (resp. upper) bound of the portfolio decreases (resp. increases), the exploration cost increases. This finding aligns with the intuitive understanding that a broader domain of investment opportunities necessitates higher exploration expenses. Notably, when subjected to both short-selling (i.e., $a=0$ ) and money borrowing constraints (i.e., $b=1$ ), the exploration cost becomes negligible compared to the unconstrained case.

Recall from Theorem 3.1 that the difference between the constrained value function $v^{*,[a,b]}(t,x;0)$ and the exploratory constrained value function $v^{*,[a,b]}(t,x;m)$ is determined by the temporal exploration rate $m$ . In Figure 2, we plot the difference $|v^{*,[a,b]}(t,x;m)-v^{*,[a,b]}(t,x;0)|$ for $m=2$ (left panel) and $m=0.001$ (right panel). It is clear from Figure 2 the difference is substantial for a large exploration rate $m$ but decreases greatly for a small exploration rate.

Next, we choose the learning rates which are used to update $\theta$ and $\phi$ to be $\eta_{\theta}=0.01$ , $\eta_{\phi}=0.001$ , respectively with the decay rate $l(i)=1/i^{0.51}$ . From the closed-from formula, the true value of $(\theta_{1},\theta_{2})$ is given by $\theta^{*}=(0.055929,-0.001497)$ . For each learning episode, we start with an initial guess

\theta=(\theta_{1},\theta_{2})=\theta^{*}+(\epsilon_{1},\epsilon_{2})

where $\epsilon_{i},i=1,2$ are independent and generated from a (Gaussian noise) normal distribution. For each iteration, we randomly sample $10000$ 1-year trajectories of the wealth process $\mathcal{D}=\{(t_{i}^{k},x_{i}^{k}):1\leq i\leq\frac{T}{\Delta t}\}$ . The gradients of the value function $\Delta_{\theta}J$ are calculated by the above approximate discrete sum (5.1). The model is trained for $N=10000$ iterations. The results reported in this section is the average of $20$ training episodes. The portfolio strategy is limited by the lower bound $a=0$ (i.e. short-selling is not possible) and the upper bound $b=1$ (i.e. money borrowing is also prohibited). Finally, we choose the test function $h_{t}=\frac{\partial J^{\theta}}{\partial\theta}=[(T-t);(T-t)]^{T}$ .

Figure 3 reports the learned policy in comparison with the true optimal policy which is a Gaussian distribution truncated on $[a,b]$ . It can be observed that the smaller the value of the exploration rate is, the more closely matched the learned optimal strategy is to that of the ground true optimal policy. As shown in Figure 4, when $m$ is small, the (truncated Gaussian) optimal policy gets closer to the Dirac distribution at the “constrained” Merton strategy $\pi^{0,[a,b]}$ defined in (4.17), which is consistent to the result obtained in Theorem 4.4.

To have a more concrete comparison between the true optimal optimal policy and the learned optimal policy, in Table 1, we compare the true mean, the learned mean, and the standard deviation of the learned mean. In addition, we report the (empirical) Kullback–Leibler divergence (see e.g. Csiszár (1975)) from the learned policy to the true policy for different $m$ , see $D_{KL}$ column.

$m$	True mean	Learned mean	std	$D_{KL}$
0.01	0.555556	0.555927	0.000005	0.000000
0.1	0.555556	0.554679	0.000007	0.000000
1	0.555556	0.555750	0.000189	0.000052

Table 1. True mean versus Learned mean under portfolio constraint for different

m

Next we consider the dependence of $|v^{[a,b]}(t,x,m)-J^{\theta}(t,x,m)|$ on the number of iterations used. To easy the comparison, we consider the difference $|v^{[a,b]}(0.5,0.5,0.01)-J^{\theta}(0.5,0.5,0.01)|$ as a function of iterations. More specifically, we run Algorithm 1 to learn the coefficients $\theta_{1},\theta_{2}$ from data. We then calculate the difference $|v^{[a,b]}(0.5,0.5,0.01)-J^{\theta}(0.5,0.5,0.01)|$ . In Figure 5, we report the difference $|v^{[a,b]}(0.5,0.5,0.01)-J^{\theta}(0.5,0.5,0.01)|$ by averaging 20 runs.

It is clear from Figure 5 the error between the optimal value function and the learned value function is decreasing as the number of iterations increases.

To have a more concrete comparison, we report the true parameters $(\theta_{1},\theta_{2})$ , $(\phi_{1},\phi_{2})$ versus the corresponding learned $(\theta_{1},\theta_{2})$ , $(\phi_{1},\phi_{2})$ in Table 2. As shown in Table 2, the quality of learning is improving as the number of iterations increases.

Iteration	True $(\theta_{1},\theta_{2})$	Learned $(\theta_{1},\theta_{2})$	True $(\phi_{1},\phi_{2})$	Learned $(\phi_{1},\phi_{2})$
500	(0.055929,-0.001497)	(0.052141,-0.001261)	(0.555556,2.407946)	(0.552537,2.407276)
1000	(0.055929,-0.001497)	(0.057109 ,-0.001836)	(0.555556,2.407946)	(0.554388,2.413809 )
2000	(0.055929,-0.001497)	(0.054673,-0.003369 )	(0.555556,2.407946)	(0.555207,2.410082)
3000	(0.055929,-0.001497)	(0.057058,-0.004642 )	(0.555556,2.407946)	(0.555872,2.408604)
4000	(0.055929,-0.001497)	(0.054119 ,0.001334)	(0.555556,2.407946)	(0.554001,2.409800)
5000	(0.055929,-0.001497)	(0.055721,-0.000253)	(0.555556,2.407946)	(0.557341,2.404103)
10000	(0.055929,-0.001497)	(0.056456,-0.002821)	(0.555556,2.407946)	(0.554809,2.405365)

Table 2. True parameters versus learned parameters

Lastly in this section, we would like to compare the effect of the exploration parameter $m$ on the distribution of the optimal wealth process. In Figure 6 we plot the distribution of the optimal wealth processes at time $t=T/2=0.5$ using the dynamics in (4.12) and a sample size of $10000$ . From Figure 6, it can be seen that the exploration parameter $m$ plays a central role on the distribution of the constrained optimal wealth process. We can clearly distinguish the difference between the case with exploration (i.e. $m\neq 0$ ) and the case without exploration (i.e. $m=0$ ). In particular, exploration leads to a more dispersed distribution with heavier tail. Again, as confirmed from the theory in the previous section, this effect becomes less significant as the exploration parameter $m$ is smaller.

6. Optimal portfolio with portfolio constraints and exploration for quadratic utility

We consider in this section the case with a quadratic utility function. It is worth noting that in portfolio theory the use of quadratic utility functions is very close to the classic Markowitz mean-variance portfolio and the reader is referred to e.g., Duffie and Richardson (1991); Bodnar et al. (2013) for discussions on connections between three quadratic optimization problems: the Markowitz mean–variance problem, the mean–variance utility function and the quadratic utility. Note that unlike mean-variance (MV) solution, the optimal solution under a quadratic utility function is not only time-consistent but, as shown below, also lies on the MV efficient frontier under mild conditions. Moreover, while it becomes extremely complex even without exploration e.g. Li et al. (2002); Bielecki et al. (2005); Li and Xu (2016) when a portfolio constraint is added in a MV problem, we still able to obtain closed form solutions when portfolio constraints are included in our exploratory learning setting with a quadratic utility. Note that for a quadratic utility function, the wealth process can be negative. Therefore, below we consider the amount of wealth $\theta_{t}$ invested in the risky asset at time $t$ . To include portfolio constraints we assume that given the current wealth $X^{\theta}_{t}$ , the risk investment amount $\theta_{t}$ at time $t$ is bounded by $a(t,X_{t}^{\theta})\leq\theta_{t}\leq b(t,X_{t}^{\theta})$ , where $a$ (resp. $b$ ) is a deterministic continuous function defining the lower (resp. upper) portfolio bound. The wealth dynamics follows the following SDE

(6.1)

\displaystyle dX_{t}^{\theta}=

\displaystyle\left(rX_{t}^{\theta}+\theta_{t}(\mu-r)\right)dt+\sigma\theta_{t}dW_{t}\,,\quad X_{0}^{\theta}=x_{0}\in\mathbb{R}\,.

The set of admissible investment strategies is now defined by

	$\displaystyle\mathcal{D}_{[a,b]}(x_{0}):=\Bigg{\{}(\theta)_{t\in[0,T]}\,$	$\displaystyle\text{is progressively measurable}\,\bigg{\|}\,\;a(t,X_{t}^{\theta})\leq\theta_{t}\leq b(t,X_{t}^{\theta}),\;$
		$\displaystyle\text{\eqref{eq_Xtnoc} has a unique strong solution and}\;\mathbb{E}\bigg{[}\int_{0}^{T}\theta_{t}^{2}dt\bigg{]}<\infty\Bigg{\}}.$

Our objective is to maximize the terminal expected utility

(6.2)

\displaystyle\max_{\theta\in{\mathcal{D}}_{[a,b]}(x_{0})}\mathbb{E}\Big{[}U(X_{T}^{\theta})\Big{]},

where $U(x)=Kx-\frac{1}{2}\varepsilon x^{2}$ , with parameter $K,\varepsilon>0$ . The constant $1/\varepsilon$ reflects the agent’s risk aversion and can be regarded as the risk aversion parameter in the mean–variance analysis. We remark that the quadratic utility function $U(x)=Kx-\frac{1}{2}\varepsilon x^{2}$ has its global maximum at the so-called bliss point $K/\varepsilon$ . Obviously, the quadratic utility function is symmetric with respect to the bliss point $K/\varepsilon$ : it is increasing for $x<K/\varepsilon$ and decreasing for $x>K/\varepsilon$ and the utility values at some $x<K/\varepsilon$ and at $K/\varepsilon-x$ are equivalent. We consider the exploratory version of the wealth dynamics given by

(6.3)

\displaystyle dX_{t}^{\lambda}=\tilde{A}(t,X_{t}^{\lambda};\lambda)dt+\tilde{B}(t,X_{t}^{\lambda};\lambda)dW_{t}\,,\quad X_{0}^{\lambda}=x_{0},

where

(6.4)

\displaystyle\tilde{A}(t,x;\lambda)=\int_{a(t,x)}^{b(t,x)}\big{(}rx+\theta(\mu-r)\big{)}\lambda(\theta|t,x)d\theta;\quad\tilde{B}(t,x;\lambda)=\sqrt{\int_{a(t,x)}^{b(t,x)}\sigma^{2}\theta^{2}\lambda(\theta|t,x)d\theta}.

and $\lambda(\cdot|t,x)$ is a probability density on the interval $[a(t,x),b(t,x)]$ . The exploratory optimization is now stated by

(6.5)

\displaystyle v(s,x;m):=\max_{\lambda\in\widetilde{\mathcal{H}}_{[a,b]}}\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{s}^{T}\int_{a(t,X_{t}^{\lambda})}^{b(t,X_{t}^{\lambda})}\lambda(\theta|t,X_{t}^{\lambda})\ln\lambda(\theta|t,X_{t}^{\lambda})d\theta dt\big{|}X_{s}^{\lambda}=x\bigg{]}.

where $\widetilde{\mathcal{H}}_{[a,b]}$ is the set of admissible feedback policies $\lambda$ that satisfy the following properties:

(1)

For each $(t,x)\in[0,T]\times\mathbb{R}$ , $\lambda(\cdot|t,x)$ is a density function on $[a(t,x),b(t,x)]$ .
(2)

The mapping $[0,T]\times\mathbb{R}\times[a(t,x),b(t,x)]\ni(t,x,\pi)\mapsto\lambda(\pi|t,x)$ is measurable.

(3)

The exploration SDE (6.3) admits a unique strong solution denoted by $X^{\lambda}$ and

\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{0}^{T}\int_{a(t,X_{t}^{\lambda})}^{b(t,X_{t}^{\lambda})}\lambda(\theta|t,X_{t}^{\lambda})\ln\lambda(\theta|t,X_{t}^{\lambda})d\theta dt\bigg{]}<\infty.

As before, the optimal value function satisfies the following HJB equation

	$\displaystyle v_{t}(t,x;m)+\sup_{\lambda\in\widetilde{\mathcal{H}}_{[a,b]}}\bigg{\{}\tilde{A}(t,x;\lambda)v_{x}(t,x;m)$	$\displaystyle+\frac{1}{2}\tilde{B}^{2}(t,x;\lambda)v_{xx}(t,x;m)$
(6.6)			$\displaystyle-m\int_{a(t,x)}^{b(t,x)}\lambda(\theta\|t,x)\ln\lambda(\theta\|t,x)d\theta\bigg{\}}=0,$

with terminal condition $v(T,x;m)=U(x)$ . Using again the standard argument of DPP we observe that under the portfolio constraint $\theta\in\mathcal{D}_{[a,b]}$ , the optimal feedback policy now follows a truncated Gaussian distribution.

Lemma 6.1.

In the exploratory constrained EU setting with quadratic utility function and $v_{xx}<0$ , the optimal feedback policy $\tilde{\lambda}^{*}$ is a Gaussian distribution with mean $\tilde{\alpha}(t,x)$ and variance $\tilde{\beta}^{2}(t,x)$ truncated on interval $[a(t,x),b(t,x)]$ , where

(6.7)

\tilde{\alpha}(t,x)=-\frac{(\mu-r)v_{x}(t,x)}{\sigma^{2}v_{xx}(t,x)};\quad\tilde{\beta}^{2}(t,x)=-\frac{m}{\sigma^{2}v_{xx}(t,x)}.

The density of the optimal policy $\widetilde{\lambda}^{*}$ is given by

(6.8)

\displaystyle\widetilde{\lambda}^{*}(\theta|t,x;m)

\displaystyle:=\frac{1}{\widetilde{\beta}(t,x)}\frac{\varphi\bigg{(}\frac{\theta-\widetilde{\alpha}(t,x)}{\widetilde{\beta}(t,x)}\bigg{)}}{\Phi\bigg{(}\frac{b(x)-\widetilde{\alpha}(t,x)}{\widetilde{\beta}(t,x)}\bigg{)}-\Phi\bigg{(}\frac{a(x)-\widetilde{\alpha}(t,x)}{\widetilde{\beta}(t,x)}\bigg{)}},

where $\varphi$ and $\Phi$ are the PDF and CDF functions of the standard normal distribution, respectively.

Substituting (6.8) back to the HJB (6.6) we obtain the following non-linear PDE

(6.9)

\displaystyle v_{t}(t,x;m)+rxv_{x}(t,x;m)-\frac{1}{2}\frac{(\mu-r)^{2}v_{x}^{2}(t,x;m)}{\sigma^{2}v_{xx}(t,x;m)}+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}v_{xx}(t,x;m)}\bigg{)}+{m\ln Z(t,x;m)}=0,

with terminal condition $v(T,x;m)=U(x)$ , where by abusing notations,

(6.10)

Z(t,x;m):={\Phi(\widetilde{Q}_{b}(t,x;m))-\Phi(\widetilde{Q}_{a}(t,x;m))},

with

(6.11)		$\displaystyle\widetilde{Q}_{a}(t,x;m):=\bigg{(}a(t,x)+\frac{(\mu-r)}{\sigma^{2}}\frac{v_{x}(t,x)}{v_{xx}(t,x)}\bigg{)}\sqrt{\frac{-\sigma^{2}v_{xx}(t,x)}{m}};$
(6.12)		$\displaystyle\widetilde{Q}_{b}(t,x;m):=\bigg{(}b(t,x)+\frac{(\mu-r)}{\sigma^{2}}\frac{v_{x}(t,x)}{v_{xx}(t,x)}\bigg{)}\sqrt{\frac{-\sigma^{2}v_{xx}(t,x)}{m}}.$

In the rest of this section we assume that the portfolio bounds are given by

a(t,x):={-}\pi^{Merton}x+a_{0}(t),\quad\mbox{and}\quad b(t,x):={-}\pi^{Merton}x+b_{0}(t),

where, as before $\pi^{Merton}=\frac{(\mu-r)}{\sigma^{2}}$ , $a_{0}$ and $b_{0}$ are two time-varying continuous bounded functions with $a_{0}(t)<b_{0}(t),\,\forall t\in[0,T]$ .

Theorem 6.1 (Quadratic utility-exploratory optimal investment under portfolio constraint).

The optimal value function of the entropy-regularized exploratory constrained optimal investment problem with quadratic utility $U(x)=Kx-\frac{1}{2}\varepsilon x^{2}$ is given by

(6.13)		$\displaystyle\widetilde{v}^{[a,b]}(t,x;m)=$	$\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})$
(6.14)			$\displaystyle+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t)+mF(t,m),$

for $(t,x)\in[0,T]\times\mathbb{R}$ , where $\rho:=\frac{(\mu-r)}{\sigma}$ is the Sharpe ratio, $F(t,m)=\int_{t}^{T}\ln(f(s,m))ds$ ,

(6.15)

f(t,m):=\Phi(\widetilde{Q}_{b}(t;m))-\Phi(\widetilde{Q}_{a}(t;m)),

where

\widetilde{Q}_{a}(t;m):=\bigg{(}a_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}});

\widetilde{Q}_{b}(t;m):=\bigg{(}b_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}}).

Moreover, the optimal feedback distribution control $\widetilde{\lambda}^{*}(\cdot|t,x;m)$ is a Gaussian distribution with parameter $(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton}$ and $\frac{m}{\varepsilon\sigma^{2}}e^{(\rho^{2}-2r)(T-t)}$ , truncated on the interval $[a(t,x),b(t,x)]$ i.e.

(6.16)

\widetilde{\lambda}^{*}(\theta|t,x;m)=\mathcal{N}\left(\theta\bigg{|}(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton},\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}\right)\bigg{|}_{[a(t,x),b(t,x)]}.

Corollary 6.1 (Unconstrained exploratory quadratic portfolio).

The optimal value function of the entropy-regularized exploratory unconstrained optimal investment problem for the quadratic utility $U(x)=Kx-\frac{1}{2}\varepsilon x^{2}$ is given by

(6.17)		$\displaystyle\widetilde{v}(t,x;m)=$	$\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})$
(6.18)			$\displaystyle+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t),$

for $(t,x)\in[0,T]\times\mathbb{R}$ . Moreover, the unconstrained optimal feedback distribution control is a Gaussian distribution with parameter $(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton}$ and $\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}$ .

Proof.

This is a direct consequence of Theorem 6.1 by letting $a_{0}(t)\to-\infty$ and $b_{0}(t)\to+\infty$ . ∎

Remark 6.1.

Compared to the mean-variance results obtained in Wang and Zhou (2020), the risk reward parameter $K{\epsilon}^{-1}$ plays the role of the Lagrangian multiplier $\omega$ (under their notations) in the mean-variance setting. It should be mentioned that instead of considering the true portfolio, the authors in Wang and Zhou (2020) study the mean-variance problem for the discounted portfolio, which can be obtained by assuming $r=0$ in our setting. Note that the variance of our optimal policy is different from the one obtained in Wang and Zhou (2020) by a multiplicative factor $2\epsilon^{-1}$ since our quadratic utility is nothing else but the mean-variance utility in Wang and Zhou (2020) scaled up by $-\epsilon/2$ .

Letting $m\to 0$ in Corollary 6.1 we can obtain the unconstrained optimal portfolio without exploration for quadratic utility function.

Corollary 6.2 (Unconstrained quadratic portfolio).

For a quadratic utility $U(x)=Kx-\frac{1}{2}\varepsilon x^{2}$ , the optimal value function of unconstrained optimal investment problem without exploration is given by

(6.19)

\displaystyle\widetilde{v}(t,x;0)=

\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})

and the unconstrained optimal control strategy is given by $(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton}$ .

Following similar arguments used in Proposition 4.1 and Lemma 4.4 we can confirm that the exploration cost for the constrained problem is smaller than that of the unconstrained problem for a quadratic utility function.

Proposition 6.1.

In the constrained problem with exploration and quadratic utility function, the exploration cost is given by

\widetilde{L}^{[a,b]}(T,x;m)=\frac{mT}{2}+m\int_{0}^{T}\frac{\widetilde{Q}_{a}(t;m)\varphi(\widetilde{Q}_{a}(t;m))-\widetilde{Q}_{b}(t;m)\varphi((\widetilde{Q}_{b}(t;m))}{f(t,m)}dt\leq\frac{mT}{2}.

Moreover, $\lim_{m\to 0}\widetilde{L}^{[a,b]}(T,x;m)=0$

Similarly to Theorems 4.3-4.4 for logarithmic utility, a policy improvement theorem can be shown for quadratic utility functions. Below we show that the optimal solution of the exploratory quadratic utility problem is mean-variance efficient (see e.g. Luenberger (1998); Kato et al. (2020); Duffie and Richardson (1991) for similar discussions in the case without exploration).

Proposition 6.2.

Assume that the agent’s initial wealth is smaller than the discounted reward level $\frac{K}{\epsilon}$ , i.e. $x_{0}\leq\frac{K}{\epsilon}e^{-rT}$ . Then, the exploratory unconstrained optimal portfolio belongs to the mean-variance frontier.

Remark 6.2.

Note that since the exploration parameter only effects the diffusion term, we can see that Proposition 6.2 also holds for the unconstrained case without exploration. Moreover, it can be shown that Proposition 6.2 still holds true for the constrained case with exploration. Indeed, recall the optimal constrained exploratory solution whose dynamics is given by

dX^{\widetilde{\lambda}^{*}}_{t}=\tilde{A}(t,X^{\widetilde{\lambda}^{*}}_{t};\widetilde{\lambda}^{*})dt+\widetilde{B}(t,X^{\widetilde{\lambda}^{*}}_{t};\widetilde{\lambda}^{*})dW_{t},

where $\tilde{A}$ , $\widetilde{B}$ are given by (C.5). Note that the second term of $\widetilde{\theta}_{t}^{*,[a,b]}$ in (C.5) is negative because

\frac{\varphi\big{(}\widetilde{Q}_{a}(t;m)\big{)}-\varphi\big{(}\widetilde{Q}_{b}(t;m)\big{)}}{f(t,m)}<0.

Using the comparison principle of ordinary differential equations we can conclude that $\mathbb{E}[X^{\widetilde{\lambda}^{*}}_{T}]\leq\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]\leq\frac{K}{\epsilon}$ , which implies that the optimal solution of the expected quadratic utility problem with both portfolio constraints and exploration lies on the (constrained) mean-variance frontier.

6.1. Implementation

We consider the case without constraint. Recall the value function

(6.20)		$\displaystyle\widetilde{v}(t,x;m)=$	$\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})$
(6.21)			$\displaystyle+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t),$

for $(t,x)\in[0,T]\times\mathbb{R}_{+}$ . We choose to parametrize the value function as

\displaystyle J^{\theta}(t,x;m)=-\frac{1}{2}\epsilon(xe^{r(T-t)}-K/\epsilon)^{2}e^{-\theta_{3}(T-t)}+K^{2}/\epsilon e^{-\theta_{3}(T-t)}-\frac{1}{2}K^{2}/\epsilon+\theta_{2}(T-t)^{2}+\theta_{1}(T-t).

That is,

\left\{\begin{array}[]{ll}&\theta_{1}=m/2\ln(2\pi_{e}m\epsilon^{-1}\sigma^{-2})\\ &\theta_{2}=m/4(\rho^{2}-2r)\\ &\theta_{3}=\rho^{2}.\end{array}\right.

From here the test functions are chosen as

(6.22)

\left\{\begin{array}[]{ll}\frac{\partial J^{\theta}}{\partial\theta_{1}}=(T-t),\\ \frac{\partial J^{\theta}}{\partial\theta_{2}}=(T-t)^{2},\\ \frac{\partial J^{\theta}}{\partial\theta_{3}}=\frac{1}{2}\epsilon(xe^{r(T-t)}-K/\epsilon)^{2}e^{-\theta_{3}(T-t)}(T-t)-K^{2}/\epsilon e^{-\theta_{3}(T-t)}(T-t).\end{array}\right.

Also, recall the unconstrained optimal feedback distribution control is a Gaussian distribution with parameter $(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton}$ and $\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}$ . For the policy distribution, we parametrize it as follows

\displaystyle L(\phi_{1},\phi_{2},\phi_{3})=\mathcal{N}\bigg{(}(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\phi_{1},e^{\phi_{2}+\phi_{3}(T-t)}\bigg{)}.

From here, it can be seen that the entropy is given by

\displaystyle H(\phi_{1},\phi_{2},\phi_{3})=0.5(\ln(2\pi_{e})+1+\phi_{2}+\phi_{3}(T-t));

The log-likelihood, $l(\phi_{1},\phi_{2},\phi_{3}):=\ln L(\phi_{1},\phi_{2},\phi_{3})$ is given by

l(\phi_{1},\phi_{2},\phi_{3})(y)=-\frac{1}{2}\ln(2\pi)-\frac{1}{2}(\phi_{2}+\phi_{3}(T-t))-\frac{1}{2}\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}^{2}e^{-\phi_{2}-\phi_{3}(T-t)}.

Also, the derivatives of the log-likelihood are given by

	$\displaystyle\frac{\partial l}{\partial\phi_{1}}=\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}e^{-\phi_{2}-\phi_{3}(T-t)}(e^{-r(T-t)}K/\epsilon-x)$
	$\displaystyle\frac{\partial l}{\partial\phi_{2}}=-\frac{1}{2}+\frac{1}{2}\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}^{2}e^{-\phi_{2}-\phi_{3}(T-t)}$
	$\displaystyle\frac{\partial l}{\partial\phi_{3}}=-\frac{1}{2}(T-t)+\frac{1}{2}\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}^{2}e^{-\phi_{2}-\phi_{3}(T-t)}(T-t).$

6.2. Numerical example

We illustrate the results for quadratic utility function. Specifically, we consider the scenario without constraints with $r=0.02,\mu=0.05,\sigma=0.30,T=1,x_{0}=0.5$ and $\epsilon=1$ . We remark that the condition $x_{0}\leq\frac{K}{\epsilon}e^{-rT}$ is fulfilled. First, we run the algorithm using $1000$ iterations, the learning rates which are used to update $\theta$ and $\phi$ are chosen to be $\eta_{\theta}=0.01$ , $\eta_{\phi}=0.01$ , respectively with the decay rate $l(i)=1/i^{0.51}$ . Moreover, we choose the exploration rate $m=0.01$ . Figure 8 plots the evolution of coefficients of $J^{\theta}$ . In this case, the true coefficients are $(\theta_{1},\theta_{2},\theta_{3})=(-0.001797,-0.000075,0.010000)$ .

In this case, it is evident from Figure 8 that $(\theta_{1},\theta_{2},\theta_{3})$ converges to its true counterpart. Next, recall from Corollary 6.1 the optimal feedback distribution control is a Gaussian distribution with parameter $(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton}$ and $\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}$ . To have a concrete comparison, we choose $x=0.5,t=0.5$ ; we then run the algorithm to obtain estimates for $\mu,\sigma$ . Using the estimates for $\mu,\sigma$ , in Figure 8 we plot the density of the true optimal control versus the density of the learned optimal control for various exploration rates $m=1,m=0.1,m=0.01$ .

It is clear from Figure 8 that the learned policies are matched extremely well to the true policies for different exploration rates. This once again confirms the theoretical findings presented in this section. We remark that compared to the existing literature on exploratory MV problem studied in e.g. Wang and Zhou (2020); Dai et al. (2020); Jia and Zhou (2022a, b) which has to deal with an additional Lagrange multiplier, our numerical implementation for the quadratic utility is much simpler thanks to our closed form solutions. As mentioned above, as the classical MV problem is time-inconsistent problem, the optimal value and strategy might be only significant when computing as the initial time $t=0$ . In our case, the optimal investment strategy and value function can be learned at any time in $t\in[0,T$ ]. Finally, we remark that the above learning procedure can be extended to the constrained problem by slightly adopting the above learning implementation as it is done for logarithm utility. Since the paper is rather lengthy we decided not to present this implementation.

7. An extension to random coefficient models

This section shows that our model can be extended beyond the classical Black-Scholes model. In particular, we assume that the drift and the volatility of the risky asset $S$ are driven by a state process $y$ as follows:

(7.1)

dS_{t}=S_{t}(\mu(y_{t})dt+\sigma(y_{t})dW^{S}_{t}),

and

(7.2)

dy_{t}=\mu_{Y}(y_{t})dt+\sigma_{Y}(y_{t})dW_{t}^{Y},

where $W^{S},W^{y}$ are two independent Brownian motions. We assume that $\mu_{Y}$ , $\sigma_{Y}$ are globally Lipschitz and linearly bounder functions so that there exists a unique strong solution $y$ to the factor SDE (7.2). It is worth mentioning that due to the additional randomness $y$ , the market is incomplete. Such a factor model has been well studied in the literature. Now, for each $m\in\mathbb{R}$ , the exploratory optimal value function is defined by

\displaystyle v(t,x,y;m):=\max_{\lambda\in\mathcal{H}}\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathbb{R}}\lambda(\pi|t,X_{t}^{\lambda})\ln\lambda(\pi|t,X_{t}^{\lambda})d\pi ds\bigg{|}X^{\lambda}_{t}=x,y_{t}=y\Bigg{]},

where as before, $X^{\lambda}$ is the exploratory wealth process and $\mathcal{H}$ is the set of admissible distributions. Hence, the optimal value function $v$ satisfies the following HJB equation

		$\displaystyle v_{t}(t,x,y;m)+\mu_{Y}v_{y}(t,x,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)v_{yy}(t,x,y;m)$
(7.3)		$\displaystyle+$	$\displaystyle\sup_{\lambda}\bigg{\{}\int_{\mathbb{R}}\bigg{(}(r+\pi(\mu(y)-r))xv_{x}(t,x,y;m)+\frac{1}{2}\sigma^{2}(y)x^{2}\pi^{2}v_{xx}(t,x,y;m)-m\ln\lambda(\pi\|t,x,y)\bigg{)}\lambda(\pi\|t,x,y)d\pi\bigg{\}},$

with terminal condition $v(T,x,y;m)=\ln(x)$ . Following the same steps in Section 3, the optimal distribution $\lambda^{*}$ is given by

(7.4)

\displaystyle\lambda^{*}(\pi|t,x,y;m)

\displaystyle:=\frac{\exp\{\frac{1}{m}\big{(}(r+\pi(\mu(y)-r))xv_{x}(t,x,y;m)+\frac{1}{2}\sigma^{2}(y)x^{2}\pi^{2}v_{xx}(t,x,y;m)\big{)}\}}{\int_{\mathbb{R}}\exp\bigg{\{}\frac{1}{m}\big{(}(r+\pi(\mu(y)-r))xv_{x}(t,x,y;m)+\frac{1}{2}\sigma^{2}(y)x^{2}\pi^{2}v_{xx}(t,x,y;m)\big{)}d\pi\bigg{\}}},

which is Gaussian with mean $\alpha$ and variance $\beta^{2}$ (assuming that $v_{xx}<0$ ) defined by

(7.5)

\alpha=-\frac{(\mu(y)-r)xv_{x}}{\sigma^{2}(y)x^{2}v_{xx}};\quad\beta^{2}=-\frac{m}{\sigma^{2}(y)x^{2}v_{xx}}.

As before, using the basic properties of Gaussian laws we obtain the following PDE

	$\displaystyle v_{t}(t,x,y;m)$	$\displaystyle+\mu_{Y}(y)v_{y}(t,x,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)v_{yy}(t,x,y;m)+rxv_{x}(t,x,y;m)$
(7.6)			$\displaystyle-\frac{1}{2}\frac{(\mu(y)-r)^{2}v_{x}^{2}(t,x;m)}{\sigma^{2}(y)v_{xx}(t,x,y;m)}+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}(y)x^{2}v_{xx}(t,x,y;m)}\bigg{)}=0,$

with terminal condition $v(T,x,y;m)=\ln(x).$ Compared to the previous sections, we have to deal with a two-dimension PDE. To solve (7.6), we seek for an ansatz of the form $v(t,x,y;m)=\ln x+f(t,y;m)$ , where $f$ is a smooth function that depends only on $t$ and $y$ . For this ansatz we obtain the following PDE for $f$

(7.7)

\displaystyle f_{t}(t,y;m)

\displaystyle+\mu_{Y}(y)f_{y}(t,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)f_{yy}(t,y;m)+h(y)=0,\quad f(T,y;m)=0

where $h(y):=\bigg{(}r+\frac{1}{2}\frac{(\mu(y)-r)^{2}}{\sigma^{2}(y)}\bigg{)}+\frac{m}{2}\ln\bigg{(}\frac{2\pi_{e}m}{\sigma^{2}(y)}\bigg{)}$ .

Theorem 7.1.

Assume that $\mu_{Y}$ and $\sigma_{Y}$ are bounded differentiable functions with bounded derivatives and

(7.8)

\sup_{y\in\mathbb{R}}\max\{\sigma_{Y}(y),\sigma^{-1}_{Y}(y),\sigma^{-1}(y)\}<\infty\quad\mbox{and}\quad\sup_{y\in\mathbb{R}}\bigg{|}\frac{(\mu(y)-r)^{2}}{\sigma^{2}(y)}\bigg{|}<\infty.

Then, the $v(t,x,y;m)=\ln x+f(t,y;m)$ , where $f$ solves the PDE (7.7), is the optimal value function of the exploratory problem. Moreover, the optimal distribution control $\lambda^{*}$ is Gaussian with mean $\frac{(\mu(y)-r)}{\sigma^{2}(y)}$ and variance $\frac{m}{\sigma^{2}(y)}$ and the exploratory wealth process is given by

(7.9)

\displaystyle dX_{t}^{\lambda^{*}}=X_{t}^{\lambda^{*}}\bigg{(}r+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}dt+\sqrt{\bigg{(}m+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}}X_{t}^{\lambda^{*}}dW_{t}^{S}\,,\quad X_{0}^{\lambda^{*}}=x_{0}.

Proof.

Changing variable $g(t,y;m):=f(T-t,y;m),$ we can rewrite the PDE (7.7) as

(7.10)

\displaystyle-g_{t}(t,y;m)

\displaystyle+\mu_{Y}(y)g_{y}(t,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)g_{yy}(t,y;m)=h(y),\quad g(0,y;m)=0.

By assumption, $h$ is continuous bounded and the Cauchy problem (7.10) satisfies all conditions in Theorems 12 and 17 (Chapter 1) in Friedman (2008). Therefore, there exists unique solution $g(t,y;m)$ in $[0,T]\times\mathbb{R}$ . It is now straightforward to see that $\ln(x)+f(t,y;m)$ with $f(t,y;m)=g(T-t,y;m)$ satisfies (7.6). Note that by Feynman-Kac’s theorem, the function $f(t,y;m)$ can be represented as

(7.11)

f(t,y;m)=\mathbb{E}^{Q}\bigg{[}\int_{t}^{T}h(\widetilde{y}_{s})ds\bigg{|}\widetilde{y}_{t}=y\bigg{]},

where

(7.12)

d\widetilde{y}_{t}=\mu_{Y}(y_{t})dt+\sigma_{Y}(\widetilde{y}_{t})dW^{Q}_{t},

and $Q$ is a probability measure and $W^{Q}$ is a Brownian motion under $Q$ . By following the same steps as in the proof of Theorem 3.1 we can show that (7.4) is admissible and hence is an optimal policy. The corresponding optimal exploratory wealth process is given by

(7.13)

\displaystyle dX_{t}^{\lambda*}=X_{t}^{\lambda^{*}}\bigg{(}r+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}dt+\sqrt{\bigg{(}m+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}}X_{t}^{\lambda^{*}}dW_{t}^{S}\,,\quad X_{0}=x_{0}.

∎

Remark 7.1.

•

It is obvious that $f(t,y;m)=(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)$ when $\sigma_{Y}=\mu_{Y}=0$ . In this case, we are back to the classical Black-Scholes model in Section 3.
•

Theorem 7.1 can be extended to the case where the factor process $y$ is correlated with the risky asset $S$ . However, the exploratory setting should be adjusted to appropriately capture the correlation, see e.g. Dai et al. (2023).
•

An extension of Theorem 7.1 to the case where the strategy is bounded in a given interval $[a,b]$ can be done by adapting the discussion in Section 4.

8. Conclusion

We study an exploration version of the continuous-time expected utility maximization problem with reinforcement learning. We show that the optimal feedback policy of the exploratory problem with exploration is Gaussian. However, when the risky investment ratio is restricted on a given portfolio interval, the constrained optimal exploration policy now follows a truncated Gaussian distribution. For logarithm and quadratic utility functions, the solution to the exploratory problem can be obtained in closed form which converges to the classical expected utility counterpart when the exploration weight goes to zero. For interpretable RL algorithms, a policy improvement theorem is provided. Finally, we devise an implementable reinforcement learning algorithm by casting the optimal problem in a martingale framework. Our work can be extended to various directions. For example, it would be interesting to consider both consumption and investment with a general utility functions in more general market settings. We foresee that the $q$ -learning framework developed in Jia and Zhou (2023) might be useful in such a setting. It would be also interesting to extend the work to a higher dimension case. We leave these exciting research problems for future studies.

Acknowledgements

Thai Nguyen acknowledges the support of the Natural Sciences and Engineering Research Council of Canada [RGPIN-2021-02594].

References

Baird (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
Barnard (1993) Barnard, E. (1993). Temporal-difference methods and markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357–365.
Bertsekas (2019) Bertsekas, D. (2019). Reinforcement learning and optimal control. Athena Scientific.
Bertsimas and Thiele (2006) Bertsimas, D. and Thiele, A. (2006). A robust optimization approach to inventory theory. Operations research, 54(1):150–168.
Bielecki et al. (2005) Bielecki, T. R., Jin, H., Pliska, S. R., and Zhou, X. Y. (2005). Continuous-time mean-variance portfolio selection with bankruptcy prohibition. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 15(2):213–244.
Bodnar et al. (2013) Bodnar, T., Parolya, N., and Schmid, W. (2013). On the equivalence of quadratic optimization problems commonly used in portfolio theory. European Journal of Operational Research, 229(3):637–644.
Charpentier et al. (2021) Charpentier, A., Elie, R., and Remlinger, C. (2021). Reinforcement learning in economics and finance. Computational Economics, pages 1–38.
Chen and Vellekoop (2017) Chen, A. and Vellekoop, M. (2017). Optimal investment and consumption when allowing terminal debt. European Journal of Operational Research, 258(1):385–397.
Csiszár (1975) Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, pages 146–158.
Cuoco (1997) Cuoco, D. (1997). Optimal consumption and equilibrium prices with portfolio constraints and stochastic income. Journal of Economic Theory, 72(1):33–73.
Dai et al. (2020) Dai, M., Dong, Y., and Jia, Y. (2020). Learning equilibrium mean-variance strategy. Available at SSRN 3770818.
Dai et al. (2023) Dai, M., Dong, Y., and Jia, Y. (2023). Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4):1166–1212.
Dai et al. (2021) Dai, M., Jin, H., Kou, S., and Xu, Y. (2021). A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108.
Donsker and Varadhan (2006) Donsker, M. and Varadhan, S. (2006). Large deviations for markov processes and the asymptotic evaluation of certain markov process expectations for large times. In Probabilistic Methods in Differential Equations: Proceedings of the Conference Held at the University of Victoria, August 19–20, 1974, pages 82–88. Springer.
Doya (2000) Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation, 12(1):219–245.
Duffie and Richardson (1991) Duffie, D. and Richardson, H. R. (1991). Mean-variance hedging in continuous time. The Annals of Applied Probability, pages 1–15.
El Karoui and Jeanblanc-Picqué (1998) El Karoui, N. and Jeanblanc-Picqué, M. (1998). Optimization of consumption with labor income. Finance and Stochastics, 2:409–440.
Fleming (1976) Fleming, W. H. (1976). Generalized solutions in optimal stochastic control. Brown Univ.
Fleming and Nisio (1984) Fleming, W. H. and Nisio, M. (1984). On stochastic relaxed control for partially observed diffusions. Nagoya Mathematical Journal, 93:71–108.
Fleming and Soner (2006) Fleming, W. H. and Soner, H. M. (2006). Controlled Markov processes and viscosity solutions, volume 25. Springer Science & Business Media.
Frémaux et al. (2013) Frémaux, N., Sprekeler, H., and Gerstner, W. (2013). Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS computational biology, 9(4):e1003024.
Friedman (2008) Friedman, A. (2008). Partial differential equations of parabolic type. Courier Dover Publications.
Gerrard et al. (2023) Gerrard, R., Kyriakou, I., Nielsen, J. P., and Vodička, P. (2023). On optimal constrained investment strategies for long-term savers in stochastic environments and probability hedging. European Journal of Operational Research, 307(2):948–962.
Gosavi (2009) Gosavi, A. (2009). Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178–192.
Guo et al. (2022) Guo, X., Hu, A., Xu, R., and Zhang, J. (2022). A general framework for learning mean-field games. Mathematics of Operations Research.
Hambly et al. (2021) Hambly, B., Xu, R., and Yang, H. (2021). Recent advances in reinforcement learning in finance. arXiv preprint arXiv:2112.04553.
Hendricks and Wilcox (2014) Hendricks, D. and Wilcox, D. (2014). A reinforcement learning extension to the almgren-chriss framework for optimal trade execution. In 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), pages 457–464. IEEE.
Huang et al. (2022) Huang, Y.-J., Wang, Z., and Zhou, Z. (2022). Convergence of policy improvement for entropy-regularized stochastic control problems. arXiv preprint arXiv:2209.07059.
Jaimungal (2022) Jaimungal, S. (2022). Reinforcement learning and stochastic optimisation. Finance and Stochastics, 26(1):103–129.
Jia and Zhou (2022a) Jia, Y. and Zhou, X. Y. (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1–55.
Jia and Zhou (2022b) Jia, Y. and Zhou, X. Y. (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(154):1–55.
Jia and Zhou (2023) Jia, Y. and Zhou, X. Y. (2023). Q-learning in continuous time. Journal of Machine Learning Research.
Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285.
Kamma and Pelsser (2022) Kamma, T. and Pelsser, A. (2022). Near-optimal asset allocation in financial markets with trading constraints. European Journal of Operational Research, 297(2):766–781.
Karatzas and Shreve (1998) Karatzas, I. and Shreve, S. E. (1998). Methods of mathematical finance, volume 39. Springer.
Kato et al. (2020) Kato, M., Nakagawa, K., Abe, K., and Morimura, T. (2020). Mean-variance efficient reinforcement learning by expected quadratic utility maximization. arXiv preprint arXiv:2010.01404.
Kotz et al. (2004) Kotz, S., Balakrishnan, N., and Johnson, N. L. (2004). Continuous multivariate distributions, Volume 1: Models and applications, volume 1. John Wiley & Sons.
Lee and Lee (2021) Lee, H.-R. and Lee, T. (2021). Multi-agent reinforcement learning algorithm to solve a partially-observable multi-agent problem in disaster response. European Journal of Operational Research, 291(1):296–308.
Lee and Sutton (2021) Lee, J. and Sutton, R. S. (2021). Policy iterations for reinforcement learning problems in continuous time and space—fundamental theory and methods. Automatica, 126:109421.
Li and Xu (2016) Li, X. and Xu, Z. Q. (2016). Continuous-time markowitz’s model with constraints on wealth and portfolio. Operations research letters, 44(6):729–736.
Li et al. (2002) Li, X., Zhou, X. Y., and Lim, A. E. (2002). Dynamic mean-variance portfolio selection with no-shorting constraints. SIAM Journal on Control and Optimization, 40(5):1540–1555.
Liu et al. (2020) Liu, Y., Chen, Y., and Jiang, T. (2020). Dynamic selective maintenance optimization for multi-state systems over a finite horizon: A deep reinforcement learning approach. European Journal of Operational Research, 283(1):166–181.
Luenberger (1998) Luenberger, D. G. (1998). Investment science. Oxford university press.
Merton (1975) Merton, R. C. (1975). Optimum consumption and portfolio rules in a continuous-time model. In Stochastic optimization models in finance, pages 621–661. Elsevier.
Moody et al. (1998) Moody, J., Wu, L., Liao, Y., and Saffell, M. (1998). Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470.
Mou et al. (2021) Mou, C., Zhang, W., and Zhou, C. (2021). Robust exploratory mean-variance problem with drift uncertainty. arXiv preprint arXiv:2108.04100.
Nevmyvaka et al. (2006) Nevmyvaka, Y., Feng, Y., and Kearns, M. (2006). Reinforcement learning for optimized trade execution. In Proceedings of the 23rd international conference on Machine learning, pages 673–680.
Nicole el et al. (1987) Nicole el, K., Du Huu, N., and Monique, J.-P. (1987). Compactification methods in the control of degenerate diffusions: existence of an optimal control. Stochastics: an international journal of probability and stochastic processes, 20(3):169–219.
Pham (2009) Pham, H. (2009). Continuous-time stochastic control and optimization with financial applications, volume 61. Springer Science & Business Media.
Pulley (1983) Pulley, L. B. (1983). Mean-variance approximations to expected logarithmic utility. Operations Research, 31(4):685–696.
Rubinstein (1977) Rubinstein, M. (1977). The strong case for the generalized logarithmic utility model as the premier model of financial markets. In Financial Dec Making Under Uncertainty, pages 11–62. Elsevier.
Schnaubelt (2022) Schnaubelt, M. (2022). Deep reinforcement learning for the optimal placement of cryptocurrency limit orders. European Journal of Operational Research, 296(3):993–1006.
Schneckenreither and Haeussler (2019) Schneckenreither, M. and Haeussler, S. (2019). Reinforcement learning methods for operations research applications: The order release problem. In Machine Learning, Optimization, and Data Science: 4th International Conference, LOD 2018, Volterra, Italy, September 13-16, 2018, Revised Selected Papers 4, pages 545–559. Springer.
Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. nature, 550(7676):354–359.
Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Tang et al. (2022) Tang, W., Zhang, Y. P., and Zhou, X. Y. (2022). Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216.
Wang (2019) Wang, H. (2019). Large scale continuous-time mean-variance portfolio allocation via reinforcement learning. arXiv preprint arXiv:1907.11718.
Wang et al. (2020) Wang, H., Zariphopoulou, T., and Zhou, X. Y. (2020). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34.
Wang and Zhou (2020) Wang, H. and Zhou, X. Y. (2020). Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308.
Williams et al. (2017) Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. (2017). Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE.

Appendix A Technical proofs of Section 3

A.1. Proof of Theorem 3.1

Proof.

Recall that $U(x)=\ln(x)$ . We find the solution of the HJB (2.12) under the form $v(t,x;m)=k(t)\ln x+l(t)$ , where $k,l:[0,T]\to\mathbb{R}$ are smooth functions satisfying the terminal condition $k(T)=1$ , $l(T)=0$ . Direct calculation shows that (2.12) boils down to

(A.1)

\displaystyle k^{\prime}(t)\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}k(t)-\frac{m}{2}\ln k(t)+l^{\prime}(t)+c_{m}=0,\forall(t,x)\in[0,T)\times\mathbb{R}_{+},

where $c_{m}:=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})$ and $\pi_{e}=3.14...$ is the Archimedes’ constant. It follows that $k^{\prime}(t)=0$ and

(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})k(t)-\frac{m}{2}\ln k(t)+l^{\prime}(t)+c_{m}=0.

Hence, $k(t)=1$ and

l(t)=-(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})t-c_{m}t+C,

where the constant $C$ is chosen such that the terminal condition $l(T)=0$ is fulfilled. Direction calculation leads to

l(t)=(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+c_{m}(T-t)

and the solution of PDE (2.12) is given by (3.1). Let us first show that $v(t,x;m)$ in (3.1) is the optimal value function. Indeed, it follows that the optimal distribution $\lambda^{*}$ is now given by a Gaussian distribution with mean $\alpha=\frac{(\mu-r)}{\sigma^{2}}$ (independent of the exploration parameter $m$ ) and variance $\beta^{2}=\frac{m}{\sigma^{2}}.$ The law of optimal feedback Gaussian control allows to determine the exploration wealth drift and volatility from (2.6) as

\displaystyle\hat{A}{(t,x;\lambda^{*})}=x\bigg{(}r+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)};\quad\hat{B}^{2}{(t,x;\lambda^{*})}=x^{2}\bigg{(}m+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}.

Hence, the exploration wealth dynamics is given by (3.3). Now, it is clear that the SDE (3.3) admits solution given by

(A.2)

\displaystyle X_{t}^{\lambda^{*}}=x_{0}\exp\bigg{\{}\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}-\frac{1}{2}m\bigg{)}t+\sqrt{m+\frac{(\mu-r)^{2}}{\sigma^{2}}}W_{t}\bigg{\}}.

Moreover, it can be seen directly from (A.2) that

\mathbb{E}[\ln[X_{T}^{\lambda*}]]=\ln x_{0}+\bigg{(}r+\frac{(\mu-r)^{2}}{2\sigma^{2}}\bigg{)}T-\frac{mT}{2}<\infty,

which means that $\lambda^{*}$ is admissible. ∎

A.2. Proof of Theorem 3.4

Proof.

Recall first that for any $\lambda$ admissible (not necessarily Gaussian), the corresponding value function $v^{{\lambda}}(s,x;m)$ defined by (3.10) solves the following average PDE

v_{t}^{{\lambda}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)-m\ln\lambda(\pi|t,x)\bigg{)}\lambda(\pi|t,x)d\pi=0

with terminal condition $v^{{\lambda}}(T,x;m)=U(x)$ . It follows that

\displaystyle v_{t}^{{\lambda}}(t,x;m)+\sup_{{\hat{\lambda}}}\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)-m\ln{\hat{\lambda}}(\pi|t,x)\bigg{)}{\hat{\lambda}}(\pi|t,x)d\pi\geq 0.

Note that the above supremum is attained at the updated policy $\widetilde{\lambda}$ defined by (3.11). In other words,

\displaystyle v_{t}^{{\lambda}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)-m\ln\widetilde{\lambda}(\pi|t,x)\bigg{)}\widetilde{\lambda}(\pi|t,x)d\pi\geq 0,

which implies that

	$\displaystyle v_{t}^{{\lambda}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)$	$\displaystyle+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)\bigg{)}\widetilde{\lambda}(\pi\|t,x)d\pi$
(A.3)			$\displaystyle\geq m\int_{\mathcal{C}}\ln\widetilde{\lambda}(\pi\|t,x)\widetilde{\lambda}(\pi\|t,x)d\pi\geq 0.$

Now, for the policy $\widetilde{\lambda}$ , the corresponding value function is given by

(A.4)

\displaystyle v^{\widetilde{\lambda}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\widetilde{\lambda}})-m\int_{t}^{T}\int_{\mathcal{C}}\widetilde{\lambda}(\pi|s,X^{\widetilde{\lambda}}_{s})\ln\widetilde{\lambda}(\pi|s,X^{\widetilde{\lambda}}_{s})d\pi ds|X^{\widetilde{\lambda}}_{t}=x\Bigg{]}.

Let $Y_{t}:=v^{\lambda}(t,X_{t}^{\widetilde{\lambda}};m)$ . By Itô’s formula we obtain

	$\displaystyle v^{{\lambda}}(s,X_{s}^{\widetilde{\lambda}};m)$	$\displaystyle=v^{{\lambda}}(t,x;m)+\int_{t}^{s}\hat{B}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m)dW_{u}$
		$\displaystyle+\int_{t}^{s}\bigg{(}v^{{\lambda}}_{t}(u,X_{u}^{\widetilde{\lambda}};m)+\hat{A}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m)+\frac{1}{2}\hat{B}^{2}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{xx}(u,X_{u}^{\widetilde{\lambda}};m)\bigg{)}du.$

Let $\{\tau_{n}\}_{n\geq 1}$ be a localization sequence of the above stochastic integral, i.e. $\tau_{n}:=\min\{s\geq t:\mathbb{E}[\int_{t}^{s\wedge\tau_{n}}\hat{B}^{2}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}(v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m))^{2}du]\geq n\}\wedge T$ . Clearly, $\lim_{n\to\infty}\tau_{n}=T$ . By (A.3) we obtain for $t\leq s\leq T$ ,

	$\displaystyle v^{{\lambda}}(t,x;m)=\mathbb{E}\bigg{[}v^{{\lambda}}({s\wedge\tau_{n}},X_{s\wedge\tau_{n}}^{\widetilde{\lambda}};m)$
	$\displaystyle-\int_{t}^{s\wedge\tau_{n}}\bigg{(}v^{{\lambda}}_{t}(u,X_{u}^{\widetilde{\lambda}};m)+\hat{A}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m)+\frac{1}{2}\hat{B}^{2}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{xx}(u,X_{u}^{\widetilde{\lambda}};m)\bigg{)}du\bigg{\|}X_{t}^{\widetilde{\lambda}}=x\bigg{]}$
	$\displaystyle\leq\mathbb{E}\bigg{[}v^{{\lambda}}({s\wedge\tau_{n}},X_{s\wedge\tau_{n}}^{\widetilde{\lambda}};m)-m\int_{t}^{s\wedge\tau_{n}}\int_{\mathcal{C}}\ln\widetilde{\lambda}(\pi\|u,X^{\widetilde{\lambda}}_{u})\widetilde{\lambda}(\pi\|u,X^{\widetilde{\lambda}}_{u})d\pi du\bigg{\|}X_{t}^{\widetilde{\lambda}}=x\bigg{]}.$

By taking $s=T$ , sending $n\to\infty$ and using (A.4) we obtain that for $(t,x)\in[0,T)\times\mathbb{R}_{+}$

v^{{\lambda}}(t,x;m)\leq\mathbb{E}\bigg{[}U(X_{T}^{\widetilde{\lambda}})-m\int_{t}^{T}\int_{\mathcal{C}}\ln\widetilde{\lambda}(\pi|u,X^{\widetilde{\lambda}}_{u})\widetilde{\lambda}(\pi|u,X^{\widetilde{\lambda}}_{u})d\pi du\bigg{|}X_{t}^{\widetilde{\lambda}}=x\bigg{]}=v^{\widetilde{\lambda}}(t,x;m).

∎

A.3. Proof of Theorem 3.5

Proof.

Observe first that $v^{{\lambda^{0}}}(s,x;m)$ solves the following average PDE

v_{t}^{{\lambda^{0}}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda^{0}}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda^{0}}}(t,x;m)-m\ln\lambda_{t}^{0}\bigg{)}\lambda_{t}^{0}(\pi)d\pi=0.

Solving the above PDE with terminal condition $v^{{\lambda^{0}}}(T,x;m)=\ln x$ and ${\lambda}^{0}(\pi|t,x;m)=\mathcal{N}(\pi|a,b^{2})$ with $a,b>0$ , we obtain $v_{t}^{{\lambda^{0}}}(t,x;m)=\ln x+h(t)$ , where $h$ is a continuous function depending only on $t$ and $h(T)=0$ . Therefore, the value function $v^{{\lambda^{0}}}$ satisfies hypothesis in Theorem 3.4 and its conclusions apply. In particular, the next Gaussian policy

(A.5)

{\lambda}^{1}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{{\lambda^{0}}}}{\sigma^{2}x^{2}v_{xx}^{{\lambda^{0}}}},-\frac{m}{\sigma^{2}x^{2}v^{{\lambda^{0}}}_{xx}}\bigg{)}=\mathcal{N}\bigg{(}\pi\bigg{|}\frac{(\mu-r)}{\sigma^{2}},\frac{m}{\sigma^{2}}\bigg{)}

is admissible and

(A.6)

v^{{\lambda}^{1}}(t,x;m)\geq v^{\lambda^{0}}(t,x;m),\quad(t,x)\in[0,T)\times\mathbb{R}_{+},

where $v^{{\lambda}^{1}}(t,x;m)$ be the value function corresponding to this new policy ${\lambda}^{1}(\pi|t,x;m)$ . Again, $v^{{\lambda^{1}}}(s,x;m)$ solves the following average PDE

v_{t}^{{\lambda^{1}}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda^{1}}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda^{1}}}(t,x;m)-m\ln\lambda^{1}(\pi|t,x;m)\bigg{)}\lambda^{1}(\pi|t,x;m)d\pi=0.

Direct computations related to the distribution of ${\lambda}^{1}$ lead to the following PDE

\displaystyle v_{t}^{{\lambda^{1}}}(t,x;m)

\displaystyle+\bigg{(}r+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}xv_{x}^{{\lambda^{1}}}(t,x;m)+\frac{1}{2}\bigg{(}\frac{(\mu-r)^{2}}{\sigma^{2}}+m\bigg{)}x^{2}v_{xx}^{{\lambda^{1}}}(t,x;m)+\frac{m}{2}\bigg{(}1+\ln\bigg{(}\frac{2\pi_{e}m}{\sigma^{2}}\bigg{)}\bigg{)}=0.

It is hence straightforward to see that $v^{\lambda^{*}}(t,x,m)=v(t,x;m)$ given by (3.1) solves the above PDE and the proof is complete. ∎

A.4. Proof of Theorem 3.6

Proof.

The proof is similar to that in Jia and Zhou (2022a, b) and can be done as follows. First, by the Markov property of the process $X^{\lambda}_{t}$ , it can be observed directly that

	$\displaystyle Y_{s}$	$\displaystyle=J(t,X^{\lambda}_{s};{\lambda})-m\int_{t}^{s}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du$
		$\displaystyle=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{s}^{T}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du\bigg{\|}X_{s}^{\lambda}\bigg{]}-m\int_{t}^{s}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du$
		$\displaystyle=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du\bigg{\|}X_{s}^{\lambda}\bigg{]}=\mathbb{E}\big{[}Y_{T}\|X_{s}^{\lambda}\big{]}.$

∎

Appendix B Technical Proofs of Section 4

B.1. Proof of Theorem 4.1

Proof.

Similar to the unconstrained problem, we find the solution of the HJB (4.6) under the form $v^{[a,b]}(t,x;m)=k(t)\ln x+l(t)$ , where $k,l:[0,T]\to\mathbb{R}+$ are smooth functions satisfying the terminal condition $k(T)=1$ , $l(T)=0$ . Direct calculation shows that

A(m)=(a-\pi^{Merton})\sigma m^{-1/2};\quad B(m)=(b-\pi^{Merton})\sigma m^{-1/2},

which are independent of $(t,x)$ . Hence, $Z(m)=Z_{a,b}(m)$ is also independent of $t,x$ . Now, (4.6) boils down to

(B.1)

\displaystyle k^{\prime}(t)\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})k(l)-\frac{m}{2}\ln k(t)+l^{\prime}(t)+c_{m}{+m\ln Z_{a,b}(m)}=0,

for $(t,x)\in[0,T)\times\mathbb{R}_{+}$ , where $c_{m}:=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})$ . It follows that $k(t)=1$ and

l(t)=-(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})t-c_{m}t+C-mt\ln Z_{a,b}(m),

where the constant $C$ is chosen such that the terminal condition $l(T)=0$ is fulfilled. Direction calculation leads to

l(t)=(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+c_{m}(T-t){+m\ln Z_{a,b}(m)}(T-t)

and the solution of PDE (4.6) is given by (4.9). It is straightforward to verify that $v^{[a,b]}$ defined by (4.9) solves (4.6). Indeed, it follows that the optimal distribution $\lambda^{*,[a,b]}$ is now given by a Gaussian distribution with mean $\alpha=\frac{(\mu-r)}{\sigma^{2}}$ (independent of exploration) and variance $\beta^{2}=\frac{m}{\sigma^{2}}$ truncated on the interval $[a,b]$ . The density of $\lambda^{*,[a,b]}$ is given by (4.14) which is independent of the wealth level. This truncated Gaussian control allows to determine the exploration wealth drift and volatility as

(B.2)

\displaystyle\hat{A}(t,x;\lambda^{*})=x\bigg{(}r+\pi_{t}^{*,[a,b]}(\mu-r)\bigg{)};\quad{\hat{B}^{2}(t,x;\lambda^{*})=x^{2}\sigma^{2}(q_{t}^{*[a,b]})^{2},}

where $\pi_{t}^{*,[a,b]}$ and $(q_{t}^{*[a,b]})^{2}$ are given by (4.14). This leads to the exploration wealth dynamics given by (4.12). Now, it is clear that the above SDE (4.12) admits a unique solution $X^{\lambda^{*,[a,b]}}$ and $\mathbb{E}[\ln[X_{T}^{\lambda^{*,[a,b]}}]]<\infty,$ which means that ${\lambda^{*,[a,b]}}$ is admissible.∎

B.2. Proof of Lemma 4.4

Proof.

Observe first that the function $x\varphi(x)$ is increasing in $(-1,1)$ and takes negative (resp. positive) values when $x<0$ (resp. $x>0$ ). Therefore, $A(m)\varphi(A(m))-B(m)\varphi(B(m))\leq 0$ if

•

$A(m)<0<B(m)$ , which implies that $a<\pi^{Merton}<b$ ;
•

$0\leq A(m)\leq B(m)\leq 1$ , which implies that $\pi^{Merton}\leq a<b\leq\pi^{Merton}+\sqrt{m}\sigma^{-1}$ ;
•

$-1\leq A(m)\leq B(m)\leq 0$ , which implies that $\pi^{Merton}-\sqrt{m}\sigma^{-1}\leq a<b<\pi^{Merton}$ .

For each of these cases, by compararing Propositions 4.1 and 3.1 we obtain

L^{[a,b]}(T,x;m)=\frac{mT}{2}+mT\frac{A(m)\varphi(A(m))-B(m)\varphi(B(m))}{2Z_{a,b}(m)}\leq\frac{mT}{2}=L(T,x;m)

and Lemma 4.4 is proved. ∎

Appendix C Technical proofs for Section 6

C.1. Proof of Theorem 6.1

Proof.

We find the solution of the HJB (6.8) under the form $\widetilde{v}^{[a,b]}(t,x;m)=k(t)x^{2}+l(t)x+q(t)$ , where $k,l,q:[0,T]\to\mathbb{R}$ are smooth functions satisfying the terminal condition $k(T)=-\frac{\varepsilon}{2}$ , $l(T)=K$ and $q(T)=0$ . Differenting this ansatz, plugging into the PDE (6.8) and studying the coefficients of $x^{2}$ and $x$ we obtain

k(t)=-\frac{1}{2}\varepsilon e^{-(\rho^{2}-2r)(T-t)},\quad l(t)=Ke^{-(\rho^{2}-r)(T-t)}.

Similarly, we obtain the following ODE when considering the coefficient of $x^{0}$ :

(C.1)

\displaystyle q^{\prime}(t)-\frac{1}{2}\rho^{2}\frac{l^{2}(t)}{2k(t)}-\frac{m}{2}\ln(-2k(t))+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})+m\ln(Z(t,x;m))=0,\quad q(T)=0.

Direct calculation shows that

(C.2)		$\displaystyle\widetilde{Q}_{a}(t,x;m)=\bigg{(}a_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}})=\widetilde{Q}_{a}(t;m);$
(C.3)		$\displaystyle\widetilde{Q}_{b}(t,x;m)=\bigg{(}b_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}}\bigg{)}=\widetilde{Q}_{b}(t;m),$

which are independent of $x$ . Hence, $Z(t,x;m)=f(t,m)$ given in (6.15) which is also independent of $x$ . Therefore,

q(t)=-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t)-m\int_{t}^{T}\ln(f(s,m))ds.

Now, it is straightforward to verify that $\widetilde{v}^{[a,b]}$ defined by (6.14) solves (6.8). Next, it follows that the optimal feedback distribution control $\widetilde{\lambda}^{*}$ is a Gaussian distribution with parameter $(\frac{K}{\varepsilon}e^{r(T-t)}-x)\pi^{Merton}$ and $\frac{m}{\varepsilon\sigma^{2}}e^{(2r-\rho^{2})(T-t)}$ , truncated on the interval $[a(t,x),b(t,x)]$ i.e.

(C.4)

\widetilde{\lambda}^{*}(\theta|t,x;m)=\mathcal{N}\left(\theta\bigg{|}(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton},\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}\right)\bigg{|}_{[a(t,x),b(t,x)]}.

This explicit truncated Gaussian policy allows to determine the exploration wealth drift and volatility as

\displaystyle\tilde{A}(t,x;\widetilde{\lambda}^{*})=rx+(\mu-r)\widetilde{\theta}_{t}^{*,[a,b]},\quad\widetilde{B}(t,x;\widetilde{\lambda}^{*}):=\sigma\widetilde{q}_{t}^{*,[a,b]},

where

(C.5)

\widetilde{\theta}_{t}^{*,[a,b]}=\big{(}\frac{K}{\varepsilon}e^{-r(T-t)}-x\big{)}\pi^{Merton}+\sqrt{\frac{m}{\varepsilon\sigma^{2}e^{-(\rho^{2}-2r)(T-t)}}}\frac{\varphi\big{(}\widetilde{Q}_{a}(t;m)\big{)}-\varphi\big{(}\widetilde{Q}_{b}(t;m)\big{)}}{f(t,m)}

and

	$\displaystyle(\widetilde{q}_{t}^{,[a,b]})^{2}=(\widetilde{\theta}_{t}^{,[a,b]})^{2}+\frac{m}{\varepsilon\sigma^{2}e^{-(\rho^{2}-2r)(T-t)}}$	$\displaystyle\Bigg{(}1+\frac{\widetilde{Q}_{a}(t;m)\varphi(\widetilde{Q}_{a}(t;m))-\widetilde{Q}_{b}(t;m)\varphi(\widetilde{Q}_{b}(t;m)\bigg{)}}{f(t,m)}$
(C.6)			$\displaystyle-\bigg{(}\frac{\varphi(\widetilde{Q}_{a}(t;m))-\varphi((\widetilde{Q}_{b}(t;m))}{f(t,m)}\bigg{)}^{2}\Bigg{)},$

which guarantees that the SDE (6.3) admits strong solution. Finally, it is straightforward to verify the integrability conditions to conclude that $\widetilde{\lambda}^{*}$ is admissible.∎

C.2. Proof of Proposition 6.2

Proof.

First, from the quadratic utility form it is easy to see that for any admissible terminal portfolio $X_{T}$ ,

(C.7)

\mathbb{E}[KX_{T}-\frac{1}{2}\epsilon X_{T}^{2}]=-\frac{1}{2}\epsilon\bigg{(}\mathbb{E}[X_{T}]-\frac{K}{\epsilon}\bigg{)}^{2}+\frac{K^{2}}{2\epsilon}-\frac{1}{2}Var[X_{T}].

Observe that when $Var[X_{T}]=\Gamma$ is fixed, the right hand side of (C.7) is increasingly monotone with respect to $\mathbb{E}[X_{T}]$ as long as $\mathbb{E}[X_{T}]\leq\frac{K}{\epsilon}$ . Therefore, among policies $\theta$ with a fixed variance $Var[X_{T}]=\Gamma$ and mean $\mathbb{E}[X_{T}]\leq\frac{K}{\epsilon}$ , the policy that maximizes the quadratic expected utility will provide the highest mean $\mathbb{E}[X_{T}]$ of the right hand-side of (C.7). In other words, to show that the unconstrained exploratory optimal terminal wealth of the quadratic expected utility maximization problem $X^{0,\widetilde{\lambda}^{*}}$ lies in the mean-variance efficient frontier, it suffices to show that $\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]\leq\frac{K}{\epsilon}$ . Indeed, from (C.5), taking $a_{0}(t)=-\infty$ and $b_{0}(t)=+\infty$ , the unconstrained optimal wealth dynamics with exploration is given by

	$\displaystyle dX^{0,\widetilde{\lambda}^{*}}_{t}$	$\displaystyle=\bigg{(}rX^{\widetilde{\lambda}^{}}_{t}+(\mu-r)\big{(}\frac{K}{\varepsilon}e^{-r(T-t)}-X^{0,\widetilde{\lambda}^{}}_{t}\big{)}\pi^{Merton}\bigg{)}dt$
		$\displaystyle+\sigma\sqrt{\big{(}\frac{K}{\varepsilon}e^{-r(T-t)}-X^{0,\widetilde{\lambda}^{*}}_{t}\big{)}^{2}(\pi^{Merton})^{2}+\frac{m}{\varepsilon\sigma^{2}e^{-(\rho^{2}-2r)(T-t)}}}dW_{t},$

which implies that $\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{t}]=x_{0}+\int_{0}^{t}\big{(}\frac{K}{\varepsilon}e^{-r(T-s)}-\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{s}]\big{)}\rho^{2}ds$ . Put $\psi(t):=\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{t}]$ we obtain the following ODE

\psi^{\prime}(t)=\rho^{2}\frac{K}{\epsilon}e^{-r(T-t)}+(r-\rho^{2})\psi(t).

Solving this ODE with initial condition $\psi(0)=x_{0}$ we obtain

(C.8)

\psi(t)=\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{t}]=\frac{K}{\epsilon}e^{-r(T-t)}+x_{0}e^{(r-\rho^{2})t}-\frac{K}{\epsilon}e^{-r(T-t)}e^{-\rho^{2}t}

and $\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]=\frac{K}{\epsilon}+(x_{0}e^{rT}-\frac{K}{\epsilon})e^{-\rho^{2}T}$ . It follows directly from the condition $x_{0}\leq\frac{K}{\epsilon}e^{-rT}$ that $\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]\leq\frac{K}{\epsilon}$ . ∎

	$\displaystyle Y_{s}$	$\displaystyle=J(t,X^{\lambda}_{s};{\lambda})-m\int_{t}^{s}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du$
		$\displaystyle=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{s}^{T}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du\bigg{\|}X_{s}^{\lambda}\bigg{]}-m\int_{t}^{s}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du$
		$\displaystyle=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathcal{C}}\lambda(\pi\|u,X^{\lambda}_{u})\ln\lambda(\pi\|u,X^{\lambda}_{u})d\pi du\bigg{\|}X_{s}^{\lambda}\bigg{]}=\mathbb{E}\big{[}Y_{T}\|X_{s}^{\lambda}\big{]}.$