This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continuous-time optimal investment with portfolio constraints: a reinforcement learning approach

H. Chau Department of Mathematics
University of Manchester
Manchester M13 9PL, United Kingdom
[email protected]
D. Nguyen Department of Mathematics
Marist College
3399 North Road
Poughkeepsie NY 12601
United States
[email protected]
 and  T. Nguyen École d’Actuariat
Université Laval
2325 Rue de l’Université, Québec, QC G1V 0A6, Canada
[email protected]
Abstract.

In a reinforcement learning (RL) framework, we study the exploratory version of the continuous time expected utility (EU) maximization problem with a portfolio constraint that includes widely-used financial regulations such as short-selling constraints and borrowing prohibition. The optimal feedback policy of the exploratory unconstrained classical EU problem is shown to be Gaussian. In the case where the portfolio weight is constrained to a given interval, the corresponding constrained optimal exploratory policy follows a truncated Gaussian distribution. We verify that the closed form optimal solution obtained for logarithmic utility and quadratic utility for both unconstrained and constrained situations converge to the non-exploratory expected utility counterpart when the exploration weight goes to zero. Finally, we establish a policy improvement theorem and devise an implementable reinforcement learning algorithm by casting the optimal problem in a martingale framework. Our numerical examples show that exploration leads to an optimal wealth process that is more dispersedly distributed with heavier tail compared to that of the case without exploration. This effect becomes less significant as the exploration parameter is smaller. Moreover, the numerical implementation also confirms the intuitive understanding that a broader domain of investment opportunities necessitates a higher exploration cost. Notably, when subjected to both short-selling and money borrowing constraints, the exploration cost becomes negligible compared to the unconstrained case.

Key words and phrases:
Optimal investment, entropy regularized, reinforcement learning, exploration, stochastic optimal control, portfolio constraint
2010 Mathematics Subject Classification:
34D20, 60H10, 92D25, 93D05, 93D20.

1. Introduction

Reinforcement learning (RL) is a dynamic and rapidly advancing subset within the field of machine learning. In recent times, RL has been applied across diverse disciplines to address substantial real-world challenges, extending its reach into domains such as game theory (Silver et al., 2016, 2017), control theory (Bertsekas, 2019), information theory (Williams et al., 2017), and statistics (Kaelbling et al., 1996). RL has been also applied to study many important issues of operations research (OR) such as manufacturing planning, control systems (Schneckenreither and Haeussler, 2019) and inventory management (Bertsimas and Thiele, 2006). In quantitative finance, RL has been used to study important problems such as algorithmic and high frequency trading, portfolio management. The availability of big data sets has too facilitated the rapid development of using RL techniques in financial engineering. As a typical example, electronic markets can provide a sufficient amount of microstructure data for training and adaptive learning, much beyond what human traders and portfolio managers could handle in old days. Numerous studies have been done in this direction including optimal order execution (Nevmyvaka et al., 2006; Schnaubelt, 2022), optimal trading (Hendricks and Wilcox, 2014), portfolio allocation (Moody et al., 1998), mean-variance portfolio allocation (Wang and Zhou, 2020; Wang, 2019; Dai et al., 2020; Jia and Zhou, 2022b). Notably, the authors in Wang and Zhou (2020) considered the continuous time mean-variance portfolio selection using a continuous RL framework. Their algorithm outperforms both traditional and deep neural network based algorithms by a large margin. For recent reviews on applications of RL in OR, finance and economics, the interested reader is invited to delve into the review papers e.g. Gosavi (2009); Hambly et al. (2021); Jaimungal (2022); Charpentier et al. (2021).

Differing from conventional econometric or supervised learning approaches prevalent in quantitative finance research, which often necessitate the assumption of parametric models, an RL agent refrains from pre-specifying a structural model. Instead, she progressively learns optimal strategies through trial and error, engaging in interactions with a black-box environment (e.g. the market). By repeatedly trying a policy for actions, receiving and evaluating reward signals, and improving the policy, the agent gradually improves her performance. The three key components that essentially capture the heart of RL are: i) exploration, ii) policy evaluation, and iii) policy improvement. The agent first explores the surrounding environment by trying a policy. She then evaluates her reward for the given policy. Lastly, using the information receiving from both exploration and current reward, she devises a new policy with larger reward. The whole process is then repeated.

Despite the fast development and vast applications, very few existing RL studies are done for continuous settings e.g. Doya (2000); Frémaux et al. (2013); Lee and Sutton (2021), whereas a large body of RL investigations is limited to discrete learning frameworks such as discrete-time Markov decision processes (MPD) or deterministic settings, see e.g. Sutton and Barto (2018); Hambly et al. (2021); Liu et al. (2020); Lee and Lee (2021). Since discrete time dynamics just provides an approximation for real work systems, it is important, from the practical point of view, to consider continuous time systems with both continuous states and action spaces. As mentioned previously, under RL settings, an agent simultaneously interacts with the surrounding environment (exploration) and improves her overall performance (exploitation). Since exploration is inherently costly in term of resources, it is important to design an active learning approach which balances both exploration and exploitation. Therefore, there is a critical necessity to extend the RL techniques to continuous settings where agent can find best learning strategies that balance exploration and exploitation.

The studies of RL under a continuous time framework in both time and space have been initiated in a series of recent papers (Wang et al., 2020; Wang and Zhou, 2020; Wang, 2019; Dai et al., 2020; Jia and Zhou, 2022b). In Wang et al. (2020) the authors propose a theoretical framework, called exploratory formulation, for studying RL problems in continuous systems in both time and space that captures repetitive learning under exploration in the continuous time limit. Wang and Zhou (2020) adopts the RL setting of Wang et al. (2020) to study the continuous-time mean variance portfolio selection with a finite time horizon. They show that in a learning framework incorporating both exploration and exploitation, the optimal feedback policy is Gaussian, with time-decaying variance. As showed in Jia and Zhou (2022a), the framework in Wang et al. (2020) minimizes the mean-square temporal difference (TD) error (Barnard, 1993; Baird, 1995; Sutton and Barto, 2018) when learning the value function, which later turns out to be inconsistent for stochastic settings. Dai et al. (2020) further considers the equilibrium mean–variance strategies addressing the time-inconsistent issue of the problem. Guo et al. (2022); Mou et al. (2021) extend the formulation and results of Wang and Zhou (2020) to mean-field games and a mean-variance problem with drift uncertainty, respectively. In Jia and Zhou (2022b), using a martingale approach, the authors are able to present the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns policy gradient into a policy evaluation problem. Studies of convergence under the exploratory formulation can be found in Tang et al. (2022); Huang et al. (2022).

In this paper, we adopt the exploratory stochastic control learning framework of Wang et al. (2020) to study the continuous-time optimal investment problem without and with a portfolio constraint which covers widely-used financial regulations such as short-selling constraints and borrowing prohibition (Cuoco, 1997). In both constrained and unconstrained settings with exploration, we manage to find the closed form solution of the exploratory problem for logarithmic utility and quadratic function. We show that the optimal feedback policy of the exploratory unconstrained expected utility (EU) problem is Gaussian, which is aligned with the result obtained in Wang and Zhou (2020) for the mean-variance setting. However, when the risky investment ratio is restricted on a given interval, the constrained optimal exploration policy now follows a truncated Gaussian distribution.

The explicit form of the solution to HJB equations enables us to obtain a policy improvement theorem which confirms that exploration can be performed within the class of (resp. truncated) Gaussian policies for the (resp. constrained) unconstrained EU problem. Moreover, by casting the optimal problem in a martingale framework as in Jia and Zhou (2022b), we devise an implementable reinforcement learning algorithm. We observe that compared to the classical case (without exploration), exploration procedure in the presence of a portfolio constraint leads to an optimal portfolio process that exhibits a more dispersed distribution with heavier tail. This effect becomes less significant as the exploration parameter is smaller. Moreover, a decrease (resp. increase) in the lower (resp. upper) bound of portfolio strategy leads to a corresponding increase in the exploration cost. This aligns with the intuitive understanding that a larger investment opportunity domain requires a higher exploration cost. Notably, when facing both short-selling and money borrowing constraints, the exploration cost becomes negligible compared to the unconstrained case. Our paper is the first attempt to explicitly analyze the classical continuous-time EU problem with possible portfolio constraints in an entropy-regularized RL framework. Although this paper addresses specifically an optimal portfolio problem, it is highly relevant to OR challenges from multiple perspectives. The techniques and methodologies discussed here can be applied directly to various areas of OR that are related to decision making under model parameter uncertainty and the reliable data is used to learn the true model parameters. It is worth highlighting that our findings in Section 6 establish a direct connection between quadratic utility and the mean-variance (MV) framework, a cornerstone in OR for evaluating risk-return profiles under portfolio constraints. With reinforcement learning (RL) methods increasingly adopted in OR for solving dynamic stochastic problems, our work offers a versatile framework with clear applications in portfolio optimization, inventory management, supply chain systems, and other OR domains involving constrained stochastic decision-making.

While revising this paper, we acknowledge a recent work Dai et al. (2023) where the authors consider the Merton optimal portfolio for a power utility function in a stochastic volatility setting with a recursive weighting scheme on exploration that endogenously discounts the current exploration reward by the past accumulative amount of exploration. By adopting a two-step optimization of the corresponding exploratory HBJ for an exploratory learning that incorporates the correlation between the risk asset and the volatility dynamics suggested in Dai et al. (2021), the authors in Dai et al. (2023) are able to characterize the Gaussian optimal policy with biased structure due to incompleteness. Compared to Dai et al. (2021, 2023), our paper studies the Merton problem for logarithmic and quadratic utility functions in complete market settings aiming at obtaining explicit solution, while accounting for possible portfolio constraints. We remark that the approach in Dai et al. (2023) seems to be no longer applicable for the classical expected utility maximization when removing the endogenously recursive weighting scheme on exploration. In addition, it is not clear how the two-step optimization procedure in Dai et al. (2023) is applied to our constrained setting.

The rest of the paper is organized as follows: Section 2 formulates the exploratory optimal investment consumption problem incorporating both exploration and exploitation and provides a general form of the optimal distributional control to the exploratory HJB equation. Section 3 studies the unconstrained optimal investment problem for logarithmic utility function. We elaborate the constrained EU problem with exploration and discuss the exploration cost and its impact in Section 4. Section 5 discuses an implementable reinforcement learning algorithm in a martingale framework and provides some numerical examples. Section 6 studies the case with a quadratic utility function. An extension to random coefficient markets is presented in Section 7. Section 8 concludes the paper with future research perspectives. Technical proofs and additional details can be found in the Appendix.

Notation. In this paper, we use YY or YtY_{t} to denote a stochastic process Y:={Yt}t[0,T]Y:=\{Y_{t}\}_{t\in[0,T]}; YtY_{t} also refers to the values of the stochastic process when it is clear from the context. We use 𝒩(x|a,b2)\mathcal{N}(x|a,b^{2}) to denote the normal distribution with mean aa and standard deviation b>0b>0, πe3.14\pi_{e}\approx 3.14 denotes the mathematical constant pi. Additionally, ft,fx,fxxf_{t},f_{x},f_{xx} denote the partial first/second derivative with respect to the corresponding arguments. φ(y)=12πe12y2\varphi(y)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}y^{2}}, Φ(y)=yφ(x)𝑑x\Phi(y)=\int_{-\infty}^{y}\varphi(x)dx are the probability density and the probability cumulative density functions of the standard normal distribution, respectively.

2. Formulation

Consider a filtered probability space (Ω,W,{tW}t[0,T],W\Omega,\mathcal{F}^{{W}},\{\mathcal{F}^{{W}}_{t}\}_{t\in[0,T]},\mathbb{P}^{{W}}). Furthermore, let {Wt}t[0,T]\{W_{t}\}_{t\in[0,T]} be a standard Brownian motion (restricted to [0,T][0,T]) under the real world measure W\mathbb{P}^{{W}} and {tW}t[0,T]\{\mathcal{F}^{{W}}_{t}\}_{t\in[0,T]} is the standard filtration of {Wt}t[0,T]\{W_{t}\}_{t\in[0,T]}. For our analysis, we assume a Black-Scholes economy. To be more precise, there are two assets traded in the market, one risk-free asset BtB_{t} earning a constant interest rate rr and one risky asset StS_{t} following a geometric Brownian motion with instantaneous rate of return μ>r\mu>r and volatility σ>0\sigma>0. Their dynamics are given by,

(2.1) dBt\displaystyle dB_{t} =rBtdt,B0=1,\displaystyle=rB_{t}dt,\enspace B_{0}=1,
(2.2) dSt\displaystyle dS_{t} =μStdt+σStdWt,t[0,T],S0=s.\displaystyle=\mu S_{t}dt+\sigma S_{t}dW_{t},\quad t\in[0,T],\enspace S_{0}=s.

Consider an investor endowed with an initial wealth xx, which will be used for investments. We assume that the investor splits her initial wealth in the two assets given above. We use π={πt}t[0,T]\pi=\{\pi_{t}\}_{t\in[0,T]} to denote the fraction of wealth that the investor invests in the risky asset. The remaining money is invested in the risk-free asset. We assume that the processes {πt}t[0,T]\{\pi_{t}\}_{t\in[0,T]} is W\mathcal{F}^{{W}}-adapted. The investor chooses an investment and consumption strategy from the following admissible set

𝒜(x0):=\displaystyle\mathcal{A}(x_{0}):= {πis progressively measurable,Xtπ0for allt[0,T],0Tπt2𝑑t<}.\displaystyle\Bigg{\{}\pi\;\text{is progressively measurable},\,X_{t}^{\pi}\geq 0\;\text{for all}\;t\in[0,T],\;\int_{0}^{T}\pi_{t}^{2}dt<\infty\Bigg{\}}.

Here, XtπX_{t}^{\pi} is the investor’s wealth process which satisfies the following stochastic differential equation:

(2.3) dXtπ\displaystyle dX_{t}^{\pi} =(r+πt(μr))Xtπdt+σπtXtπdWt,X0=x0>0.\displaystyle=\left(r+\pi_{t}(\mu-r)\right)X_{t}^{\pi}dt+\sigma\pi_{t}X_{t}^{\pi}dW_{t}\,,\quad X_{0}=x_{0}>0.

2.1. The classic optimal investment problem

The classical asset allocation problem can be written as

(2.4) max(π)𝒜(x0)𝔼W[U(XTπ)],\displaystyle\max_{(\pi)\in\mathcal{A}(x_{0})}\mathbb{E}^{{\mathbb{P}^{W}}}\Big{[}U(X_{T}^{\pi})\Big{]},

where UU is a concave utility function. The optimization problem (2.4) is solely an exploitation problem, which is a typical set up of classical stochastic control optimization problems. In a financial market with complete knowledge of the model parameters, ones are readily to employ the classical model-based stochastic control theory (see e.g. Fleming and Soner (2006); Pham (2009)), duality approach (see e.g. Chen and Vellekoop (2017); Kamma and Pelsser (2022)) or to use the so-called martingale approach (see e.g. Karatzas and Shreve (1998)) to find the solution of Problem (2.4). When implementing the optimal strategy, one needs to estimate the market parameters from historical time series of asset prices. However, in practice, estimating these parameters is notoriously difficult with an acceptable accuracy, especially the mean return μ\mu, also known as the mean-blur problem, see e.g. Luenberger (1998).

2.2. Optimal investment with exploration

In the RL setting where the underlying model is not known, dynamic learning is needed and the agent employs exploration to interact with and learn from the unknown environment through trial and error. The key idea is to model exploration via a distribution over the space of controls 𝒞\mathcal{C}\subseteq\mathbb{R} from which the trials are sampled. Here we assume that the action space 𝒞\mathcal{C} is continuous and randomization is restricted to those distributions that have density functions. In particular, at time t[0,T]t\in[0,T], with the corresponding current wealth XtX_{t}, the agent considers a classical control (πt)(\pi_{t}) sampled from a policy distribution λt(π):=λ(π|t,Xt)\lambda_{t}(\pi):=\lambda(\pi|t,X_{t}). Mathematically, such a policy λ\lambda can be generated e.g. by introducing an additional random variable ZZ that is uniformly distributed in [0,1][0,1] and is independent of the filtration of the Brownian motion WW, see e.g. Jia and Zhou (2022a). In this sense, the probability space is now expanded to (Ω,,{t}t[0,T],\Omega,\mathcal{F},\{\mathcal{F}_{t}\}_{t\in[0,T]},\mathbb{P}), where :=Wσ(Z)\mathcal{F}:=\mathcal{F}^{W}\vee\sigma(Z) and \mathbb{P} is now an extended probability measure on \mathcal{F} whose projection on W\mathcal{F}^{W} is W\mathbb{P}^{W}. This sampling is executed for NN rounds over the same time horizon. Intuitively, the utility of such a (feedback) policy becomes accurate enough when N is large by using the law of large numbers. This procedure, known as policy evaluation, is considered as a fundamental element of most RL algorithms in practice. For our continuous-time setting, we follow the exploratory setting suggested in Wang et al. (2020) and refer to e.g., Wang et al. (2020); Wang and Zhou (2020); Jia and Zhou (2022a, b) for motivations and additional details. In particular, we consider the exploratory version of the wealth dynamics (2.3) given by

(2.5) dXtλ=A^(t,Xtλ;λ)dt+B^(t,Xtλ;λ)dWt,X0=x0,\displaystyle dX_{t}^{\lambda}={\hat{A}(t,X_{t}^{\lambda};\lambda)}dt+{\hat{B}(t,X_{t}^{\lambda};\lambda)}dW_{t}\,,\quad X_{0}=x_{0},

where the exploration drift and exploration volatility are defined by

(2.6) A^(t,x;λ):=𝒞A(π,x)λ(π|t,x)𝑑π;B^(t,π,x):=𝒞B2(π,x)λ(π|t,x)𝑑π.\hat{A}(t,x;\lambda):=\int_{\mathcal{C}}A(\pi,x)\lambda(\pi|t,x)d\pi;\quad\hat{B}(t,\pi,x):=\sqrt{\int_{\mathcal{C}}B^{2}(\pi,x)\lambda(\pi|t,x)d\pi}.

where

A(π,x):=(r+π(μr))x;B(π,x):=σπx.A(\pi,x):=(r+\pi(\mu-r))x;\quad B(\pi,x):=\sigma\pi x.

From (2.6) we observe that

(2.7) A^(t,x;λ)=x(r+(μr)𝒞πλ(π|t,x)𝑑π);B^2(t,x;λ)=σ2x2𝒞π2λ(π|t,x)𝑑π.\hat{A}{(t,x;\lambda)}=x\bigg{(}r+(\mu-r)\int_{\mathcal{C}}\pi\lambda(\pi|t,x)d\pi\bigg{)};\quad\hat{B}^{2}{(t,x;\lambda)}=\sigma^{2}x^{2}\int_{{\mathcal{C}}}\pi^{2}\lambda(\pi|t,x)d\pi.

As a result, the agent may want to maximize over all admissible policies λ\lambda the exploratory expected investment utility 𝔼[U(XTλ)].\mathbb{E}[U(X_{T}^{\lambda})]. Intuitively, the agent must account for exploration effect in her objective. As suggested in Wang et al. (2020), for a given exploration distribution λ\lambda, the following Shannon’s differential entropy penalty at time t[0,T]t\in[0,T]

𝒦λ(t,x):=𝒞λ(π|t,x)lnλ(π|t,x)𝑑π,\mathcal{K}^{\lambda}(t,x):=-\int_{\mathcal{C}}\lambda(\pi|t,x)\ln\lambda(\pi|t,x)d\pi,

can be used to measure the exploration impact. Note that when the model is fully known, there would be no requirement for exploration and learning. In such a scenario, control distributions would collapse to Dirac measures, placing us in the domain of classical stochastic control Fleming and Soner (2006); Pham (2009).

Motivated by the above discussion, below we consider the following Shannon’s entropy-regularized exploratory optimization problem

(2.8) maxλ𝔼[U(XTλ)m0T𝒞λ(π|t,Xtλ)lnλ(π|t,Xtλ)𝑑π𝑑t],\displaystyle\max_{\lambda\in\mathcal{H}}\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{0}^{T}\int_{\mathcal{C}}{\lambda(\pi|t,X_{t}^{\lambda})\ln\lambda(\pi|t,X_{t}^{\lambda})}d\pi dt\Bigg{]},

where \mathcal{H} is the set of admissible (feedback) exploration distributions (or more precisely density). Mathematically, (2.8) is a relaxed stochastic control problem which has been widely studied in the control literature, see e.g. Fleming (1976); Fleming and Nisio (1984); Nicole el et al. (1987). Note that relaxed controls in general have a natural interpretation that at each time tt, the agent chooses not a single action (strategy) but rather a probability measure over the action space 𝒞\mathcal{C} from which a specific action is randomly sampled. In more general settings of relaxed control e.g. Fleming (1976); Fleming and Nisio (1984); Nicole el et al. (1987), the admissible set may contain open-loop distributions that are measure-valued stochastic processes. This is also in line with the notion of a mixed strategy in game theory, in which players choose probability measures over the action set, rather than single actions (pure strategies).

We note that the minus sign in front of the entropy term accounts for the fact that exploration is inherently costly in term of resources. Also, we can see that the optimization problem (2.8) incorporates both exploration (due to the entropy factor) and exploitation. The rate of exploration is determined by the exogenous temporal parameter m>0m>0 in the sense that a larger value of mm indicates that more explorations are needed and vice versus. Hence, the agent can personalize her exploration rate by selecting an appropriate exogenous temporal parameter mm.

Next, we specify the set of admissible controls as follows.

Assumption 2.1.

The set of admissible control λ\lambda\in\mathcal{H} has the following properties:

  1. (1)

    For each (t,x)[0,T]×(t,x)\in[0,T]\times\mathbb{R}, λ(|t,x)\lambda(\cdot|t,x) is a density function on 𝒞\mathcal{C}.

  2. (2)

    The mapping [0,T]××𝒞(t,x,π)λ(π|t,x)[0,T]\times\mathbb{R}\times\mathcal{C}\ni(t,x,\pi)\mapsto\lambda(\pi|t,x) is measurable.

  3. (3)

    For each λ\lambda\in\mathcal{H}, the exploratory SDE (2.5) admits a unique strong solution denoted by XλX^{\lambda} which is positive and

    𝔼[U(XTλ)m0T𝒞λ(π|t,Xtλ)lnλ(π|t,Xtλ)𝑑π𝑑t]<.\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{0}^{T}\int_{\mathcal{C}}\lambda(\pi|t,X_{t}^{\lambda})\ln\lambda(\pi|t,X_{t}^{\lambda})d\pi dt\Bigg{]}<\infty.

2.3. HJB equation and optimal distribution control

For each mm\in\mathbb{R}, we denote by v(t,x;m)v(t,x;m) the optimal value function of Problem (2.8), i.e.

v(t,x;m):=maxλJ(t,x,m,λ):=𝔼[U(XTλ)mtT𝒞λ(π|s,Xsλ)lnλ(π|s,Xsλ)𝑑π𝑑s|Xtλ=x].\displaystyle v(t,x;m):=\max_{\lambda\in\mathcal{H}}J(t,x,m,\lambda):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathcal{C}}\lambda(\pi|s,X_{s}^{\lambda})\ln\lambda(\pi|s,X_{s}^{\lambda})d\pi ds|X^{\lambda}_{t}=x\Bigg{]}.

By standard arguments of Bellman’s principle of optimality, the optimal value function vv satisfies the following HJB equation

(2.9) vt(t,x;m)+supλ{A^(t,x;λ)vx(t,x;m)+12B^2(t,x;λ)vxx(t,x;m)m𝒞lnλ(π|t,x)λ(π|t,x)𝑑π}=0,v_{t}(t,x;m)+\sup_{\lambda}\bigg{\{}\hat{A}(t,x;\lambda)v_{x}(t,x;m)+\frac{1}{2}\hat{B}^{2}(t,x;\lambda)v_{xx}(t,x;m)-m\int_{\mathcal{C}}\ln\lambda(\pi|t,x)\lambda(\pi|t,x)d\pi\bigg{\}}=0,

with terminal condition v(T,x;m)=U(x)v(T,x;m)=U(x). We observe that the formula in the bracket of (2.9) can be expressed as

𝒞((r+π(μr)))xvx(t,x;m)+12σ2x2π2vxx(t,x;m)mlnλ(π|t,x))λ(π|t,x)dπ.\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r)))xv_{x}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}(t,x;m)-m\ln\lambda(\pi|t,x)\bigg{)}\lambda(\pi|t,x)d\pi.

The optimal distribution λ\lambda^{*} can be obtained by Donsker and Varadhan’s variational formula (see e.g. Donsker and Varadhan (2006))

(2.10) λ(π|t,x;m)exp{1m((r+π(μr)))xvx(t,x;m)+12σ2x2π2vxx(t,x;m))},\lambda^{*}(\pi|t,x;m){\propto\exp\bigg{\{}\frac{1}{m}\bigg{(}(r+\pi(\mu-r)))xv_{x}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}(t,x;m)\bigg{)}\bigg{\}}},

which is a feedback form and can be seen as a Boltzmann distribution (see e.g., Tang et al. (2022) for a similar result). Here the notation λ(π|t,x;m)\lambda^{*}(\pi|t,x;m) is introduced to indicate that λ\lambda^{*} depends on the exploration parameter mm. Note that we can write λ(π|t,x;m)\lambda^{*}(\pi|t,x;m) as

λ(π|t,x;m)exp(12mσ2x2vxx(π(rμ)xvxσ2x2vxx)2).\lambda^{*}(\pi|t,x;m)\propto\exp\left(\frac{1}{2m}\sigma^{2}x^{2}v_{xx}\big{(}\pi-\frac{(r-\mu)xv_{x}}{\sigma^{2}x^{2}v_{xx}}\big{)}^{2}\right).

which is Gaussian with mean α\alpha and variance β2\beta^{2} (assuming that vxx<0v_{xx}<0) defined by

(2.11) α=(μr)xvxσ2x2vxx;β2=mσ2x2vxx.\alpha=-\frac{(\mu-r)xv_{x}}{\sigma^{2}x^{2}v_{xx}};\quad\beta^{2}=-\frac{m}{\sigma^{2}x^{2}v_{xx}}.

Substituting (3.2) back to the HJB (2.9) we obtain the following non-linear PDE

(2.12) vt(t,x;m)+rxvx(t,x;m)12(μr)2vx2(t,x;m)σ2vxx(t,x;m)+m2ln(2πemσ2x2vxx(t,x;m))=0,\displaystyle v_{t}(t,x;m)+rxv_{x}(t,x;m)-\frac{1}{2}\frac{(\mu-r)^{2}v_{x}^{2}(t,x;m)}{\sigma^{2}v_{xx}(t,x;m)}+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}x^{2}v_{xx}(t,x;m)}\bigg{)}=0,

with terminal condition v(T,x;m)=U(x).\quad v(T,x;m)=U(x).

3. Unconstrained optimal investment problem with exploration

3.1. Unconstrained optimal policy

Below we show that the optimal solution to (2.12) can be given in explicit form for the case of logarithmic utility. Logarithmic utility is widely recognized as the preference function for rational investors, as highlighted in Rubinstein (1977). The study in Pulley (1983) supports this assertion by showcasing that the logarithmic utility serves as an excellent approximation in the context of the Markowitz mean-variance setting and remarkably, the optimal portfolio under logarithmic utility is almost identical to that derived from the mean-variance strategy. It is crucial to emphasize that while our optimal strategy with logarithmic utility maintains time consistency, the mean-variance strategy is only precommitted, meaning it is only optimal at time 0. This stands out as one of the most critical drawbacks within the mean-variance setting. Logarithmic utility’s versatility extends to diverse applications, including long-term investor preferences (Gerrard et al., 2023), optimal hedging theory and within semimartingale market models (Merton, 1975), categorizing it under the hyperbolic absolute risk aversion (HARA) class. This class has been explored in portfolio optimization under finite-horizon economies (Cuoco, 1997), infinite-horizon consumption-portfolio problems (El Karoui and Jeanblanc-Picqué, 1998), and scenarios involving allowed terminal debt (Chen and Vellekoop, 2017).

We summarize the results for the unconstrained case i.e. 𝒞=\mathcal{C}=\mathbb{R} in the following theorem .

Theorem 3.1.

The optimal value function of the entropy-regularized exploratory optimal investment problem with logarithmic utility U(x)=lnxU(x)=\ln x is given by 111Recall that πe=3.14\pi_{e}=3.14\ldots is the regular constant pi.

(3.1) v(t,x;m)=\displaystyle v(t,x;m)= lnx+(r+12(μr)2σ2)(Tt)+m2ln(σ22πem)(Tt)\displaystyle\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)

for (t,x)[0,T]×+(t,x)\in[0,T]\times\mathbb{R}_{+}. Moreover, the optimal feedback distribution control λ\lambda^{*} (which is independent of (t,x)(t,x)) is a Gaussian with mean (μr)σ2\frac{(\mu-r)}{\sigma^{2}} and variance mσ2\frac{m}{\sigma^{2}}, i.e.

(3.2) λ(π)=𝒩(π|(μr)σ2,mσ2)\lambda^{*}(\pi)=\mathcal{N}\left(\pi\bigg{|}\frac{(\mu-r)}{\sigma^{2}},\frac{m}{\sigma^{2}}\right)

and the associated optimal wealth under λ\lambda^{*} is given by the following SDE

(3.3) dXtλ=Xtλ(r+(μr)2σ2)dt+(m+(μr)2σ2)XtλdWt,X0=x0.\displaystyle dX_{t}^{\lambda*}=X_{t}^{\lambda^{*}}\bigg{(}r+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}dt+\sqrt{\bigg{(}m+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}}X_{t}^{\lambda^{*}}dW_{t}\,,\quad X_{0}=x_{0}\,.

From Theorem 3.1 we can see that the best control distribution to balance exploration and exploitation is Gaussian. This demonstrates the popularity of Gaussian distribution in RL studies. The mean of the optimal distribution control is given by the Merton strategy

(3.4) πMerton:=μrσ2,\pi^{Merton}:=\frac{\mu-r}{\sigma^{2}},

whereas the variance of the optimal Gaussian policy is controlled by the degree of exploration mm. We also observe that at any t[0,T]t\in[0,T], the exploration variance decreases as σ\sigma increases, which means that exploration is less necessary in more random market environment.

Remark 3.1.

The optimal value function of the regularized exploration problem can be expressed as

(3.5) v(t,x;m)=\displaystyle v(t,x;m)= v(t,x;0)+ψ(t,x;m),\displaystyle v(t,x;0)+\psi(t,x;m),

where

ψ(t,x;m):=m2ln(σ22πem)(Tt)\psi(t,x;m):=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)

Intuitively, ψ(t,x;m)\psi(t,x;m) measures the exploration effect. It is easy to see that ψ(t,x;m)0\psi(t,x;m)\to 0 as m0m\to 0, hence, v(t,x;m)vMerton(t,x)=v(t,x;0)=lnx+(r+12(μr)2σ2)(Tt)v(t,x;m)\to v^{Merton}(t,x)=v(t,x;0)=\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t) which is the optimal value function in the absence of exploration.

The following shows solvability equivalence between the classical and exploratory EU problem, meaning that the solution to one directly provides the solution to the other, without requiring a separate resolution.

Theorem 3.2.

The following statements are equivalent

  1. (a)

    The function

    v(t,x;m)=\displaystyle v(t,x;m)= lnx+(r+12(μr)2σ2)(Tt)+m2ln(σ22πem)(Tt)\displaystyle\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t)

    is the optimal value function of the exploratory problem. The optimal distribution control λ(π|t,x;m)\lambda^{*}(\pi|t,x;m) is given by (3.2) and the exploratory wealth process is given by (3.3).

  2. (b)

    The function

    vMerton(t,x):=v(t,x;0)=lnx+(r+12(μr)2σ2)(Tt)v^{Merton}({t},x):=v(t,x;0)=\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)

    is the optimal value function of the classic (without exploration) EU problem where the optimal investment strategy is given by the Merton fraction πtMerton=(μr)σ2\pi_{t}^{Merton}=\frac{(\mu-r)}{\sigma^{2}}.

Proof.

It is easy to see that the v(t,x;m)v(t,x;m) and v(t,x;0)v(t,x;0) solve the HJB equation (2.12) with and without exploration respectively. The admissibility of these two optimal strategies are straightforward to see. ∎

3.2. Exploration cost and exploration effect

The exploration cost for a general RL problem is defined by the difference between the expected utility following the corresponding optimal control under the classical objective and the exploratory objective, net of the value of the entropy. Remark that the exploration cost is well defined only if both the classical and the exploratory problems are solvable. Let v(t,x;0)v(t,x;0) be the value function of the classical EU problem. Below, we often write for shorthand λs(π)=λ(π|s,Xsλ)\lambda_{s}(\pi)=\lambda(\pi|s,X^{\lambda}_{s}) when there is no confusion. The exploration cost at time t=0t=0 is defined by

(3.6) v(0,x;0)(v(0,x;m)+m𝔼[0T𝒞λs(π)lnλs(π)𝑑π𝑑s|X0λ=x]).v(0,x;0)-\bigg{(}v(0,x;m)+m\mathbb{E}\bigg{[}\int_{0}^{T}\int_{\mathcal{C}}\lambda_{s}^{*}(\pi)\ln\lambda_{s}^{*}(\pi)d\pi ds|X^{\lambda^{*}}_{0}=x\bigg{]}\bigg{)}.

The first term is the classical value function at time t=0t=0. The second term is the expected utility (before being regularized) of the exploration problem at time t=0t=0.

Proposition 3.1.

Assume that one the statements in Theorem 3.2 holds. Then, the exploration cost is given by mT2\frac{mT}{2}.

Proof.

By Remark 3.1, the optimal value function of the regularized exploration problem can be expressed as

(3.7) v(0,x;m)=\displaystyle v(0,x;m)= vMerton(0,x)+m2Tln(σ22πem).\displaystyle v^{Merton}(0,x)+\frac{m}{2}T\ln(\sigma^{-2}{2\pi_{e}m}).

On the other hand, the entropy term can be computed using the Gaussian form of λ\lambda^{*} as follows

0T𝔼[+λs(π)lnλs(π)dπ|X0λ=\displaystyle\int_{0}^{T}\mathbb{E}\bigg{[}\int_{-\infty}^{+\infty}\lambda_{s}^{*}(\pi)\ln\lambda_{s}^{*}(\pi)d\pi|X^{\lambda^{*}}_{0}= x]ds=T+(12β2(πα)2ln(2πeβ2))𝒩(π|α,β2)dπ\displaystyle x\bigg{]}ds=T\int_{-\infty}^{+\infty}\big{(}-\frac{1}{2\beta^{2}}(\pi-\alpha)^{2}-\ln(\sqrt{2\pi_{e}\beta^{2}})\big{)}\mathcal{N}(\pi|\alpha,\beta^{2})d\pi
(3.8) =T(ln2mπeσ2+12).\displaystyle=-T(\ln\sqrt{\frac{2m\pi_{e}}{\sigma^{2}}}+\frac{1}{2}).

Substituting this into (3.6) leads to the desired conclusion.∎

Observe that the exploration cost converges to zero as m0m\to 0. Since the exploration weight as been taken as an exogenous parameter mm which reflects the level of exploration desired by the learning agent, it is intuitive to expect that the smaller mm is, the more emphasis is placed on exploitation. Moreover, when mm is sufficiently close to zero, the exploratory formulation is getting close to the problem without exploration. The following theorem confirms a desirable result that the entropy-regularized EU problem converges to its classical EU counterpart when the exploration weight mm goes to zero.

Theorem 3.3.

Assume that one the statements in Theorem 3.2 holds. Then, for each x+x\in\mathbb{R}_{+}, it holds that

(3.9) λ(|t,x;m)δπMerton()whenm0.\lambda^{*}(\cdot|t,x;m)\to\delta_{\pi^{Merton}}(\cdot)\quad\mbox{when}\quad m\to 0.

where δπMerton\delta_{\pi^{Merton}} is the Dirac measure at the Merton strategy. Furthermore, for each (t,x)[0,T]×+(t,x)\in[0,T]\times\mathbb{R}_{+},

limm0|v(t,x;m)vMerton(t,x)|=0.\lim_{m\to 0}|v(t,x;m)-v^{Merton}(t,x)|=0.
Proof.

The convergence of the optimal distribution control λ\lambda^{*} to the Dirac measure δπMerton\delta_{\pi^{Merton}} is straightforward because it is Gaussian density with mean identical to πMerton=(μr)σ2\pi^{Merton}=\frac{(\mu-r)}{\sigma^{2}} and variance mσ20\frac{m}{\sigma^{2}}\to 0 as m0m\to 0. The convergence of the value function follows directly from the relation (3.5) in Remark 3.1. ∎

3.3. Policy improvement

The following policy improvement theorem is crucial for interpretable RL algorithms as it ensures that the iterated value functions is non-decreasing and converges to the optimal value function. Below, for some given policy λ\lambda (not necessarily Gaussian), we denote the corresponding value function by

(3.10) vλ(t,x;m):=𝔼[U(XTλ)mtT+λs(π)lnλs(π)𝑑π|Xtλ=x].\displaystyle v^{\lambda}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda_{s}(\pi)\ln\lambda_{s}(\pi)d\pi|X^{\lambda}_{t}=x\Bigg{]}.
Theorem 3.4.

For some given admissible policy λ\lambda (not necessarily Gaussian), we assume that the corresponding value function vλ(,;m)C1,2([0,T)×+)C([0,T]×+)v^{\lambda}(\cdot,\cdot;m)\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) and satisfies vxxλ(t,x,;m)<0v^{\lambda}_{xx}(t,x,;m)<0 for any (t,x)[0,T)×+.(t,x)\in[0,T)\times\mathbb{R}_{+}. Suppose furthermore that the feedback policy λ~\widetilde{\lambda} defined by

(3.11) λ~(π|t,x;m)=𝒩(π|(μr)xvxλσ2x2vxxλ,mσ2x2vxxλ)\widetilde{\lambda}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{\lambda}}{\sigma^{2}x^{2}v_{xx}^{\lambda}},-\frac{m}{\sigma^{2}x^{2}v_{xx}^{\lambda}}\bigg{)}

is admissible. Let vλ~(t,x;m)v^{\widetilde{\lambda}}(t,x;m) be the value function corresponding to this new (Gaussian) policy λ~\widetilde{\lambda}. Then,

(3.12) vλ~(t,x;m)vλ(t,x;m),(t,x)[0,T)×+.v^{\widetilde{\lambda}}(t,x;m)\geq v^{\lambda}(t,x;m),\quad(t,x)\in[0,T)\times\mathbb{R}_{+}.

Note that Theorem 3.4 holds true for general utility function UU. One important implication of Theorem 3.4 is that for any given (not necessarily Gaussian) policy λ\lambda, there are always policies in the Gaussian family that improve the value function of λ\lambda (i.e. providing higher expected utility values). Therefore, it is intuitive to focus on the Gaussian policies when choosing an initial exploration distribution. Note that the optimal Gaussian policy given by (3.2) also suggests that a candidate of the initial feedback policy may take the form λ0(π|t,x,m)=𝒩(π|a,b2){\lambda}^{0}(\pi|t,x,m)=\mathcal{N}\big{(}\pi\big{|}a,b^{2}\big{)} for some real number a,ba,b. As shown below, such a choice leads to a fast convergence of both value functions and policies in a finite number of iterations.

Theorem 3.5.

Let λ0(π|t,x,m)=𝒩(π|a,b2){\lambda}^{0}(\pi|t,x,m)=\mathcal{N}\big{(}\pi\big{|}a,b^{2}\big{)} with some real number a,b>0a,b>0 and consider the logarithmic utility function U(x)=lnxU(x)=\ln x. Define the sequence of feedback policies {λn(|t,x;m)}n1\{\lambda^{n}(\cdot|t,x;m)\}_{n\geq 1} updated by the policy improvement scheme (3.11), i.e.,

(3.13) λn(π|t,x;m)=𝒩(π|(μr)xvxλn1(t,x;m)σ2x2vxxλn1(t,x;m),mσ2x2vxxλn1(t,x;m)),n=1,2,,\lambda^{n}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{\lambda^{n-1}}(t,x;m)}{\sigma^{2}x^{2}v^{\lambda^{n-1}}_{xx}(t,x;m)},-\frac{m}{\sigma^{2}x^{2}v^{\lambda^{n-1}}_{xx}(t,x;m)}\bigg{)},\quad n=1,2,\ldots,

where vλn1v^{\lambda^{n-1}} is the value function corresponding to the policy λn1\lambda^{n-1} defined by

(3.14) vλn1(t,x;m):=𝔼[U(XTλn1)mtT+λn1(π|s,Xsλn1)lnλn1(π|s,Xsλn1)𝑑π|Xtλn1=x].\displaystyle v^{\lambda^{n-1}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda^{n-1}})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda^{n-1}(\pi|s,X^{\lambda^{n-1}}_{s})\ln\lambda^{n-1}(\pi|s,X^{\lambda^{n-1}}_{s})d\pi|X^{\lambda^{n-1}}_{t}=x\Bigg{]}.

Then,

(3.15) limnλn(|t,x;m)=λ(|t,x;m),weakly\lim_{n\to\infty}\lambda^{n}(\cdot|t,x;m)=\lambda^{*}(\cdot|t,x;m),\quad\mbox{weakly}

and

(3.16) limnvλn(t,x;m)=vλ(t,x;m)=v(t,x;m),\lim_{n\to\infty}v^{\lambda^{n}}(t,x;m)=v^{\lambda^{*}}(t,x;m)=v(t,x;m),

which is the optimal value function given by (3.1).

As shown in Appendix A.3, starting with a Gaussian policy the optimal solution of the exploratory EU with logarithmic utility is attained rapidly (after one update).

Remark 3.1.

It is valuable to extend Theorem 3.5 to contexts with general utility functions. However, obtaining closed-form solutions for highly nonlinear average PDEs (see Appendix A.3) is not feasible. Note that even the solution existence of the next-step average PDE is not obvious and would require a thorough analysis, including asymptotic expansions.

3.4. Policy evaluation and algorithm

Following Jia and Zhou (2022a, b) we consider in this section the policy evaluation that takes the martingale property into account. Recall with an abusive notation the value function

(3.17) J(t,x;λ):=𝔼[U(XTλ)mtT+λ(π|s,Xsλ)lnλ(π|s,Xsλ)𝑑π𝑑s|Xtλ=x],J(t,x;{\lambda}):=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda(\pi|s,X^{\lambda}_{s})\ln\lambda(\pi|s,X^{\lambda}_{s})d\pi ds\bigg{|}X_{t}^{\lambda}=x\bigg{]},

where λt\lambda_{t} is a given learning (feedback) policy. By Feymann-Kac’s theorem, it can be seen that J(t,x;λ)J(t,x;{\lambda}) solves (e.g. in viscosity sense) the following average PDE

(3.18) +(J(t,x;λ)mlnλ(π|t,x))λ(π|x,t)𝑑π=0,\int_{-\infty}^{+\infty}\bigg{(}\mathcal{L}J(t,x;\lambda)-m\ln\lambda(\pi|t,x)\bigg{)}\lambda(\pi|x,t)d\pi=0,

where \mathcal{L} is the infinitesimal operator associated to XtλX_{t}^{\lambda}, i.e.

(3.19) J(t,x;λ):=Jt(t,x;λ)+A^(t,x;λ)Jx(t,x;λ)+12B^2(t,x;λ)Jxx(t,x;λ).\mathcal{L}J(t,x;\lambda):=J_{t}(t,x;\lambda)+\hat{A}({t,x;}\lambda)J_{x}(t,x;\lambda)+\frac{1}{2}\hat{B}^{2}({t,x;}\lambda)J_{xx}(t,x;\lambda).

As in Jia and Zhou (2022a, b), the value function of a policy λ\lambda can be characterized by the following martingale property.

Theorem 3.6.

A function J(,;λ)J(\cdot,\cdot;{\lambda}) is the value function associated with the policy λ\lambda if and only if it satisfies the terminal condition J(T,x;λ)=U(x)J(T,x;{\lambda})=U(x), and

(3.20) Ys:=J(t,Xsλ;λ)mts+λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u,s[t,T]\displaystyle Y_{s}:=J(t,X^{\lambda}_{s};{\lambda})-m\int_{t}^{s}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du,\quad{s\in[t,T]}

is a martingale on [t,T][t,T].

It follows from Itô’s Lemma and Theorem 3.6 that 𝔼[0Ths𝑑Ys]=0\mathbb{E}[\int_{0}^{T}h_{s}dY_{s}]=0 for all adapted square-integrable processes hh satisfying 𝔼[0Ths2dYs]<\mathbb{E}[\int_{0}^{T}h^{2}_{s}d\langle Y\rangle_{s}]<\infty. Equivalently,

(3.21) 𝔼[0Tht(dJ(t,Xtλ;λ)m+λ(π|t,Xtλ)lnλ(π|t,Xtλ)𝑑π)𝑑t]=0.\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t,X^{\lambda}_{t})\ln\lambda(\pi|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]}=0.

Such a process hh is called test function. Let Jθ(t,Xtλ;λ)J^{\theta}(t,X^{\lambda}_{t};{\lambda}) be a parameterized family that is used to approximate JJ, where θΘn,n1\theta\in\Theta\subset\mathbb{R}^{n},n\geq 1. Here JθJ^{\theta} satisfies Assumption 3.1 below and our goal is to find the best parameter θ\theta^{*} such that the martingale property in Theorem 3.6 holds. In this sense, the process

Ytθ=Jθ(t,Xtλ;λ)m0t+λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑uY_{t}^{\theta^{*}}=J^{\theta^{*}}(t,X^{\lambda}_{t};{\lambda})-m\int_{0}^{t}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du

should be a martingale in [0,T][0,T] with terminal value

YTθ=U(XTλ)m0T+λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u.Y_{T}^{{\theta^{*}}}=U(X^{\lambda}_{T})-m\int_{0}^{T}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du.

As discussed in (Jia and Zhou, 2022a, page 18), the martingale condition of YθY^{\theta^{*}} is further equivalent to a requirement that the process at any given time t<Tt<T is the expectation of the terminal value conditional on all the information available at that time. A fundamental property of the conditional expectation then yields that YθtY^{\theta_{t}^{*}} minimizes the L2L^{2}-error between YTθY^{\theta}_{T} and any t{\mathcal{F}}_{t}-measurable random variables. Therefore, our objective is to minimize the martingale cost function defined by 𝔼[0T(YTθYtθ)2𝑑t]\mathbb{E}[\int_{0}^{T}(Y_{T}^{\theta}-Y_{t}^{\theta})^{2}dt]. In other words, for the policy evaluation (PE) procedure, we look at the following (offline) minimization problem

(3.22) minθΘ𝔼[0T(U(XTλ)Jθ(t,Xuλ;λ)mtT+λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u)2𝑑t]=0\min_{\theta\in\Theta}\mathbb{E}\bigg{[}\int_{0}^{T}\bigg{(}U(X^{\lambda}_{T})-J^{\theta}(t,X^{\lambda}_{u};{\lambda})-m\int_{t}^{T}\int_{-\infty}^{+\infty}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du\bigg{)}^{2}dt\bigg{]}=0

and check the martingale orthogonal property

(3.23) 𝔼[0Tht(dJθ(t,Xtλ;λ)m+λ(π|t,Xtλ)lnλ(π|t,Xtλ)𝑑π)𝑑t]=0.\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ^{\theta}(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t,X^{\lambda}_{t})\ln\lambda(\pi|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]}=0.

The orthogonality condition (3.23) can be assured by solving the following minimization problem (which is a quadratic form of (3.23)) 222Our procedure is inspired from (Jia and Zhou, 2022a, page 22). In particular, we are optimizing with respect to the parameter θ\theta which is a vector in general. In fact, we have to make sure that the parameterization satisfies the martingale orthogonality condition (3.23)) by choosing a suitable test function hh. For numerical approximation methods involving a parametric family {Jθ,θn}\{J^{\theta},\theta\in\mathbb{R}^{n}\}, in principle, we need at least nn equations as our martingale orthogonality conditions in order to fully determine θ\theta. Consequently, hh may be vector-valued, making (3.23) a vector-valued equation or, equivalently, a system of equations.

minθΘ\displaystyle\min_{\theta\in\Theta} 𝔼[0Tht(dJθ(t,Xtλ;λ)m+λ(π|Xtλ,t)lnλ(π|t,Xtλ)dπ)dt]×Σ×\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ^{\theta}(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|X^{\lambda}_{t},t)\ln\lambda(\pi|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]}^{\intercal}\times{\Sigma}\times
(3.24) 𝔼[0Tht(dJθ(t,Xtλ;λ)m+λ(π|t,Xtλ)lnλ(π|t,Xtλ)𝑑π)𝑑t],\displaystyle\mathbb{E}\bigg{[}\int_{0}^{T}h_{t}\bigg{(}dJ^{\theta}(t,X^{\lambda}_{t};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t,X^{\lambda}_{t})\ln\lambda(\pi|t,X^{\lambda}_{t})d\pi\bigg{)}dt\bigg{]},

where Σ\Sigma is a positively definite matrix with suitable size (e.g. Σ=I\Sigma=I or Σ=𝔼[0Ththt𝑑t]\Sigma=\mathbb{E}[\int_{0}^{T}h_{t}^{\intercal}h_{t}dt]) for any test process hh. A common choice of test process is to take ht=θJθ(t,Xtλ;λ)h_{t}=\frac{\partial}{\partial\theta}J^{\theta}(t,X^{\lambda}_{t};\lambda). The following assumption333There is a rich theory on solution existence and regularity of general parabolic PDEs, see e.g. Friedman (2008). However, the average PDE (3.18) appears to be a new type of PDEs for which further studies on well-posedness (existence and uniqueness) and regularity of solution are needed. See Tang et al. (2022) for related discussions on a similar exploratory elliptic PDEs. is needed in such an approximation procedure, see e.g. Jia and Zhou (2022b).

Assumption 3.1.

For all θΘn\theta\in\Theta\subset\mathbb{R}^{n}, JθJ^{\theta} is a C1,2([0,T)×+)C([0,T]×+)C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) with polynomial growth condition in xx. Moreover, JθJ^{\theta} is smooth in θ\theta and its two first derivatives are C1,2([0,T)×+)C([0,T]×+)C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) and satisfy the polynomial growth condition in xx.

Suppose that a PE step can be proceeded to obtain an estimate of the value function for an given admissible policy λ\lambda. If the policy λϕ\lambda^{\phi} can be parameterized by ϕΦd,d1\phi\in\Phi\subset\mathbb{R}^{d},{d\geq 1}, it is possible to learn the corresponding value function by following a policy gradient (PG) step. Recall first that the value function J(t,x,λϕ)J(t,x,\lambda^{\phi}) satisfies the average PDE

(3.25) +(J(t,x;λϕ)mlnλϕ(π|t,x))λϕ(π|t,x)𝑑π=0.\int_{-\infty}^{+\infty}\bigg{(}\mathcal{L}J(t,x;\lambda^{\phi})-m\ln\lambda^{\phi}(\pi|t,x)\bigg{)}\lambda^{\phi}(\pi|t,x)d\pi=0.

By differentiating (3.25) w.r.t. ϕ\phi we obtain the following average PDE for the policy gradient g(t,x;ϕ):=ϕJ(t,x,λϕ)g(t,x;\phi):=\frac{\partial}{\partial\phi}J(t,x,\lambda^{\phi})

(3.26) +(ψ(t,x,π;ϕ)+g(t,x;ϕ))λϕ(π|t,x)𝑑π=0,\int_{-\infty}^{+\infty}\bigg{(}\psi(t,x,\pi;\phi)+\mathcal{L}g(t,x;\phi)\bigg{)}\lambda^{\phi}(\pi|t,x)d\pi=0,

with terminal condition g(T,x;ϕ)=ϕJ(T,x;λϕ)=0g(T,x;\phi)=\frac{\partial}{\partial\phi}J(T,x;\lambda^{\phi})=0, where

(3.27) ψ(t,x,π;ϕ):=(J(t,x;λϕ)mlnλϕ(π|t,x))ϕλϕ(π|t,x)λϕ(π|t,x)mϕlnλϕ(π|t,x).\psi(t,x,\pi;\phi):=\bigg{(}\mathcal{L}J(t,x;\lambda^{\phi})-m\ln\lambda^{\phi}(\pi|t,x)\bigg{)}\frac{\frac{\partial}{\partial\phi}\lambda^{\phi}(\pi|t,x)}{\lambda^{\phi}(\pi|t,x)}-m\frac{\partial}{\partial\phi}\ln\lambda^{\phi}(\pi|t,x).

By Feynman-Kac’s formula, g(t,x;ϕ)g(t,x;\phi) admits the following representation

(3.28) g(t,x;ϕ)=𝔼[tT+ψ(s,Xsλϕ,π;ϕ)λϕ(π|s,Xsλϕ)𝑑π𝑑s|Xtλϕ=x].g(t,x;\phi)=\mathbb{E}\Big{[}\int_{t}^{T}\int_{-\infty}^{+\infty}\psi(s,X^{\lambda^{\phi}}_{s},\pi;\phi)\lambda^{\phi}(\pi|s,X^{\lambda^{\phi}}_{s})d\pi ds|X^{\lambda^{\phi}}_{t}=x\Big{]}.

Since ψ\psi cannot be observed nor computed without knowing the environment, it is important to replace ψ(s,Xsλϕ,π;ϕ)ds\psi(s,X^{\lambda^{\phi}}_{s},\pi;\phi)ds by a suitable approximation. To this end, using Itô’s formula, the term J(s,Xsλϕ;λϕ)ds\mathcal{L}J(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})ds can be replaced by dJ(s,Xsλϕ;λϕ)xJ(s,Xsλϕ;λϕ)B^(π,λsϕ)dWsdJ(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})-\frac{\partial}{\partial{x}}J(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})\hat{B}(\pi,\lambda_{s}^{\phi})dW_{s}. Taking into account the fact that the term xJ(s,Xsλϕ;λϕ)B^(π,λsϕ)dWs\frac{\partial}{\partial{x}}J(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})\hat{B}(\pi,\lambda_{s}^{\phi})dW_{s} is a martingale, we obtain the following.

Theorem 3.7.

Let λϕ\lambda^{\phi} be an admissible policy. The policy gradient g(t,x;ϕ)=ϕJ(t,x,λϕ)g(t,x;\phi)=\frac{\partial}{\partial\phi}J(t,x,\lambda^{\phi}) admits the following representation

g(t,x;ϕ)\displaystyle g(t,x;\phi) =𝔼[tT+(ϕlnλϕ(π|s,Xsλϕ)(dJ(s,Xsλϕ;λϕ)mlnλϕ(π|s,Xsλϕ))\displaystyle=\mathbb{E}\Bigg{[}\int_{t}^{T}\int_{-\infty}^{+\infty}\bigg{(}\frac{\partial}{\partial\phi}\ln\lambda^{\phi}(\pi|s,X^{\lambda^{\phi}}_{s})\big{(}dJ(s,X^{\lambda^{\phi}}_{s};\lambda^{\phi})-m\ln\lambda^{\phi}(\pi|s,X^{\lambda^{\phi}}_{s})\big{)}
(3.29) mϕlnλϕ(π|s,Xsλϕ))λϕ(π|s,Xsλϕ)dπds|Xtλϕ=x].\displaystyle-m\frac{\partial}{\partial\phi}\ln\lambda^{\phi}(\pi|s,X^{\lambda^{\phi}}_{s})\bigg{)}\lambda^{\phi}(\pi|s,X^{\lambda^{\phi}}_{s})d\pi ds\bigg{|}X^{\lambda^{\phi}}_{t}=x\Bigg{]}.
Proof.

The proof is similar to that of Theorem 5 in Jia and Zhou (2022b), and hence is omitted. ∎

Note that for any arbitrary policy, the gradient of the corresponding value function given by the above expectation (3.29) is not zero in general. As mentioned in Jia and Zhou (2022a, b), all the terms inside the expectation above are all only computable when the action trajectories and the corresponding state trajectories on [t,T][t,T] are available. The sample is also needed to estimate the value function JJ obtained in the previous PE step.

4. Constrained optimal investment problem with exploration

In this section, we extend the results obtained in Section 3 to settings with a convex portfolio constraint see e.g. Cuoco (1997). In particular, we assume that the agent does not consume over her investment period and in addition, due to regulatory reasons, her risky investment ratio π\pi is set in a given interval [a,b][a,b], where a<ba<b are real numbers. Observe that the well-known short-selling constraint can be included by taking a=0a=0, b>0b>0. If, in addition b=1b=1, then both short-selling and money borrowing are prohibited. The case of buying constraint can be also covered by setting b=0b=0. Clearly, we are back to the unconstrained setting when sending aa\to-\infty and b+b\to+\infty. In such a constrained framework, the set of admissible investment strategies is now defined by

𝒜[a,b](x0):=\displaystyle\mathcal{A}_{[a,b]}(x_{0}):= {(π)|aπbis progressively measurable,Xtπ0for allt0,0Tπt2𝑑t<},\displaystyle\Bigg{\{}(\pi)\,\,\bigg{|}\,\;a\leq\pi\leq b\;\text{is progressively measurable},\,X_{t}^{\pi}\geq 0\;\text{for all}\;t\geq 0,\;\int_{0}^{T}\pi_{t}^{2}dt<\infty\Bigg{\}},

where XtπX_{t}^{\pi} is the non-exploration wealth process starting with xx for strategy π\pi. The objective is to maximize the terminal expected utility

(4.1) maxπ𝒜[a,b](x0)𝔼[U(XTπ)],\displaystyle\max_{\pi\in\mathcal{A}_{[a,b]}(x_{0})}\mathbb{E}\Big{[}U(X_{T}^{\pi})\Big{]},

where UU is a utility function.

As in the unconstrained case, we look at the exploratory version of the wealth dynamics given by (2.5) with exploration drift A^(t,x;λ)\hat{A}(t,x;\lambda) and exploration volatility B^(t,x;λ)\hat{B}(t,x;\lambda) are defined by (2.6). Intuitively, the corresponding exploration setting is slightly adjusted to taking the constraint πt[a,b]\pi_{t}\in[a,b] into account. In particular, let [a,b]\mathcal{H}_{[a,b]} be the set of admissible exploration distributions λ\lambda in 𝒞=[a,b]\mathcal{C}=[a,b] that satisfy the following properties:

  1. (1)

    For each (t,x)[0,T]×(t,x)\in[0,T]\times\mathbb{R}, λ(|t,x)\lambda(\cdot|t,x) is a density function on [a,b][a,b].

  2. (2)

    The mapping [0,T]××[a,b](t,x,π)λ(π|t,x)[0,T]\times\mathbb{R}\times[a,b]\ni(t,x,\pi)\mapsto\lambda(\pi|t,x) is measurable.

  3. (3)

    For each λ[a,b]\lambda\in\mathcal{H}_{[a,b]}, the exploration SDE (2.5) admits a unique strong solution denoted by Xλ,[a,b]X^{\lambda^{{,[a,b]}}} which is positive and 𝔼[U(XTλ,[a,b])m0Tabλ(π|t,Xtλ,[a,b])lnλ(π|t,Xtλ,[a,b])𝑑π𝑑t]<.\mathbb{E}\bigg{[}U(X_{T}^{\lambda^{{,[a,b]}}})-m\int_{0}^{T}\int_{a}^{b}\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})\ln\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})d\pi dt\bigg{]}<\infty.

The exploration optimization is now stated by

(4.2) maxλ[a,b]𝔼[U(XTλ,[a,b])m0Tabλ(π|t,Xtλ,[a,b])lnλ(π|t,Xtλ,[a,b])𝑑π𝑑t].\displaystyle\max_{\lambda\in\mathcal{H}_{[a,b]}}\mathbb{E}\bigg{[}U(X_{T}^{\lambda^{{,[a,b]}}})-m\int_{0}^{T}{\int_{a}^{b}}\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})\ln\lambda(\pi|t,X^{\lambda^{{,[a,b]}}}_{t})d\pi dt\bigg{]}.

As before, the optimal value function satisfies the following HJB equation

vt[a,b](t,x;m)+supλ[a,b]{A^(t,x;λ)vx[a,b](t,x;m)\displaystyle v^{[a,b]}_{t}(t,x;m)+\sup_{\lambda\in\mathcal{H}_{[a,b]}}\bigg{\{}{\hat{A}(t,x;\lambda)}v^{[a,b]}_{x}(t,x;m) +12B^2(t,x;λ)vxx[a,b](t,x;m)\displaystyle+\frac{1}{2}{\hat{B}^{2}(t,x;\lambda)}v^{[a,b]}_{xx}(t,x;m)
(4.3) mabλ(π|t,x)lnλ(π|t,x)dπ}=0,\displaystyle-m{\int_{a}^{b}}\lambda(\pi|t,x)\ln\lambda(\pi|t,x)d\pi\bigg{\}}=0,

with terminal condition v(T,x;m)=U(x)v(T,x;m)=U(x). Using again the standard argument of DPP we observe that under the portfolio constraint π[a,b]\pi\in[a,b], the optimal feedback policy now follows a truncated Gaussian distribution.

Lemma 4.1.

In the exploratory constrained EU setting, if vxx[a,b]<0v^{[a,b]}_{xx}<0 then the optimal distribution λ,[a,b]\lambda^{*,[a,b]} is a Gaussian distribution with mean α\alpha and variance β2\beta^{2}, truncated on interval [a,b][a,b], where

(4.4) α=(μr)xvxσ2x2vxx[a,b];β2=mσ2x2vxx[a,b].\alpha=-\frac{(\mu-r)xv_{x}}{\sigma^{2}x^{2}v^{[a,b]}_{xx}};\quad\beta^{2}=-\frac{m}{\sigma^{2}x^{2}v^{[a,b]}_{xx}}.

The density of the optimal policy λ,[a,b]\lambda^{*,[a,b]} is given by

(4.5) λ,[a,b](π|t,x;m)\displaystyle\lambda^{*,[a,b]}(\pi|t,x;m) :=1βφ(παβ)Φ(bαβ)Φ(aαβ),\displaystyle:=\frac{1}{\beta}\frac{\varphi(\frac{\pi-\alpha}{\beta})}{\Phi(\frac{b-\alpha}{\beta})-\Phi(\frac{a-\alpha}{\beta})},

where φ\varphi and Φ\Phi are the PDF and CDF functions of the standard normal distribution, respectively.

Note that the first two moments and entropy of a truncated Gaussian distribution can be computed explicitly, see e.g. Kotz et al. (2004). Substituting (4.5) back to the HJB (4.3) we obtain the following highly non-linear PDE

vt[a,b](t,x;m)+rxvx[a,b](t,x;m)\displaystyle v^{[a,b]}_{t}(t,x;m)+rxv^{[a,b]}_{x}(t,x;m) 12(μr)2(vx[a,b])2(t,x;m)σ2vxx[a,b](t,x;m)\displaystyle-\frac{1}{2}\frac{(\mu-r)^{2}(v^{[a,b]}_{x})^{2}(t,x;m)}{\sigma^{2}v^{[a,b]}_{xx}(t,x;m)}
(4.6) +m2ln(2πemσ2x2vxx[a,b](t,x;m))+mlnZ(m)=0,\displaystyle+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}x^{2}v^{[a,b]}_{xx}(t,x;m)}\bigg{)}+{m\ln Z(m)}=0,

with terminal condition v(T,x;m)=U(x)v(T,x;m)=U(x), where by abusing notations,

(4.7) Z(m):=Φ(B(m))Φ(A(m)),Z(m):={\Phi(B(m))-\Phi(A(m))},

with

(4.8) A(m):=(a+(μr)σ2vx[a,b]xvxx[a,b])σ2x2vxx[a,b]m;B(m):=(b+(μr)σ2vx[a,b]xvxx[a,b])σ2x2vxx[a,b]m.A(m):=\bigg{(}a+\frac{(\mu-r)}{\sigma^{2}}\frac{v^{[a,b]}_{x}}{xv^{[a,b]}_{xx}}\bigg{)}\sqrt{\frac{-\sigma^{2}x^{2}v^{[a,b]}_{xx}}{m}};\quad B(m):=\bigg{(}b+\frac{(\mu-r)}{\sigma^{2}}\frac{v^{[a,b]}_{x}}{xv^{[a,b]}_{xx}}\bigg{)}\sqrt{\frac{-\sigma^{2}x^{2}v^{[a,b]}_{xx}}{m}}.

4.1. Optimal policy with portfolio constraints

Below we show that the optimal solution to (4.6) can be given in explicit form for the case of logarithmic utility. We summarize the results in the following theorem.

Theorem 4.1 (Exploratory optimal investment under portfolio constraint).

For logarithmic utility U(x)=lnxU(x)=\ln x, the optimal value function of the entropy-regularized exploratory constrained optimal investment problem is given by

(4.9) v[a,b](t,x;m)=\displaystyle v^{[a,b]}(t,x;m)= lnx+(r+12(μr)2σ2)(Tt)+m2ln(σ22πem)(Tt)+mlnZa,b(m)(Tt)\displaystyle\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t){+m\ln Z_{a,b}(m)(T-t)}

for (t,x)[0,T]×+(t,x)\in[0,T]\times\mathbb{R}_{+}, where

(4.10) Za,b(m):=Φ((bπMerton)σm1/2)Φ((aπMerton)σm1/2).Z_{a,b}(m):=\Phi\bigg{(}(b-\pi^{Merton})\sigma m^{-1/2}\bigg{)}-\Phi\bigg{(}(a-\pi^{Merton})\sigma m^{-1/2}\bigg{)}.

Moreover, the optimal feedback distribution control λ,[a,b]\lambda^{*,[a,b]} is a Gaussian distribution with parameter πMerton=(μr)σ2\pi^{Merton}=\frac{(\mu-r)}{\sigma^{2}} and mσ2\frac{m}{\sigma^{2}}, truncated on the interval [a,b][a,b] i.e.

(4.11) λ,[a,b](π)=𝒩(π|(μr)σ2,mσ2)|[a,b]\lambda^{*,[a,b]}(\pi)=\mathcal{N}\left(\pi\bigg{|}\frac{(\mu-r)}{\sigma^{2}},\frac{m}{\sigma^{2}}\right)\bigg{|}_{[a,b]}

and the associated optimal wealth under λ,[a,b]\lambda^{*,[a,b]} is given by the following SDE

(4.12) dXtλ,[a,b]=Xtλ,[a,b](r+πt,[a,b](μr))dt+σqt,[a,b]Xtλ,[a,b]dWt,X0=x0,\displaystyle dX_{t}^{\lambda^{*,[a,b]}}=X_{t}^{\lambda^{*,[a,b]}}\bigg{(}r+\pi_{t}^{*,[a,b]}(\mu-r)\bigg{)}dt+\sigma{q_{t}^{*,[a,b]}}X_{t}^{\lambda^{*,[a,b]}}dW_{t}\,,\quad X_{0}=x_{0}\,,

where

(4.13) (qt,[a,b])2=(πt,[a,b])2+mσ2(1+A(m)φ(A(m))B(m)φ(B(m))Za,b(m)(φ(A(m))φ(B(m))Za,b(m))2)\displaystyle(q_{t}^{*,[a,b]})^{2}=(\pi_{t}^{*,[a,b]})^{2}+\frac{m}{\sigma^{2}}\bigg{(}1+\frac{A(m)\varphi(A(m))-B(m)\varphi(B(m))}{Z_{a,b}(m)}-\bigg{(}\frac{\varphi(A(m))-\varphi(B(m))}{Z_{a,b}(m)}\bigg{)}^{2}\bigg{)}

with, by abusing notations, A(m):=(aπMerton)σm1/2A(m):=(a-\pi^{Merton})\sigma m^{-1/2}, B(m):=(bπMerton)σm1/2B(m):=(b-\pi^{Merton})\sigma m^{-1/2} and

(4.14) πt,[a,b]=πMerton+φ((aπMerton)σm1/2)φ((bπMerton)σm1/2)σm1/2Za,b(m).\pi^{*,[a,b]}_{t}=\pi^{Merton}+\frac{\varphi\bigg{(}(a-\pi^{Merton})\sigma m^{-1/2}\bigg{)}-\varphi\bigg{(}(b-\pi^{Merton})\sigma m^{-1/2}\bigg{)}}{\sigma m^{-1/2}Z_{a,b}(m)}.
Remark 4.1.

At any t[0,T]t\in[0,T], the exploration variance decreases as σ\sigma increases, which means that exploration is less necessary in more random market environment. However, in contrast to the unconstrained section, both the mean and variance of the optimal distribution λ,[a,b]\lambda^{*,[a,b]} are controlled by the degree of exploration mm.

From Theorem 4.1 we can see that in such a setting with portfolio constraints, the best control distribution to balance exploration and exploitation is truncated Gaussian, demonstrating once again the popularity of Gaussian distribution in RL studies, even in constrained optimization settings.

Lemma 4.2.

The optimal value function of the regularized exploration constrained problem can be expressed as

(4.15) v[a,b](t,x;m)=\displaystyle v^{[a,b]}(t,x;m)= v[a,b](t,x;0)+ψ[a,b](t,x;m),\displaystyle v^{[a,b]}(t,x;0)+\psi^{[a,b]}(t,x;m),

where

(4.16) v[a,b](t,x;0):=\displaystyle v^{[a,b]}(t,x;0):= lnx+(r+(μr)π0,[a,b]12(π0,[a,b])2σ2)(Tt),\displaystyle\ln x+\bigg{(}r+(\mu-r)\pi^{0,[a,b]}-\frac{1}{2}(\pi^{0,[a,b]})^{2}\sigma^{2}\bigg{)}(T-t),

which is the constrained optimal value function without exploration using the corresponding constrained optimal investment strategy

(4.17) π0,[a,b]:=πMerton𝟏πMerton[a,b]+a𝟏πMerton<a+b𝟏πMerton>b,\pi^{0,[a,b]}:=\pi^{Merton}{\bf 1}_{\pi^{Merton}\in[a,b]}+a{\bf 1}_{\pi^{Merton}<a}+b{\bf 1}_{\pi^{Merton}>b},

where

ψ[a,b](t,x;m)=m2ln(σ22πem)(Tt)+mlnZa,b(m)(Tt)+(f(a)𝟏πMerton<a+f(b)𝟏πMerton>b)(Tt)\psi^{[a,b]}(t,x;m)=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t){+m\ln Z_{a,b}(m)}(T-t)+(f(a){\bf 1}_{\pi^{Merton}<a}+f(b){\bf 1}_{\pi^{Merton}>b})(T-t)

and

f(y):=12y2σ2y(μr)+12(μr)2σ2=12σ2(yπMerton)2.f(y):=\frac{1}{2}y^{2}\sigma^{2}-y(\mu-r)+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}=\frac{1}{2}\sigma^{2}(y-\pi^{Merton})^{2}.

Note that f(πMerton)=0f(\pi^{Merton})=0. The following lemma confirms the intuition that ψ[a,b](t,x;m)\psi^{[a,b]}(t,x;m), which measures the exploration effect, converges to 0 as m0m\to 0.

Lemma 4.3.

Consider Za,b(m)Z_{a,b}(m) defined by (4.10) for each m>0m>0. We have

(4.18) limm0mlnZa,b(m)={0πMerton[a,b]f(a)πMerton<af(b)πMerton>b.\lim_{m\to 0}{m\ln Z_{a,b}(m)}=\begin{cases}0&{\pi^{Merton}\in[a,b]}\\ -f(a)&{\pi^{Merton}<a}\\ -f(b)&{\pi^{Merton}>b}.\end{cases}
Proof.

It can be seen directly using e.g. the classical L’Hôpital’s rule. ∎

Similar to Theorem 3.2 for the unconstrained case, it is straightforward to obtain the solvability equivalence between classical and exploratory constrained EU problem.

4.2. Exploration cost and exploration effect under portfolio constraints

We now turn our attention to the exploration cost. As before, the exploration cost at time t=0t=0 is defined by

(4.19) L[a,b](T,x;m):=v[a,b](0,x;0)v[a,b](0,x;m)m𝔼[0Tabλs,[a,b](π)lnλs,[a,b](π)𝑑π𝑑s|X0λ,[a,b]=x],\begin{split}L^{[a,b]}(T,x;m)&:=v^{[a,b]}(0,x;0)-v^{[a,b]}(0,x;m)\\ &-m\mathbb{E}\bigg{[}\int_{0}^{T}\int_{a}^{b}\lambda_{s}^{*,[a,b]}(\pi)\ln\lambda_{s}^{*,[a,b]}(\pi)d\pi ds|X^{\lambda^{*,[a,b]}}_{0}=x\bigg{]},\end{split}

where v[a,b](0,x;0)v^{[a,b]}(0,x;0) and v[a,b](0,x;m)v^{[a,b]}(0,x;m) are the optimal value function at time t=0t=0 of the constrained EU problem without and with exploration respectively (defined by (4.15) and (4.16)). The integral term is the expected utility (before being regularized) of the exploration problem at time t=0t=0.

Proposition 4.1.

In the constrained problem with exploration, the exploration cost is given by

L[a,b](T,x;m)=mT2+mTA(m)φ(A(m))B(m)φ(B(m))2Za,b(m),L^{[a,b]}(T,x;m)=\frac{mT}{2}+mT\frac{A(m)\varphi(A(m))-B(m)\varphi(B(m))}{2Z_{a,b}(m)},

where, by abusing notations,

A(m):=(aπMerton)σm1/2;B(m):=(bπMerton)σm1/2.A(m):=(a-\pi^{Merton})\sigma m^{-1/2};\quad B(m):=(b-\pi^{Merton})\sigma m^{-1/2}.

Moreover, limm0L[a,b](T,x;m)=0\lim_{m\to 0}L^{[a,b]}(T,x;m)=0.

Proof.

Recall first that the optimal distribution λt,[a,b]\lambda_{t}^{*,[a,b]} is a Gaussian distribution with mean α=πMerton\alpha=\pi^{Merton} and variance β2=mσ2\beta^{2}=\frac{m}{\sigma^{2}} truncated on the interval [a,b][a,b]. It is known e.g. Kotz et al. (2004) that for any Gaussian distribution 𝒩(α,β2)\mathcal{N}(\alpha,\beta^{2}) truncated on [a,b][a,b], its entropy is given by

(4.20) ln2πeβ2+12+ln(Φ(l(b))Φ(l(a)))+l(a)φ(l(a))l(b)φ(l(b))2(Φ(l(b)))Φ(l(a))),\ln\sqrt{{2\pi_{e}\beta^{2}}}+\frac{1}{2}+\ln\bigg{(}\Phi(l(b))-\Phi(l(a))\bigg{)}+\frac{l(a)\varphi(l(a))-l(b)\varphi(l(b))}{2(\Phi(l(b))-)\Phi(l(a)))},

where l(y):=yαβl(y):=\frac{y-\alpha}{\beta}. The explicit form of the exploration cost now follows directly by Lemma 4.2. Finally, it is straightforward to verify that limm0L[a,b](T,x;m)=0\lim_{m\to 0}L^{[a,b]}(T,x;m)=0. ∎

As shown above, the exploration cost converges to zero as m0m\to 0. Since the exploration weight as been taken as an exogenous parameter mm which reflects the level of exploration desired by the learning agent, it is intuitive to expect that the smaller mm is, the more emphasis is placed on exploitation; and when mm is sufficiently close to zero, the exploratory formulation is getting close to the problem without exploration. The following lemma confirms the intuition that exploration cost under a portfolio constraint is smaller than that without constraint when the investment strategy is restricted in an interval.

Lemma 4.4.

If the investment strategy is restricted in [a,b][a,b], where a<πMerton<ba<\pi^{Merton}<b, then L[a,b](T,x;m)L(T,x;m)L^{[a,b]}(T,x;m)\leq L(T,x;m), for all x>0x>0. The same conclusion also holds for the case where πMertona<bπMerton+mσ1\pi^{Merton}\leq a<b\leq\pi^{Merton}+\sqrt{m}\sigma^{-1} or πMertonmσ1a<bπMerton\pi^{Merton}-\sqrt{m}\sigma^{-1}\leq a<b\leq\pi^{Merton}.

The following theorem confirms a desirable result that the entropy-regularized constrained EU problem converges to its non-exploratory constrained EU counterpart when the exploration weight mm goes to zero.

Theorem 4.2.

For each x+x\in\mathbb{R}+,

(4.21) λ,[a,b](|t,x,m)δπ0,[a,b]()whenm0,\lambda^{*,[a,b]}(\cdot|t,x,m)\to\delta_{\pi^{0,[a,b]}}{(\cdot)}\quad\mbox{when}\quad m\to 0,

where the (without exploration) constrained Merton strategy π0,[a,b]\pi^{0,[a,b]} is defined by (4.17). Furthermore, for each (t,x)[0,T]×+(t,x)\in[0,T]\times\mathbb{R}+,

limm0|v,[a,b](t,x;m)v,[a,b](t,x;0)|=0.\lim_{m\to 0}|v^{*,[a,b]}(t,x;m)-v^{*,[a,b]}(t,x;0)|=0.
Proof.

The convergence of the optimal distribution control λ,[a,b]\lambda^{*,[a,b]} to the Dirac measure δπ0,[a,b]\delta_{{\pi^{0,[a,b]}}} is straightforward. The convergence of the value function follows directly from fact that ψ[a,b](t,x;m)\psi^{[a,b]}(t,x;m) given by (4.14) converges to 0 as m0m\to 0 by Lemma 4.2. ∎

4.3. Policy improvement

Below, for some given admissible (feedback) policy λ[a,b]\lambda^{[a,b]} (not necessarily truncated Gaussian) on [a,b][a,b], we denote the corresponding value function by

(4.22) vλ[a,b](t,x;m):=𝔼[U(XTλ[a,b])mtTabλs[a,b](π)lnλs[a,b](π)𝑑π|Xtλ[a,b]=x].\displaystyle v^{\lambda^{[a,b]}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda^{[a,b]}})-m\int_{t}^{T}{\int_{a}^{b}}\lambda_{s}^{[a,b]}(\pi)\ln\lambda_{s}^{[a,b]}(\pi)d\pi|X^{\lambda^{[a,b]}}_{t}=x\Bigg{]}.
Theorem 4.3.

For some given policy λ[a,b]\lambda^{[a,b]} (not necessarily Gaussian) on the interval [a,b][a,b], we assume that the corresponding value function vλ[a,b](,;m)C1,2([0,T)×+)C([0,T]×+)v^{\lambda^{[a,b]}}(\cdot,\cdot;m)\in C^{1,2}([0,T)\times\mathbb{R}_{+})\cap C([0,T]\times\mathbb{R}_{+}) and satisfies vxxλ[a,b](t,x,;m)<0v^{{\lambda^{[a,b]}}}_{xx}(t,x,;m)<0 for any (t,x)[0,T)×+.(t,x)\in[0,T)\times\mathbb{R}_{+}. Suppose furthermore that the feedback policy λ~[a,b]\widetilde{\lambda}^{[a,b]} defined by

(4.23) λ~[a,b](π|t,x;m)=𝒩(π|(μr)xvxλ[a,b]σ2x2vxxλ[a,b],mσ2x2vxxλ[a,b])|[a,b]\widetilde{\lambda}^{[a,b]}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv^{{\lambda^{[a,b]}}}_{x}}{\sigma^{2}x^{2}v^{{\lambda^{[a,b]}}}_{xx}},-\frac{m}{\sigma^{2}x^{2}v^{{\lambda^{[a,b]}}}_{xx}}\bigg{)}\bigg{|}_{[a,b]}

(Gaussian truncated on [a,b][a,b]) is admissible. Let vλ~[a,b](t,x;m)v^{\widetilde{\lambda}^{[a,b]}}(t,x;m) be the value function corresponding to this new truncated (Gaussian) policy λ~[a,b]\widetilde{\lambda}^{[a,b]}. Then,

(4.24) vλ~[a,b](t,x;m)vλ[a,b](t,x;m),(t,x)[0,T)×+.v^{\widetilde{\lambda}^{[a,b]}}(t,x;m)\geq v^{\lambda^{[a,b]}}(t,x;m),\quad(t,x)\in[0,T)\times\mathbb{R}_{+}.
Proof.

The proof is similar to that of the unconstrained case hence, omitted. ∎

Theorem 4.3, also suggests that a candidate of the initial feedback policy may take the form λ0,[a,b](π|t,x;m)=𝒩(π|α,β2)|[a,b]{\lambda}^{0,[a,b]}(\pi|t,x;m)=\mathcal{N}\big{(}\pi\big{|}\alpha,\beta^{2}\big{)}|_{[a,b]} for some parameters α,β2\alpha,\beta^{2}. Similar to the unconstrained case, such a choice of policy leads to the convergence of both the value functions and the policies in a finite number of iterations.

Theorem 4.4.

Let λ0,[a,b](π;t,x,m)=𝒩(π|α,β2)|[a,b]{\lambda}^{0,[a,b]}(\pi;t,x,m)=\mathcal{N}\big{(}\pi\big{|}\alpha,\beta^{2}\big{)}|_{[a,b]} with α,β>0\alpha,\beta>0 and assume that U(x)=lnxU(x)=\ln x. Define the sequence of feedback policies (λn,[a,b](π;t,x,m))(\lambda^{n,{[a,b]}}(\pi;t,x,m)) updated by the policy improvement scheme (4.23), i.e.,

(4.25) λn,[a,b](π|t,x;m)=𝒩(π|(μr)xvxλn1,[a,b](t,x;m)σ2x2vxxλn1,[a,b](t,x;m),mσ2x2vxxλn1,[a,b](t,x;m))|[a,b],n=1,2,,\lambda^{n,{[a,b]}}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{\lambda^{n-1,{[a,b]}}}(t,x;m)}{\sigma^{2}x^{2}v^{\lambda^{n-1,{[a,b]}}}_{xx}(t,x;m)},-\frac{m}{\sigma^{2}x^{2}v^{\lambda^{n-1,{[a,b]}}}_{xx}(t,x;m)}\bigg{)}\bigg{|}_{[a,b]},\quad n=1,2,\cdots,

where vλn1,[a,b]v^{\lambda^{n-1,{[a,b]}}} is the value function corresponding to the policy λn1,[a,b]\lambda^{n-1,{[a,b]}} defined by

(4.26) vλn1,[a,b](t,x;m):=𝔼[U(XTλn1,[a,b])mtTabλsn1,[a,b](π)lnλsn1,[a,b](π)𝑑π|Xtλn1,[a,b]=x].\displaystyle v^{\lambda^{n-1,{[a,b]}}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\lambda^{n-1,{[a,b]}}})-m\int_{t}^{T}\int_{a}^{b}\lambda_{s}^{n-1,{[a,b]}}(\pi)\ln\lambda_{s}^{n-1,{[a,b]}}(\pi)d\pi|X^{\lambda^{n-1,{[a,b]}}}_{t}=x\Bigg{]}.

Then,

(4.27) limnλn,[a,b](|t,x;m)=λ,[a,b](|t,x;m),weakly\lim_{n\to\infty}\lambda^{n,{[a,b]}}(\cdot|t,x;m)=\lambda^{*,{[a,b]}}(\cdot|t,x;m),\quad\mbox{weakly}

and

(4.28) limnvλn,[a,b](t,x;m)=vλ,[a,b](t,x;m),\lim_{n\to\infty}v^{\lambda^{n,{[a,b]}}}(t,x;m)=v^{\lambda^{*,{[a,b]}}}(t,x;m),

which is the optimal value function given in Lemma 4.2.

Proof.

The proof can be done using Feymann-Kac representation and the update result of Theorem 4.3, similarly to the unconstrained problem. ∎

The above improvement theorem allows to establish the policy evaluation and algorithm taking the martingale property into account. Since the steps are similar to the unconstrained problem in Section 3.4, we skip the detail.

5. Learning implementation and numerical example

We are ready now in the position to discuss a RL algorithm for this problem. Note that a common practice in RL literature is that one can represent the value function JθJ^{\theta} and the policy λϕ\lambda^{\phi} using (deep) neural networks. In this section, we don’t follow this path. Inspired by the (offline) actor-critic approach in Jia and Zhou (2022b), we instead take advantages of the explicit parametric form of the optimal value function vv and the improvement policy which are given in Theorem 3.2. This will in turn facilitate the learning process and will lead to faster learning and convergence.

Below, to implement a workable algorithm, one can approximate a generic expectation in the following manner: Let 0=t0<t1<<tl<tl+1=T0=t_{0}<t_{1}<\ldots<t_{l}<t_{l+1}=T be a partition for the finite interval [0,T][0,T]. We then collect a sample 𝒟={(ti,xi):i=0,1,,l+1}\mathcal{D}=\{(t_{i},x_{i}):i=0,1,\ldots,l+1\} as follows: For i=0i=0, the initial sample point is (0,x0)(0,x_{0}). Next, at each ti,i=0,1,,lt_{i},i=0,1,\ldots,l, λti(π)\lambda_{t_{i}}(\pi) is sampled to obtain an allocation π\pi for the risky asset. The wealth xti+1x_{t_{i}+1} at the next time instant ti+1t_{i+1} is computed by equation (2.5). As a result, a generic expectation can be approximated by the following finite sum

(5.1) E(θ)(ti,xi)𝒟hti(Jθ(ti+1,Xti+1λ;λ)Jθ(ti,Xtiλ;λ)m+λ(π|ti,Xtiλ)lnλ(π|ti,Xtiλ)𝑑π)Δt.E(\theta)\approx\sum_{(t_{i},x_{i})\in\mathcal{D}}h_{t_{i}}\bigg{(}J^{\theta}(t_{i+1},X^{\lambda}_{t_{i+1}};{\lambda})-J^{\theta}(t_{i},X^{\lambda}_{t_{i}};{\lambda})-m\int_{-\infty}^{+\infty}\lambda(\pi|t_{i},X^{\lambda}_{t_{i}})\ln\lambda(\pi|t_{i},X^{\lambda}_{t_{i}})d\pi\bigg{)}\Delta t.

5.1. Implementation for the unconstrained problem

Recall first that the exploratory optimal value function is given by

(5.2) v(t,x;m)=\displaystyle v(t,x;m)= lnx+(r+12(μr)2σ2)(Tt)+m2ln(2πem)(Tt)mln(σ)(Tt).\displaystyle\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}(T-t)+\frac{m}{2}\ln({2\pi_{e}m})(T-t)-m\ln(\sigma)(T-t).

We learn it from the following parametric form

(5.3) Jθ(t,x;m)=lnx+θ(Tt)+m2ln(2πem)(Tt).\displaystyle J^{\theta}(t,x;m)=\ln x+\theta(T-t)+\frac{m}{2}\ln({2\pi_{e}m})(T-t).

which clearly satisfies Assumption 3.1. It follows from (5.2) that

(5.4) θ=r+12(μr)2σ2mlnσ.\displaystyle\theta=r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}-m\ln\sigma.

Recall that for a Gaussian distribution 𝒩(.|a,b2)\mathcal{N}(.|a,b^{2}) its entropy is given by 12+ln(b2πe)\frac{1}{2}+\ln\left(b\sqrt{2\pi_{e}}\right). Since we do not know the model’s parameters μ\mu and σ\sigma, we will learn them using Theorem 3.2 with the parametric policy λϕ(π)=𝒩(π|ϕ1,eϕ2m).\lambda^{\phi}(\pi)=\mathcal{N}(\pi|\phi_{1},e^{\phi_{2}}m). It follows that

(5.5) p^(ϕ1,ϕ2):=+λϕ(π)lnλϕ(π)𝑑π=12+12(ϕ2+ln(2πem)).\displaystyle\hat{p}(\phi_{1},\phi_{2}):=-\int_{-\infty}^{+\infty}\lambda^{\phi}(\pi)\ln\lambda^{\phi}(\pi)d\pi=\frac{1}{2}+\frac{1}{2}(\phi_{2}+\ln(2\pi_{e}m)).

Also,

lnλϕ(π)=12ϕ212m(πϕ1)2eϕ212ln(2πem).\ln\lambda^{\phi}(\pi)=-\frac{1}{2}\phi_{2}-\frac{1}{2m}(\pi-\phi_{1})^{2}e^{-\phi_{2}}-\frac{1}{2}\ln(2\pi_{e}m).

Therefore

ϕ1lnλϕ(π)=(πϕ1)eϕ2m,ϕ2lnλϕ(π)=12+12(πϕ1)2eϕ2m.\frac{\partial}{\phi_{1}}\ln\lambda^{\phi}(\pi)=(\pi-\phi_{1})\frac{e^{-\phi_{2}}}{m},\quad\frac{\partial}{\phi_{2}}\ln\lambda^{\phi}(\pi)=-\frac{1}{2}+\frac{1}{2}(\pi-\phi_{1})^{2}\frac{e^{-\phi_{2}}}{m}.

Moreover, it can be seen from (5.5) that p^ϕ1=0\frac{\partial\hat{p}}{\partial\phi_{1}}=0 and p^ϕ2=12.\frac{\partial\hat{p}}{\partial\phi_{2}}=\frac{1}{2}. Finally, in such a learning framework, we choose the test function ht=Jθθ=(Tt)h_{t}=\frac{\partial J^{\theta}}{\partial\theta}=(T-t).

For learning step, we adopt the (offline) actor-critic approach in Jia and Zhou (2022b) which is summarized in Algorithm 1 below.

0:  Market simulator Market
0:  Learning rates: ηθ,ηϕ\eta_{\theta},\eta_{\phi}, exploration rate mm, number of iterations MM
1:  for k=1,2,,Mk=1,2,\ldots,M do
2:     for i=1,2,,TΔti=1,2,\ldots,\frac{T}{\Delta t} do
3:        Sample (tik,xik)(t_{i}^{k},x_{i}^{k}) from Market under λϕ\lambda^{\phi}
4:        Obtain collected samples 𝒟={(tik,xik):1iTΔt}\mathcal{D}=\{(t_{i}^{k},x_{i}^{k}):1\leq i\leq\frac{T}{\Delta t}\}
5:        Compute ΔθJ\Delta_{\theta}J using (3.4)
6:        Compute ΔϕJ\Delta_{\phi}J using (3.29)
7:        Update θθ+l(i)ηθΔθJ\theta\leftarrow\theta+l(i)\eta_{\theta}\Delta_{\theta}J
8:        Update ϕϕl(i)ηϕΔϕJ\phi\leftarrow\phi-l(i)\eta_{\phi}\Delta_{\phi}J
9:     end for
10:     Update the policy λϕ𝒩(ϕ1,eϕ2m)\lambda^{\phi}\leftarrow\mathcal{N}\left(\phi_{1},e^{\phi_{2}}m\right)
11:  end for
Algorithm 1 PE/PG: Actor-Critic Algorithm: unconstrained problem

5.2. Implementation for the constrained problem

First, recall from (4.9) that the exploratory optimal value function is given by

v[a,b](t,x;m)=\displaystyle v^{[a,b]}(t,x;m)= lnx+(r+12(μr)2σ2)(Tt)+m2ln(σ22πem)(Tt)+mlnZa,b(m)(Tt)\displaystyle\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t){+}m\ln Z_{a,b}(m)(T-t)

for (t,x)[0,T]×+(t,x)\in[0,T]\times\mathbb{R}_{+}, where Za,b(m)Z_{a,b}(m) is given by (4.10). We learn it from the following parametric form

(5.6) Jθ(t,x;m)=lnx+θ1(Tt)+θ2(Tt)+m2ln(2πem)(Tt),\displaystyle J^{\theta}(t,x;m)=\ln x+\theta_{1}(T-t)+\theta_{2}(T-t)+\frac{m}{2}\ln({2\pi_{e}m})(T-t),

which clearly satisfies Assumption 3.1. It follows from (5.6) that

(5.7) θ1=r+12(μr)2σ2mln(σ),θ2=mlnZa,b(m).\displaystyle\theta_{1}=r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}-m\ln(\sigma),\quad\theta_{2}=m\ln Z_{a,b}(m).

Again, we parametrize the policy from Theorem 4.1 as the following

λϕ(π)=𝒩(π|ϕ1,eϕ2m)|[a,b]=1eϕ2mφ(πϕ1eϕ2m)(Φ(bϕ1eϕ2m)Φ(aϕ1eϕ2m)).\lambda^{\phi}(\pi)=\mathcal{N}(\pi|\phi_{1},e^{\phi_{2}}m)|_{[a,b]}=\frac{1}{\sqrt{e^{\phi_{2}}m}}\frac{\varphi(\frac{\pi-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})}{\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}.

Also recall that for λϕ\lambda^{\phi}, a Gaussian distribution truncated on [a,b][a,b], its entropy is given by

p^(ϕ1,ϕ2)=\displaystyle\hat{p}(\phi_{1},\phi_{2})= ln2πm+12ϕ2+12+ln(Φ(bϕ1eϕ2m)Φ(aϕ1eϕ2m))+aϕ1eϕ2mφ(aϕ1eϕ2m)bϕ1eϕ2mφ(bϕ1eϕ2m)2(Φ(bϕ1eϕ2m)Φ(aϕ1eϕ2m)).\displaystyle\ln\sqrt{{2\pi m}}+\frac{1}{2}\phi_{2}+\frac{1}{2}+\ln\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)+\frac{\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}}\varphi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}}\varphi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})}{2\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}.

Also,

lnλϕ(π)=12ϕ212m(πϕ1)2eϕ212ln(2πem)ln(Φ(bϕ1eϕ2m)Φ(aϕ1eϕ2m)).\ln\lambda^{\phi}(\pi)=-\frac{1}{2}\phi_{2}-\frac{1}{2m}(\pi-\phi_{1})^{2}e^{-\phi_{2}}-\frac{1}{2}\ln(2\pi_{e}m)-\ln\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right).

Therefore

ϕ1lnλϕ(π)=(πϕ1)eϕ2m+(φ(bϕ1e2ϕm)φ(aϕ1eϕ2m))(Φ(bϕ1eϕ2m)Φ(aϕ1eϕ2m))1eϕ2m,\frac{\partial}{\phi_{1}}\ln\lambda^{\phi}(\pi)=(\pi-\phi_{1})\frac{e^{-\phi_{2}}}{m}+\frac{\left({\varphi}(\frac{b-\phi_{1}}{\sqrt{e^{\phi}_{2}m}})-{\varphi}(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}{\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}\frac{1}{\sqrt{e^{\phi_{2}}m}},
ϕ2lnλϕ(π)=12+12(πϕ1)2eϕ2m+(φ(bϕ1e2ϕm)bϕ1mφ(aϕ1eϕ2m)aϕ1m)(Φ(bϕ1eϕ2m)Φ(aϕ1eϕ2m))12eϕ2/2.\frac{\partial}{\phi_{2}}\ln\lambda^{\phi}(\pi)=-\frac{1}{2}+\frac{1}{2}(\pi-\phi_{1})^{2}\frac{e^{-\phi_{2}}}{m}+\frac{\left({\varphi}(\frac{b-\phi_{1}}{\sqrt{e^{\phi}_{2}m}})\frac{b-\phi_{1}}{\sqrt{m}}-{\varphi}(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\frac{a-\phi_{1}}{\sqrt{m}}\right)}{\left(\Phi(\frac{b-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})-\Phi(\frac{a-\phi_{1}}{\sqrt{e^{\phi_{2}}m}})\right)}\frac{1}{2}e^{-\phi_{2}/2}.

The partial derivatives of p^(ϕ1,ϕ2)\hat{p}(\phi_{1},\phi_{2}) can be obtained explicitly.

Similarly to the unconstrained case, the learning step can be done using the Actor-Critic Algorithm adopted for the constrained problem in Algorithm 2 below.

0:  Market simulator Market
0:  Learning rates: ηθ,ηϕ\eta_{\theta},\eta_{\phi}, exploration rate mm, number of iterations MM, portfolio bounds a,ba,b
1:  for k=1,2,,Mk=1,2,\ldots,M do
2:     for i=1,2,,TΔti=1,2,\ldots,\frac{T}{\Delta t} do
3:        Sample (tik,xik)(t_{i}^{k},x_{i}^{k}) from Market under λϕ\lambda^{\phi}
4:        Obtain collected samples 𝒟={(tik,xik):1iTΔt}\mathcal{D}=\{(t_{i}^{k},x_{i}^{k}):1\leq i\leq\frac{T}{\Delta t}\}
5:        Compute ΔθJ\Delta_{\theta}J using (3.4)
6:        Compute ΔϕJ\Delta_{\phi}J using (3.29)
7:        Update θθ+l(i)ηθΔθJ\theta\leftarrow\theta+l(i)\eta_{\theta}\Delta_{\theta}J
8:        Update ϕϕl(i)ηϕΔϕJ\phi\leftarrow\phi-l(i)\eta_{\phi}\Delta_{\phi}J
9:     end for
10:     Update the policy λϕ𝒩(ϕ1,eϕ2m)|[a,b]\lambda^{\phi}\leftarrow\mathcal{N}\left(\phi_{1},e^{\phi_{2}}m\right)|_{[a,b]}
11:  end for
Algorithm 2 PE/PG Actor-Critic Algorithm: constrained problem

5.3. Numerical demonstration

We provide an an example to demonstrate our results under a portfolio constraint in a setting with T=1T=1 and Δt=1/250\Delta t=1/250. The annualized interest rate is r=3%r=3\%. We choose μ=0.08,σ=0.3\mu=0.08,\sigma=0.3 to simulate sample paths of the diffusion wealth process (6.1) based on the the update policy after each episode. First, for m[0.001,2]m\in[0.001,2] we plot the corresponding exploration cost function for both constrained and unconstrained cases. Figure 1 clearly confirms the theoretical results obtained in Proposition 3.1. Interestingly, the exploration cost of unconstrained case (L(T,x;m)L(T,x;m)) is much larger than that of constrained case (L[a,b](T,x;m)L^{[a,b]}(T,x;m)). Consistent to Lemma 4.4, this could be explained that for the constrained case, one just needs to search for the optimal strategy in a finite domain as opposed to an infinite domain for the unconstrained case.

Refer to caption
(a) Impact of lower bound aa
Refer to caption
(b) Impact of upper bound bb
Figure 1. Exploration cost versus exploration rate

The left (resp. right) panel of Figure 1 demonstrates the impact of the lower (resp. upper) portfolio bound on the exploration cost. Clearly, as the lower (resp. upper) bound of the portfolio decreases (resp. increases), the exploration cost increases. This finding aligns with the intuitive understanding that a broader domain of investment opportunities necessitates higher exploration expenses. Notably, when subjected to both short-selling (i.e., a=0a=0) and money borrowing constraints (i.e., b=1b=1), the exploration cost becomes negligible compared to the unconstrained case.

Recall from Theorem 3.1 that the difference between the constrained value function v,[a,b](t,x;0)v^{*,[a,b]}(t,x;0) and the exploratory constrained value function v,[a,b](t,x;m)v^{*,[a,b]}(t,x;m) is determined by the temporal exploration rate mm. In Figure 2, we plot the difference |v,[a,b](t,x;m)v,[a,b](t,x;0)||v^{*,[a,b]}(t,x;m)-v^{*,[a,b]}(t,x;0)| for m=2m=2 (left panel) and m=0.001m=0.001 (right panel). It is clear from Figure 2 the difference is substantial for a large exploration rate mm but decreases greatly for a small exploration rate.

Refer to caption
(a) m=2m=2
Refer to caption
(b) m=0.001m=0.001
Figure 2. The effect of the exploration rate mm on |v,[a,b](t,x;m)v,[a,b](t,x;0)||v^{*,[a,b]}(t,x;m)-v^{*,[a,b]}(t,x;0)|.

Next, we choose the learning rates which are used to update θ\theta and ϕ\phi to be ηθ=0.01\eta_{\theta}=0.01, ηϕ=0.001\eta_{\phi}=0.001, respectively with the decay rate l(i)=1/i0.51l(i)=1/i^{0.51}. From the closed-from formula, the true value of (θ1,θ2)(\theta_{1},\theta_{2}) is given by θ=(0.055929,0.001497)\theta^{*}=(0.055929,-0.001497). For each learning episode, we start with an initial guess

θ=(θ1,θ2)=θ+(ϵ1,ϵ2)\theta=(\theta_{1},\theta_{2})=\theta^{*}+(\epsilon_{1},\epsilon_{2})

where ϵi,i=1,2\epsilon_{i},i=1,2 are independent and generated from a (Gaussian noise) normal distribution. For each iteration, we randomly sample 1000010000 1-year trajectories of the wealth process 𝒟={(tik,xik):1iTΔt}\mathcal{D}=\{(t_{i}^{k},x_{i}^{k}):1\leq i\leq\frac{T}{\Delta t}\}. The gradients of the value function ΔθJ\Delta_{\theta}J are calculated by the above approximate discrete sum (5.1). The model is trained for N=10000N=10000 iterations. The results reported in this section is the average of 2020 training episodes. The portfolio strategy is limited by the lower bound a=0a=0 (i.e. short-selling is not possible) and the upper bound b=1b=1 (i.e. money borrowing is also prohibited). Finally, we choose the test function ht=Jθθ=[(Tt);(Tt)]Th_{t}=\frac{\partial J^{\theta}}{\partial\theta}=[(T-t);(T-t)]^{T}.

Refer to caption
(a) m=1m=1
Refer to caption
(b) m=0.1m=0.1
Refer to caption
(c) m=0.01m=0.01
Figure 3. Learned policy for different exploration rates.

Figure 3 reports the learned policy in comparison with the true optimal policy which is a Gaussian distribution truncated on [a,b][a,b]. It can be observed that the smaller the value of the exploration rate is, the more closely matched the learned optimal strategy is to that of the ground true optimal policy. As shown in Figure 4, when mm is small, the (truncated Gaussian) optimal policy gets closer to the Dirac distribution at the “constrained” Merton strategy π0,[a,b]\pi^{0,[a,b]} defined in (4.17), which is consistent to the result obtained in Theorem 4.4.

Refer to caption
(a) m=0.0001m=0.0001
Refer to caption
(b) m=0.00001m=0.00001
Figure 4. Convergence of the constrained policy when the exploration rate m0m\to 0.

To have a more concrete comparison between the true optimal optimal policy and the learned optimal policy, in Table 1, we compare the true mean, the learned mean, and the standard deviation of the learned mean. In addition, we report the (empirical) Kullback–Leibler divergence (see e.g. Csiszár (1975)) from the learned policy to the true policy for different mm, see DKLD_{KL} column.

mm True mean Learned mean std DKLD_{KL}
0.01 0.555556 0.555927 0.000005 0.000000
0.1 0.555556 0.554679 0.000007 0.000000
1 0.555556 0.555750 0.000189 0.000052
Table 1. True mean versus Learned mean under portfolio constraint for different mm

Next we consider the dependence of |v[a,b](t,x,m)Jθ(t,x,m)||v^{[a,b]}(t,x,m)-J^{\theta}(t,x,m)| on the number of iterations used. To easy the comparison, we consider the difference |v[a,b](0.5,0.5,0.01)Jθ(0.5,0.5,0.01)||v^{[a,b]}(0.5,0.5,0.01)-J^{\theta}(0.5,0.5,0.01)| as a function of iterations. More specifically, we run Algorithm 1 to learn the coefficients θ1,θ2\theta_{1},\theta_{2} from data. We then calculate the difference |v[a,b](0.5,0.5,0.01)Jθ(0.5,0.5,0.01)||v^{[a,b]}(0.5,0.5,0.01)-J^{\theta}(0.5,0.5,0.01)|. In Figure 5, we report the difference |v[a,b](0.5,0.5,0.01)Jθ(0.5,0.5,0.01)||v^{[a,b]}(0.5,0.5,0.01)-J^{\theta}(0.5,0.5,0.01)| by averaging 20 runs.

Refer to caption
Figure 5. Error versus iterations

It is clear from Figure 5 the error between the optimal value function and the learned value function is decreasing as the number of iterations increases.

To have a more concrete comparison, we report the true parameters (θ1,θ2)(\theta_{1},\theta_{2}), (ϕ1,ϕ2)(\phi_{1},\phi_{2}) versus the corresponding learned (θ1,θ2)(\theta_{1},\theta_{2}), (ϕ1,ϕ2)(\phi_{1},\phi_{2}) in Table 2. As shown in Table 2, the quality of learning is improving as the number of iterations increases.

Iteration True (θ1,θ2)(\theta_{1},\theta_{2}) Learned (θ1,θ2)(\theta_{1},\theta_{2}) True (ϕ1,ϕ2)(\phi_{1},\phi_{2}) Learned (ϕ1,ϕ2)(\phi_{1},\phi_{2})
500 (0.055929,-0.001497) (0.052141,-0.001261) (0.555556,2.407946) (0.552537,2.407276)
1000 (0.055929,-0.001497) (0.057109 ,-0.001836) (0.555556,2.407946) (0.554388,2.413809 )
2000 (0.055929,-0.001497) (0.054673,-0.003369 ) (0.555556,2.407946) (0.555207,2.410082)
3000 (0.055929,-0.001497) (0.057058,-0.004642 ) (0.555556,2.407946) (0.555872,2.408604)
4000 (0.055929,-0.001497) (0.054119 ,0.001334) (0.555556,2.407946) (0.554001,2.409800)
5000 (0.055929,-0.001497) (0.055721,-0.000253) (0.555556,2.407946) (0.557341,2.404103)
10000 (0.055929,-0.001497) (0.056456,-0.002821) (0.555556,2.407946) (0.554809,2.405365)
Table 2. True parameters versus learned parameters

Lastly in this section, we would like to compare the effect of the exploration parameter mm on the distribution of the optimal wealth process. In Figure 6 we plot the distribution of the optimal wealth processes at time t=T/2=0.5t=T/2=0.5 using the dynamics in (4.12) and a sample size of 1000010000. From Figure 6, it can be seen that the exploration parameter mm plays a central role on the distribution of the constrained optimal wealth process. We can clearly distinguish the difference between the case with exploration (i.e. m0m\neq 0) and the case without exploration (i.e. m=0m=0). In particular, exploration leads to a more dispersed distribution with heavier tail. Again, as confirmed from the theory in the previous section, this effect becomes less significant as the exploration parameter mm is smaller.

Refer to caption
(a) m=0.5m=0.5
Refer to caption
(b) m=0.1m=0.1
Refer to caption
(c) m=0.001m=0.001
Figure 6. Density plot of wealth process: Exploration vs non exploration

6. Optimal portfolio with portfolio constraints and exploration for quadratic utility

We consider in this section the case with a quadratic utility function. It is worth noting that in portfolio theory the use of quadratic utility functions is very close to the classic Markowitz mean-variance portfolio and the reader is referred to e.g., Duffie and Richardson (1991); Bodnar et al. (2013) for discussions on connections between three quadratic optimization problems: the Markowitz mean–variance problem, the mean–variance utility function and the quadratic utility. Note that unlike mean-variance (MV) solution, the optimal solution under a quadratic utility function is not only time-consistent but, as shown below, also lies on the MV efficient frontier under mild conditions. Moreover, while it becomes extremely complex even without exploration e.g. Li et al. (2002); Bielecki et al. (2005); Li and Xu (2016) when a portfolio constraint is added in a MV problem, we still able to obtain closed form solutions when portfolio constraints are included in our exploratory learning setting with a quadratic utility. Note that for a quadratic utility function, the wealth process can be negative. Therefore, below we consider the amount of wealth θt\theta_{t} invested in the risky asset at time tt. To include portfolio constraints we assume that given the current wealth XtθX^{\theta}_{t}, the risk investment amount θt\theta_{t} at time tt is bounded by a(t,Xtθ)θtb(t,Xtθ)a(t,X_{t}^{\theta})\leq\theta_{t}\leq b(t,X_{t}^{\theta}), where aa (resp. bb) is a deterministic continuous function defining the lower (resp. upper) portfolio bound. The wealth dynamics follows the following SDE

(6.1) dXtθ=\displaystyle dX_{t}^{\theta}= (rXtθ+θt(μr))dt+σθtdWt,X0θ=x0.\displaystyle\left(rX_{t}^{\theta}+\theta_{t}(\mu-r)\right)dt+\sigma\theta_{t}dW_{t}\,,\quad X_{0}^{\theta}=x_{0}\in\mathbb{R}\,.

The set of admissible investment strategies is now defined by

𝒟[a,b](x0):={(θ)t[0,T]\displaystyle\mathcal{D}_{[a,b]}(x_{0}):=\Bigg{\{}(\theta)_{t\in[0,T]}\, is progressively measurable|a(t,Xtθ)θtb(t,Xtθ),\displaystyle\text{is progressively measurable}\,\bigg{|}\,\;a(t,X_{t}^{\theta})\leq\theta_{t}\leq b(t,X_{t}^{\theta}),\;
(6.1) has a unique strong solution and𝔼[0Tθt2dt]<}.\displaystyle\text{\eqref{eq_Xtnoc} has a unique strong solution and}\;\mathbb{E}\bigg{[}\int_{0}^{T}\theta_{t}^{2}dt\bigg{]}<\infty\Bigg{\}}.

Our objective is to maximize the terminal expected utility

(6.2) maxθ𝒟[a,b](x0)𝔼[U(XTθ)],\displaystyle\max_{\theta\in{\mathcal{D}}_{[a,b]}(x_{0})}\mathbb{E}\Big{[}U(X_{T}^{\theta})\Big{]},

where U(x)=Kx12εx2U(x)=Kx-\frac{1}{2}\varepsilon x^{2}, with parameter K,ε>0K,\varepsilon>0. The constant 1/ε1/\varepsilon reflects the agent’s risk aversion and can be regarded as the risk aversion parameter in the mean–variance analysis. We remark that the quadratic utility function U(x)=Kx12εx2U(x)=Kx-\frac{1}{2}\varepsilon x^{2} has its global maximum at the so-called bliss point K/εK/\varepsilon. Obviously, the quadratic utility function is symmetric with respect to the bliss point K/εK/\varepsilon: it is increasing for x<K/εx<K/\varepsilon and decreasing for x>K/εx>K/\varepsilon and the utility values at some x<K/εx<K/\varepsilon and at K/εxK/\varepsilon-x are equivalent. We consider the exploratory version of the wealth dynamics given by

(6.3) dXtλ=A~(t,Xtλ;λ)dt+B~(t,Xtλ;λ)dWt,X0λ=x0,\displaystyle dX_{t}^{\lambda}=\tilde{A}(t,X_{t}^{\lambda};\lambda)dt+\tilde{B}(t,X_{t}^{\lambda};\lambda)dW_{t}\,,\quad X_{0}^{\lambda}=x_{0},

where

(6.4) A~(t,x;λ)=a(t,x)b(t,x)(rx+θ(μr))λ(θ|t,x)𝑑θ;B~(t,x;λ)=a(t,x)b(t,x)σ2θ2λ(θ|t,x)𝑑θ.\displaystyle\tilde{A}(t,x;\lambda)=\int_{a(t,x)}^{b(t,x)}\big{(}rx+\theta(\mu-r)\big{)}\lambda(\theta|t,x)d\theta;\quad\tilde{B}(t,x;\lambda)=\sqrt{\int_{a(t,x)}^{b(t,x)}\sigma^{2}\theta^{2}\lambda(\theta|t,x)d\theta}.

and λ(|t,x)\lambda(\cdot|t,x) is a probability density on the interval [a(t,x),b(t,x)][a(t,x),b(t,x)]. The exploratory optimization is now stated by

(6.5) v(s,x;m):=maxλ~[a,b]𝔼[U(XTλ)msTa(t,Xtλ)b(t,Xtλ)λ(θ|t,Xtλ)lnλ(θ|t,Xtλ)𝑑θ𝑑t|Xsλ=x].\displaystyle v(s,x;m):=\max_{\lambda\in\widetilde{\mathcal{H}}_{[a,b]}}\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{s}^{T}\int_{a(t,X_{t}^{\lambda})}^{b(t,X_{t}^{\lambda})}\lambda(\theta|t,X_{t}^{\lambda})\ln\lambda(\theta|t,X_{t}^{\lambda})d\theta dt\big{|}X_{s}^{\lambda}=x\bigg{]}.

where ~[a,b]\widetilde{\mathcal{H}}_{[a,b]} is the set of admissible feedback policies λ\lambda that satisfy the following properties:

  1. (1)

    For each (t,x)[0,T]×(t,x)\in[0,T]\times\mathbb{R}, λ(|t,x)\lambda(\cdot|t,x) is a density function on [a(t,x),b(t,x)][a(t,x),b(t,x)].

  2. (2)

    The mapping [0,T]××[a(t,x),b(t,x)](t,x,π)λ(π|t,x)[0,T]\times\mathbb{R}\times[a(t,x),b(t,x)]\ni(t,x,\pi)\mapsto\lambda(\pi|t,x) is measurable.

  3. (3)

    The exploration SDE (6.3) admits a unique strong solution denoted by XλX^{\lambda} and

    𝔼[U(XTλ)m0Ta(t,Xtλ)b(t,Xtλ)λ(θ|t,Xtλ)lnλ(θ|t,Xtλ)𝑑θ𝑑t]<.\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{0}^{T}\int_{a(t,X_{t}^{\lambda})}^{b(t,X_{t}^{\lambda})}\lambda(\theta|t,X_{t}^{\lambda})\ln\lambda(\theta|t,X_{t}^{\lambda})d\theta dt\bigg{]}<\infty.

As before, the optimal value function satisfies the following HJB equation

vt(t,x;m)+supλ~[a,b]{A~(t,x;λ)vx(t,x;m)\displaystyle v_{t}(t,x;m)+\sup_{\lambda\in\widetilde{\mathcal{H}}_{[a,b]}}\bigg{\{}\tilde{A}(t,x;\lambda)v_{x}(t,x;m) +12B~2(t,x;λ)vxx(t,x;m)\displaystyle+\frac{1}{2}\tilde{B}^{2}(t,x;\lambda)v_{xx}(t,x;m)
(6.6) ma(t,x)b(t,x)λ(θ|t,x)lnλ(θ|t,x)dθ}=0,\displaystyle-m\int_{a(t,x)}^{b(t,x)}\lambda(\theta|t,x)\ln\lambda(\theta|t,x)d\theta\bigg{\}}=0,

with terminal condition v(T,x;m)=U(x)v(T,x;m)=U(x). Using again the standard argument of DPP we observe that under the portfolio constraint θ𝒟[a,b]\theta\in\mathcal{D}_{[a,b]}, the optimal feedback policy now follows a truncated Gaussian distribution.

Lemma 6.1.

In the exploratory constrained EU setting with quadratic utility function and vxx<0v_{xx}<0, the optimal feedback policy λ~\tilde{\lambda}^{*} is a Gaussian distribution with mean α~(t,x)\tilde{\alpha}(t,x) and variance β~2(t,x)\tilde{\beta}^{2}(t,x) truncated on interval [a(t,x),b(t,x)][a(t,x),b(t,x)], where

(6.7) α~(t,x)=(μr)vx(t,x)σ2vxx(t,x);β~2(t,x)=mσ2vxx(t,x).\tilde{\alpha}(t,x)=-\frac{(\mu-r)v_{x}(t,x)}{\sigma^{2}v_{xx}(t,x)};\quad\tilde{\beta}^{2}(t,x)=-\frac{m}{\sigma^{2}v_{xx}(t,x)}.

The density of the optimal policy λ~\widetilde{\lambda}^{*} is given by

(6.8) λ~(θ|t,x;m)\displaystyle\widetilde{\lambda}^{*}(\theta|t,x;m) :=1β~(t,x)φ(θα~(t,x)β~(t,x))Φ(b(x)α~(t,x)β~(t,x))Φ(a(x)α~(t,x)β~(t,x)),\displaystyle:=\frac{1}{\widetilde{\beta}(t,x)}\frac{\varphi\bigg{(}\frac{\theta-\widetilde{\alpha}(t,x)}{\widetilde{\beta}(t,x)}\bigg{)}}{\Phi\bigg{(}\frac{b(x)-\widetilde{\alpha}(t,x)}{\widetilde{\beta}(t,x)}\bigg{)}-\Phi\bigg{(}\frac{a(x)-\widetilde{\alpha}(t,x)}{\widetilde{\beta}(t,x)}\bigg{)}},

where φ\varphi and Φ\Phi are the PDF and CDF functions of the standard normal distribution, respectively.

Substituting (6.8) back to the HJB (6.6) we obtain the following non-linear PDE

(6.9) vt(t,x;m)+rxvx(t,x;m)12(μr)2vx2(t,x;m)σ2vxx(t,x;m)+m2ln(2πemσ2vxx(t,x;m))+mlnZ(t,x;m)=0,\displaystyle v_{t}(t,x;m)+rxv_{x}(t,x;m)-\frac{1}{2}\frac{(\mu-r)^{2}v_{x}^{2}(t,x;m)}{\sigma^{2}v_{xx}(t,x;m)}+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}v_{xx}(t,x;m)}\bigg{)}+{m\ln Z(t,x;m)}=0,

with terminal condition v(T,x;m)=U(x)v(T,x;m)=U(x), where by abusing notations,

(6.10) Z(t,x;m):=Φ(Q~b(t,x;m))Φ(Q~a(t,x;m)),Z(t,x;m):={\Phi(\widetilde{Q}_{b}(t,x;m))-\Phi(\widetilde{Q}_{a}(t,x;m))},

with

(6.11) Q~a(t,x;m):=(a(t,x)+(μr)σ2vx(t,x)vxx(t,x))σ2vxx(t,x)m;\displaystyle\widetilde{Q}_{a}(t,x;m):=\bigg{(}a(t,x)+\frac{(\mu-r)}{\sigma^{2}}\frac{v_{x}(t,x)}{v_{xx}(t,x)}\bigg{)}\sqrt{\frac{-\sigma^{2}v_{xx}(t,x)}{m}};
(6.12) Q~b(t,x;m):=(b(t,x)+(μr)σ2vx(t,x)vxx(t,x))σ2vxx(t,x)m.\displaystyle\widetilde{Q}_{b}(t,x;m):=\bigg{(}b(t,x)+\frac{(\mu-r)}{\sigma^{2}}\frac{v_{x}(t,x)}{v_{xx}(t,x)}\bigg{)}\sqrt{\frac{-\sigma^{2}v_{xx}(t,x)}{m}}.

In the rest of this section we assume that the portfolio bounds are given by

a(t,x):=πMertonx+a0(t),andb(t,x):=πMertonx+b0(t),a(t,x):={-}\pi^{Merton}x+a_{0}(t),\quad\mbox{and}\quad b(t,x):={-}\pi^{Merton}x+b_{0}(t),

where, as before πMerton=(μr)σ2\pi^{Merton}=\frac{(\mu-r)}{\sigma^{2}}, a0a_{0} and b0b_{0} are two time-varying continuous bounded functions with a0(t)<b0(t),t[0,T]a_{0}(t)<b_{0}(t),\,\forall t\in[0,T].

Theorem 6.1 (Quadratic utility-exploratory optimal investment under portfolio constraint).

The optimal value function of the entropy-regularized exploratory constrained optimal investment problem with quadratic utility U(x)=Kx12εx2U(x)=Kx-\frac{1}{2}\varepsilon x^{2} is given by

(6.13) v~[a,b](t,x;m)=\displaystyle\widetilde{v}^{[a,b]}(t,x;m)= 12εx2e(ρ22r)(Tt)+Kxe(ρ2r)(Tt)K22ε(1eρ2(Tt))\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})
(6.14) +m4(ρ22r)(Tt)2+m2ln(ε1σ22πem)(Tt)+mF(t,m),\displaystyle+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t)+mF(t,m),

for (t,x)[0,T]×(t,x)\in[0,T]\times\mathbb{R}, where ρ:=(μr)σ\rho:=\frac{(\mu-r)}{\sigma} is the Sharpe ratio, F(t,m)=tTln(f(s,m))𝑑sF(t,m)=\int_{t}^{T}\ln(f(s,m))ds,

(6.15) f(t,m):=Φ(Q~b(t;m))Φ(Q~a(t;m)),f(t,m):=\Phi(\widetilde{Q}_{b}(t;m))-\Phi(\widetilde{Q}_{a}(t;m)),

where

Q~a(t;m):=(a0(t)Kεer(Tt)πMerton)εσ2me(ρ22r)(Tt));\widetilde{Q}_{a}(t;m):=\bigg{(}a_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}});
Q~b(t;m):=(b0(t)Kεer(Tt)πMerton)εσ2me(ρ22r)(Tt)).\widetilde{Q}_{b}(t;m):=\bigg{(}b_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}}).

Moreover, the optimal feedback distribution control λ~(|t,x;m)\widetilde{\lambda}^{*}(\cdot|t,x;m) is a Gaussian distribution with parameter (Kεer(Tt)x)πMerton(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton} and mεσ2e(ρ22r)(Tt)\frac{m}{\varepsilon\sigma^{2}}e^{(\rho^{2}-2r)(T-t)}, truncated on the interval [a(t,x),b(t,x)][a(t,x),b(t,x)] i.e.

(6.16) λ~(θ|t,x;m)=𝒩(θ|(Kεer(Tt)x)πMerton,mεσ2e(2rρ2)(Tt))|[a(t,x),b(t,x)].\widetilde{\lambda}^{*}(\theta|t,x;m)=\mathcal{N}\left(\theta\bigg{|}(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton},\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}\right)\bigg{|}_{[a(t,x),b(t,x)]}.
Corollary 6.1 (Unconstrained exploratory quadratic portfolio).

The optimal value function of the entropy-regularized exploratory unconstrained optimal investment problem for the quadratic utility U(x)=Kx12εx2U(x)=Kx-\frac{1}{2}\varepsilon x^{2} is given by

(6.17) v~(t,x;m)=\displaystyle\widetilde{v}(t,x;m)= 12εx2e(ρ22r)(Tt)+Kxe(ρ2r)(Tt)K22ε(1eρ2(Tt))\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})
(6.18) +m4(ρ22r)(Tt)2+m2ln(ε1σ22πem)(Tt),\displaystyle+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t),

for (t,x)[0,T]×(t,x)\in[0,T]\times\mathbb{R}. Moreover, the unconstrained optimal feedback distribution control is a Gaussian distribution with parameter (Kεer(Tt)x)πMerton(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton} and mεσ2e(2rρ2)(Tt)\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}.

Proof.

This is a direct consequence of Theorem 6.1 by letting a0(t)a_{0}(t)\to-\infty and b0(t)+b_{0}(t)\to+\infty. ∎

Remark 6.1.

Compared to the mean-variance results obtained in Wang and Zhou (2020), the risk reward parameter Kϵ1K{\epsilon}^{-1} plays the role of the Lagrangian multiplier ω\omega (under their notations) in the mean-variance setting. It should be mentioned that instead of considering the true portfolio, the authors in Wang and Zhou (2020) study the mean-variance problem for the discounted portfolio, which can be obtained by assuming r=0r=0 in our setting. Note that the variance of our optimal policy is different from the one obtained in Wang and Zhou (2020) by a multiplicative factor 2ϵ12\epsilon^{-1} since our quadratic utility is nothing else but the mean-variance utility in Wang and Zhou (2020) scaled up by ϵ/2-\epsilon/2.

Letting m0m\to 0 in Corollary 6.1 we can obtain the unconstrained optimal portfolio without exploration for quadratic utility function.

Corollary 6.2 (Unconstrained quadratic portfolio).

For a quadratic utility U(x)=Kx12εx2U(x)=Kx-\frac{1}{2}\varepsilon x^{2}, the optimal value function of unconstrained optimal investment problem without exploration is given by

(6.19) v~(t,x;0)=\displaystyle\widetilde{v}(t,x;0)= 12εx2e(ρ22r)(Tt)+Kxe(ρ2r)(Tt)K22ε(1eρ2(Tt))\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})

and the unconstrained optimal control strategy is given by (Kεer(Tt)x)πMerton(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton}.

Following similar arguments used in Proposition 4.1 and Lemma 4.4 we can confirm that the exploration cost for the constrained problem is smaller than that of the unconstrained problem for a quadratic utility function.

Proposition 6.1.

In the constrained problem with exploration and quadratic utility function, the exploration cost is given by

L~[a,b](T,x;m)=mT2+m0TQ~a(t;m)φ(Q~a(t;m))Q~b(t;m)φ((Q~b(t;m))f(t,m)𝑑tmT2.\widetilde{L}^{[a,b]}(T,x;m)=\frac{mT}{2}+m\int_{0}^{T}\frac{\widetilde{Q}_{a}(t;m)\varphi(\widetilde{Q}_{a}(t;m))-\widetilde{Q}_{b}(t;m)\varphi((\widetilde{Q}_{b}(t;m))}{f(t,m)}dt\leq\frac{mT}{2}.

Moreover, limm0L~[a,b](T,x;m)=0\lim_{m\to 0}\widetilde{L}^{[a,b]}(T,x;m)=0

Similarly to Theorems 4.3-4.4 for logarithmic utility, a policy improvement theorem can be shown for quadratic utility functions. Below we show that the optimal solution of the exploratory quadratic utility problem is mean-variance efficient (see e.g. Luenberger (1998); Kato et al. (2020); Duffie and Richardson (1991) for similar discussions in the case without exploration).

Proposition 6.2.

Assume that the agent’s initial wealth is smaller than the discounted reward level Kϵ\frac{K}{\epsilon}, i.e. x0KϵerTx_{0}\leq\frac{K}{\epsilon}e^{-rT}. Then, the exploratory unconstrained optimal portfolio belongs to the mean-variance frontier.

Remark 6.2.

Note that since the exploration parameter only effects the diffusion term, we can see that Proposition 6.2 also holds for the unconstrained case without exploration. Moreover, it can be shown that Proposition 6.2 still holds true for the constrained case with exploration. Indeed, recall the optimal constrained exploratory solution whose dynamics is given by

dXtλ~=A~(t,Xtλ~;λ~)dt+B~(t,Xtλ~;λ~)dWt,dX^{\widetilde{\lambda}^{*}}_{t}=\tilde{A}(t,X^{\widetilde{\lambda}^{*}}_{t};\widetilde{\lambda}^{*})dt+\widetilde{B}(t,X^{\widetilde{\lambda}^{*}}_{t};\widetilde{\lambda}^{*})dW_{t},

where A~\tilde{A}, B~\widetilde{B} are given by (C.5). Note that the second term of θ~t,[a,b]\widetilde{\theta}_{t}^{*,[a,b]} in (C.5) is negative because

φ(Q~a(t;m))φ(Q~b(t;m))f(t,m)<0.\frac{\varphi\big{(}\widetilde{Q}_{a}(t;m)\big{)}-\varphi\big{(}\widetilde{Q}_{b}(t;m)\big{)}}{f(t,m)}<0.

Using the comparison principle of ordinary differential equations we can conclude that 𝔼[XTλ~]𝔼[XT0,λ~]Kϵ\mathbb{E}[X^{\widetilde{\lambda}^{*}}_{T}]\leq\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]\leq\frac{K}{\epsilon}, which implies that the optimal solution of the expected quadratic utility problem with both portfolio constraints and exploration lies on the (constrained) mean-variance frontier.

6.1. Implementation

We consider the case without constraint. Recall the value function

(6.20) v~(t,x;m)=\displaystyle\widetilde{v}(t,x;m)= 12εx2e(ρ22r)(Tt)+Kxe(ρ2r)(Tt)K22ε(1eρ2(Tt))\displaystyle-\frac{1}{2}\varepsilon x^{2}e^{-(\rho^{2}-2r)(T-t)}+Kxe^{-(\rho^{2}-r)(T-t)}-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})
(6.21) +m4(ρ22r)(Tt)2+m2ln(ε1σ22πem)(Tt),\displaystyle+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t),

for (t,x)[0,T]×+(t,x)\in[0,T]\times\mathbb{R}_{+}. We choose to parametrize the value function as

Jθ(t,x;m)=12ϵ(xer(Tt)K/ϵ)2eθ3(Tt)+K2/ϵeθ3(Tt)12K2/ϵ+θ2(Tt)2+θ1(Tt).\displaystyle J^{\theta}(t,x;m)=-\frac{1}{2}\epsilon(xe^{r(T-t)}-K/\epsilon)^{2}e^{-\theta_{3}(T-t)}+K^{2}/\epsilon e^{-\theta_{3}(T-t)}-\frac{1}{2}K^{2}/\epsilon+\theta_{2}(T-t)^{2}+\theta_{1}(T-t).

That is,

{θ1=m/2ln(2πemϵ1σ2)θ2=m/4(ρ22r)θ3=ρ2.\left\{\begin{array}[]{ll}&\theta_{1}=m/2\ln(2\pi_{e}m\epsilon^{-1}\sigma^{-2})\\ &\theta_{2}=m/4(\rho^{2}-2r)\\ &\theta_{3}=\rho^{2}.\end{array}\right.

From here the test functions are chosen as

(6.22) {Jθθ1=(Tt),Jθθ2=(Tt)2,Jθθ3=12ϵ(xer(Tt)K/ϵ)2eθ3(Tt)(Tt)K2/ϵeθ3(Tt)(Tt).\left\{\begin{array}[]{ll}\frac{\partial J^{\theta}}{\partial\theta_{1}}=(T-t),\\ \frac{\partial J^{\theta}}{\partial\theta_{2}}=(T-t)^{2},\\ \frac{\partial J^{\theta}}{\partial\theta_{3}}=\frac{1}{2}\epsilon(xe^{r(T-t)}-K/\epsilon)^{2}e^{-\theta_{3}(T-t)}(T-t)-K^{2}/\epsilon e^{-\theta_{3}(T-t)}(T-t).\end{array}\right.

Also, recall the unconstrained optimal feedback distribution control is a Gaussian distribution with parameter (Kεer(Tt)x)πMerton(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton} and mεσ2e(2rρ2)(Tt)\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}. For the policy distribution, we parametrize it as follows

L(ϕ1,ϕ2,ϕ3)=𝒩((Kεer(Tt)x)ϕ1,eϕ2+ϕ3(Tt)).\displaystyle L(\phi_{1},\phi_{2},\phi_{3})=\mathcal{N}\bigg{(}(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\phi_{1},e^{\phi_{2}+\phi_{3}(T-t)}\bigg{)}.

From here, it can be seen that the entropy is given by

H(ϕ1,ϕ2,ϕ3)=0.5(ln(2πe)+1+ϕ2+ϕ3(Tt));\displaystyle H(\phi_{1},\phi_{2},\phi_{3})=0.5(\ln(2\pi_{e})+1+\phi_{2}+\phi_{3}(T-t));

The log-likelihood, l(ϕ1,ϕ2,ϕ3):=lnL(ϕ1,ϕ2,ϕ3)l(\phi_{1},\phi_{2},\phi_{3}):=\ln L(\phi_{1},\phi_{2},\phi_{3}) is given by

l(ϕ1,ϕ2,ϕ3)(y)=12ln(2π)12(ϕ2+ϕ3(Tt))12(yϕ1(er(Tt)K/ϵx))2eϕ2ϕ3(Tt).l(\phi_{1},\phi_{2},\phi_{3})(y)=-\frac{1}{2}\ln(2\pi)-\frac{1}{2}(\phi_{2}+\phi_{3}(T-t))-\frac{1}{2}\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}^{2}e^{-\phi_{2}-\phi_{3}(T-t)}.

Also, the derivatives of the log-likelihood are given by

lϕ1=(yϕ1(er(Tt)K/ϵx))eϕ2ϕ3(Tt)(er(Tt)K/ϵx)\displaystyle\frac{\partial l}{\partial\phi_{1}}=\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}e^{-\phi_{2}-\phi_{3}(T-t)}(e^{-r(T-t)}K/\epsilon-x)
lϕ2=12+12(yϕ1(er(Tt)K/ϵx))2eϕ2ϕ3(Tt)\displaystyle\frac{\partial l}{\partial\phi_{2}}=-\frac{1}{2}+\frac{1}{2}\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}^{2}e^{-\phi_{2}-\phi_{3}(T-t)}
lϕ3=12(Tt)+12(yϕ1(er(Tt)K/ϵx))2eϕ2ϕ3(Tt)(Tt).\displaystyle\frac{\partial l}{\partial\phi_{3}}=-\frac{1}{2}(T-t)+\frac{1}{2}\bigg{(}y-\phi_{1}(e^{-r(T-t)}K/\epsilon-x)\bigg{)}^{2}e^{-\phi_{2}-\phi_{3}(T-t)}(T-t).

6.2. Numerical example

We illustrate the results for quadratic utility function. Specifically, we consider the scenario without constraints with r=0.02,μ=0.05,σ=0.30,T=1,x0=0.5r=0.02,\mu=0.05,\sigma=0.30,T=1,x_{0}=0.5 and ϵ=1\epsilon=1. We remark that the condition x0KϵerTx_{0}\leq\frac{K}{\epsilon}e^{-rT} is fulfilled. First, we run the algorithm using 10001000 iterations, the learning rates which are used to update θ\theta and ϕ\phi are chosen to be ηθ=0.01\eta_{\theta}=0.01, ηϕ=0.01\eta_{\phi}=0.01, respectively with the decay rate l(i)=1/i0.51l(i)=1/i^{0.51}. Moreover, we choose the exploration rate m=0.01m=0.01. Figure 8 plots the evolution of coefficients of JθJ^{\theta}. In this case, the true coefficients are (θ1,θ2,θ3)=(0.001797,0.000075,0.010000)(\theta_{1},\theta_{2},\theta_{3})=(-0.001797,-0.000075,0.010000).

Refer to caption
(a) θ1\theta_{1}
Refer to caption
(b) θ2\theta_{2}
Refer to caption
(c) θ3\theta_{3}
Figure 7. Convergence of θ1\theta_{1} (Left), θ2\theta_{2} (Center) and θ3\theta_{3} (Right)

In this case, it is evident from Figure 8 that (θ1,θ2,θ3)(\theta_{1},\theta_{2},\theta_{3}) converges to its true counterpart. Next, recall from Corollary 6.1 the optimal feedback distribution control is a Gaussian distribution with parameter (Kεer(Tt)x)πMerton(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton} and mεσ2e(2rρ2)(Tt)\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}. To have a concrete comparison, we choose x=0.5,t=0.5x=0.5,t=0.5; we then run the algorithm to obtain estimates for μ,σ\mu,\sigma. Using the estimates for μ,σ\mu,\sigma, in Figure 8 we plot the density of the true optimal control versus the density of the learned optimal control for various exploration rates m=1,m=0.1,m=0.01m=1,m=0.1,m=0.01.

Refer to caption
(a) m=1m=1
Refer to caption
(b) m=0.1m=0.1
Refer to caption
(c) m=0.01m=0.01
Figure 8. Quadratic utility: True policy versus learned policy

It is clear from Figure 8 that the learned policies are matched extremely well to the true policies for different exploration rates. This once again confirms the theoretical findings presented in this section. We remark that compared to the existing literature on exploratory MV problem studied in e.g. Wang and Zhou (2020); Dai et al. (2020); Jia and Zhou (2022a, b) which has to deal with an additional Lagrange multiplier, our numerical implementation for the quadratic utility is much simpler thanks to our closed form solutions. As mentioned above, as the classical MV problem is time-inconsistent problem, the optimal value and strategy might be only significant when computing as the initial time t=0t=0. In our case, the optimal investment strategy and value function can be learned at any time in t[0,Tt\in[0,T]. Finally, we remark that the above learning procedure can be extended to the constrained problem by slightly adopting the above learning implementation as it is done for logarithm utility. Since the paper is rather lengthy we decided not to present this implementation.

7. An extension to random coefficient models

This section shows that our model can be extended beyond the classical Black-Scholes model. In particular, we assume that the drift and the volatility of the risky asset SS are driven by a state process yy as follows:

(7.1) dSt=St(μ(yt)dt+σ(yt)dWtS),dS_{t}=S_{t}(\mu(y_{t})dt+\sigma(y_{t})dW^{S}_{t}),

and

(7.2) dyt=μY(yt)dt+σY(yt)dWtY,dy_{t}=\mu_{Y}(y_{t})dt+\sigma_{Y}(y_{t})dW_{t}^{Y},

where WS,WyW^{S},W^{y} are two independent Brownian motions. We assume that μY\mu_{Y}, σY\sigma_{Y} are globally Lipschitz and linearly bounder functions so that there exists a unique strong solution yy to the factor SDE (7.2). It is worth mentioning that due to the additional randomness yy, the market is incomplete. Such a factor model has been well studied in the literature. Now, for each mm\in\mathbb{R}, the exploratory optimal value function is defined by

v(t,x,y;m):=maxλ𝔼[U(XTλ)mtTλ(π|t,Xtλ)lnλ(π|t,Xtλ)𝑑π𝑑s|Xtλ=x,yt=y],\displaystyle v(t,x,y;m):=\max_{\lambda\in\mathcal{H}}\mathbb{E}\Bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathbb{R}}\lambda(\pi|t,X_{t}^{\lambda})\ln\lambda(\pi|t,X_{t}^{\lambda})d\pi ds\bigg{|}X^{\lambda}_{t}=x,y_{t}=y\Bigg{]},

where as before, XλX^{\lambda} is the exploratory wealth process and \mathcal{H} is the set of admissible distributions. Hence, the optimal value function vv satisfies the following HJB equation

vt(t,x,y;m)+μYvy(t,x,y;m)+12σY2(y)vyy(t,x,y;m)\displaystyle v_{t}(t,x,y;m)+\mu_{Y}v_{y}(t,x,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)v_{yy}(t,x,y;m)
(7.3) +\displaystyle+ supλ{((r+π(μ(y)r))xvx(t,x,y;m)+12σ2(y)x2π2vxx(t,x,y;m)mlnλ(π|t,x,y))λ(π|t,x,y)𝑑π},\displaystyle\sup_{\lambda}\bigg{\{}\int_{\mathbb{R}}\bigg{(}(r+\pi(\mu(y)-r))xv_{x}(t,x,y;m)+\frac{1}{2}\sigma^{2}(y)x^{2}\pi^{2}v_{xx}(t,x,y;m)-m\ln\lambda(\pi|t,x,y)\bigg{)}\lambda(\pi|t,x,y)d\pi\bigg{\}},

with terminal condition v(T,x,y;m)=ln(x)v(T,x,y;m)=\ln(x). Following the same steps in Section 3, the optimal distribution λ\lambda^{*} is given by

(7.4) λ(π|t,x,y;m)\displaystyle\lambda^{*}(\pi|t,x,y;m) :=exp{1m((r+π(μ(y)r))xvx(t,x,y;m)+12σ2(y)x2π2vxx(t,x,y;m))}exp{1m((r+π(μ(y)r))xvx(t,x,y;m)+12σ2(y)x2π2vxx(t,x,y;m))dπ},\displaystyle:=\frac{\exp\{\frac{1}{m}\big{(}(r+\pi(\mu(y)-r))xv_{x}(t,x,y;m)+\frac{1}{2}\sigma^{2}(y)x^{2}\pi^{2}v_{xx}(t,x,y;m)\big{)}\}}{\int_{\mathbb{R}}\exp\bigg{\{}\frac{1}{m}\big{(}(r+\pi(\mu(y)-r))xv_{x}(t,x,y;m)+\frac{1}{2}\sigma^{2}(y)x^{2}\pi^{2}v_{xx}(t,x,y;m)\big{)}d\pi\bigg{\}}},

which is Gaussian with mean α\alpha and variance β2\beta^{2} (assuming that vxx<0v_{xx}<0) defined by

(7.5) α=(μ(y)r)xvxσ2(y)x2vxx;β2=mσ2(y)x2vxx.\alpha=-\frac{(\mu(y)-r)xv_{x}}{\sigma^{2}(y)x^{2}v_{xx}};\quad\beta^{2}=-\frac{m}{\sigma^{2}(y)x^{2}v_{xx}}.

As before, using the basic properties of Gaussian laws we obtain the following PDE

vt(t,x,y;m)\displaystyle v_{t}(t,x,y;m) +μY(y)vy(t,x,y;m)+12σY2(y)vyy(t,x,y;m)+rxvx(t,x,y;m)\displaystyle+\mu_{Y}(y)v_{y}(t,x,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)v_{yy}(t,x,y;m)+rxv_{x}(t,x,y;m)
(7.6) 12(μ(y)r)2vx2(t,x;m)σ2(y)vxx(t,x,y;m)+m2ln(2πemσ2(y)x2vxx(t,x,y;m))=0,\displaystyle-\frac{1}{2}\frac{(\mu(y)-r)^{2}v_{x}^{2}(t,x;m)}{\sigma^{2}(y)v_{xx}(t,x,y;m)}+\frac{m}{2}\ln\bigg{(}-\frac{2\pi_{e}m}{\sigma^{2}(y)x^{2}v_{xx}(t,x,y;m)}\bigg{)}=0,

with terminal condition v(T,x,y;m)=ln(x).v(T,x,y;m)=\ln(x). Compared to the previous sections, we have to deal with a two-dimension PDE. To solve (7.6), we seek for an ansatz of the form v(t,x,y;m)=lnx+f(t,y;m)v(t,x,y;m)=\ln x+f(t,y;m), where ff is a smooth function that depends only on tt and yy. For this ansatz we obtain the following PDE for ff

(7.7) ft(t,y;m)\displaystyle f_{t}(t,y;m) +μY(y)fy(t,y;m)+12σY2(y)fyy(t,y;m)+h(y)=0,f(T,y;m)=0\displaystyle+\mu_{Y}(y)f_{y}(t,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)f_{yy}(t,y;m)+h(y)=0,\quad f(T,y;m)=0

where h(y):=(r+12(μ(y)r)2σ2(y))+m2ln(2πemσ2(y))h(y):=\bigg{(}r+\frac{1}{2}\frac{(\mu(y)-r)^{2}}{\sigma^{2}(y)}\bigg{)}+\frac{m}{2}\ln\bigg{(}\frac{2\pi_{e}m}{\sigma^{2}(y)}\bigg{)}.

Theorem 7.1.

Assume that μY\mu_{Y} and σY\sigma_{Y} are bounded differentiable functions with bounded derivatives and

(7.8) supymax{σY(y),σY1(y),σ1(y)}<andsupy|(μ(y)r)2σ2(y)|<.\sup_{y\in\mathbb{R}}\max\{\sigma_{Y}(y),\sigma^{-1}_{Y}(y),\sigma^{-1}(y)\}<\infty\quad\mbox{and}\quad\sup_{y\in\mathbb{R}}\bigg{|}\frac{(\mu(y)-r)^{2}}{\sigma^{2}(y)}\bigg{|}<\infty.

Then, the v(t,x,y;m)=lnx+f(t,y;m)v(t,x,y;m)=\ln x+f(t,y;m), where ff solves the PDE (7.7), is the optimal value function of the exploratory problem. Moreover, the optimal distribution control λ\lambda^{*} is Gaussian with mean (μ(y)r)σ2(y)\frac{(\mu(y)-r)}{\sigma^{2}(y)} and variance mσ2(y)\frac{m}{\sigma^{2}(y)} and the exploratory wealth process is given by

(7.9) dXtλ=Xtλ(r+(μ(yt)r)2σ2(yt))dt+(m+(μ(yt)r)2σ2(yt))XtλdWtS,X0λ=x0.\displaystyle dX_{t}^{\lambda^{*}}=X_{t}^{\lambda^{*}}\bigg{(}r+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}dt+\sqrt{\bigg{(}m+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}}X_{t}^{\lambda^{*}}dW_{t}^{S}\,,\quad X_{0}^{\lambda^{*}}=x_{0}.
Proof.

Changing variable g(t,y;m):=f(Tt,y;m),g(t,y;m):=f(T-t,y;m), we can rewrite the PDE (7.7) as

(7.10) gt(t,y;m)\displaystyle-g_{t}(t,y;m) +μY(y)gy(t,y;m)+12σY2(y)gyy(t,y;m)=h(y),g(0,y;m)=0.\displaystyle+\mu_{Y}(y)g_{y}(t,y;m)+\frac{1}{2}\sigma_{Y}^{2}(y)g_{yy}(t,y;m)=h(y),\quad g(0,y;m)=0.

By assumption, hh is continuous bounded and the Cauchy problem (7.10) satisfies all conditions in Theorems 12 and 17 (Chapter 1) in Friedman (2008). Therefore, there exists unique solution g(t,y;m)g(t,y;m) in [0,T]×[0,T]\times\mathbb{R}. It is now straightforward to see that ln(x)+f(t,y;m)\ln(x)+f(t,y;m) with f(t,y;m)=g(Tt,y;m)f(t,y;m)=g(T-t,y;m) satisfies (7.6). Note that by Feynman-Kac’s theorem, the function f(t,y;m)f(t,y;m) can be represented as

(7.11) f(t,y;m)=𝔼Q[tTh(y~s)𝑑s|y~t=y],f(t,y;m)=\mathbb{E}^{Q}\bigg{[}\int_{t}^{T}h(\widetilde{y}_{s})ds\bigg{|}\widetilde{y}_{t}=y\bigg{]},

where

(7.12) dy~t=μY(yt)dt+σY(y~t)dWtQ,d\widetilde{y}_{t}=\mu_{Y}(y_{t})dt+\sigma_{Y}(\widetilde{y}_{t})dW^{Q}_{t},

and QQ is a probability measure and WQW^{Q} is a Brownian motion under QQ. By following the same steps as in the proof of Theorem 3.1 we can show that (7.4) is admissible and hence is an optimal policy. The corresponding optimal exploratory wealth process is given by

(7.13) dXtλ=Xtλ(r+(μ(yt)r)2σ2(yt))dt+(m+(μ(yt)r)2σ2(yt))XtλdWtS,X0=x0.\displaystyle dX_{t}^{\lambda*}=X_{t}^{\lambda^{*}}\bigg{(}r+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}dt+\sqrt{\bigg{(}m+\frac{(\mu(y_{t})-r)^{2}}{\sigma^{2}(y_{t})}\bigg{)}}X_{t}^{\lambda^{*}}dW_{t}^{S}\,,\quad X_{0}=x_{0}.

Remark 7.1.
  • It is obvious that f(t,y;m)=(r+12(μr)2σ2)(Tt)+m2ln(σ22πem)(Tt)f(t,y;m)=(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})(T-t) when σY=μY=0\sigma_{Y}=\mu_{Y}=0. In this case, we are back to the classical Black-Scholes model in Section 3.

  • Theorem 7.1 can be extended to the case where the factor process yy is correlated with the risky asset SS. However, the exploratory setting should be adjusted to appropriately capture the correlation, see e.g. Dai et al. (2023).

  • An extension of Theorem 7.1 to the case where the strategy is bounded in a given interval [a,b][a,b] can be done by adapting the discussion in Section 4.

8. Conclusion

We study an exploration version of the continuous-time expected utility maximization problem with reinforcement learning. We show that the optimal feedback policy of the exploratory problem with exploration is Gaussian. However, when the risky investment ratio is restricted on a given portfolio interval, the constrained optimal exploration policy now follows a truncated Gaussian distribution. For logarithm and quadratic utility functions, the solution to the exploratory problem can be obtained in closed form which converges to the classical expected utility counterpart when the exploration weight goes to zero. For interpretable RL algorithms, a policy improvement theorem is provided. Finally, we devise an implementable reinforcement learning algorithm by casting the optimal problem in a martingale framework. Our work can be extended to various directions. For example, it would be interesting to consider both consumption and investment with a general utility functions in more general market settings. We foresee that the qq-learning framework developed in Jia and Zhou (2023) might be useful in such a setting. It would be also interesting to extend the work to a higher dimension case. We leave these exciting research problems for future studies.

Acknowledgements

Thai Nguyen acknowledges the support of the Natural Sciences and Engineering Research Council of Canada [RGPIN-2021-02594].

References

  • Baird (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
  • Barnard (1993) Barnard, E. (1993). Temporal-difference methods and markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357–365.
  • Bertsekas (2019) Bertsekas, D. (2019). Reinforcement learning and optimal control. Athena Scientific.
  • Bertsimas and Thiele (2006) Bertsimas, D. and Thiele, A. (2006). A robust optimization approach to inventory theory. Operations research, 54(1):150–168.
  • Bielecki et al. (2005) Bielecki, T. R., Jin, H., Pliska, S. R., and Zhou, X. Y. (2005). Continuous-time mean-variance portfolio selection with bankruptcy prohibition. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 15(2):213–244.
  • Bodnar et al. (2013) Bodnar, T., Parolya, N., and Schmid, W. (2013). On the equivalence of quadratic optimization problems commonly used in portfolio theory. European Journal of Operational Research, 229(3):637–644.
  • Charpentier et al. (2021) Charpentier, A., Elie, R., and Remlinger, C. (2021). Reinforcement learning in economics and finance. Computational Economics, pages 1–38.
  • Chen and Vellekoop (2017) Chen, A. and Vellekoop, M. (2017). Optimal investment and consumption when allowing terminal debt. European Journal of Operational Research, 258(1):385–397.
  • Csiszár (1975) Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, pages 146–158.
  • Cuoco (1997) Cuoco, D. (1997). Optimal consumption and equilibrium prices with portfolio constraints and stochastic income. Journal of Economic Theory, 72(1):33–73.
  • Dai et al. (2020) Dai, M., Dong, Y., and Jia, Y. (2020). Learning equilibrium mean-variance strategy. Available at SSRN 3770818.
  • Dai et al. (2023) Dai, M., Dong, Y., and Jia, Y. (2023). Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4):1166–1212.
  • Dai et al. (2021) Dai, M., Jin, H., Kou, S., and Xu, Y. (2021). A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108.
  • Donsker and Varadhan (2006) Donsker, M. and Varadhan, S. (2006). Large deviations for markov processes and the asymptotic evaluation of certain markov process expectations for large times. In Probabilistic Methods in Differential Equations: Proceedings of the Conference Held at the University of Victoria, August 19–20, 1974, pages 82–88. Springer.
  • Doya (2000) Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation, 12(1):219–245.
  • Duffie and Richardson (1991) Duffie, D. and Richardson, H. R. (1991). Mean-variance hedging in continuous time. The Annals of Applied Probability, pages 1–15.
  • El Karoui and Jeanblanc-Picqué (1998) El Karoui, N. and Jeanblanc-Picqué, M. (1998). Optimization of consumption with labor income. Finance and Stochastics, 2:409–440.
  • Fleming (1976) Fleming, W. H. (1976). Generalized solutions in optimal stochastic control. Brown Univ.
  • Fleming and Nisio (1984) Fleming, W. H. and Nisio, M. (1984). On stochastic relaxed control for partially observed diffusions. Nagoya Mathematical Journal, 93:71–108.
  • Fleming and Soner (2006) Fleming, W. H. and Soner, H. M. (2006). Controlled Markov processes and viscosity solutions, volume 25. Springer Science & Business Media.
  • Frémaux et al. (2013) Frémaux, N., Sprekeler, H., and Gerstner, W. (2013). Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS computational biology, 9(4):e1003024.
  • Friedman (2008) Friedman, A. (2008). Partial differential equations of parabolic type. Courier Dover Publications.
  • Gerrard et al. (2023) Gerrard, R., Kyriakou, I., Nielsen, J. P., and Vodička, P. (2023). On optimal constrained investment strategies for long-term savers in stochastic environments and probability hedging. European Journal of Operational Research, 307(2):948–962.
  • Gosavi (2009) Gosavi, A. (2009). Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178–192.
  • Guo et al. (2022) Guo, X., Hu, A., Xu, R., and Zhang, J. (2022). A general framework for learning mean-field games. Mathematics of Operations Research.
  • Hambly et al. (2021) Hambly, B., Xu, R., and Yang, H. (2021). Recent advances in reinforcement learning in finance. arXiv preprint arXiv:2112.04553.
  • Hendricks and Wilcox (2014) Hendricks, D. and Wilcox, D. (2014). A reinforcement learning extension to the almgren-chriss framework for optimal trade execution. In 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), pages 457–464. IEEE.
  • Huang et al. (2022) Huang, Y.-J., Wang, Z., and Zhou, Z. (2022). Convergence of policy improvement for entropy-regularized stochastic control problems. arXiv preprint arXiv:2209.07059.
  • Jaimungal (2022) Jaimungal, S. (2022). Reinforcement learning and stochastic optimisation. Finance and Stochastics, 26(1):103–129.
  • Jia and Zhou (2022a) Jia, Y. and Zhou, X. Y. (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1–55.
  • Jia and Zhou (2022b) Jia, Y. and Zhou, X. Y. (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(154):1–55.
  • Jia and Zhou (2023) Jia, Y. and Zhou, X. Y. (2023). Q-learning in continuous time. Journal of Machine Learning Research.
  • Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285.
  • Kamma and Pelsser (2022) Kamma, T. and Pelsser, A. (2022). Near-optimal asset allocation in financial markets with trading constraints. European Journal of Operational Research, 297(2):766–781.
  • Karatzas and Shreve (1998) Karatzas, I. and Shreve, S. E. (1998). Methods of mathematical finance, volume 39. Springer.
  • Kato et al. (2020) Kato, M., Nakagawa, K., Abe, K., and Morimura, T. (2020). Mean-variance efficient reinforcement learning by expected quadratic utility maximization. arXiv preprint arXiv:2010.01404.
  • Kotz et al. (2004) Kotz, S., Balakrishnan, N., and Johnson, N. L. (2004). Continuous multivariate distributions, Volume 1: Models and applications, volume 1. John Wiley & Sons.
  • Lee and Lee (2021) Lee, H.-R. and Lee, T. (2021). Multi-agent reinforcement learning algorithm to solve a partially-observable multi-agent problem in disaster response. European Journal of Operational Research, 291(1):296–308.
  • Lee and Sutton (2021) Lee, J. and Sutton, R. S. (2021). Policy iterations for reinforcement learning problems in continuous time and space—fundamental theory and methods. Automatica, 126:109421.
  • Li and Xu (2016) Li, X. and Xu, Z. Q. (2016). Continuous-time markowitz’s model with constraints on wealth and portfolio. Operations research letters, 44(6):729–736.
  • Li et al. (2002) Li, X., Zhou, X. Y., and Lim, A. E. (2002). Dynamic mean-variance portfolio selection with no-shorting constraints. SIAM Journal on Control and Optimization, 40(5):1540–1555.
  • Liu et al. (2020) Liu, Y., Chen, Y., and Jiang, T. (2020). Dynamic selective maintenance optimization for multi-state systems over a finite horizon: A deep reinforcement learning approach. European Journal of Operational Research, 283(1):166–181.
  • Luenberger (1998) Luenberger, D. G. (1998). Investment science. Oxford university press.
  • Merton (1975) Merton, R. C. (1975). Optimum consumption and portfolio rules in a continuous-time model. In Stochastic optimization models in finance, pages 621–661. Elsevier.
  • Moody et al. (1998) Moody, J., Wu, L., Liao, Y., and Saffell, M. (1998). Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470.
  • Mou et al. (2021) Mou, C., Zhang, W., and Zhou, C. (2021). Robust exploratory mean-variance problem with drift uncertainty. arXiv preprint arXiv:2108.04100.
  • Nevmyvaka et al. (2006) Nevmyvaka, Y., Feng, Y., and Kearns, M. (2006). Reinforcement learning for optimized trade execution. In Proceedings of the 23rd international conference on Machine learning, pages 673–680.
  • Nicole el et al. (1987) Nicole el, K., Du Huu, N., and Monique, J.-P. (1987). Compactification methods in the control of degenerate diffusions: existence of an optimal control. Stochastics: an international journal of probability and stochastic processes, 20(3):169–219.
  • Pham (2009) Pham, H. (2009). Continuous-time stochastic control and optimization with financial applications, volume 61. Springer Science & Business Media.
  • Pulley (1983) Pulley, L. B. (1983). Mean-variance approximations to expected logarithmic utility. Operations Research, 31(4):685–696.
  • Rubinstein (1977) Rubinstein, M. (1977). The strong case for the generalized logarithmic utility model as the premier model of financial markets. In Financial Dec Making Under Uncertainty, pages 11–62. Elsevier.
  • Schnaubelt (2022) Schnaubelt, M. (2022). Deep reinforcement learning for the optimal placement of cryptocurrency limit orders. European Journal of Operational Research, 296(3):993–1006.
  • Schneckenreither and Haeussler (2019) Schneckenreither, M. and Haeussler, S. (2019). Reinforcement learning methods for operations research applications: The order release problem. In Machine Learning, Optimization, and Data Science: 4th International Conference, LOD 2018, Volterra, Italy, September 13-16, 2018, Revised Selected Papers 4, pages 545–559. Springer.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. nature, 550(7676):354–359.
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Tang et al. (2022) Tang, W., Zhang, Y. P., and Zhou, X. Y. (2022). Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216.
  • Wang (2019) Wang, H. (2019). Large scale continuous-time mean-variance portfolio allocation via reinforcement learning. arXiv preprint arXiv:1907.11718.
  • Wang et al. (2020) Wang, H., Zariphopoulou, T., and Zhou, X. Y. (2020). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34.
  • Wang and Zhou (2020) Wang, H. and Zhou, X. Y. (2020). Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308.
  • Williams et al. (2017) Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. (2017). Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE.

Appendix A Technical proofs of Section 3

A.1. Proof of Theorem 3.1

Proof.

Recall that U(x)=ln(x)U(x)=\ln(x). We find the solution of the HJB (2.12) under the form v(t,x;m)=k(t)lnx+l(t)v(t,x;m)=k(t)\ln x+l(t), where k,l:[0,T]k,l:[0,T]\to\mathbb{R} are smooth functions satisfying the terminal condition k(T)=1k(T)=1, l(T)=0l(T)=0. Direct calculation shows that (2.12) boils down to

(A.1) k(t)lnx+(r+12(μr)2σ2)k(t)m2lnk(t)+l(t)+cm=0,(t,x)[0,T)×+,\displaystyle k^{\prime}(t)\ln x+\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}k(t)-\frac{m}{2}\ln k(t)+l^{\prime}(t)+c_{m}=0,\forall(t,x)\in[0,T)\times\mathbb{R}_{+},

where cm:=m2ln(σ22πem)c_{m}:=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m}) and πe=3.14\pi_{e}=3.14... is the Archimedes’ constant. It follows that k(t)=0k^{\prime}(t)=0 and

(r+12(μr)2σ2)k(t)m2lnk(t)+l(t)+cm=0.(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})k(t)-\frac{m}{2}\ln k(t)+l^{\prime}(t)+c_{m}=0.

Hence, k(t)=1k(t)=1 and

l(t)=(r+12(μr)2σ2)tcmt+C,l(t)=-(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})t-c_{m}t+C,

where the constant CC is chosen such that the terminal condition l(T)=0l(T)=0 is fulfilled. Direction calculation leads to

l(t)=(r+12(μr)2σ2)(Tt)+cm(Tt)l(t)=(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+c_{m}(T-t)

and the solution of PDE (2.12) is given by (3.1). Let us first show that v(t,x;m)v(t,x;m) in (3.1) is the optimal value function. Indeed, it follows that the optimal distribution λ\lambda^{*} is now given by a Gaussian distribution with mean α=(μr)σ2\alpha=\frac{(\mu-r)}{\sigma^{2}} (independent of the exploration parameter mm) and variance β2=mσ2.\beta^{2}=\frac{m}{\sigma^{2}}. The law of optimal feedback Gaussian control allows to determine the exploration wealth drift and volatility from (2.6) as

A^(t,x;λ)=x(r+(μr)2σ2);B^2(t,x;λ)=x2(m+(μr)2σ2).\displaystyle\hat{A}{(t,x;\lambda^{*})}=x\bigg{(}r+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)};\quad\hat{B}^{2}{(t,x;\lambda^{*})}=x^{2}\bigg{(}m+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}.

Hence, the exploration wealth dynamics is given by (3.3). Now, it is clear that the SDE (3.3) admits solution given by

(A.2) Xtλ=x0exp{(r+12(μr)2σ212m)t+m+(μr)2σ2Wt}.\displaystyle X_{t}^{\lambda^{*}}=x_{0}\exp\bigg{\{}\bigg{(}r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}}-\frac{1}{2}m\bigg{)}t+\sqrt{m+\frac{(\mu-r)^{2}}{\sigma^{2}}}W_{t}\bigg{\}}.

Moreover, it can be seen directly from (A.2) that

𝔼[ln[XTλ]]=lnx0+(r+(μr)22σ2)TmT2<,\mathbb{E}[\ln[X_{T}^{\lambda*}]]=\ln x_{0}+\bigg{(}r+\frac{(\mu-r)^{2}}{2\sigma^{2}}\bigg{)}T-\frac{mT}{2}<\infty,

which means that λ\lambda^{*} is admissible. ∎

A.2. Proof of Theorem 3.4

Proof.

Recall first that for any λ\lambda admissible (not necessarily Gaussian), the corresponding value function vλ(s,x;m)v^{{\lambda}}(s,x;m) defined by (3.10) solves the following average PDE

vtλ(t,x;m)+𝒞((r+π(μr))xvxλ(t,x;m)+12σ2x2π2vxxλ(t,x;m)mlnλ(π|t,x))λ(π|t,x)𝑑π=0v_{t}^{{\lambda}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)-m\ln\lambda(\pi|t,x)\bigg{)}\lambda(\pi|t,x)d\pi=0

with terminal condition vλ(T,x;m)=U(x)v^{{\lambda}}(T,x;m)=U(x). It follows that

vtλ(t,x;m)+supλ^𝒞((r+π(μr))xvxλ(t,x;m)+12σ2x2π2vxxλ(t,x;m)mlnλ^(π|t,x))λ^(π|t,x)𝑑π0.\displaystyle v_{t}^{{\lambda}}(t,x;m)+\sup_{{\hat{\lambda}}}\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)-m\ln{\hat{\lambda}}(\pi|t,x)\bigg{)}{\hat{\lambda}}(\pi|t,x)d\pi\geq 0.

Note that the above supremum is attained at the updated policy λ~\widetilde{\lambda} defined by (3.11). In other words,

vtλ(t,x;m)+𝒞((r+π(μr))xvxλ(t,x;m)+12σ2x2π2vxxλ(t,x;m)mlnλ~(π|t,x))λ~(π|t,x)𝑑π0,\displaystyle v_{t}^{{\lambda}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)-m\ln\widetilde{\lambda}(\pi|t,x)\bigg{)}\widetilde{\lambda}(\pi|t,x)d\pi\geq 0,

which implies that

vtλ(t,x;m)+𝒞((r+π(μr))xvxλ(t,x;m)\displaystyle v_{t}^{{\lambda}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda}}(t,x;m) +12σ2x2π2vxxλ(t,x;m))λ~(π|t,x)dπ\displaystyle+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda}}(t,x;m)\bigg{)}\widetilde{\lambda}(\pi|t,x)d\pi
(A.3) m𝒞lnλ~(π|t,x)λ~(π|t,x)𝑑π0.\displaystyle\geq m\int_{\mathcal{C}}\ln\widetilde{\lambda}(\pi|t,x)\widetilde{\lambda}(\pi|t,x)d\pi\geq 0.

Now, for the policy λ~\widetilde{\lambda}, the corresponding value function is given by

(A.4) vλ~(t,x;m):=𝔼[U(XTλ~)mtT𝒞λ~(π|s,Xsλ~)lnλ~(π|s,Xsλ~)𝑑π𝑑s|Xtλ~=x].\displaystyle v^{\widetilde{\lambda}}(t,x;m):=\mathbb{E}\Bigg{[}U(X_{T}^{\widetilde{\lambda}})-m\int_{t}^{T}\int_{\mathcal{C}}\widetilde{\lambda}(\pi|s,X^{\widetilde{\lambda}}_{s})\ln\widetilde{\lambda}(\pi|s,X^{\widetilde{\lambda}}_{s})d\pi ds|X^{\widetilde{\lambda}}_{t}=x\Bigg{]}.

Let Yt:=vλ(t,Xtλ~;m)Y_{t}:=v^{\lambda}(t,X_{t}^{\widetilde{\lambda}};m). By Itô’s formula we obtain

vλ(s,Xsλ~;m)\displaystyle v^{{\lambda}}(s,X_{s}^{\widetilde{\lambda}};m) =vλ(t,x;m)+tsB^(u,Xuλ~;λ~)vxλ(u,Xuλ~;m)𝑑Wu\displaystyle=v^{{\lambda}}(t,x;m)+\int_{t}^{s}\hat{B}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m)dW_{u}
+ts(vtλ(u,Xuλ~;m)+A^(u,Xuλ~;λ~)vxλ(u,Xuλ~;m)+12B^2(u,Xuλ~;λ~)vxxλ(u,Xuλ~;m))𝑑u.\displaystyle+\int_{t}^{s}\bigg{(}v^{{\lambda}}_{t}(u,X_{u}^{\widetilde{\lambda}};m)+\hat{A}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m)+\frac{1}{2}\hat{B}^{2}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{xx}(u,X_{u}^{\widetilde{\lambda}};m)\bigg{)}du.

Let {τn}n1\{\tau_{n}\}_{n\geq 1} be a localization sequence of the above stochastic integral, i.e. τn:=min{st:𝔼[tsτnB^2(u,Xuλ~;λ~)(vxλ(u,Xuλ~;m))2𝑑u]n}T\tau_{n}:=\min\{s\geq t:\mathbb{E}[\int_{t}^{s\wedge\tau_{n}}\hat{B}^{2}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}(v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m))^{2}du]\geq n\}\wedge T. Clearly, limnτn=T\lim_{n\to\infty}\tau_{n}=T. By (A.3) we obtain for tsTt\leq s\leq T,

vλ(t,x;m)=𝔼[vλ(sτn,Xsτnλ~;m)\displaystyle v^{{\lambda}}(t,x;m)=\mathbb{E}\bigg{[}v^{{\lambda}}({s\wedge\tau_{n}},X_{s\wedge\tau_{n}}^{\widetilde{\lambda}};m)
tsτn(vtλ(u,Xuλ~;m)+A^(u,Xuλ~;λ~)vxλ(u,Xuλ~;m)+12B^2(u,Xuλ~;λ~)vxxλ(u,Xuλ~;m))du|Xtλ~=x]\displaystyle-\int_{t}^{s\wedge\tau_{n}}\bigg{(}v^{{\lambda}}_{t}(u,X_{u}^{\widetilde{\lambda}};m)+\hat{A}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{x}(u,X_{u}^{\widetilde{\lambda}};m)+\frac{1}{2}\hat{B}^{2}{(u,X_{u}^{\widetilde{\lambda}};\widetilde{\lambda})}v^{{\lambda}}_{xx}(u,X_{u}^{\widetilde{\lambda}};m)\bigg{)}du\bigg{|}X_{t}^{\widetilde{\lambda}}=x\bigg{]}
𝔼[vλ(sτn,Xsτnλ~;m)mtsτn𝒞lnλ~(π|u,Xuλ~)λ~(π|u,Xuλ~)𝑑π𝑑u|Xtλ~=x].\displaystyle\leq\mathbb{E}\bigg{[}v^{{\lambda}}({s\wedge\tau_{n}},X_{s\wedge\tau_{n}}^{\widetilde{\lambda}};m)-m\int_{t}^{s\wedge\tau_{n}}\int_{\mathcal{C}}\ln\widetilde{\lambda}(\pi|u,X^{\widetilde{\lambda}}_{u})\widetilde{\lambda}(\pi|u,X^{\widetilde{\lambda}}_{u})d\pi du\bigg{|}X_{t}^{\widetilde{\lambda}}=x\bigg{]}.

By taking s=Ts=T , sending nn\to\infty and using (A.4) we obtain that for (t,x)[0,T)×+(t,x)\in[0,T)\times\mathbb{R}_{+}

vλ(t,x;m)𝔼[U(XTλ~)mtT𝒞lnλ~(π|u,Xuλ~)λ~(π|u,Xuλ~)𝑑π𝑑u|Xtλ~=x]=vλ~(t,x;m).v^{{\lambda}}(t,x;m)\leq\mathbb{E}\bigg{[}U(X_{T}^{\widetilde{\lambda}})-m\int_{t}^{T}\int_{\mathcal{C}}\ln\widetilde{\lambda}(\pi|u,X^{\widetilde{\lambda}}_{u})\widetilde{\lambda}(\pi|u,X^{\widetilde{\lambda}}_{u})d\pi du\bigg{|}X_{t}^{\widetilde{\lambda}}=x\bigg{]}=v^{\widetilde{\lambda}}(t,x;m).

A.3. Proof of Theorem 3.5

Proof.

Observe first that vλ0(s,x;m)v^{{\lambda^{0}}}(s,x;m) solves the following average PDE

vtλ0(t,x;m)+𝒞((r+π(μr))xvxλ0(t,x;m)+12σ2x2π2vxxλ0(t,x;m)mlnλt0)λt0(π)𝑑π=0.v_{t}^{{\lambda^{0}}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda^{0}}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda^{0}}}(t,x;m)-m\ln\lambda_{t}^{0}\bigg{)}\lambda_{t}^{0}(\pi)d\pi=0.

Solving the above PDE with terminal condition vλ0(T,x;m)=lnxv^{{\lambda^{0}}}(T,x;m)=\ln x and λ0(π|t,x;m)=𝒩(π|a,b2){\lambda}^{0}(\pi|t,x;m)=\mathcal{N}(\pi|a,b^{2}) with a,b>0a,b>0, we obtain vtλ0(t,x;m)=lnx+h(t)v_{t}^{{\lambda^{0}}}(t,x;m)=\ln x+h(t), where hh is a continuous function depending only on tt and h(T)=0h(T)=0. Therefore, the value function vλ0v^{{\lambda^{0}}} satisfies hypothesis in Theorem 3.4 and its conclusions apply. In particular, the next Gaussian policy

(A.5) λ1(π|t,x;m)=𝒩(π|(μr)xvxλ0σ2x2vxxλ0,mσ2x2vxxλ0)=𝒩(π|(μr)σ2,mσ2){\lambda}^{1}(\pi|t,x;m)=\mathcal{N}\bigg{(}\pi\bigg{|}-\frac{(\mu-r)xv_{x}^{{\lambda^{0}}}}{\sigma^{2}x^{2}v_{xx}^{{\lambda^{0}}}},-\frac{m}{\sigma^{2}x^{2}v^{{\lambda^{0}}}_{xx}}\bigg{)}=\mathcal{N}\bigg{(}\pi\bigg{|}\frac{(\mu-r)}{\sigma^{2}},\frac{m}{\sigma^{2}}\bigg{)}

is admissible and

(A.6) vλ1(t,x;m)vλ0(t,x;m),(t,x)[0,T)×+,v^{{\lambda}^{1}}(t,x;m)\geq v^{\lambda^{0}}(t,x;m),\quad(t,x)\in[0,T)\times\mathbb{R}_{+},

where vλ1(t,x;m)v^{{\lambda}^{1}}(t,x;m) be the value function corresponding to this new policy λ1(π|t,x;m){\lambda}^{1}(\pi|t,x;m). Again, vλ1(s,x;m)v^{{\lambda^{1}}}(s,x;m) solves the following average PDE

vtλ1(t,x;m)+𝒞((r+π(μr))xvxλ1(t,x;m)+12σ2x2π2vxxλ1(t,x;m)mlnλ1(π|t,x;m))λ1(π|t,x;m)𝑑π=0.v_{t}^{{\lambda^{1}}}(t,x;m)+\int_{\mathcal{C}}\bigg{(}(r+\pi(\mu-r))xv_{x}^{{\lambda^{1}}}(t,x;m)+\frac{1}{2}\sigma^{2}x^{2}\pi^{2}v_{xx}^{{\lambda^{1}}}(t,x;m)-m\ln\lambda^{1}(\pi|t,x;m)\bigg{)}\lambda^{1}(\pi|t,x;m)d\pi=0.

Direct computations related to the distribution of λ1{\lambda}^{1} lead to the following PDE

vtλ1(t,x;m)\displaystyle v_{t}^{{\lambda^{1}}}(t,x;m) +(r+(μr)2σ2)xvxλ1(t,x;m)+12((μr)2σ2+m)x2vxxλ1(t,x;m)+m2(1+ln(2πemσ2))=0.\displaystyle+\bigg{(}r+\frac{(\mu-r)^{2}}{\sigma^{2}}\bigg{)}xv_{x}^{{\lambda^{1}}}(t,x;m)+\frac{1}{2}\bigg{(}\frac{(\mu-r)^{2}}{\sigma^{2}}+m\bigg{)}x^{2}v_{xx}^{{\lambda^{1}}}(t,x;m)+\frac{m}{2}\bigg{(}1+\ln\bigg{(}\frac{2\pi_{e}m}{\sigma^{2}}\bigg{)}\bigg{)}=0.

It is hence straightforward to see that vλ(t,x,m)=v(t,x;m)v^{\lambda^{*}}(t,x,m)=v(t,x;m) given by (3.1) solves the above PDE and the proof is complete. ∎

A.4. Proof of Theorem 3.6

Proof.

The proof is similar to that in Jia and Zhou (2022a, b) and can be done as follows. First, by the Markov property of the process XtλX^{\lambda}_{t}, it can be observed directly that

Ys\displaystyle Y_{s} =J(t,Xsλ;λ)mts𝒞λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u\displaystyle=J(t,X^{\lambda}_{s};{\lambda})-m\int_{t}^{s}\int_{\mathcal{C}}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du
=𝔼[U(XTλ)msT𝒞λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u|Xsλ]mts𝒞λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u\displaystyle=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{s}^{T}\int_{\mathcal{C}}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du\bigg{|}X_{s}^{\lambda}\bigg{]}-m\int_{t}^{s}\int_{\mathcal{C}}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du
=𝔼[U(XTλ)mtT𝒞λ(π|u,Xuλ)lnλ(π|u,Xuλ)𝑑π𝑑u|Xsλ]=𝔼[YT|Xsλ].\displaystyle=\mathbb{E}\bigg{[}U(X_{T}^{\lambda})-m\int_{t}^{T}\int_{\mathcal{C}}\lambda(\pi|u,X^{\lambda}_{u})\ln\lambda(\pi|u,X^{\lambda}_{u})d\pi du\bigg{|}X_{s}^{\lambda}\bigg{]}=\mathbb{E}\big{[}Y_{T}|X_{s}^{\lambda}\big{]}.

Appendix B Technical Proofs of Section 4

B.1. Proof of Theorem 4.1

Proof.

Similar to the unconstrained problem, we find the solution of the HJB (4.6) under the form v[a,b](t,x;m)=k(t)lnx+l(t)v^{[a,b]}(t,x;m)=k(t)\ln x+l(t), where k,l:[0,T]+k,l:[0,T]\to\mathbb{R}+ are smooth functions satisfying the terminal condition k(T)=1k(T)=1, l(T)=0l(T)=0. Direct calculation shows that

A(m)=(aπMerton)σm1/2;B(m)=(bπMerton)σm1/2,A(m)=(a-\pi^{Merton})\sigma m^{-1/2};\quad B(m)=(b-\pi^{Merton})\sigma m^{-1/2},

which are independent of (t,x)(t,x). Hence, Z(m)=Za,b(m)Z(m)=Z_{a,b}(m) is also independent of t,xt,x. Now, (4.6) boils down to

(B.1) k(t)lnx+(r+12(μr)2σ2)k(l)m2lnk(t)+l(t)+cm+mlnZa,b(m)=0,\displaystyle k^{\prime}(t)\ln x+(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})k(l)-\frac{m}{2}\ln k(t)+l^{\prime}(t)+c_{m}{+m\ln Z_{a,b}(m)}=0,

for (t,x)[0,T)×+(t,x)\in[0,T)\times\mathbb{R}_{+}, where cm:=m2ln(σ22πem)c_{m}:=\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m}). It follows that k(t)=1k(t)=1 and

l(t)=(r+12(μr)2σ2)tcmt+CmtlnZa,b(m),l(t)=-(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})t-c_{m}t+C-mt\ln Z_{a,b}(m),

where the constant CC is chosen such that the terminal condition l(T)=0l(T)=0 is fulfilled. Direction calculation leads to

l(t)=(r+12(μr)2σ2)(Tt)+cm(Tt)+mlnZa,b(m)(Tt)l(t)=(r+\frac{1}{2}\frac{(\mu-r)^{2}}{\sigma^{2}})(T-t)+c_{m}(T-t){+m\ln Z_{a,b}(m)}(T-t)

and the solution of PDE (4.6) is given by (4.9). It is straightforward to verify that v[a,b]v^{[a,b]} defined by (4.9) solves (4.6). Indeed, it follows that the optimal distribution λ,[a,b]\lambda^{*,[a,b]} is now given by a Gaussian distribution with mean α=(μr)σ2\alpha=\frac{(\mu-r)}{\sigma^{2}} (independent of exploration) and variance β2=mσ2\beta^{2}=\frac{m}{\sigma^{2}} truncated on the interval [a,b][a,b]. The density of λ,[a,b]\lambda^{*,[a,b]} is given by (4.14) which is independent of the wealth level. This truncated Gaussian control allows to determine the exploration wealth drift and volatility as

(B.2) A^(t,x;λ)=x(r+πt,[a,b](μr));B^2(t,x;λ)=x2σ2(qt[a,b])2,\displaystyle\hat{A}(t,x;\lambda^{*})=x\bigg{(}r+\pi_{t}^{*,[a,b]}(\mu-r)\bigg{)};\quad{\hat{B}^{2}(t,x;\lambda^{*})=x^{2}\sigma^{2}(q_{t}^{*[a,b]})^{2},}

where πt,[a,b]\pi_{t}^{*,[a,b]} and (qt[a,b])2(q_{t}^{*[a,b]})^{2} are given by (4.14). This leads to the exploration wealth dynamics given by (4.12). Now, it is clear that the above SDE (4.12) admits a unique solution Xλ,[a,b]X^{\lambda^{*,[a,b]}} and 𝔼[ln[XTλ,[a,b]]]<,\mathbb{E}[\ln[X_{T}^{\lambda^{*,[a,b]}}]]<\infty, which means that λ,[a,b]{\lambda^{*,[a,b]}} is admissible.∎

B.2. Proof of Lemma 4.4

Proof.

Observe first that the function xφ(x)x\varphi(x) is increasing in (1,1)(-1,1) and takes negative (resp. positive) values when x<0x<0 (resp. x>0x>0). Therefore, A(m)φ(A(m))B(m)φ(B(m))0A(m)\varphi(A(m))-B(m)\varphi(B(m))\leq 0 if

  • A(m)<0<B(m)A(m)<0<B(m), which implies that a<πMerton<ba<\pi^{Merton}<b;

  • 0A(m)B(m)10\leq A(m)\leq B(m)\leq 1, which implies that πMertona<bπMerton+mσ1\pi^{Merton}\leq a<b\leq\pi^{Merton}+\sqrt{m}\sigma^{-1};

  • 1A(m)B(m)0-1\leq A(m)\leq B(m)\leq 0, which implies that πMertonmσ1a<b<πMerton\pi^{Merton}-\sqrt{m}\sigma^{-1}\leq a<b<\pi^{Merton}.

For each of these cases, by compararing Propositions 4.1 and 3.1 we obtain

L[a,b](T,x;m)=mT2+mTA(m)φ(A(m))B(m)φ(B(m))2Za,b(m)mT2=L(T,x;m)L^{[a,b]}(T,x;m)=\frac{mT}{2}+mT\frac{A(m)\varphi(A(m))-B(m)\varphi(B(m))}{2Z_{a,b}(m)}\leq\frac{mT}{2}=L(T,x;m)

and Lemma 4.4 is proved. ∎

Appendix C Technical proofs for Section 6

C.1. Proof of Theorem 6.1

Proof.

We find the solution of the HJB (6.8) under the form v~[a,b](t,x;m)=k(t)x2+l(t)x+q(t)\widetilde{v}^{[a,b]}(t,x;m)=k(t)x^{2}+l(t)x+q(t), where k,l,q:[0,T]k,l,q:[0,T]\to\mathbb{R} are smooth functions satisfying the terminal condition k(T)=ε2k(T)=-\frac{\varepsilon}{2}, l(T)=Kl(T)=K and q(T)=0q(T)=0. Differenting this ansatz, plugging into the PDE (6.8) and studying the coefficients of x2x^{2} and xx we obtain

k(t)=12εe(ρ22r)(Tt),l(t)=Ke(ρ2r)(Tt).k(t)=-\frac{1}{2}\varepsilon e^{-(\rho^{2}-2r)(T-t)},\quad l(t)=Ke^{-(\rho^{2}-r)(T-t)}.

Similarly, we obtain the following ODE when considering the coefficient of x0x^{0}:

(C.1) q(t)12ρ2l2(t)2k(t)m2ln(2k(t))+m2ln(σ22πem)+mln(Z(t,x;m))=0,q(T)=0.\displaystyle q^{\prime}(t)-\frac{1}{2}\rho^{2}\frac{l^{2}(t)}{2k(t)}-\frac{m}{2}\ln(-2k(t))+\frac{m}{2}\ln(\sigma^{-2}{2\pi_{e}m})+m\ln(Z(t,x;m))=0,\quad q(T)=0.

Direct calculation shows that

(C.2) Q~a(t,x;m)=(a0(t)Kεer(Tt)πMerton)εσ2me(ρ22r)(Tt))=Q~a(t;m);\displaystyle\widetilde{Q}_{a}(t,x;m)=\bigg{(}a_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}})=\widetilde{Q}_{a}(t;m);
(C.3) Q~b(t,x;m)=(b0(t)Kεer(Tt)πMerton)εσ2me(ρ22r)(Tt))=Q~b(t;m),\displaystyle\widetilde{Q}_{b}(t,x;m)=\bigg{(}b_{0}(t)-\frac{K}{\varepsilon}e^{-r(T-t)}\pi^{Merton}\bigg{)}\sqrt{\frac{\varepsilon\sigma^{2}}{m}e^{-(\rho^{2}-2r)(T-t)}}\bigg{)}=\widetilde{Q}_{b}(t;m),

which are independent of xx. Hence, Z(t,x;m)=f(t,m)Z(t,x;m)=f(t,m) given in (6.15) which is also independent of xx. Therefore,

q(t)=K22ε(1eρ2(Tt))+m4(ρ22r)(Tt)2+m2ln(ε1σ22πem)(Tt)mtTln(f(s,m))𝑑s.q(t)=-\frac{K^{2}}{2\varepsilon}(1-e^{-\rho^{2}(T-t)})+\frac{m}{4}(\rho^{2}-2r)(T-t)^{2}+\frac{m}{2}\ln(\varepsilon^{-1}\sigma^{-2}{2\pi_{e}m})(T-t)-m\int_{t}^{T}\ln(f(s,m))ds.

Now, it is straightforward to verify that v~[a,b]\widetilde{v}^{[a,b]} defined by (6.14) solves (6.8). Next, it follows that the optimal feedback distribution control λ~\widetilde{\lambda}^{*} is a Gaussian distribution with parameter (Kεer(Tt)x)πMerton(\frac{K}{\varepsilon}e^{r(T-t)}-x)\pi^{Merton} and mεσ2e(2rρ2)(Tt)\frac{m}{\varepsilon\sigma^{2}}e^{(2r-\rho^{2})(T-t)}, truncated on the interval [a(t,x),b(t,x)][a(t,x),b(t,x)] i.e.

(C.4) λ~(θ|t,x;m)=𝒩(θ|(Kεer(Tt)x)πMerton,mεσ2e(2rρ2)(Tt))|[a(t,x),b(t,x)].\widetilde{\lambda}^{*}(\theta|t,x;m)=\mathcal{N}\left(\theta\bigg{|}(\frac{K}{\varepsilon}e^{-r(T-t)}-x)\pi^{Merton},\frac{m}{\varepsilon\sigma^{2}}e^{-(2r-\rho^{2})(T-t)}\right)\bigg{|}_{[a(t,x),b(t,x)]}.

This explicit truncated Gaussian policy allows to determine the exploration wealth drift and volatility as

A~(t,x;λ~)=rx+(μr)θ~t,[a,b],B~(t,x;λ~):=σq~t,[a,b],\displaystyle\tilde{A}(t,x;\widetilde{\lambda}^{*})=rx+(\mu-r)\widetilde{\theta}_{t}^{*,[a,b]},\quad\widetilde{B}(t,x;\widetilde{\lambda}^{*}):=\sigma\widetilde{q}_{t}^{*,[a,b]},

where

(C.5) θ~t,[a,b]=(Kεer(Tt)x)πMerton+mεσ2e(ρ22r)(Tt)φ(Q~a(t;m))φ(Q~b(t;m))f(t,m)\widetilde{\theta}_{t}^{*,[a,b]}=\big{(}\frac{K}{\varepsilon}e^{-r(T-t)}-x\big{)}\pi^{Merton}+\sqrt{\frac{m}{\varepsilon\sigma^{2}e^{-(\rho^{2}-2r)(T-t)}}}\frac{\varphi\big{(}\widetilde{Q}_{a}(t;m)\big{)}-\varphi\big{(}\widetilde{Q}_{b}(t;m)\big{)}}{f(t,m)}

and

(q~t,[a,b])2=(θ~t,[a,b])2+mεσ2e(ρ22r)(Tt)\displaystyle(\widetilde{q}_{t}^{*,[a,b]})^{2}=(\widetilde{\theta}_{t}^{*,[a,b]})^{2}+\frac{m}{\varepsilon\sigma^{2}e^{-(\rho^{2}-2r)(T-t)}} (1+Q~a(t;m)φ(Q~a(t;m))Q~b(t;m)φ(Q~b(t;m))f(t,m)\displaystyle\Bigg{(}1+\frac{\widetilde{Q}_{a}(t;m)\varphi(\widetilde{Q}_{a}(t;m))-\widetilde{Q}_{b}(t;m)\varphi(\widetilde{Q}_{b}(t;m)\bigg{)}}{f(t,m)}
(C.6) (φ(Q~a(t;m))φ((Q~b(t;m))f(t,m))2),\displaystyle-\bigg{(}\frac{\varphi(\widetilde{Q}_{a}(t;m))-\varphi((\widetilde{Q}_{b}(t;m))}{f(t,m)}\bigg{)}^{2}\Bigg{)},

which guarantees that the SDE (6.3) admits strong solution. Finally, it is straightforward to verify the integrability conditions to conclude that λ~\widetilde{\lambda}^{*} is admissible.∎

C.2. Proof of Proposition 6.2

Proof.

First, from the quadratic utility form it is easy to see that for any admissible terminal portfolio XTX_{T},

(C.7) 𝔼[KXT12ϵXT2]=12ϵ(𝔼[XT]Kϵ)2+K22ϵ12Var[XT].\mathbb{E}[KX_{T}-\frac{1}{2}\epsilon X_{T}^{2}]=-\frac{1}{2}\epsilon\bigg{(}\mathbb{E}[X_{T}]-\frac{K}{\epsilon}\bigg{)}^{2}+\frac{K^{2}}{2\epsilon}-\frac{1}{2}Var[X_{T}].

Observe that when Var[XT]=ΓVar[X_{T}]=\Gamma is fixed, the right hand side of (C.7) is increasingly monotone with respect to 𝔼[XT]\mathbb{E}[X_{T}] as long as 𝔼[XT]Kϵ\mathbb{E}[X_{T}]\leq\frac{K}{\epsilon}. Therefore, among policies θ\theta with a fixed variance Var[XT]=ΓVar[X_{T}]=\Gamma and mean 𝔼[XT]Kϵ\mathbb{E}[X_{T}]\leq\frac{K}{\epsilon}, the policy that maximizes the quadratic expected utility will provide the highest mean 𝔼[XT]\mathbb{E}[X_{T}] of the right hand-side of (C.7). In other words, to show that the unconstrained exploratory optimal terminal wealth of the quadratic expected utility maximization problem X0,λ~X^{0,\widetilde{\lambda}^{*}} lies in the mean-variance efficient frontier, it suffices to show that 𝔼[XT0,λ~]Kϵ\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]\leq\frac{K}{\epsilon}. Indeed, from (C.5), taking a0(t)=a_{0}(t)=-\infty and b0(t)=+b_{0}(t)=+\infty, the unconstrained optimal wealth dynamics with exploration is given by

dXt0,λ~\displaystyle dX^{0,\widetilde{\lambda}^{*}}_{t} =(rXtλ~+(μr)(Kεer(Tt)Xt0,λ~)πMerton)dt\displaystyle=\bigg{(}rX^{\widetilde{\lambda}^{*}}_{t}+(\mu-r)\big{(}\frac{K}{\varepsilon}e^{-r(T-t)}-X^{0,\widetilde{\lambda}^{*}}_{t}\big{)}\pi^{Merton}\bigg{)}dt
+σ(Kεer(Tt)Xt0,λ~)2(πMerton)2+mεσ2e(ρ22r)(Tt)dWt,\displaystyle+\sigma\sqrt{\big{(}\frac{K}{\varepsilon}e^{-r(T-t)}-X^{0,\widetilde{\lambda}^{*}}_{t}\big{)}^{2}(\pi^{Merton})^{2}+\frac{m}{\varepsilon\sigma^{2}e^{-(\rho^{2}-2r)(T-t)}}}dW_{t},

which implies that 𝔼[Xt0,λ~]=x0+0t(Kεer(Ts)𝔼[Xs0,λ~])ρ2𝑑s\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{t}]=x_{0}+\int_{0}^{t}\big{(}\frac{K}{\varepsilon}e^{-r(T-s)}-\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{s}]\big{)}\rho^{2}ds. Put ψ(t):=𝔼[Xt0,λ~]\psi(t):=\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{t}] we obtain the following ODE

ψ(t)=ρ2Kϵer(Tt)+(rρ2)ψ(t).\psi^{\prime}(t)=\rho^{2}\frac{K}{\epsilon}e^{-r(T-t)}+(r-\rho^{2})\psi(t).

Solving this ODE with initial condition ψ(0)=x0\psi(0)=x_{0} we obtain

(C.8) ψ(t)=𝔼[Xt0,λ~]=Kϵer(Tt)+x0e(rρ2)tKϵer(Tt)eρ2t\psi(t)=\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{t}]=\frac{K}{\epsilon}e^{-r(T-t)}+x_{0}e^{(r-\rho^{2})t}-\frac{K}{\epsilon}e^{-r(T-t)}e^{-\rho^{2}t}

and 𝔼[XT0,λ~]=Kϵ+(x0erTKϵ)eρ2T\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]=\frac{K}{\epsilon}+(x_{0}e^{rT}-\frac{K}{\epsilon})e^{-\rho^{2}T}. It follows directly from the condition x0KϵerTx_{0}\leq\frac{K}{\epsilon}e^{-rT} that 𝔼[XT0,λ~]Kϵ\mathbb{E}[X^{0,\widetilde{\lambda}^{*}}_{T}]\leq\frac{K}{\epsilon}. ∎