This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.


Parameter learning: stochastic optimal control approach with reinforcement learning

Shuzhen Yang Shandong University-Zhong Tai Securities Institute for Financial Studies, Shandong University, PR China, ([email protected]). This work was supported by the National Key R&D program of China (Grant No.2018YFA0703900, ZR2019ZD41), National Natural Science Foundation of China (Grant No.11701330), and Taishan Scholar Talent Project Youth Project.
Abstract

In this study, we develop a stochastic optimal control approach with reinforcement learning structure to learn the unknown parameters appeared in the drift and diffusion terms of the stochastic differential equation. By choosing an appropriate cost functional, based on a classical optimal feedback control, we translate the original optimal control problem to a new control problem which takes place the unknown parameter as control, and the related optimal control can be used to estimate the unknown parameter. We establish the mathematical framework of the dynamic equation for the exploratory state, which is consistent with the existing results. Furthermore, we consider the linear stochastic differential equation case where the drift or diffusion term with unknown parameter. Then, we investigate the general case where both the drift and diffusion terms contain unknown parameters. For the above cases, we show that the optimal density function satisfies a Gaussian distribution which can be used to estimate the unknown parameter. Based on the obtained estimation of the parameters, we can do empirical analysis for a given model.

KEYWORDS: Parameters estimation; Stochastic optimal control; Reinforcement learning; Linear SDE; Exploratory

1 Introduction

Zhou (2023) commented that ”We strive to seek optimality, but often find ourselves trapped in bad optimal, solutions that are either local optimizers, or are too rigid to leave any room for errors”. Indeed, exploration through randomization brings a way to break this curse of optimality. In this present study, we try to balance the model based and model free methods, that is we aim to use the reinforcement learning structure to explore the unknown parameters appeared in the model based method.

Recently, Wang et al. (2020) first considered the reinforcement learning in continuous time and spaces, and an exploratory formulation for the dynamics equation of the state and an entropy-regularized reward function were introduced. Then, Wang and Zhou (2020) solved the continuous time mean variance portfolio selection problem under the reinforcement learning stochastic optimal control framework. Tang et al. (2022) considered the exploratory Hamilton-Jacobi-Bellman (HJB) equation formulated by Wang et al. (2020) in the context of reinforcement learning in continuous time, and established the well-posedness and regularity of the viscosity solution to the HJB equation. Gao et al. (2022) considered the temperature control problem for Langevin diffusions in the context of nonconvex optimization under the regularized exploratory formulation developed by Wang et al. (2020). Differing from the continuous-time entropy-regularized reinforcement learning problem of Wang et al. (2020), Han et al. (2023) proposed a Choquet regularizer to measure and manage the level of the exploration for reinforcement learning.

For the classical stochastic optimal control theory and related applications, we refer the readers to monographs Fleming and Rishel (1975); Yong and Zhou (1999); Fleming and Soner (2006). For the recursive stochastic optimal control problem and related dynamic programming principle, see Pardoux and Peng (1990); Peng (1990, 1992). Well-known that the classical optimal control theory is applied to solve the mean-variance model, see Zhou and Li (2000); Basak and Chabakauri (2010); Björk et al. (2014); Dai et al. (2021). When we consider the applications of the classical optimal control problem, for example in the mean-variance investment portfolio problem, we first need to estimate the parameters appeared in the model. Some statistic methods could be employed such as moment estimation, maximum likelihood estimation which heavily depend on the stringent assumptions on the observed samples. Furthermore, based on the observations (historical data), it is difficult to estimate the time varying parameter appeared in the model based problems. The reinforcement learning algorithm has been widely used to study optimization, engineering and finance, etc. In particular, Wang et al. (2020) first introduced the reinforcement learning in continuous time stochastic optimal control problem.

However, it is important to find a better estimation for the parameter appeared in the model. Based on the estimation of the parameter, we can do empirical and time trend analysis. Therefore, in this study, we first show how to use a stochastic optimal control approach with reinforcement learning structure to learn the unknown parameters appeared in the dynamic equation. We consider the following stochastic optimal control problem with unknown deterministic parameter β()\beta(\cdot),

dXsu=b(βs,Xsu,us)ds+σ(βs,Xsu,us)dWs,Xtu=x,\mathrm{d}X^{u}_{s}=b(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}s+\sigma(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}W_{s},\ X^{u}_{t}=x, (1.1)

where b,σb,\sigma are given deterministic function, W()W(\cdot) is a Brownian motion, and u()u(\cdot) is an input control. In the input control u()u(\cdot), we can observe the output state Xu()X^{u}(\cdot). In the classical mean-variance investment portfolio framework, the parameter β()\beta(\cdot) in (1.1) can be used to describe the mean and volatility of risky asset. Thus, it is useful to obtain the estimation of the unknown parameter β()\beta(\cdot).

Now, we show the details of the stochastic optimal control approach with reinforcement learning structure for learning the unknown parameters. Step 1: Based on equation (1.1), by choosing an appropriate cost functional, we obtain a feedback optimal control u()u^{*}(\cdot) which contains the unknown parameter β()\beta(\cdot), denote by us=u(βs,s,Xs),tsTu^{*}_{s}=u^{*}(\beta_{s},s,X^{*}_{s}),\ t\leq s\leq T. Step 2: We take place the unknown parameter β()\beta(\cdot) in the optimal control u()u^{*}(\cdot) as a new deterministic control ρ()\rho(\cdot), and the input control is denoted by usρ=uρ(ρs,s,Xsρ),tsTu^{\rho}_{s}=u^{\rho}(\rho_{s},s,X^{\rho}_{s}),\ t\leq s\leq T. Step 3: We put the new control uρ()u^{\rho}(\cdot) into equation (1.1) and the related cost functional, and establish a new optimal control problem. Indeed, the optimal control of the new exploratory control problem can be used to estimate the unknown parameters. Therefore, we consider the linear stochastic differential equation case where the drift or diffusion term contains the unknown parameter, and the general case where both the drift and diffusion terms contain unknown parameters. For the above cases, we find that the optimal density function can be used to learn the unknown parameters.

In this present paper, we focus on develop the theory of a stochastic optimal control approach with reinforcement learning structure to learn the unknown parameters appeared in the equation (1.1). For the application of this new theory, we leave it as the future work. However, we refer the reader to the following references for the study of the algorithm of the optimal control. A unified framework to study policy evaluation and the associated temporal difference methods in continuous time was investigated in Jia and Zhou (2022a). Jia and Zhou (2022b) studied the policy gradient for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). Dong (2022) studied the optimal stopping problem under the exploratory framework, and established the related HJB equation and algorithm. Furthermore, Jia and Zhou (2022c) introduced a q-learning in continuous time and space. For the policy evaluation, we follow Doya (2000) for learning the value function. An introduction of reinforcement learning could be found in Sutton and Barto (2018), and the deep learning and related topic see Goodfellow et al. (2016).

The main contributions of this study are threefold:

(i). To learn the unknown parameters appeared in the dynamic equation of the state, we first develop a stochastic optimal control approach with reinforcement learning structure.

(ii). By choosing an appropriate cost functional, based on a classical optimal feedback control, we translate the original optimal control problem into a new control problem, in which the related optimal control is used to estimate unknown parameter. Based on the obtained estimations of the parameters, we can do empirical analysis for a given model.

(iii). We consider the linear stochastic differential equation case where the drift or diffusion term contains unknown parameter, and the general case where both the drift and diffusion terms contain unknown parameters. We show that the optimal density function satisfies a Gaussian distribution which can be used to learn the unknown parameters.

The remainder of this study is organized as follows. In section 2, we formulate the stochastic optimal control approach with reinforcement learning structure, and show how to learn the unknown parameter. Then, we investigate the exploratory HJB equation for the stochastic optimal control approach in Section 3. In Section 4, we consider the problem of the Linear SDE case, where the drift or diffusion term is allowed to contain the unknown parameter. Furthermore, in Section 5, we general the model in Section 4 to the case where both the drift and diffusion terms contain the unknown parameters. Then, we give an example to verify the main results. Finally, we conclude this study in Section 6.

2 Formulation of the model

Given a probability space (Ω,,P)(\Omega,\mathcal{F},P), Brownian motion {Wt}t0\{W_{t}\}_{t\geq 0}, and filtration {t}t0\{\mathcal{F}_{t}\}_{t\geq 0} generated by {Wt}t0\{W_{t}\}_{t\geq 0}, where =T\mathcal{F}=\mathcal{F}_{T} with a given terminal time TT. We introduce the following stochastic differential equation with the deterministic unknown parameter βs,tsT\beta_{s}\in\mathbb{R},\ t\leq s\leq T, and control u()𝒰[t,T]u(\cdot)\in\mathcal{U}[t,T],

dXsu=b(βs,Xsu,us)ds+σ(βs,Xsu,us)dWs,Xtu=x,\mathrm{d}X^{u}_{s}=b(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}s+\sigma(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}W_{s},\ X^{u}_{t}=x, (2.1)

where 𝒰[t,T]\mathcal{U}[t,T] is the set of all progressing measurable, square integrate processes on [t,T][t,T], and the processes take value in a Euclidean space. Based on the state {Xsu}tsT\{X^{u}_{s}\}_{t\leq s\leq T}, we consider the running and terminal cost functional,

J(t,x;u())=𝔼[tTf(Xsu,us)ds+Φ(XTu)],J(t,x;u(\cdot))=\mathbb{E}\left[\int_{t}^{T}f(X_{s}^{u},u_{s})\mathrm{d}s+\Phi(X_{T}^{u})\right], (2.2)

and the target is to minimize J(t,x;u())J(t,x;u(\cdot)) over u()𝒰[t,T]u(\cdot)\in\mathcal{U}[t,T], that is

infu()𝒰[t,T]J(t,x;u()).\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

The conditions on functions b,σ,fb,\sigma,f and Φ\Phi in equations (2.1) and (2.2) will be given later. Under mild conditions, we can show that there exists a feedback optimal control us=u(βs,s,Xs),tsTu^{*}_{s}=u^{*}(\beta_{s},s,X^{*}_{s}),\ t\leq s\leq T (see Yong and Zhou (1999) for further details), where XX^{*} is the optimal state with the optimal control uu^{*}.

Remark 2.1.

Indeed, we can choose a cost functional J(t,x;u())J(t,x;u(\cdot)) such that there exists a feedback optimal control which contains the unknown parameter β()\beta(\cdot). See Sections 4 and 5 for further details.

Now, we give the assumptions of this study. The functions b,σ,fb,\sigma,f and Φ\Phi are continuous on all the variables.

Assumption 2.1.

The functions bb and σ\sigma are linear growth and Lipschaitz continuous on the second and third variables with Lipschitz constant LL.

Assumption 2.2.

The functions f(x,u)f(x,u) and Φ(x)\Phi(x) are polynomial growth on xx and uu.

Assumption 2.3.

β()\beta(\cdot) is a deterministic piecewise continuous function on [t,T][t,T]. The feedback optimal control u(βs,s,x),tsTu^{*}(\beta_{s},s,x),\ t\leq s\leq T is uniformly Lipschitz continuous on xx, with Lipschitz constant LL.

In practical analysis, we always assume the form of the functions b,σ,fb,\sigma,f and Φ\Phi and leave room for estimating the unknown parameter β()\beta(\cdot). Some classical statistics estimations are implemented, for example moment estimation, maximum likelihood estimation, etc. However, these estimation methods are heavily based on the observations. In this study, we aim to develop a new estimations method based on the explicit feedback control u()u^{*}(\cdot).

It should be note that we can observe the value of the state Xu()X^{u}(\cdot) with a given control u()u(\cdot), but not the unknown parameter β()\beta(\cdot). In the following, we develop the model used to estimate the parameter β()\beta(\cdot). Since we don’t know the value of the parameter β()\beta(\cdot), we take place β()\beta(\cdot) as a new deterministic control process ρ()\rho(\cdot) in the feedback optimal control u(βs,s,Xs)u^{*}(\beta_{s},s,X^{*}_{s}), and denotes as usρ=u(ρs,s,Xsρ)u^{\rho}_{s}=u^{*}(\rho_{s},s,X^{\rho}_{s}), where XsρX^{\rho}_{s} satisfies the following stochastic differential equation,

dXsρ=b^(βs,Xsρ,ρs)ds+σ^(βs,Xsρ,ρs)dWs,Xtρ=x,\mathrm{d}X^{\rho}_{s}=\hat{b}(\beta_{s},X_{s}^{\rho},\rho_{s})\mathrm{d}s+\hat{\sigma}(\beta_{s},X_{s}^{\rho},{\rho}_{s})\mathrm{d}W_{s},\ X^{\rho}_{t}=x, (2.3)

where b^(βs,Xsρ,ρs)=b(βs,Xsρ,usρ)\hat{b}(\beta_{s},X_{s}^{\rho},\rho_{s})={b}(\beta_{s},X_{s}^{\rho},u^{\rho}_{s}) and σ^(βs,Xsρ,ρs)=σ(βs,Xsρ,usρ)\hat{\sigma}(\beta_{s},X_{s}^{\rho},\rho_{s})={\sigma}(\beta_{s},X_{s}^{\rho},u^{\rho}_{s}). Based on the above given assumptions, we have the following unique results for the solution of equation (2.3).

Theorem 2.1.

Let Assumptions 2.1 and 2.3 hold. The equation (2.3) admits a unique solution.

Furthermore, we rewrite the cost functional (2.2) as follows,

J^(t,x;ρ())=𝔼[tTf^(Xsρ,ρs)ds+Φ^(XTρ)],\hat{J}(t,x;\rho(\cdot))=\mathbb{E}\left[\int_{t}^{T}\hat{f}(X_{s}^{\rho},\rho_{s})\mathrm{d}s+\hat{\Phi}(X_{T}^{\rho})\right], (2.4)

where ρs\rho_{s}\in\mathbb{R}, f^(Xsρ,ρs)=f(Xsρ,usρ)\hat{f}(X_{s}^{\rho},\rho_{s})=f(X_{s}^{\rho},u^{\rho}_{s}), and Φ^(XTρ)=Φ(XTρ)\hat{\Phi}(X_{T}^{\rho})={\Phi}(X_{T}^{\rho}), tsTt\leq s\leq T. We denote 𝒟[t,T]\mathcal{D}[t,T]={all deterministic piecewise continuous functions on [t,T][t,T]}. Obviously, we have the following equivalence between the cost functional J(t,x;u())J(t,x;u(\cdot)) and J^(t,x;ρ())\hat{J}(t,x;\rho(\cdot)).

Theorem 2.2.

Let Assumption 2.3 hold, J(t,x;u())J(t,x;u(\cdot)) be the cost functional given in (2.2), and J(t,x;ρ())J(t,x;\rho(\cdot)) be given in (2.4). Then, we have

infρ()𝒟[t,T]J^(t,x;ρ())=infu()𝒰[t,T]J(t,x;u()).\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}\hat{J}(t,x;\rho(\cdot))=\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).
Proof.

Note that, ρ()𝒟[t,T]\rho(\cdot)\in\mathcal{D}[t,T] is a deterministic piecewise continuous process. From Assumption 2.3, it is easily to verify that uρ()=u(ρ(),,Xρ())𝒰[t,T]u^{\rho}(\cdot)=u^{*}(\rho(\cdot),\cdot,X^{\rho}(\cdot))\in\mathcal{U}[t,T]. Thus,

infρ()𝒟[t,T]J^(t,x;ρ())infu()𝒰[t,T]J(t,x;u()).\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}\hat{J}(t,x;\rho(\cdot))\geq\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

On the contrary, since u()=u(β(),,Xρ())𝒟[t,T]u^{*}(\cdot)=u^{*}(\beta(\cdot),\cdot,X^{\rho}(\cdot))\in\mathcal{D}[t,T] and

J(t,x;u())=infu()𝒰[t,T]J(t,x;u()),J(t,x;u^{*}(\cdot))=\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)),

thus, we have

infρ()𝒟[t,T]J^(t,x;ρ())infu()𝒰[t,T]J(t,x;u()).\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}\hat{J}(t,x;\rho(\cdot))\leq\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

This completes the proof. ∎

Remark 2.2.

Based on Theorem 2.2, if the cost functional J^(t,x;ρ())\hat{J}(t,x;\rho(\cdot)) admits a unique optimal control ρ()\rho^{*}(\cdot), we have that β()=ρ()\beta(\cdot)=\rho^{*}(\cdot), which means that we can use the above new optimal control problem to estimate the parameter in the original optimal control problem. Indeed, we can choose a cost functional which satisfies the above inference. Further details can be found in Sections 4 and 5.

From equation (2.3), we can observe the value of the state Xρ()X^{\rho}(\cdot) based on an input control ρ()\rho(\cdot). Following the idea investigated in Wang et al. (2020), we use reinforcement learning (RL) structure to learn the optimal control ρ()\rho^{*}(\cdot). Motivated with the law of large numbers of samples of state under the density function π()\pi(\cdot), Wang et al. (2020) introduced the following exploratory dynamic equation for state Xπ()X^{\pi}(\cdot),

dXsπ=b~(βs,Xsπ,πs)ds+σ~(βs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=\tilde{b}(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (2.5)

where πs\pi_{s} is the density function of the input control ρs\rho_{s},

b~(βs,x,πs)=b^(βs,x,ρ)πs(ρ)dρ,\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}\hat{b}(\beta_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho,

and

σ~(βs,x,πs)=σ^2(βs,x,ρ)πs(ρ)dρ,\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho},

for tsTt\leq s\leq T.

It should be note that, the observations of the exploratory state Xπ()X^{\pi}(\cdot) should satisfy the equation (2.3). Based on this fact, we can prove that Xπ()X^{\pi}(\cdot) satisfies equation (2.5).

Theorem 2.3.

Let Assumptions 2.1, 2.2, and 2.3 hold, and the state Xπ()X^{\pi}(\cdot) satisfies a diffusion process. Then, Xπ()X^{\pi}(\cdot) satisfies equation (2.5).

Proof.

We assume that the state Xπ()X^{\pi}(\cdot) satisfies the following diffusion process,

dXsπ=h(βs,Xsπ,πs)ds+g(βs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=h(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+g(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (2.6)

where the functions hh and gg are needed to be determined.

Let us consider a sequence of samples {xi()}i=1N\{x^{i}(\cdot)\}_{i=1}^{N} which are the observations of Xπ()X^{\pi}(\cdot) with {Wi()}i=1N\{W^{i}(\cdot)\}_{i=1}^{N} and {ρi()}i=1N\{\rho^{i}(\cdot)\}_{i=1}^{N}, where Wi()W^{i}(\cdot) is the independent sample paths of the Brownian motion W()W(\cdot), and ρi()\rho^{i}(\cdot) is the control which obeys the density function π()\pi(\cdot), 1iN1\leq i\leq N. For given tsTt\leq s\leq T and sufficiently small Δs>0\Delta s>0, xi()x^{i}(\cdot) satisfies

xs+Δsixsi=ss+Δsb^(βr,xri,ρri)dr+ss+Δsσ^(βr,xri,ρri)dWri,xti=x,x^{i}_{s+\Delta s}-x^{i}_{s}=\int_{s}^{s+\Delta s}\hat{b}(\beta_{r},x_{r}^{i},\rho^{i}_{r})\mathrm{d}r+\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r},{\rho}^{i}_{r})\mathrm{d}W^{i}_{r},\ x^{i}_{t}=x, (2.7)

which leads to

xs+Δsixsi=\displaystyle x^{i}_{s+\Delta s}-x^{i}_{s}= b^(βs,xsi,ρsi)Δs+ss+Δsσ^(βr,xri,ρri)dWri\displaystyle\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\Delta s+\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r},{\rho}^{i}_{r})\mathrm{d}W^{i}_{r}
+ss+Δs[b^(βr,xri,ρri)b^(βs,xsi,ρsi)]dr.\displaystyle+\int_{s}^{s+\Delta s}\left[\hat{b}(\beta_{r},x_{r}^{i},\rho^{i}_{r})-\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\right]\mathrm{d}r.

Applying the classical law of larger numbers, we have the following convergence results in probability:

limN1N(xs+Δsixsi)=𝔼[Xs+ΔsπXsπ];\displaystyle\lim_{N\to\infty}\frac{1}{N}\left(x^{i}_{s+\Delta s}-x^{i}_{s}\right)=\mathbb{E}[X^{\pi}_{s+\Delta s}-X^{\pi}_{s}];
limN1Nb^(βs,xsi,ρsi)Δs=𝔼[b^(βs,Xsπ,ρ)πs(ρ)dρΔs];\displaystyle\lim_{N\to\infty}\frac{1}{N}\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\Delta s=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\Delta s\right];
limN1Nss+Δsσ^(βr,xri,ρri)dWri=0;\displaystyle\lim_{N\to\infty}\frac{1}{N}\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r},{\rho}^{i}_{r})\mathrm{d}W^{i}_{r}=0;
limN1Nss+Δs[b^(βr,xri,ρri)b^(βs,xsi,ρsi)]dr=o(Δs),\displaystyle\lim_{N\to\infty}\frac{1}{N}\int_{s}^{s+\Delta s}\left[\hat{b}(\beta_{r},x_{r}^{i},\rho^{i}_{r})-\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\right]\mathrm{d}r=o(\Delta s),

where the last equality is obtained by the continuous dependence of stochastic differential equation, o(Δs)o(\Delta s) is the higher order infinitesimal of Δs\Delta s, and thus,

𝔼[Xs+ΔsπXsπ]=𝔼[b^(βs,Xsπ,ρ)πs(ρ)dρΔs]+o(Δs).\mathbb{E}[X^{\pi}_{s+\Delta s}-X^{\pi}_{s}]=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\Delta s\right]+o(\Delta s). (2.8)

Dividing on both sides of equation (2.8) by Δs\Delta s, and letting Δs0\Delta s\to 0, it follows that

limΔs0𝔼[Xs+ΔsπXsπ]Δs=𝔼[b^(βs,Xsπ,ρ)πs(ρ)dρ].\lim_{\Delta s\to 0}\frac{\mathbb{E}[X^{\pi}_{s+\Delta s}-X^{\pi}_{s}]}{\Delta s}=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\right]. (2.9)

Combining equations (2.6) and (2.9), we have that

𝔼[h(βs,Xsπ,πs)]=𝔼[b^(βs,Xsπ,ρ)πs(ρ)dρ].\mathbb{E}\left[h(\beta_{s},X^{\pi}_{s},\pi_{s})\right]=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\right]. (2.10)

It is worth noting that, for any given AsA\in\mathcal{F}_{s}, from equations (2.6) and (2.7), we have

dXsπ1A=h(βs,Xsπ1A,πs)1Ads+g(βs,Xsπ1A,πs)1AdWs,\mathrm{d}X^{\pi}_{s}1_{A}=h(\beta_{s},X_{s}^{\pi}1_{A},\pi_{s})1_{A}\mathrm{d}s+g(\beta_{s},X_{s}^{\pi}1_{A},{\pi}_{s})1_{A}\mathrm{d}W_{s},

and

xs+Δsi1Aixsi1Ai=ss+Δsb^(βr,xri1Ai,ρri)1Aidr+ss+Δsσ^(βr,xri1Ai,ρri)1AidWri,x^{i}_{s+\Delta s}1_{A^{i}}-x^{i}_{s}1_{A^{i}}=\int_{s}^{s+\Delta s}\hat{b}(\beta_{r},x_{r}^{i}1_{A^{i}},\rho^{i}_{r})1_{A^{i}}\mathrm{d}r+\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r}1_{A^{i}},{\rho}^{i}_{r})1_{A^{i}}\mathrm{d}W^{i}_{r},

where 1A1_{A} is the indicator function on set AA, and AiA^{i} is measurable about the filtration generated by Wri,trsW^{i}_{r},\ t\leq r\leq s. Then, similar with the above proof and equation (2.10), we have

𝔼[h(βs,Xsπ,πs)1A]=𝔼[b^(βs,Xsπ,ρ)πs(ρ)dρ1A],\mathbb{E}\left[h(\beta_{s},X^{\pi}_{s},\pi_{s})1_{A}\right]=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho 1_{A}\right], (2.11)

and thus

h(βs,Xsπ,πs)=b^(βs,Xsπ,ρ)πs(ρ)dρ,Pa.s.h(\beta_{s},X^{\pi}_{s},\pi_{s})=\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho,\quad P-a.s. (2.12)

Applying Itô formula to (Xsπ)2(X^{\pi}_{s})^{2}, it follows that

d(Xsπ)2=[2Xsπh(βs,Xsπ,πs)+g2(βs,Xsπ,πs)]ds+2Xsπg(βs,Xsπ,πs)dWs.\mathrm{d}(X^{\pi}_{s})^{2}=\left[2X^{\pi}_{s}h(\beta_{s},X_{s}^{\pi},\pi_{s})+g^{2}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\right]\mathrm{d}s+2X^{\pi}_{s}g(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s}. (2.13)

Now, similar with the proof in equation (2.12), we consider the samples {(xi())2}i=1N\{(x^{i}(\cdot))^{2}\}_{i=1}^{N}, and obtain that

2Xsπh(βs,Xsπ,πs)+g2(βs,Xsπ,πs)=[2Xsπb^(βs,Xsπ,ρ)+σ^2(βs,Xsπ,ρ)]πs(ρ)dρ.2X^{\pi}_{s}h(\beta_{s},X_{s}^{\pi},\pi_{s})+g^{2}(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\int_{\mathbb{R}}\left[2X^{\pi}_{s}\hat{b}(\beta_{s},X_{s}^{\pi},\rho)+\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\right]\pi_{s}(\rho)\mathrm{d}\rho. (2.14)

Combining equations (2.12) and (2.14), we have

g2(βs,Xsπ,πs)=σ^2(βs,Xsπ,ρ)πs(ρ)dρ,Pa.s.g^{2}(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho,\quad P-a.s. (2.15)

Thus, we can choose

g(βs,Xsπ,πs)=σ^2(βs,Xsπ,ρ)πs(ρ)dρ,g(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho},

which completes the proof. ∎

Remark 2.3.

In Theorem 2.3, we show that the state Xπ()X^{\pi}(\cdot) satisfies equation (2.5). Furthermore, from equation (2.15), for tsTt\leq s\leq T, we have that

g(βs,Xsπ,πs)=σ^2(βs,Xsπ,ρ)πs(ρ)dρ,g(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho},

or

g(βs,Xsπ,πs)=σ^2(βs,Xsπ,ρ)πs(ρ)dρ.g(\beta_{s},X_{s}^{\pi},{\pi}_{s})=-\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho}.

Indeed, the Brownian motion WsW_{s} satisfies the normal distribution with mean 0 and variance ss. Thus, we just need to choose one of them.

Furthermore, in the framework of exploratory, we rewrite the cost functional (2.4) as follows,

J~(t,x;π())=𝔼[tT[f^(Xsπ,ρ)πs(ρ)+λπs(ρ)lnπs(ρ)]dρds+Φ^(XTπ)],\tilde{J}(t,x;\pi(\cdot))=\mathbb{E}\left[\int_{t}^{T}\int_{\mathbb{R}}\left[\hat{f}(X_{s}^{\pi},\rho)\pi_{s}(\rho)+\lambda\pi_{s}(\rho)\ln\pi_{s}(\rho)\right]\mathrm{d}\rho\mathrm{d}s+\hat{\Phi}(X_{T}^{\pi})\right], (2.16)

where the term

λπs(ρ)lnπs(ρ)dρ\lambda\int_{\mathbb{R}}\pi_{s}(\rho)\ln\pi_{s}(\rho)\mathrm{d}\rho

denotes the Shanon’s differential entropy used to measure the level of exploratory, and λ>0\lambda>0 is the temperature constant balanced the exploitation and exploration. We denote all the admissible density functions control on 𝒟[t,T]\mathcal{D}[t,T] as 𝒜[t,T]\mathcal{A}[t,T]. Thus, our target is to minimum the cost functional (2.16) over 𝒜[t,T]\mathcal{A}[t,T].

3 Hamilton-Jacobi-Bellman Approach

Now, we consider equation (2.5) with cost functional (2.16). Similar with the manner in Wang et al. (2020), employing the classical dynamic programming principle, one obtains that

v(t,x)=infπ()𝒜[t,s]𝔼[ts[f^(Xrπ,ρ)πr(ρ)+λπr(ρ)lnπr(ρ)]dρdr+v(s,Xsπ)],v(t,x)=\inf_{\pi(\cdot)\in\mathcal{A}[t,s]}\mathbb{E}\left[\int_{t}^{s}\int_{\mathbb{R}}\left[\hat{f}(X_{r}^{\pi},\rho)\pi_{r}(\rho)+\lambda\pi_{r}(\rho)\ln\pi_{r}(\rho)\right]\mathrm{d}\rho\mathrm{d}r+v(s,X^{\pi}_{s})\right],

and the related Hamilton-Jacobi-Bellman (HJB) equation is

0=\displaystyle 0= minπt𝒜[t,t][tv(t,x)+(f^(x,ρ)+λlnπt(ρ)\displaystyle\min_{\pi_{t}\in\mathcal{A}[t,t]}\bigg{[}\partial_{t}v(t,x)+\int_{\mathbb{R}}\bigg{(}\hat{f}(x,\rho)+\lambda\ln\pi_{t}(\rho)
+12σ^2(βt,x,ρ)xxv(t,x)+b^(βt,x,ρ)xv(t,x))πt(ρ)dρ],\displaystyle+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x)\bigg{)}\pi_{t}(\rho)\mathrm{d}\rho\bigg{]}, (3.1)

with terminal condition v(T,x)=Φ^(x)v(T,x)=\hat{\Phi}(x).

It is worth noting that, equation (3.1) has an optimal control πt()\pi^{*}_{t}(\cdot) which satisfies

πt(ρ)=exp(1λ(f^(x,ρ)+12σ^2(βt,x,ρ)xxv(t,x)+b^(βt,x,ρ)xv(t,x)))exp(1λ(f^(x,u)+12σ^2(βt,x,u)xxv(t,x)+b^(βt,x,u)xv(t,x)))du.\pi^{*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left(\hat{f}(x,\rho)+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left(\hat{f}(x,u)+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,u)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,u)\partial_{x}v(t,x)\right)\right)\mathrm{d}u}. (3.2)

Letting λ0\lambda\to 0, πt()\pi^{*}_{t}(\cdot) reduces to a Dirac measure, where

πt(ρ)=δρt(ρ),ρ,\pi^{*}_{t}(\rho)=\delta_{\rho^{*}_{t}}(\rho),\quad\rho\in\mathbb{R},

and ρt\rho_{t}^{*} solves the following equation of ρ\rho,

0=ρf^(x,ρ)+σ^(βt,x,ρ)ρσ^(βt,x,ρ)xxv(t,x)+ρb^(βt,x,ρ)xv(t,x).0=\partial_{\rho}\hat{f}(x,\rho)+\hat{\sigma}(\beta_{t},x,\rho)\partial_{\rho}\hat{\sigma}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\partial_{\rho}\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x). (3.3)

By Theorem 2.2, we have that equation (3.3) has at least one solution ρt=βt\rho_{t}^{*}=\beta_{t}, and thus

0=ρf^(x,βt)+σ^(βt,x,βt)ρσ^(βt,x,βt)xxv(t,x)+ρb^(βt,x,βt)xv(t,x).0=\partial_{\rho}\hat{f}(x,\beta_{t})+\hat{\sigma}(\beta_{t},x,\beta_{t})\partial_{\rho}\hat{\sigma}(\beta_{t},x,\beta_{t})\partial_{xx}v(t,x)+\partial_{\rho}\hat{b}(\beta_{t},x,\beta_{t})\partial_{x}v(t,x).

Furthermore, the function of ρ\rho, (t,x,βt,ρ)\mathcal{L}(t,x,\beta_{t},\rho) takes the minimum value at βt\beta_{t}, where

(t,x,βt,ρ)=f^(x,ρ)+12σ^2(βt,x,ρ)xxv(t,x)+b^(βt,x,ρ)xv(t,x).\mathcal{L}(t,x,\beta_{t},\rho)=\hat{f}(x,\rho)+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x).

Since λ>0\lambda>0, thus πt(ρ)\pi^{*}_{t}(\rho) takes the maximum value at βt\beta_{t}. The above observations motive us to find the optimal estimation for the parameter β()\beta(\cdot) from the optimal density function π()\pi^{*}(\cdot).

Based on the above analysis, we conclude the following results.

Theorem 3.1.

If (t,x,βt,ρ)\mathcal{L}(t,x,\beta_{t},\rho) satisfies for any ρβt\rho\neq\beta_{t},

(t,x,βt,ρ)>(t,x,βt,βt).\mathcal{L}(t,x,\beta_{t},\rho)>\mathcal{L}(t,x,\beta_{t},\beta_{t}). (3.4)

We have that

argmaxρπt(ρ)=βt, 0tT.{\arg\max}_{\rho\in\mathbb{R}}\pi^{*}_{t}(\rho)=\beta_{t},\ \ 0\leq t\leq T.
Proof.

From condition (3.4), we have that (t,x,βt,ρ)\mathcal{L}(t,x,\beta_{t},\rho) takes the unique minimum value at βt\beta_{t}, and πt(ρ)\pi^{*}_{t}(\rho) takes the unique maximum value at βt\beta_{t}. Obviously, argmaxρπt(ρ){\arg\max}_{\rho\in\mathbb{R}}\pi^{*}_{t}(\rho) is equal to the parameter β()\beta(\cdot). ∎

In the following section, we consider a linear stochastic differential equation (SDE) model to verify the results given in Theorem 3.1, where the drift or diffusion term contains unknown parameter.

4 Linear SDE problem

We now consider the problem of the Linear SDE case, where the drift or diffusion term is allowed to contain the unknown parameter. We first study the diffusion term with unknown parameter case, and then investigate the drift term with unknown parameter.

4.1 The diffusion term with unknown parameter

We consider the following linear SDE,

dXsu=[Xsu+us]ds+[βsXsu+us]dWs,\mathrm{d}X^{u}_{s}=\left[X^{u}_{s}+u_{s}\right]\mathrm{d}s+\left[\beta_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}W_{s}, (4.1)

with the initial condition Xtu=xX^{u}_{t}=x and parameter β()\beta(\cdot) needs to be estimated. The cost functional is given as follows,

J(t,x;u())=𝔼[(XTu)2],J(t,x;u(\cdot))=\mathbb{E}[\left(X^{u}_{T}\right)^{2}], (4.2)

where u()𝒰[t,T]u(\cdot)\in\mathcal{U}[t,T]. Based on the classical optimal control theory, the related HJB equation is given by

0=tv(t,x)+infuU[12(βtx+u)2xxv(t,x)+(x+u)xv(t,x)].0=\partial_{t}v(t,x)+\inf_{u\in U}\left[\frac{1}{2}(\beta_{t}x+u)^{2}\partial_{xx}v(t,x)+(x+u)\partial_{x}v(t,x)\right].

The optimal control is given by

u(s,Xs)=(1+βs)Xs,tsT.u^{*}(s,X^{*}_{s})=-(1+\beta_{s})X^{*}_{s},\quad t\leq s\leq T.

We take the parameter βsXs\beta_{s}X^{*}_{s} as ρs\rho_{s}, and obtain

uρ(s,Xsρ)=Xsρρs,tsT.u^{\rho}(s,X^{\rho}_{s})=-X^{\rho}_{s}-\rho_{s},\quad t\leq s\leq T.
Remark 4.1.

In this present paper, we aim to establish the theory to estimate the parameter appeared in the state’s dynamic equation. To derive the explicit solution of the optimal control π()\pi^{*}(\cdot), here, we take place the parameter β()X()\beta(\cdot)X^{*}(\cdot) as ρ()\rho(\cdot) which is essentially same with the method developed in Section 3. Note that, when we obtain the optimal control ρ()\rho(\cdot), we can divide ρ()\rho^{*}(\cdot) by the state xx. Then we can obtain the optimal estimation for the parameter β()\beta(\cdot). Indeed, when deal with some special problems, this kind of structure method should be considered according the properties of the problem.

Then equation (2.3) becomes

dXsρ=ρsds+(βs1ρs/Xsρ)XsρdWs,Xtρ=x.\mathrm{d}X^{\rho}_{s}=-\rho_{s}\mathrm{d}s+(\beta_{s}-1-\rho_{s}{/}X^{\rho}_{s})X^{\rho}_{s}\mathrm{d}W_{s},\quad X^{\rho}_{t}=x. (4.3)

Now, applying Theorem 2.3, we can formulate the RL stochastic optimal control problem. The exploratory dynamic equation is,

dXsπ=b~(βs,Xsπ,πs)ds+σ~(βs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=\tilde{b}(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (4.4)

where

b~(βs,x,πs)=ρπs(ρ)dρ,\tilde{b}(\beta_{s},x,\pi_{s})=-\int_{\mathbb{R}}\rho\pi_{s}(\rho)\mathrm{d}\rho,

and

σ~(βs,x,πs)=(βs1ρ/Xsρ)2(Xsρ)2πs(ρ)dρ.\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}(\beta_{s}-1-\rho{/}X^{\rho}_{s})^{2}(X^{\rho}_{s})^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional becomes

J(t,x;π())=𝔼[(XTπ)2].J(t,x;\pi(\cdot))=\mathbb{E}[\left(X^{\pi}_{T}\right)^{2}]. (4.5)

By the formula of the optimal control in (3.2), we have that

πt(ρ)=exp(1λ(12(βt1ρ/x)2x2xxv(t,x)ρxv(t,x)))exp(1λ(12(βt1u/x)2x2xxv(t,x)uxv(t,x)))du,\pi^{*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}(\beta_{t}-1-\rho{/}x)^{2}x^{2}\partial_{xx}v(t,x)-\rho\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}(\beta_{t}-1-u{/}x)^{2}x^{2}\partial_{xx}v(t,x)-u\partial_{x}v(t,x)\right)\right)\mathrm{d}u}, (4.6)

and thus

πt(ρ)=12πσ2(t,x)exp((ρμ(t,x))22σ2(t,x)),\pi^{*}_{t}(\rho)=\frac{1}{\sqrt{2\pi\sigma^{2}(t,x)}}\exp\left(-\frac{\left(\rho-\mu(t,x)\right)^{2}}{2\sigma^{2}(t,x)}\right),

where

μ(t,x)=a2(t,x)2a1(t,x);\displaystyle\mu(t,x)=-\frac{a_{2}(t,x)}{2a_{1}(t,x)};
σ2(t,x)=λ2a1(t,x);\displaystyle\sigma^{2}(t,x)=\frac{\lambda}{2a_{1}(t,x)};
a1(t,x)=xxv(t,x)2;\displaystyle a_{1}(t,x)=\frac{\partial_{xx}v(t,x)}{2};
a2(t,x)=(1βt)xxxv(t,x)xv(t,x).\displaystyle a_{2}(t,x)=(1-\beta_{t})x\partial_{xx}v(t,x)-\partial_{x}v(t,x).

Putting πt(ρ)\pi^{*}_{t}(\rho) into equation (3.1), and by a simple calculation, one obtains

0=tv(t,x)+(βt1)2x2a1(t,x)μ2(t,x)a1(t,x)λ2lnπλa1(t,x).0=\partial_{t}v(t,x)+(\beta_{t}-1)^{2}x^{2}a_{1}(t,x)-\mu^{2}(t,x)a_{1}(t,x)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{a_{1}(t,x)}. (4.7)

We assume that the classical solution of equation (4.7) satisfies

v(t,x)=α1(t)x2+α2(t).v(t,x)=\alpha_{1}(t)x^{2}+\alpha_{2}(t).

Then, it follows that

μ(t,x)=βtx;\displaystyle\mu(t,x)=\beta_{t}x;
σ2(t,x)=λ2α1(t);\displaystyle\sigma^{2}(t,x)=\frac{\lambda}{2\alpha_{1}(t)};
a1(t,x)=α1(t);\displaystyle a_{1}(t,x)=\alpha_{1}(t);
a2(t,x)=2βtα1(t)x.\displaystyle a_{2}(t,x)=-2\beta_{t}\alpha_{1}(t)x.

Now, we put the formula of v(t,x)v(t,x) into equation (4.7) and obtain that

0=α1(t)x2+α2(t)+(12βt)α1(t)x2λ2lnπλα1(t).0=\alpha^{\prime}_{1}(t)x^{2}+\alpha^{\prime}_{2}(t)+(1-2\beta_{t})\alpha_{1}(t)x^{2}-\frac{\lambda}{2}\ln\frac{\pi\lambda}{\alpha_{1}(t)}. (4.8)

Thus, α1(t),α2(t)\alpha_{1}(t),\ \alpha_{2}(t) satisfy the following equations

0=α1(t)+(12βt)α1(t),α1(T)=1;\displaystyle 0=\alpha^{\prime}_{1}(t)+(1-2\beta_{t})\alpha_{1}(t),\quad\alpha_{1}(T)=1;
0=α2(t)λ2lnπλα1(t),α2(T)=0,\displaystyle 0=\alpha^{\prime}_{2}(t)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{\alpha_{1}(t)},\quad\alpha_{2}(T)=0,

and

α1(t)=exp(tT(12βs)ds);\displaystyle\alpha_{1}(t)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right);
α2(t)=tTλ2lnπλα1(s)ds,\displaystyle\alpha_{2}(t)=\int_{t}^{T}\frac{\lambda}{2}\ln\frac{\pi\lambda}{\alpha_{1}(s)}\mathrm{d}s,

which solve the value function v(t,x)v(t,x) with the optimal control π(ρ)\pi^{*}(\rho) satisfying the Gaussian density function,

πt(ρ)=1πλ/α1(t)exp((ρ(βtx)2)2λ/α1(t)),0tT.\pi_{t}^{*}(\rho)=\frac{1}{\sqrt{\pi\lambda{/}\alpha_{1}(t)}}\exp\left(-\frac{\left(\rho-(\beta_{t}x)^{2}\right)^{2}}{\lambda{/}\alpha_{1}(t)}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

πt()=𝑑𝒩(βtx,λ2α1(t)),\pi_{t}^{*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\alpha_{1}(t)}),

where

α1(t)=exp(tT(12βs)ds).\alpha_{1}(t)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right).

We conclude the above results in the following theorem.

Theorem 4.1.

For the exploratory dynamic equation (4.4) with cost functional (4.5), the optimal control is given as follows,

πt()=𝑑𝒩(βtx,λ2α1(t)), 0tT,\pi_{t}^{*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\alpha_{1}(t)}),\ 0\leq t\leq T,

where 𝒩(βtx,λ2α1(t))\displaystyle\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\alpha_{1}(t)}) denotes the Gaussian density function with mean βtx\beta_{t}x and variance λ2α1(t)\displaystyle\frac{\lambda}{2\alpha_{1}(t)}.

Remark 4.2.

In Theorem 4.1, we show the explicit formula of the optimal control πt(), 0tT\pi_{t}^{*}(\cdot),\ 0\leq t\leq T. In the control πt(), 0tT\pi_{t}^{*}(\cdot),\ 0\leq t\leq T, we can use the mean βtx\beta_{t}x to estimate the parameter β()\beta(\cdot). In this exploratory stochastic optimal control problem, the variance λ2α1(t)\frac{\lambda}{2\alpha_{1}(t)} of the optimal control πt()\pi_{t}^{*}(\cdot) investigates the efficiency of exploratory of the RL. It is worth noting that, let λ0\lambda\to 0, πt()\pi_{t}^{*}(\cdot) reduces to the Dirac measure δβtx()\delta_{\beta_{t}x}(\cdot).

Now, we consider the following classical optimal control problem. The state satisfies

dXsρ=ρsds+[(βs1)Xsρρs]dWs,,Xtρ=x.\mathrm{d}X^{\rho}_{s}=-\rho_{s}\mathrm{d}s+\left[(\beta_{s}-1)X^{\rho}_{s}-\rho_{s}\right]\mathrm{d}W_{s},,\ X^{\rho}_{t}=x. (4.9)

and the cost functional is

J(t,x;u())=𝔼[(XTu)2].J(t,x;u(\cdot))=\mathbb{E}[\left(X^{u}_{T}\right)^{2}]. (4.10)

The value function is defined by

v^(t,x)=infρ()𝒟[t,T]J(t,x;u()),\hat{v}(t,x)=\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}J(t,x;u(\cdot)),

and satisfies the following HJB equation,

0=minρ[tv^(t,x)+12[(βt1)xρ]2xxv^(t,x)ρxv^(t,x)].0=\min_{\rho\in\mathbb{R}}\left[\partial_{t}\hat{v}(t,x)+\frac{1}{2}\left[(\beta_{t}-1)x-\rho\right]^{2}\partial_{xx}\hat{v}(t,x)-\rho\partial_{x}\hat{v}(t,x)\right]. (4.11)

The optimal control of HJB equation (4.11) is

ρt=(βt1)xxxv^(t,x)+xv^(t,x)xxv^(t,x).\rho^{*}_{t}=\frac{(\beta_{t}-1)x\partial_{xx}\hat{v}(t,x)+\partial_{x}\hat{v}(t,x)}{\partial_{xx}\hat{v}(t,x)}.

Then, putting ρt\rho_{t}^{*} into equation (4.11), and we assume the formula of the value function v^(t,x)\hat{v}(t,x),

v^(t,x)=b1(t)x2+b2(t).\hat{v}(t,x)=b_{1}(t)x^{2}+b_{2}(t).

Thus, b1(t),b2(t)b_{1}(t),\ b_{2}(t) satisfy the following equations

0=b1(t)+(12βt)b1(t),b1(T)=1;\displaystyle 0=b^{\prime}_{1}(t)+(1-2\beta_{t})b_{1}(t),\quad b_{1}(T)=1;
0=b2(t),b2(T)=0,\displaystyle 0=b^{\prime}_{2}(t),\quad b_{2}(T)=0,

and

b1(t)=exp(tT(12βs)ds);\displaystyle b_{1}(t)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right);
b2(t)=0.\displaystyle b_{2}(t)=0.

The value function v^(t,x)\hat{v}(t,x) is given as

v^(t,x)=exp(tT(12βs)ds)x2,\hat{v}(t,x)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right)x^{2},

with the optimal control ρt=βtx\rho^{*}_{t}=\beta_{t}x.

Thus, the optimal control ρt\rho^{*}_{t} of the classical optimal control problem is consistent with the mean of the optimal density function πt\pi^{*}_{t}. Therefore, based on the reinforcement learning (RL) stochastic optimal control structure presented in this study, we can use the RL method to learn the optimal density function πt\pi^{*}_{t} and then obtains the estimation for parameter β()\beta(\cdot).

4.2 The drift term with unknown parameter

We could also consider the following model where the state satisfies

dXsu=[βsXsu+us]ds+[Xsu+us]dWs,\mathrm{d}X^{u}_{s}=\left[\beta_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}s+\left[X^{u}_{s}+u_{s}\right]\mathrm{d}W_{s}, (4.12)

with the initial condition Xtu=xX^{u}_{t}=x and parameter β()\beta(\cdot) needs to be estimated. Note that the parameter β()\beta(\cdot) appears in the drift term of the state equation (4.12). We construct the following cost functional differed from (4.10),

J(t,x;u())=𝔼tT[(us+Xsu+βsXsu)2]ds.J(t,x;u(\cdot))=\mathbb{E}\int_{t}^{T}[(u_{s}+X^{u}_{s}+\beta_{s}X^{u}_{s})^{2}]\mathrm{d}s. (4.13)
Remark 4.3.

In the cost functional (4.13), there is unknown term βsXsu\beta_{s}X^{u}_{s}. In fact, from equation (4.12), we can observe the value of βsXsu,tsT\beta_{s}X^{u}_{s},\ t\leq s\leq T. Precisely, Based on an input u()u(\cdot), we can obtain the value of J(t,x;u())J(t,x;u(\cdot)).

Indeed, when we consider the cost functional (4.10), the optimal control admits the unique optima control u()=2X()u^{*}(\cdot)=-2X^{*}(\cdot) which cannot be used to estimate the parameter β()\beta(\cdot). From cost functional (4.13), the related optimal control is given as follows:

us=XsβsXs,tsT.u^{*}_{s}=-X_{s}^{*}-{\beta_{s}X_{s}^{*}},\ t\leq s\leq T.

We take place βsXs{\beta_{s}X_{s}^{*}} as ρs\rho_{s}, and thus

usρ=Xsρρs,tsT.u^{\rho}_{s}=-X_{s}^{\rho}-{\rho_{s}},\ t\leq s\leq T.

Then, equation (4.12) becomes

dXsρ=[(βs1)Xsρρs]dsρsdWs,Xtρ=x.\mathrm{d}X^{\rho}_{s}=[(\beta_{s}-1)X_{s}^{\rho}-\rho_{s}]\mathrm{d}s-{\rho}_{s}\mathrm{d}W_{s},\ X^{\rho}_{t}=x. (4.14)

Based on Theorem 2.3, the exploratory dynamic equation is,

dXsπ=b~(βs,Xsπ,πs)ds+σ~(βs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=\tilde{b}(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (4.15)

where

b~(βs,x,πs)=[(βs1)xρ]πs(ρ)dρ,\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}[(\beta_{s}-1)x-\rho]\pi_{s}(\rho)\mathrm{d}\rho,

and

σ~(βs,x,πs)=ρ2πs(ρ)dρ.\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\rho^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional is,

J(t,x;π())=𝔼[tT(ρβsXsπ)2πs(ρ)dρds].J(t,x;\pi(\cdot))=\mathbb{E}\left[\int_{t}^{T}\int_{\mathbb{R}}(\rho-\beta_{s}X^{\pi}_{s})^{2}\pi_{s}(\rho)\mathrm{d}\rho\mathrm{d}s\right]. (4.16)

By the formula of the optimal control in (3.2), we have that

πt(ρ)=exp(1λ((ρβtx)2+12ρ2xxv(t,x)+[(βt1)xρ]xv(t,x)))exp(1λ((uβtx)2+12u2xxv(t,x)+[(βt1)xu]xv(t,x)))du,\pi^{*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left((\rho-\beta_{t}x)^{2}+\frac{1}{2}\rho^{2}\partial_{xx}v(t,x)+[(\beta_{t}-1)x-\rho]\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left((u-\beta_{t}x)^{2}+\frac{1}{2}u^{2}\partial_{xx}v(t,x)+[(\beta_{t}-1)x-u]\partial_{x}v(t,x)\right)\right)\mathrm{d}u}, (4.17)

which satisfies the Gaussian density function with mean

μ(t,x)=2βtx+xv(t,x)2+xxv(t,x),\mu(t,x)=\frac{2\beta_{t}x+\partial_{x}v(t,x)}{2+\partial_{xx}v(t,x)},

and variance

σ2(t,x)=λ2+xxv(t,x).\sigma^{2}(t,x)=\frac{\lambda}{2+\partial_{xx}v(t,x)}.

We assume

v(t,x)=α1(t)x2+α2(t).v(t,x)=\alpha_{1}(t)x^{2}+\alpha_{2}(t).

The related HJB equation becomes

0=α1(t)x2+α2(t)+βt2x2+2(βt1)α1(t)x2μ2(t,x)(1+α1(t))λ2lnπλ1+α1(t),0=\alpha^{\prime}_{1}(t)x^{2}+\alpha^{\prime}_{2}(t)+\beta_{t}^{2}x^{2}+2(\beta_{t}-1)\alpha_{1}(t)x^{2}-\mu^{2}(t,x)(1+\alpha_{1}(t))-\frac{\lambda}{2}\ln\frac{\pi\lambda}{1+\alpha_{1}(t)}, (4.18)

and derives that

0=α1(t)(1+α1(t))+βt2α1(t)+2βtα12(t)2α1(t)3α12(t);\displaystyle 0=\alpha^{\prime}_{1}(t)(1+\alpha_{1}(t))+\beta_{t}^{2}\alpha_{1}(t)+2\beta_{t}\alpha^{2}_{1}(t)-2\alpha_{1}(t)-3\alpha_{1}^{2}(t);
0=α2(t)λ2lnπλ1+α1(t),\displaystyle 0=\alpha_{2}(t)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{1+\alpha_{1}(t)},

with α1(T)=α2(T)=0\alpha_{1}(T)=\alpha_{2}(T)=0, and thus

α1(t)=0;\displaystyle\alpha_{1}(t)=0;
α2(t)=λ(Tt)2ln(πλ),\displaystyle\alpha_{2}(t)=\frac{\lambda(T-t)}{2}\ln(\pi\lambda),

which solves the value function v(t,x)v(t,x) with the optimal control π(ρ)\pi^{*}(\rho),

πt(ρ)=1πλexp((ρβtx)2λ),0tT.\pi_{t}^{*}(\rho)=\frac{1}{\sqrt{\pi\lambda}}\exp\left(-\frac{\left(\rho-\beta_{t}x\right)^{2}}{\lambda}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

πt()=𝑑𝒩(βtx,λ2).\pi_{t}^{*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2}).

Based on RL, we can learn the mean βtx\beta_{t}x from the observation data. Furthermore, we can use the variance λ2\displaystyle\frac{\lambda}{2} to adjust the exploratory rate.

In this section, we have considered the parameters appeared in drift or diffusion term. In the following Section 5, we aim to consider a more general case where both drift and diffusion terms contain parameter.

5 Extension

Now, we general the model in Section 4 to the case where both the drift and diffusion terms contain unknown parameters. We show the details how to estimate the parameters in the drift and diffusion terms. Then, we give an example to verify the main results.

5.1 General model

We consider the following stochastic differential equation with the deterministic parameters αs,βs,tsT\alpha_{s},\beta_{s}\in\mathbb{R},\ t\leq s\leq T, and control u()𝒰[t,T]u(\cdot)\in\mathcal{U}[t,T],

dXsu=b(αs,Xsu,us)ds+σ(βs,Xsu,us)dWs,Xtu=x.\mathrm{d}X^{u}_{s}=b(\alpha_{s},X_{s}^{u},u_{s})\mathrm{d}s+\sigma(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}W_{s},\ X^{u}_{t}=x. (5.1)

We construct two cost functionals,

J1(t,x;u())=𝔼[tTf1(Xsu,us)ds+Φ1(XTu)],J_{1}(t,x;u(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{1}(X_{s}^{u},u_{s})\mathrm{d}s+\Phi_{1}(X_{T}^{u})\right], (5.2)

and

J2(t,x;u())=𝔼[tTf2(Xsu,us)ds+Φ2(XTu)],J_{2}(t,x;u(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{2}(X_{s}^{u},u_{s})\mathrm{d}s+\Phi_{2}(X_{T}^{u})\right], (5.3)

to estimate the parameters α()\alpha(\cdot) and β()\beta(\cdot), respectively.

Firstly, we show how to estimate α()\alpha(\cdot) from cost functional J1(t,x;u())J_{1}(t,x;u(\cdot)). By choosing appropriate functions f1()f_{1}(\cdot) and Φ1()\Phi_{1}(\cdot), we obtain a feedback optimal control us1,=u1,(αs,s,Xs1,),tsTu^{1,*}_{s}=u^{1,*}(\alpha_{s},s,X^{1,*}_{s}),\ t\leq s\leq T from cost functional J1(t,x;u())J_{1}(t,x;u(\cdot)), where X1,()X^{1,*}(\cdot) is the optimal state with the optimal control u1,()u^{1,*}(\cdot). However, we need that u1,()u^{1,*}(\cdot) contains the parameter α()\alpha(\cdot) only. The question is that if u1,()u^{1,*}(\cdot) contains the parameter β()\beta(\cdot), we don’t know the value of the new input control (see uρ()u^{\rho}(\cdot)). We take place α()\alpha(\cdot) as a deterministic process ρ()\rho(\cdot) in the feedback optimal control u1,(αs,s,Xs1,)u^{1,*}(\alpha_{s},s,X^{1,*}_{s}), and denotes by usρ=u1,(ρs,s,Xsρ)u^{\rho}_{s}=u^{1,*}(\rho_{s},s,X^{\rho}_{s}), where XsρX^{\rho}_{s} satisfies the following stochastic differential equation,

dXsρ=b^(αs,Xsρ,ρs)ds+σ^(βs,Xsρ,ρs)dWs,Xtρ=x,\mathrm{d}X^{\rho}_{s}=\hat{b}(\alpha_{s},X_{s}^{\rho},\rho_{s})\mathrm{d}s+\hat{\sigma}(\beta_{s},X_{s}^{\rho},{\rho}_{s})\mathrm{d}W_{s},\ X^{\rho}_{t}=x, (5.4)

where b^(αs,Xsρ,ρs)=b(αs,Xsρ,usρ)\hat{b}(\alpha_{s},X_{s}^{\rho},\rho_{s})={b}(\alpha_{s},X_{s}^{\rho},u^{\rho}_{s}) and σ^(βs,Xsρ,ρs)=σ(βs,Xsρ,usρ)\hat{\sigma}(\beta_{s},X_{s}^{\rho},\rho_{s})={\sigma}(\beta_{s},X_{s}^{\rho},u^{\rho}_{s}). Furthermore, we rewrite the cost functional (5.2) as follows,

J1(t,x;ρ())=𝔼[tTf1(Xsρ,ρs)ds+Φ1(XTρ)],J_{1}(t,x;\rho(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{1}(X_{s}^{\rho},\rho_{s})\mathrm{d}s+\Phi_{1}(X_{T}^{\rho})\right], (5.5)

where ρs,tsT\rho_{s}\in\mathbb{R},\ t\leq s\leq T. Based on Theorem 2.3, we consider the following exploratory dynamic equation,

dXsπ=b~(αs,Xsπ,πs)ds+σ~(βs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=\tilde{b}(\alpha_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (5.6)

where πs\pi_{s} is the density function of the input control ρs\rho_{s},

b~(αs,x,πs)=b^(αs,x,ρ)πs(ρ)dρ,\tilde{b}(\alpha_{s},x,\pi_{s})=\int_{\mathbb{R}}\hat{b}(\alpha_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho,

and

σ~(βs,x,πs)=σ^2(βs,x,ρ)πs(ρ)dρ\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho}

for tsTt\leq s\leq T. The cost functional (5.5) is rewritten as follows,

J~1(t,x;π())=𝔼[tT[f1(Xsπ,ρ)πs(ρ)+λπs(ρ)lnπs(ρ)]dρds+Φ1(XTπ)].\tilde{J}_{1}(t,x;\pi(\cdot))=\mathbb{E}\left[\int_{t}^{T}\int_{\mathbb{R}}\left[f_{1}(X_{s}^{\pi},\rho)\pi_{s}(\rho)+\lambda\pi_{s}(\rho)\ln\pi_{s}(\rho)\right]\mathrm{d}\rho\mathrm{d}s+\Phi_{1}(X_{T}^{\pi})\right]. (5.7)

Secondly, we estimate β()\beta(\cdot) by cost functional J2(t,x;u())J_{2}(t,x;u(\cdot)). We choose appropriate functions f2()f_{2}(\cdot) and Φ2()\Phi_{2}(\cdot), such that the feedback optimal control us2,=u2,(βs,s,Xs2,),tsTu^{2,*}_{s}=u^{2,*}(\beta_{s},s,X^{2,*}_{s}),\ t\leq s\leq T of the cost functional J2(t,x;u())J_{2}(t,x;u(\cdot)) contains the parameter β()\beta(\cdot), where X2,()X^{2,*}(\cdot) is the optimal state with the optimal control u2,()u^{2,*}(\cdot). We take place β()\beta(\cdot) as a deterministic process γ()\gamma(\cdot) in the feedback optimal control u2,(βs,s,Xs2,)u^{2,*}(\beta_{s},s,X^{2,*}_{s}), and denotes by usγ=u2,(γs,s,Xsγ)u^{\gamma}_{s}=u^{2,*}(\gamma_{s},s,X^{\gamma}_{s}), where XsγX^{\gamma}_{s} satisfies the following stochastic differential equation.

dXsγ=b^(αs,Xsγ,γs)ds+σ^(βs,Xsγ,γs)dWs,Xtγ=x.\mathrm{d}X^{\gamma}_{s}=\hat{b}(\alpha_{s},X_{s}^{\gamma},\gamma_{s})\mathrm{d}s+\hat{\sigma}(\beta_{s},X_{s}^{\gamma},{\gamma}_{s})\mathrm{d}W_{s},\ X^{\gamma}_{t}=x. (5.8)

where b^(αs,Xsγ,γs)=b(αs,Xsγ,usγ)\hat{b}(\alpha_{s},X_{s}^{\gamma},\gamma_{s})={b}(\alpha_{s},X_{s}^{\gamma},u^{\gamma}_{s}) and σ^(βs,Xsγ,γs)=σ(βs,Xsγ,usγ)\hat{\sigma}(\beta_{s},X_{s}^{\gamma},\gamma_{s})={\sigma}(\beta_{s},X_{s}^{\gamma},u^{\gamma}_{s}). Furthermore, we rewrite the cost functional (5.3) as follows,

J2(t,x;γ())=𝔼[tTf2(Xsγ,γs)ds+Φ2(XTγ)],J_{2}(t,x;\gamma(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{2}(X_{s}^{\gamma},\gamma_{s})\mathrm{d}s+\Phi_{2}(X_{T}^{\gamma})\right], (5.9)

where γs,tsT\gamma_{s}\in\mathbb{R},\ t\leq s\leq T. Then, based on Theorem 2.3, we can introduce the exploratory dynamic equations almost same with (5.6) and (5.7). For notations simplicity, we omit them.

5.2 Linear SDE with unknown parameters

Now, we consider the following example.

dXsu=[αsXsu+us]ds+[βsXsu+us]dWs,\mathrm{d}X^{u}_{s}=\left[\alpha_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}s+\left[\beta_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}W_{s}, (5.10)

where α()\alpha(\cdot) and β()\beta(\cdot) need to be estimated. Note that, equation (5.10) contains the parameters α()\alpha(\cdot) and β()\beta(\cdot). Thus, we cannot use the cost functional developed in Subsection 4.2 to investigate the estimation for α()\alpha(\cdot), directly. Therefore, we first consider to estimate the parameter β()\beta(\cdot). Based on the estimation of β()\beta(\cdot), we can use the cost functional constructed in Subsection 4.2 to estimate the parameter α()\alpha(\cdot).

Step 1: We first construct the cost functional J1(t,x;u())J_{1}(t,x;u(\cdot)) used to estimate parameter β()\beta(\cdot), where

J1(t,x;u())=𝔼[(XTu)2].J_{1}(t,x;u(\cdot))=\mathbb{E}[(X^{u}_{T})^{2}]. (5.11)

Based on the classical optimal control theory, the related HJB equation is given by

0=tv(t,x)+infuU[12(βtx+u)2xxv(t,x)+(αtx+u)xv(t,x)].0=\partial_{t}v(t,x)+\inf_{u\in U}\left[\frac{1}{2}(\beta_{t}x+u)^{2}\partial_{xx}v(t,x)+(\alpha_{t}x+u)\partial_{x}v(t,x)\right].

The related optimal control is

u(s,Xs)=(1+βs)Xs,ts.u^{*}(s,X^{*}_{s})=-(1+\beta_{s})X^{*}_{s},\quad t\leq s.

We take place βsXs{\beta_{s}X_{s}^{*}} as γs\gamma_{s}, and have that

usγ=γsXs,tsT.u^{\gamma}_{s}=-{\gamma_{s}}-{X_{s}^{*}},\ t\leq s\leq T.

Equation (5.10) becomes

dXsγ=[(αs1)Xsγγs]ds+[(βs1)Xsγγs]dWs,Xtγ=x.\mathrm{d}X^{\gamma}_{s}=[(\alpha_{s}-1)X^{\gamma}_{s}-\gamma_{s}]\mathrm{d}s+[(\beta_{s}-1)X^{\gamma}_{s}-{\gamma}_{s}]\mathrm{d}W_{s},\ X^{\gamma}_{t}=x. (5.12)

From Theorem 2.3, the exploratory dynamic equation is,

dXsπ=b~(αs,Xsπ,πs)ds+σ~(βsαs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=\tilde{b}(\alpha_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s}-\alpha_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (5.13)

where

b~(βs,x,πs)=[(αs1)xγ]πs(γ)dγ,\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}[(\alpha_{s}-1)x-\gamma]\pi_{s}(\gamma)\mathrm{d}\gamma,

and

σ~(βs,x,πs)=[(βs1)xγ]2πs(ρ)dρ.\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}[(\beta_{s}-1)x-\gamma]^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional is,

J1(t,x;π())=𝔼[(XTπ)2].J_{1}(t,x;\pi(\cdot))=\mathbb{E}[(X^{\pi}_{T})^{2}]. (5.14)

By the formula of the optimal control in (3.2), we have that

πt1,(ρ)=exp(1λ(12[(βt1)xγ]2xxv(t,x)+[(αt1)xγ]xv(t,x)))exp(1λ(12[(βt1)xu]2xxv(t,x)+[(αt1)xu]xv(t,x)))du,\pi^{1,*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}[(\beta_{t}-1)x-\gamma]^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-1)x-\gamma]\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}[(\beta_{t}-1)x-u]^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-1)x-u]\partial_{x}v(t,x)\right)\right)\mathrm{d}u}, (5.15)

which satisfies the Gaussian density function with mean

μ(t,x)=(βt1)xxxv+xv(t,x)xxv(t,x),\mu(t,x)=\frac{(\beta_{t}-1)x\partial_{xx}v+\partial_{x}v(t,x)}{\partial_{xx}v(t,x)},

and variance

σ2(t,x)=λxxv(t,x).\sigma^{2}(t,x)=\frac{\lambda}{\partial_{xx}v(t,x)}.

We assume

v(t,x)=θ1(t)x2+θ2(t).v(t,x)=\theta_{1}(t)x^{2}+\theta_{2}(t).

The related HJB equation becomes

0=θ1(t)x2+θ2(t)+(βt1)2θ1(t)x2+2(αt1)θ1(t)x2μ2(t,x)θ1(t)λ2lnπλθ1(t),0=\theta^{\prime}_{1}(t)x^{2}+\theta^{\prime}_{2}(t)+(\beta_{t}-1)^{2}\theta_{1}(t)x^{2}+2(\alpha_{t}-1)\theta_{1}(t)x^{2}-\mu^{2}(t,x)\theta_{1}(t)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{\theta_{1}(t)}, (5.16)

and derives that

θ1(t)=exp(tT[2(αsβs)1]ds);\displaystyle\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right);
θ2(t)=λ2tTln(πλθ1(s))ds,\displaystyle\theta_{2}(t)=\frac{\lambda}{2}\int_{t}^{T}\ln(\frac{\pi\lambda}{\theta_{1}(s)})\mathrm{d}s,

which solves the value function v(t,x)v(t,x) with the optimal control π1,(γ)\pi^{1,*}(\gamma) satisfies Gaussian density function,

πt1,(γ)=θ1(t)πλexp((γβtx)2λ/θ1(t)),0tT.\pi_{t}^{1,*}(\gamma)=\sqrt{\frac{\theta_{1}(t)}{\pi\lambda}}\exp\left(-\frac{\left(\gamma-\beta_{t}x\right)^{2}}{\lambda{/}\theta_{1}(t)}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

πt1,()=𝑑𝒩(βtx,λ2θ1(t)),\pi_{t}^{1,*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\theta_{1}(t)}),

where

θ1(t)=exp(tT[2(αsβs)1]ds).\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right).

Step 2: We then construct the cost functional J2(t,x;u())J_{2}(t,x;u(\cdot)) used to estimate parameter α()\alpha(\cdot), where

J2(t,x;u())=𝔼tT[(us+αsXsu+βsXsu)2]ds.J_{2}(t,x;u(\cdot))=\mathbb{E}\int_{t}^{T}[(u_{s}+\alpha_{s}X^{u}_{s}+\beta_{s}X^{u}_{s})^{2}]\mathrm{d}s. (5.17)

Minimizing the cost functional (5.17), the related optimal control is given by

us=αsXsβsXs,tsT.u^{*}_{s}=-\alpha_{s}X_{s}^{*}-{\beta_{s}X_{s}^{*}},\ t\leq s\leq T.

We take place αsXs{\alpha_{s}X_{s}^{*}} as ρs\rho_{s}, and thus

usρ=ρsβsXs,tsT.u^{\rho}_{s}=-{\rho_{s}}-{\beta_{s}X_{s}^{*}},\ t\leq s\leq T.

Then, equation (5.10) becomes

dXsρ=[(αsβs)Xsρρs]dsρsdWs,Xtρ=x.\mathrm{d}X^{\rho}_{s}=[(\alpha_{s}-\beta_{s})X_{s}^{\rho}-\rho_{s}]\mathrm{d}s-{\rho}_{s}\mathrm{d}W_{s},\ X^{\rho}_{t}=x. (5.18)

From Theorem 2.3, the exploratory dynamic equation is,

dXsπ=b~(αsβs,Xsπ,πs)ds+σ~(βs,Xsπ,πs)dWs,Xtπ=x,\mathrm{d}X^{\pi}_{s}=\tilde{b}(\alpha_{s}-\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x, (5.19)

where

b~(βs,x,πs)=[(αsβs)xρ]πs(ρ)dρ,\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}[(\alpha_{s}-\beta_{s})x-\rho]\pi_{s}(\rho)\mathrm{d}\rho,

and

σ~(βs,x,πs)=ρ2πs(ρ)dρ.\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\rho^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional is,

J2(t,x;π())=𝔼[tT(ραsXsπ)2πs(ρ)dρds].J_{2}(t,x;\pi(\cdot))=\mathbb{E}[\int_{t}^{T}\int_{\mathbb{R}}(\rho-\alpha_{s}X^{\pi}_{s})^{2}\pi_{s}(\rho)\mathrm{d}\rho\mathrm{d}s]. (5.20)

By the formula of the optimal control in (3.2), we have that

πt2,(ρ)=exp(1λ((ραtx)2+12ρ2xxv(t,x)+[(αtβt)xρ]xv(t,x)))exp(1λ((uαtx)2+12u2xxv(t,x)+[(αtβt)xu]xv(t,x)))du,\pi^{2,*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left((\rho-\alpha_{t}x)^{2}+\frac{1}{2}\rho^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-\beta_{t})x-\rho]\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left((u-\alpha_{t}x)^{2}+\frac{1}{2}u^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-\beta_{t})x-u]\partial_{x}v(t,x)\right)\right)\mathrm{d}u}, (5.21)

which satisfies the Gaussian density function with mean

μ(t,x)=2αtx+xv(t,x)2+xxv(t,x),\mu(t,x)=\frac{2\alpha_{t}x+\partial_{x}v(t,x)}{2+\partial_{xx}v(t,x)},

and variance

σ2(t,x)=λ2+xxv(t,x).\sigma^{2}(t,x)=\frac{\lambda}{2+\partial_{xx}v(t,x)}.

We assume

v(t,x)=θ1(t)x2+θ2(t).v(t,x)=\theta_{1}(t)x^{2}+\theta_{2}(t).

The related HJB equation becomes

0=θ1(t)x2+θ2(t)+αt2x2+2(αtβt)θ1(t)x2μ2(t,x)(1+θ1(t))λ2lnπλ1+θ1(t).0=\theta^{\prime}_{1}(t)x^{2}+\theta^{\prime}_{2}(t)+\alpha_{t}^{2}x^{2}+2(\alpha_{t}-\beta_{t})\theta_{1}(t)x^{2}-\mu^{2}(t,x)(1+\theta_{1}(t))-\frac{\lambda}{2}\ln\frac{\pi\lambda}{1+\theta_{1}(t)}. (5.22)

and derives that

θ1(t)=0;\displaystyle\theta_{1}(t)=0;
θ2(t)=λ(Tt)2ln(πλ),\displaystyle\theta_{2}(t)=\frac{\lambda(T-t)}{2}\ln(\pi\lambda),

which solves the value function v(t,x)v(t,x) with the optimal control π2,(ρ)\pi^{2,*}(\rho) satisfies Gaussian density function,

πt2,(ρ)=1πλexp((ραtx)2λ),0tT.\pi_{t}^{2,*}(\rho)=\frac{1}{\sqrt{\pi\lambda}}\exp\left(-\frac{\left(\rho-\alpha_{t}x\right)^{2}}{\lambda}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

πt2,()=𝑑𝒩(αtx,λ2).\pi_{t}^{2,*}(\cdot)\overset{d}{=}\mathcal{N}(\alpha_{t}x,\ \frac{\lambda}{2}).

We conclude the above main results in the following theorem.

Theorem 5.1.

When both the drift and diffusion terms of the linear SDE (5.1) contain the unknown parameters α()\alpha(\cdot) and β()\beta(\cdot), based on the exploratory equations (5.13) and (5.19), and related cost functionals (5.14) and (5.20), we have the optimal density functions for the parameters β()\beta(\cdot) and α()\alpha(\cdot), respectively,

πt1,()=𝑑𝒩(βtx,λ2θ1(t)),πt2,()=𝑑𝒩(αtx,λ2),\pi_{t}^{1,*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\theta_{1}(t)}),\quad\pi_{t}^{2,*}(\cdot)\overset{d}{=}\mathcal{N}(\alpha_{t}x,\ \frac{\lambda}{2}),

where

θ1(t)=exp(tT[2(αsβs)1]ds).\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right).

Based on πt1,()\pi_{t}^{1,*}(\cdot) and πt2,()\pi_{t}^{2,*}(\cdot), we can learn the unknown parameters β()\beta(\cdot) and α()\alpha(\cdot).

Remark 5.1.

Wang and Zhou (2020) investigated a continuous time mean-variance portfolio selection with reinforcement learning model, and developed an implementable reinforcement learning algorithm. Based on the results in Wang and Zhou (2020), we can obtain the parameters appeared in density functions πt1,()\pi_{t}^{1,*}(\cdot) and πt2,()\pi_{t}^{2,*}(\cdot), which can be used to estimate parameters β()\beta(\cdot) and α()\alpha(\cdot). Based on the estimations of β()\beta(\cdot) and α()\alpha(\cdot), we can do empirical analysis for the classical optimal control problem and so on.

6 Conclusion

Combining the stochastic optimal control model and the reinforcement learning structure, we develop a novel approach to learn the unknown parameters appeared in the drift and diffusion terms of the stochastic differential equation. By choosing an appropriate cost functional, we can obtain a feedback optimal control u()u^{*}(\cdot) which contains the unknown parameter β()\beta(\cdot), denote by us=u(βs,s,Xs),tsTu^{*}_{s}=u^{*}(\beta_{s},s,X^{*}_{s}),\ t\leq s\leq T. We take place the unknown parameter β()\beta(\cdot) in the optimal control u()u^{*}(\cdot) as a new deterministic control ρ()\rho(\cdot), and obtain a new input control usρ=uρ(ρs,s,Xsρ),tsTu^{\rho}_{s}=u^{\rho}(\rho_{s},s,X^{\rho}_{s}),\ t\leq s\leq T. Putting the new control uρ()u^{\rho}(\cdot) into the related stochastic differential equation and cost functional. We establish the mathematical framework of the exploratory dynamic equation and cost functional, which is consistent with the structure introduced in Wang et al. (2020).

Indeed, the optimal control of the new exploratory control problem can be used to estimate unknown parameters. Therefore, we consider the linear stochastic differential equation case where the drift or diffusion term with unknown parameter. Then, we investigate the general case where both the drift and diffusion terms contain unknown parameters. For the above cases, we show that the optimal density function satisfies a Gaussian distribution which can be used to estimate the unknown parameters. When both the drift and diffusion terms of the linear SDE contain the unknown parameters α()\alpha(\cdot) and β()\beta(\cdot), based on the exploratory equations, and related cost functionals, we obtain the optimal density functions for the parameters α()\alpha(\cdot) and β()\beta(\cdot), respectively. The optimal density functions are given by

πt1,()=𝑑𝒩(βtx,λ2θ1(t)),πt2,()=𝑑𝒩(αtx,λ2),\pi_{t}^{1,*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\theta_{1}(t)}),\quad\pi_{t}^{2,*}(\cdot)\overset{d}{=}\mathcal{N}(\alpha_{t}x,\ \frac{\lambda}{2}),

where

θ1(t)=exp(tT[2(αsβs)1]ds).\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right).

Based on πt1,()\pi_{t}^{1,*}(\cdot) and πt2,()\pi_{t}^{2,*}(\cdot), we can learn the unknown parameters β()\beta(\cdot) and α()\alpha(\cdot).

In this present paper, we focus on develop the theory of a stochastic optimal control approach with reinforcement learning structure to estimate the unknown parameters appeared in the model. Indeed, based on the existing optimal control learning methods, for example policy evaluation and policy improvement developed in Wang and Zhou (2020), we can learn the optimal density functions, and then obtain the estimations of the unknown parameters. These estimations are useful in the related investment portfolio problems, for example mean-variance investment portfolio. We will further consider these problems in the future work.

References

  • Basak and Chabakauri (2010) S. Basak and G. Chabakauri. Dynamic mean-variance asset allocation. The Review of Financial Studies, 23(8):2970–3016, 2010.
  • Björk et al. (2014) T. Björk, A. Murgoci, and X. Y. Zhou. Mean-variance portfolio optimization with state-dependent risk aversion. Mathematical Finance, 24:1–24, 2014.
  • Dai et al. (2021) M. Dai, H. Jin, S. Kou, and Y. Xu. A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108, 2021.
  • Dong (2022) Y. C. Dong. Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. arxiv.org/abs/2208.02409, pages 1–19, 2022.
  • Doya (2000) K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
  • Fleming and Rishel (1975) W. H. Fleming and R. W. Rishel. Determnistic and stochastic optimal control. Springer-Verlag, New York, 1975.
  • Fleming and Soner (2006) W. H. Fleming and H. M. Soner. Controlled markov processes and viscosity solutions. Springer, New York, 2006.
  • Gao et al. (2022) X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for langevin diffusions. SIAM Journal on Control and Optimization, 60(3):1250–1268, 2022.
  • Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, http://www.deeplearningbook.org, 2016.
  • Han et al. (2023) X. Han, R. Wang, and X. Y. Zhou. Choquet regularization for continuous-time reinforcement learning. pages 1–35, 2023.
  • Jia and Zhou (2022a) Y. Jia and X. Y. Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. The Journal of Machine Learning Research, 23(154):1–55, 2022a.
  • Jia and Zhou (2022b) Y. Jia and X. Y. Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. The Journal of Machine Learning Research, 23(275):1–50, 2022b.
  • Jia and Zhou (2022c) Y. Jia and X. Y. Zhou. q-learning in continuous time. arXiv: 2207.00713, 2022c.
  • Pardoux and Peng (1990) E. Pardoux and S. Peng. Adapted solutions of backward stochastic equations. Systerm and Control Letters, 14:55–61, 1990.
  • Peng (1990) S. Peng. A general stochastic maximum principle for optimal control problems. SIAM J.Control and Optim., 28(4):966–979, 1990.
  • Peng (1992) S. Peng. A generalized dynamic programming principle and Hamilton-Jacobi-Bellmen equation. Stochastics Stochastics Rep., 38:119–134, 1992.
  • Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tang et al. (2022) W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022.
  • Wang and Zhou (2020) H. Wang and X. Y. Zhou. Continuous-time mean-variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308, 2020.
  • Wang et al. (2020) H. Wang, T. Zariphopoulou, and X. Y. Zhou. Exploration versus exploitation in reinforcement learning: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34, 2020.
  • Yong and Zhou (1999) J. Yong and X. Y. Zhou. Stochastic controls: Hamiltonian systems and HJB equations. Springer, New York, 1999.
  • Zhou (2023) X. Y. Zhou. The curse of optimality, and how to break it? Cambridge University Press, Part V-New Frontiers for Stochastic Control in Finance, 2023.
  • Zhou and Li (2000) X. Y. Zhou and D. Li. Continuous-time mean-variance portfolio selection: A stochastic LQ frame work. Appl. Math. Optim., 42:19–33, 2000.