Parameter learning: stochastic optimal control approach with reinforcement learning

Shuzhen Yang Shandong University-Zhong Tai Securities Institute for Financial Studies, Shandong University, PR China, ([email protected]). This work was supported by the National Key R&D program of China (Grant No.2018YFA0703900, ZR2019ZD41), National Natural Science Foundation of China (Grant No.11701330), and Taishan Scholar Talent Project Youth Project.

Abstract

In this study, we develop a stochastic optimal control approach with reinforcement learning structure to learn the unknown parameters appeared in the drift and diffusion terms of the stochastic differential equation. By choosing an appropriate cost functional, based on a classical optimal feedback control, we translate the original optimal control problem to a new control problem which takes place the unknown parameter as control, and the related optimal control can be used to estimate the unknown parameter. We establish the mathematical framework of the dynamic equation for the exploratory state, which is consistent with the existing results. Furthermore, we consider the linear stochastic differential equation case where the drift or diffusion term with unknown parameter. Then, we investigate the general case where both the drift and diffusion terms contain unknown parameters. For the above cases, we show that the optimal density function satisfies a Gaussian distribution which can be used to estimate the unknown parameter. Based on the obtained estimation of the parameters, we can do empirical analysis for a given model.

KEYWORDS: Parameters estimation; Stochastic optimal control; Reinforcement learning; Linear SDE; Exploratory

1 Introduction

Zhou (2023) commented that ”We strive to seek optimality, but often find ourselves trapped in bad optimal, solutions that are either local optimizers, or are too rigid to leave any room for errors”. Indeed, exploration through randomization brings a way to break this curse of optimality. In this present study, we try to balance the model based and model free methods, that is we aim to use the reinforcement learning structure to explore the unknown parameters appeared in the model based method.

Recently, Wang et al. (2020) first considered the reinforcement learning in continuous time and spaces, and an exploratory formulation for the dynamics equation of the state and an entropy-regularized reward function were introduced. Then, Wang and Zhou (2020) solved the continuous time mean variance portfolio selection problem under the reinforcement learning stochastic optimal control framework. Tang et al. (2022) considered the exploratory Hamilton-Jacobi-Bellman (HJB) equation formulated by Wang et al. (2020) in the context of reinforcement learning in continuous time, and established the well-posedness and regularity of the viscosity solution to the HJB equation. Gao et al. (2022) considered the temperature control problem for Langevin diffusions in the context of nonconvex optimization under the regularized exploratory formulation developed by Wang et al. (2020). Differing from the continuous-time entropy-regularized reinforcement learning problem of Wang et al. (2020), Han et al. (2023) proposed a Choquet regularizer to measure and manage the level of the exploration for reinforcement learning.

For the classical stochastic optimal control theory and related applications, we refer the readers to monographs Fleming and Rishel (1975); Yong and Zhou (1999); Fleming and Soner (2006). For the recursive stochastic optimal control problem and related dynamic programming principle, see Pardoux and Peng (1990); Peng (1990, 1992). Well-known that the classical optimal control theory is applied to solve the mean-variance model, see Zhou and Li (2000); Basak and Chabakauri (2010); Björk et al. (2014); Dai et al. (2021). When we consider the applications of the classical optimal control problem, for example in the mean-variance investment portfolio problem, we first need to estimate the parameters appeared in the model. Some statistic methods could be employed such as moment estimation, maximum likelihood estimation which heavily depend on the stringent assumptions on the observed samples. Furthermore, based on the observations (historical data), it is difficult to estimate the time varying parameter appeared in the model based problems. The reinforcement learning algorithm has been widely used to study optimization, engineering and finance, etc. In particular, Wang et al. (2020) first introduced the reinforcement learning in continuous time stochastic optimal control problem.

However, it is important to find a better estimation for the parameter appeared in the model. Based on the estimation of the parameter, we can do empirical and time trend analysis. Therefore, in this study, we first show how to use a stochastic optimal control approach with reinforcement learning structure to learn the unknown parameters appeared in the dynamic equation. We consider the following stochastic optimal control problem with unknown deterministic parameter $\beta(\cdot)$ ,

\mathrm{d}X^{u}_{s}=b(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}s+\sigma(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}W_{s},\ X^{u}_{t}=x,

(1.1)

where $b,\sigma$ are given deterministic function, $W(\cdot)$ is a Brownian motion, and $u(\cdot)$ is an input control. In the input control $u(\cdot)$ , we can observe the output state $X^{u}(\cdot)$ . In the classical mean-variance investment portfolio framework, the parameter $\beta(\cdot)$ in (1.1) can be used to describe the mean and volatility of risky asset. Thus, it is useful to obtain the estimation of the unknown parameter $\beta(\cdot)$ .

Now, we show the details of the stochastic optimal control approach with reinforcement learning structure for learning the unknown parameters. Step 1: Based on equation (1.1), by choosing an appropriate cost functional, we obtain a feedback optimal control $u^{*}(\cdot)$ which contains the unknown parameter $\beta(\cdot)$ , denote by $u^{*}_{s}=u^{*}(\beta_{s},s,X^{*}_{s}),\ t\leq s\leq T$ . Step 2: We take place the unknown parameter $\beta(\cdot)$ in the optimal control $u^{*}(\cdot)$ as a new deterministic control $\rho(\cdot)$ , and the input control is denoted by $u^{\rho}_{s}=u^{\rho}(\rho_{s},s,X^{\rho}_{s}),\ t\leq s\leq T$ . Step 3: We put the new control $u^{\rho}(\cdot)$ into equation (1.1) and the related cost functional, and establish a new optimal control problem. Indeed, the optimal control of the new exploratory control problem can be used to estimate the unknown parameters. Therefore, we consider the linear stochastic differential equation case where the drift or diffusion term contains the unknown parameter, and the general case where both the drift and diffusion terms contain unknown parameters. For the above cases, we find that the optimal density function can be used to learn the unknown parameters.

In this present paper, we focus on develop the theory of a stochastic optimal control approach with reinforcement learning structure to learn the unknown parameters appeared in the equation (1.1). For the application of this new theory, we leave it as the future work. However, we refer the reader to the following references for the study of the algorithm of the optimal control. A unified framework to study policy evaluation and the associated temporal difference methods in continuous time was investigated in Jia and Zhou (2022a). Jia and Zhou (2022b) studied the policy gradient for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). Dong (2022) studied the optimal stopping problem under the exploratory framework, and established the related HJB equation and algorithm. Furthermore, Jia and Zhou (2022c) introduced a q-learning in continuous time and space. For the policy evaluation, we follow Doya (2000) for learning the value function. An introduction of reinforcement learning could be found in Sutton and Barto (2018), and the deep learning and related topic see Goodfellow et al. (2016).

The main contributions of this study are threefold:

(i). To learn the unknown parameters appeared in the dynamic equation of the state, we first develop a stochastic optimal control approach with reinforcement learning structure.

(ii). By choosing an appropriate cost functional, based on a classical optimal feedback control, we translate the original optimal control problem into a new control problem, in which the related optimal control is used to estimate unknown parameter. Based on the obtained estimations of the parameters, we can do empirical analysis for a given model.

(iii). We consider the linear stochastic differential equation case where the drift or diffusion term contains unknown parameter, and the general case where both the drift and diffusion terms contain unknown parameters. We show that the optimal density function satisfies a Gaussian distribution which can be used to learn the unknown parameters.

The remainder of this study is organized as follows. In section 2, we formulate the stochastic optimal control approach with reinforcement learning structure, and show how to learn the unknown parameter. Then, we investigate the exploratory HJB equation for the stochastic optimal control approach in Section 3. In Section 4, we consider the problem of the Linear SDE case, where the drift or diffusion term is allowed to contain the unknown parameter. Furthermore, in Section 5, we general the model in Section 4 to the case where both the drift and diffusion terms contain the unknown parameters. Then, we give an example to verify the main results. Finally, we conclude this study in Section 6.

2 Formulation of the model

Given a probability space $(\Omega,\mathcal{F},P)$ , Brownian motion $\{W_{t}\}_{t\geq 0}$ , and filtration $\{\mathcal{F}_{t}\}_{t\geq 0}$ generated by $\{W_{t}\}_{t\geq 0}$ , where $\mathcal{F}=\mathcal{F}_{T}$ with a given terminal time $T$ . We introduce the following stochastic differential equation with the deterministic unknown parameter $\beta_{s}\in\mathbb{R},\ t\leq s\leq T$ , and control $u(\cdot)\in\mathcal{U}[t,T]$ ,

\mathrm{d}X^{u}_{s}=b(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}s+\sigma(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}W_{s},\ X^{u}_{t}=x,

(2.1)

where $\mathcal{U}[t,T]$ is the set of all progressing measurable, square integrate processes on $[t,T]$ , and the processes take value in a Euclidean space. Based on the state $\{X^{u}_{s}\}_{t\leq s\leq T}$ , we consider the running and terminal cost functional,

J(t,x;u(\cdot))=\mathbb{E}\left[\int_{t}^{T}f(X_{s}^{u},u_{s})\mathrm{d}s+\Phi(X_{T}^{u})\right],

(2.2)

and the target is to minimize $J(t,x;u(\cdot))$ over $u(\cdot)\in\mathcal{U}[t,T]$ , that is

\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

The conditions on functions $b,\sigma,f$ and $\Phi$ in equations (2.1) and (2.2) will be given later. Under mild conditions, we can show that there exists a feedback optimal control $u^{*}_{s}=u^{*}(\beta_{s},s,X^{*}_{s}),\ t\leq s\leq T$ (see Yong and Zhou (1999) for further details), where $X^{*}$ is the optimal state with the optimal control $u^{*}$ .

Remark 2.1.

Indeed, we can choose a cost functional $J(t,x;u(\cdot))$ such that there exists a feedback optimal control which contains the unknown parameter $\beta(\cdot)$ . See Sections 4 and 5 for further details.

Now, we give the assumptions of this study. The functions $b,\sigma,f$ and $\Phi$ are continuous on all the variables.

Assumption 2.1.

The functions $b$ and $\sigma$ are linear growth and Lipschaitz continuous on the second and third variables with Lipschitz constant $L$ .

Assumption 2.2.

The functions $f(x,u)$ and $\Phi(x)$ are polynomial growth on $x$ and $u$ .

Assumption 2.3.

$\beta(\cdot)$ is a deterministic piecewise continuous function on $[t,T]$ . The feedback optimal control $u^{*}(\beta_{s},s,x),\ t\leq s\leq T$ is uniformly Lipschitz continuous on $x$ , with Lipschitz constant $L$ .

In practical analysis, we always assume the form of the functions $b,\sigma,f$ and $\Phi$ and leave room for estimating the unknown parameter $\beta(\cdot)$ . Some classical statistics estimations are implemented, for example moment estimation, maximum likelihood estimation, etc. However, these estimation methods are heavily based on the observations. In this study, we aim to develop a new estimations method based on the explicit feedback control $u^{*}(\cdot)$ .

It should be note that we can observe the value of the state $X^{u}(\cdot)$ with a given control $u(\cdot)$ , but not the unknown parameter $\beta(\cdot)$ . In the following, we develop the model used to estimate the parameter $\beta(\cdot)$ . Since we don’t know the value of the parameter $\beta(\cdot)$ , we take place $\beta(\cdot)$ as a new deterministic control process $\rho(\cdot)$ in the feedback optimal control $u^{*}(\beta_{s},s,X^{*}_{s})$ , and denotes as $u^{\rho}_{s}=u^{*}(\rho_{s},s,X^{\rho}_{s})$ , where $X^{\rho}_{s}$ satisfies the following stochastic differential equation,

\mathrm{d}X^{\rho}_{s}=\hat{b}(\beta_{s},X_{s}^{\rho},\rho_{s})\mathrm{d}s+\hat{\sigma}(\beta_{s},X_{s}^{\rho},{\rho}_{s})\mathrm{d}W_{s},\ X^{\rho}_{t}=x,

(2.3)

where $\hat{b}(\beta_{s},X_{s}^{\rho},\rho_{s})={b}(\beta_{s},X_{s}^{\rho},u^{\rho}_{s})$ and $\hat{\sigma}(\beta_{s},X_{s}^{\rho},\rho_{s})={\sigma}(\beta_{s},X_{s}^{\rho},u^{\rho}_{s})$ . Based on the above given assumptions, we have the following unique results for the solution of equation (2.3).

Theorem 2.1.

Let Assumptions 2.1 and 2.3 hold. The equation (2.3) admits a unique solution.

Furthermore, we rewrite the cost functional (2.2) as follows,

\hat{J}(t,x;\rho(\cdot))=\mathbb{E}\left[\int_{t}^{T}\hat{f}(X_{s}^{\rho},\rho_{s})\mathrm{d}s+\hat{\Phi}(X_{T}^{\rho})\right],

(2.4)

where $\rho_{s}\in\mathbb{R}$ , $\hat{f}(X_{s}^{\rho},\rho_{s})=f(X_{s}^{\rho},u^{\rho}_{s})$ , and $\hat{\Phi}(X_{T}^{\rho})={\Phi}(X_{T}^{\rho})$ , $t\leq s\leq T$ . We denote $\mathcal{D}[t,T]$ ={all deterministic piecewise continuous functions on $[t,T]$ }. Obviously, we have the following equivalence between the cost functional $J(t,x;u(\cdot))$ and $\hat{J}(t,x;\rho(\cdot))$ .

Theorem 2.2.

Let Assumption 2.3 hold, $J(t,x;u(\cdot))$ be the cost functional given in (2.2), and $J(t,x;\rho(\cdot))$ be given in (2.4). Then, we have

\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}\hat{J}(t,x;\rho(\cdot))=\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

Proof.

Note that, $\rho(\cdot)\in\mathcal{D}[t,T]$ is a deterministic piecewise continuous process. From Assumption 2.3, it is easily to verify that $u^{\rho}(\cdot)=u^{*}(\rho(\cdot),\cdot,X^{\rho}(\cdot))\in\mathcal{U}[t,T]$ . Thus,

\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}\hat{J}(t,x;\rho(\cdot))\geq\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

On the contrary, since $u^{*}(\cdot)=u^{*}(\beta(\cdot),\cdot,X^{\rho}(\cdot))\in\mathcal{D}[t,T]$ and

J(t,x;u^{*}(\cdot))=\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)),

thus, we have

\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}\hat{J}(t,x;\rho(\cdot))\leq\inf_{u(\cdot)\in\mathcal{U}[t,T]}J(t,x;u(\cdot)).

This completes the proof. ∎

Remark 2.2.

Based on Theorem 2.2, if the cost functional $\hat{J}(t,x;\rho(\cdot))$ admits a unique optimal control $\rho^{*}(\cdot)$ , we have that $\beta(\cdot)=\rho^{*}(\cdot)$ , which means that we can use the above new optimal control problem to estimate the parameter in the original optimal control problem. Indeed, we can choose a cost functional which satisfies the above inference. Further details can be found in Sections 4 and 5.

From equation (2.3), we can observe the value of the state $X^{\rho}(\cdot)$ based on an input control $\rho(\cdot)$ . Following the idea investigated in Wang et al. (2020), we use reinforcement learning (RL) structure to learn the optimal control $\rho^{*}(\cdot)$ . Motivated with the law of large numbers of samples of state under the density function $\pi(\cdot)$ , Wang et al. (2020) introduced the following exploratory dynamic equation for state $X^{\pi}(\cdot)$ ,

\mathrm{d}X^{\pi}_{s}=\tilde{b}(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(2.5)

where $\pi_{s}$ is the density function of the input control $\rho_{s}$ ,

\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}\hat{b}(\beta_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho,

and

\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho},

for $t\leq s\leq T$ .

It should be note that, the observations of the exploratory state $X^{\pi}(\cdot)$ should satisfy the equation (2.3). Based on this fact, we can prove that $X^{\pi}(\cdot)$ satisfies equation (2.5).

Theorem 2.3.

Let Assumptions 2.1, 2.2, and 2.3 hold, and the state $X^{\pi}(\cdot)$ satisfies a diffusion process. Then, $X^{\pi}(\cdot)$ satisfies equation (2.5).

Proof.

We assume that the state $X^{\pi}(\cdot)$ satisfies the following diffusion process,

\mathrm{d}X^{\pi}_{s}=h(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+g(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(2.6)

where the functions $h$ and $g$ are needed to be determined.

Let us consider a sequence of samples $\{x^{i}(\cdot)\}_{i=1}^{N}$ which are the observations of $X^{\pi}(\cdot)$ with $\{W^{i}(\cdot)\}_{i=1}^{N}$ and $\{\rho^{i}(\cdot)\}_{i=1}^{N}$ , where $W^{i}(\cdot)$ is the independent sample paths of the Brownian motion $W(\cdot)$ , and $\rho^{i}(\cdot)$ is the control which obeys the density function $\pi(\cdot)$ , $1\leq i\leq N$ . For given $t\leq s\leq T$ and sufficiently small $\Delta s>0$ , $x^{i}(\cdot)$ satisfies

x^{i}_{s+\Delta s}-x^{i}_{s}=\int_{s}^{s+\Delta s}\hat{b}(\beta_{r},x_{r}^{i},\rho^{i}_{r})\mathrm{d}r+\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r},{\rho}^{i}_{r})\mathrm{d}W^{i}_{r},\ x^{i}_{t}=x,

(2.7)

which leads to

	$\displaystyle x^{i}_{s+\Delta s}-x^{i}_{s}=$	$\displaystyle\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\Delta s+\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r},{\rho}^{i}_{r})\mathrm{d}W^{i}_{r}$
		$\displaystyle+\int_{s}^{s+\Delta s}\left[\hat{b}(\beta_{r},x_{r}^{i},\rho^{i}_{r})-\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\right]\mathrm{d}r.$

Applying the classical law of larger numbers, we have the following convergence results in probability:

	$\displaystyle\lim_{N\to\infty}\frac{1}{N}\left(x^{i}_{s+\Delta s}-x^{i}_{s}\right)=\mathbb{E}[X^{\pi}_{s+\Delta s}-X^{\pi}_{s}];$
	$\displaystyle\lim_{N\to\infty}\frac{1}{N}\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\Delta s=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\Delta s\right];$
	$\displaystyle\lim_{N\to\infty}\frac{1}{N}\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r},{\rho}^{i}_{r})\mathrm{d}W^{i}_{r}=0;$
	$\displaystyle\lim_{N\to\infty}\frac{1}{N}\int_{s}^{s+\Delta s}\left[\hat{b}(\beta_{r},x_{r}^{i},\rho^{i}_{r})-\hat{b}(\beta_{s},x_{s}^{i},\rho^{i}_{s})\right]\mathrm{d}r=o(\Delta s),$

where the last equality is obtained by the continuous dependence of stochastic differential equation, $o(\Delta s)$ is the higher order infinitesimal of $\Delta s$ , and thus,

\mathbb{E}[X^{\pi}_{s+\Delta s}-X^{\pi}_{s}]=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\Delta s\right]+o(\Delta s).

(2.8)

Dividing on both sides of equation (2.8) by $\Delta s$ , and letting $\Delta s\to 0$ , it follows that

\lim_{\Delta s\to 0}\frac{\mathbb{E}[X^{\pi}_{s+\Delta s}-X^{\pi}_{s}]}{\Delta s}=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\right].

(2.9)

Combining equations (2.6) and (2.9), we have that

\mathbb{E}\left[h(\beta_{s},X^{\pi}_{s},\pi_{s})\right]=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho\right].

(2.10)

It is worth noting that, for any given $A\in\mathcal{F}_{s}$ , from equations (2.6) and (2.7), we have

\mathrm{d}X^{\pi}_{s}1_{A}=h(\beta_{s},X_{s}^{\pi}1_{A},\pi_{s})1_{A}\mathrm{d}s+g(\beta_{s},X_{s}^{\pi}1_{A},{\pi}_{s})1_{A}\mathrm{d}W_{s},

and

x^{i}_{s+\Delta s}1_{A^{i}}-x^{i}_{s}1_{A^{i}}=\int_{s}^{s+\Delta s}\hat{b}(\beta_{r},x_{r}^{i}1_{A^{i}},\rho^{i}_{r})1_{A^{i}}\mathrm{d}r+\int_{s}^{s+\Delta s}\hat{\sigma}(\beta_{r},x^{i}_{r}1_{A^{i}},{\rho}^{i}_{r})1_{A^{i}}\mathrm{d}W^{i}_{r},

where $1_{A}$ is the indicator function on set $A$ , and $A^{i}$ is measurable about the filtration generated by $W^{i}_{r},\ t\leq r\leq s$ . Then, similar with the above proof and equation (2.10), we have

\mathbb{E}\left[h(\beta_{s},X^{\pi}_{s},\pi_{s})1_{A}\right]=\mathbb{E}\left[\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho 1_{A}\right],

(2.11)

and thus

h(\beta_{s},X^{\pi}_{s},\pi_{s})=\int_{\mathbb{R}}\hat{b}(\beta_{s},X^{\pi}_{s},\rho)\pi_{s}(\rho)\mathrm{d}\rho,\quad P-a.s.

(2.12)

Applying Itô formula to $(X^{\pi}_{s})^{2}$ , it follows that

\mathrm{d}(X^{\pi}_{s})^{2}=\left[2X^{\pi}_{s}h(\beta_{s},X_{s}^{\pi},\pi_{s})+g^{2}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\right]\mathrm{d}s+2X^{\pi}_{s}g(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s}.

(2.13)

Now, similar with the proof in equation (2.12), we consider the samples $\{(x^{i}(\cdot))^{2}\}_{i=1}^{N}$ , and obtain that

2X^{\pi}_{s}h(\beta_{s},X_{s}^{\pi},\pi_{s})+g^{2}(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\int_{\mathbb{R}}\left[2X^{\pi}_{s}\hat{b}(\beta_{s},X_{s}^{\pi},\rho)+\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\right]\pi_{s}(\rho)\mathrm{d}\rho.

(2.14)

Combining equations (2.12) and (2.14), we have

g^{2}(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho,\quad P-a.s.

(2.15)

Thus, we can choose

g(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho},

which completes the proof. ∎

Remark 2.3.

In Theorem 2.3, we show that the state $X^{\pi}(\cdot)$ satisfies equation (2.5). Furthermore, from equation (2.15), for $t\leq s\leq T$ , we have that

g(\beta_{s},X_{s}^{\pi},{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho},

g(\beta_{s},X_{s}^{\pi},{\pi}_{s})=-\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},X_{s}^{\pi},\rho)\pi_{s}(\rho)\mathrm{d}\rho}.

Indeed, the Brownian motion $W_{s}$ satisfies the normal distribution with mean $0$ and variance $s$ . Thus, we just need to choose one of them.

Furthermore, in the framework of exploratory, we rewrite the cost functional (2.4) as follows,

\tilde{J}(t,x;\pi(\cdot))=\mathbb{E}\left[\int_{t}^{T}\int_{\mathbb{R}}\left[\hat{f}(X_{s}^{\pi},\rho)\pi_{s}(\rho)+\lambda\pi_{s}(\rho)\ln\pi_{s}(\rho)\right]\mathrm{d}\rho\mathrm{d}s+\hat{\Phi}(X_{T}^{\pi})\right],

(2.16)

where the term

\lambda\int_{\mathbb{R}}\pi_{s}(\rho)\ln\pi_{s}(\rho)\mathrm{d}\rho

denotes the Shanon’s differential entropy used to measure the level of exploratory, and $\lambda>0$ is the temperature constant balanced the exploitation and exploration. We denote all the admissible density functions control on $\mathcal{D}[t,T]$ as $\mathcal{A}[t,T]$ . Thus, our target is to minimum the cost functional (2.16) over $\mathcal{A}[t,T]$ .

3 Hamilton-Jacobi-Bellman Approach

Now, we consider equation (2.5) with cost functional (2.16). Similar with the manner in Wang et al. (2020), employing the classical dynamic programming principle, one obtains that

v(t,x)=\inf_{\pi(\cdot)\in\mathcal{A}[t,s]}\mathbb{E}\left[\int_{t}^{s}\int_{\mathbb{R}}\left[\hat{f}(X_{r}^{\pi},\rho)\pi_{r}(\rho)+\lambda\pi_{r}(\rho)\ln\pi_{r}(\rho)\right]\mathrm{d}\rho\mathrm{d}r+v(s,X^{\pi}_{s})\right],

and the related Hamilton-Jacobi-Bellman (HJB) equation is

	$\displaystyle 0=$	$\displaystyle\min_{\pi_{t}\in\mathcal{A}[t,t]}\bigg{[}\partial_{t}v(t,x)+\int_{\mathbb{R}}\bigg{(}\hat{f}(x,\rho)+\lambda\ln\pi_{t}(\rho)$
		$\displaystyle+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x)\bigg{)}\pi_{t}(\rho)\mathrm{d}\rho\bigg{]},$		(3.1)

with terminal condition $v(T,x)=\hat{\Phi}(x)$ .

It is worth noting that, equation (3.1) has an optimal control $\pi^{*}_{t}(\cdot)$ which satisfies

\pi^{*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left(\hat{f}(x,\rho)+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left(\hat{f}(x,u)+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,u)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,u)\partial_{x}v(t,x)\right)\right)\mathrm{d}u}.

(3.2)

Letting $\lambda\to 0$ , $\pi^{*}_{t}(\cdot)$ reduces to a Dirac measure, where

\pi^{*}_{t}(\rho)=\delta_{\rho^{*}_{t}}(\rho),\quad\rho\in\mathbb{R},

and $\rho_{t}^{*}$ solves the following equation of $\rho$ ,

0=\partial_{\rho}\hat{f}(x,\rho)+\hat{\sigma}(\beta_{t},x,\rho)\partial_{\rho}\hat{\sigma}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\partial_{\rho}\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x).

(3.3)

By Theorem 2.2, we have that equation (3.3) has at least one solution $\rho_{t}^{*}=\beta_{t}$ , and thus

0=\partial_{\rho}\hat{f}(x,\beta_{t})+\hat{\sigma}(\beta_{t},x,\beta_{t})\partial_{\rho}\hat{\sigma}(\beta_{t},x,\beta_{t})\partial_{xx}v(t,x)+\partial_{\rho}\hat{b}(\beta_{t},x,\beta_{t})\partial_{x}v(t,x).

Furthermore, the function of $\rho$ , $\mathcal{L}(t,x,\beta_{t},\rho)$ takes the minimum value at $\beta_{t}$ , where

\mathcal{L}(t,x,\beta_{t},\rho)=\hat{f}(x,\rho)+\frac{1}{2}\hat{\sigma}^{2}(\beta_{t},x,\rho)\partial_{xx}v(t,x)+\hat{b}(\beta_{t},x,\rho)\partial_{x}v(t,x).

Since $\lambda>0$ , thus $\pi^{*}_{t}(\rho)$ takes the maximum value at $\beta_{t}$ . The above observations motive us to find the optimal estimation for the parameter $\beta(\cdot)$ from the optimal density function $\pi^{*}(\cdot)$ .

Based on the above analysis, we conclude the following results.

Theorem 3.1.

If $\mathcal{L}(t,x,\beta_{t},\rho)$ satisfies for any $\rho\neq\beta_{t}$ ,

\mathcal{L}(t,x,\beta_{t},\rho)>\mathcal{L}(t,x,\beta_{t},\beta_{t}).

(3.4)

We have that

{\arg\max}_{\rho\in\mathbb{R}}\pi^{*}_{t}(\rho)=\beta_{t},\ \ 0\leq t\leq T.

Proof.

From condition (3.4), we have that $\mathcal{L}(t,x,\beta_{t},\rho)$ takes the unique minimum value at $\beta_{t}$ , and $\pi^{*}_{t}(\rho)$ takes the unique maximum value at $\beta_{t}$ . Obviously, ${\arg\max}_{\rho\in\mathbb{R}}\pi^{*}_{t}(\rho)$ is equal to the parameter $\beta(\cdot)$ . ∎

In the following section, we consider a linear stochastic differential equation (SDE) model to verify the results given in Theorem 3.1, where the drift or diffusion term contains unknown parameter.

4 Linear SDE problem

We now consider the problem of the Linear SDE case, where the drift or diffusion term is allowed to contain the unknown parameter. We first study the diffusion term with unknown parameter case, and then investigate the drift term with unknown parameter.

4.1 The diffusion term with unknown parameter

We consider the following linear SDE,

\mathrm{d}X^{u}_{s}=\left[X^{u}_{s}+u_{s}\right]\mathrm{d}s+\left[\beta_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}W_{s},

(4.1)

with the initial condition $X^{u}_{t}=x$ and parameter $\beta(\cdot)$ needs to be estimated. The cost functional is given as follows,

J(t,x;u(\cdot))=\mathbb{E}[\left(X^{u}_{T}\right)^{2}],

(4.2)

where $u(\cdot)\in\mathcal{U}[t,T]$ . Based on the classical optimal control theory, the related HJB equation is given by

0=\partial_{t}v(t,x)+\inf_{u\in U}\left[\frac{1}{2}(\beta_{t}x+u)^{2}\partial_{xx}v(t,x)+(x+u)\partial_{x}v(t,x)\right].

The optimal control is given by

u^{*}(s,X^{*}_{s})=-(1+\beta_{s})X^{*}_{s},\quad t\leq s\leq T.

We take the parameter $\beta_{s}X^{*}_{s}$ as $\rho_{s}$ , and obtain

u^{\rho}(s,X^{\rho}_{s})=-X^{\rho}_{s}-\rho_{s},\quad t\leq s\leq T.

Remark 4.1.

In this present paper, we aim to establish the theory to estimate the parameter appeared in the state’s dynamic equation. To derive the explicit solution of the optimal control $\pi^{*}(\cdot)$ , here, we take place the parameter $\beta(\cdot)X^{*}(\cdot)$ as $\rho(\cdot)$ which is essentially same with the method developed in Section 3. Note that, when we obtain the optimal control $\rho(\cdot)$ , we can divide $\rho^{*}(\cdot)$ by the state $x$ . Then we can obtain the optimal estimation for the parameter $\beta(\cdot)$ . Indeed, when deal with some special problems, this kind of structure method should be considered according the properties of the problem.

Then equation (2.3) becomes

\mathrm{d}X^{\rho}_{s}=-\rho_{s}\mathrm{d}s+(\beta_{s}-1-\rho_{s}{/}X^{\rho}_{s})X^{\rho}_{s}\mathrm{d}W_{s},\quad X^{\rho}_{t}=x.

(4.3)

Now, applying Theorem 2.3, we can formulate the RL stochastic optimal control problem. The exploratory dynamic equation is,

\mathrm{d}X^{\pi}_{s}=\tilde{b}(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(4.4)

where

\tilde{b}(\beta_{s},x,\pi_{s})=-\int_{\mathbb{R}}\rho\pi_{s}(\rho)\mathrm{d}\rho,

and

\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}(\beta_{s}-1-\rho{/}X^{\rho}_{s})^{2}(X^{\rho}_{s})^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional becomes

J(t,x;\pi(\cdot))=\mathbb{E}[\left(X^{\pi}_{T}\right)^{2}].

(4.5)

By the formula of the optimal control in (3.2), we have that

\pi^{*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}(\beta_{t}-1-\rho{/}x)^{2}x^{2}\partial_{xx}v(t,x)-\rho\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}(\beta_{t}-1-u{/}x)^{2}x^{2}\partial_{xx}v(t,x)-u\partial_{x}v(t,x)\right)\right)\mathrm{d}u},

(4.6)

and thus

\pi^{*}_{t}(\rho)=\frac{1}{\sqrt{2\pi\sigma^{2}(t,x)}}\exp\left(-\frac{\left(\rho-\mu(t,x)\right)^{2}}{2\sigma^{2}(t,x)}\right),

where

	$\displaystyle\mu(t,x)=-\frac{a_{2}(t,x)}{2a_{1}(t,x)};$
	$\displaystyle\sigma^{2}(t,x)=\frac{\lambda}{2a_{1}(t,x)};$
	$\displaystyle a_{1}(t,x)=\frac{\partial_{xx}v(t,x)}{2};$
	$\displaystyle a_{2}(t,x)=(1-\beta_{t})x\partial_{xx}v(t,x)-\partial_{x}v(t,x).$

Putting $\pi^{*}_{t}(\rho)$ into equation (3.1), and by a simple calculation, one obtains

0=\partial_{t}v(t,x)+(\beta_{t}-1)^{2}x^{2}a_{1}(t,x)-\mu^{2}(t,x)a_{1}(t,x)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{a_{1}(t,x)}.

(4.7)

We assume that the classical solution of equation (4.7) satisfies

v(t,x)=\alpha_{1}(t)x^{2}+\alpha_{2}(t).

Then, it follows that

	$\displaystyle\mu(t,x)=\beta_{t}x;$
	$\displaystyle\sigma^{2}(t,x)=\frac{\lambda}{2\alpha_{1}(t)};$
	$\displaystyle a_{1}(t,x)=\alpha_{1}(t);$
	$\displaystyle a_{2}(t,x)=-2\beta_{t}\alpha_{1}(t)x.$

Now, we put the formula of $v(t,x)$ into equation (4.7) and obtain that

0=\alpha^{\prime}_{1}(t)x^{2}+\alpha^{\prime}_{2}(t)+(1-2\beta_{t})\alpha_{1}(t)x^{2}-\frac{\lambda}{2}\ln\frac{\pi\lambda}{\alpha_{1}(t)}.

(4.8)

Thus, $\alpha_{1}(t),\ \alpha_{2}(t)$ satisfy the following equations

	$\displaystyle 0=\alpha^{\prime}_{1}(t)+(1-2\beta_{t})\alpha_{1}(t),\quad\alpha_{1}(T)=1;$
	$\displaystyle 0=\alpha^{\prime}_{2}(t)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{\alpha_{1}(t)},\quad\alpha_{2}(T)=0,$

and

	$\displaystyle\alpha_{1}(t)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right);$
	$\displaystyle\alpha_{2}(t)=\int_{t}^{T}\frac{\lambda}{2}\ln\frac{\pi\lambda}{\alpha_{1}(s)}\mathrm{d}s,$

which solve the value function $v(t,x)$ with the optimal control $\pi^{*}(\rho)$ satisfying the Gaussian density function,

\pi_{t}^{*}(\rho)=\frac{1}{\sqrt{\pi\lambda{/}\alpha_{1}(t)}}\exp\left(-\frac{\left(\rho-(\beta_{t}x)^{2}\right)^{2}}{\lambda{/}\alpha_{1}(t)}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

\pi_{t}^{*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\alpha_{1}(t)}),

where

\alpha_{1}(t)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right).

We conclude the above results in the following theorem.

Theorem 4.1.

For the exploratory dynamic equation (4.4) with cost functional (4.5), the optimal control is given as follows,

\pi_{t}^{*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\alpha_{1}(t)}),\ 0\leq t\leq T,

where $\displaystyle\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\alpha_{1}(t)})$ denotes the Gaussian density function with mean $\beta_{t}x$ and variance $\displaystyle\frac{\lambda}{2\alpha_{1}(t)}$ .

Remark 4.2.

In Theorem 4.1, we show the explicit formula of the optimal control $\pi_{t}^{*}(\cdot),\ 0\leq t\leq T$ . In the control $\pi_{t}^{*}(\cdot),\ 0\leq t\leq T$ , we can use the mean $\beta_{t}x$ to estimate the parameter $\beta(\cdot)$ . In this exploratory stochastic optimal control problem, the variance $\frac{\lambda}{2\alpha_{1}(t)}$ of the optimal control $\pi_{t}^{*}(\cdot)$ investigates the efficiency of exploratory of the RL. It is worth noting that, let $\lambda\to 0$ , $\pi_{t}^{*}(\cdot)$ reduces to the Dirac measure $\delta_{\beta_{t}x}(\cdot)$ .

Now, we consider the following classical optimal control problem. The state satisfies

\mathrm{d}X^{\rho}_{s}=-\rho_{s}\mathrm{d}s+\left[(\beta_{s}-1)X^{\rho}_{s}-\rho_{s}\right]\mathrm{d}W_{s},,\ X^{\rho}_{t}=x.

(4.9)

and the cost functional is

J(t,x;u(\cdot))=\mathbb{E}[\left(X^{u}_{T}\right)^{2}].

(4.10)

The value function is defined by

\hat{v}(t,x)=\inf_{\rho(\cdot)\in\mathcal{D}[t,T]}J(t,x;u(\cdot)),

and satisfies the following HJB equation,

0=\min_{\rho\in\mathbb{R}}\left[\partial_{t}\hat{v}(t,x)+\frac{1}{2}\left[(\beta_{t}-1)x-\rho\right]^{2}\partial_{xx}\hat{v}(t,x)-\rho\partial_{x}\hat{v}(t,x)\right].

(4.11)

The optimal control of HJB equation (4.11) is

\rho^{*}_{t}=\frac{(\beta_{t}-1)x\partial_{xx}\hat{v}(t,x)+\partial_{x}\hat{v}(t,x)}{\partial_{xx}\hat{v}(t,x)}.

Then, putting $\rho_{t}^{*}$ into equation (4.11), and we assume the formula of the value function $\hat{v}(t,x)$ ,

\hat{v}(t,x)=b_{1}(t)x^{2}+b_{2}(t).

Thus, $b_{1}(t),\ b_{2}(t)$ satisfy the following equations

	$\displaystyle 0=b^{\prime}_{1}(t)+(1-2\beta_{t})b_{1}(t),\quad b_{1}(T)=1;$
	$\displaystyle 0=b^{\prime}_{2}(t),\quad b_{2}(T)=0,$

and

	$\displaystyle b_{1}(t)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right);$
	$\displaystyle b_{2}(t)=0.$

The value function $\hat{v}(t,x)$ is given as

\hat{v}(t,x)=\exp\left(\int_{t}^{T}(1-2\beta_{s})\mathrm{d}s\right)x^{2},

with the optimal control $\rho^{*}_{t}=\beta_{t}x$ .

Thus, the optimal control $\rho^{*}_{t}$ of the classical optimal control problem is consistent with the mean of the optimal density function $\pi^{*}_{t}$ . Therefore, based on the reinforcement learning (RL) stochastic optimal control structure presented in this study, we can use the RL method to learn the optimal density function $\pi^{*}_{t}$ and then obtains the estimation for parameter $\beta(\cdot)$ .

4.2 The drift term with unknown parameter

We could also consider the following model where the state satisfies

\mathrm{d}X^{u}_{s}=\left[\beta_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}s+\left[X^{u}_{s}+u_{s}\right]\mathrm{d}W_{s},

(4.12)

with the initial condition $X^{u}_{t}=x$ and parameter $\beta(\cdot)$ needs to be estimated. Note that the parameter $\beta(\cdot)$ appears in the drift term of the state equation (4.12). We construct the following cost functional differed from (4.10),

J(t,x;u(\cdot))=\mathbb{E}\int_{t}^{T}[(u_{s}+X^{u}_{s}+\beta_{s}X^{u}_{s})^{2}]\mathrm{d}s.

(4.13)

Remark 4.3.

In the cost functional (4.13), there is unknown term $\beta_{s}X^{u}_{s}$ . In fact, from equation (4.12), we can observe the value of $\beta_{s}X^{u}_{s},\ t\leq s\leq T$ . Precisely, Based on an input $u(\cdot)$ , we can obtain the value of $J(t,x;u(\cdot))$ .

Indeed, when we consider the cost functional (4.10), the optimal control admits the unique optima control $u^{*}(\cdot)=-2X^{*}(\cdot)$ which cannot be used to estimate the parameter $\beta(\cdot)$ . From cost functional (4.13), the related optimal control is given as follows:

u^{*}_{s}=-X_{s}^{*}-{\beta_{s}X_{s}^{*}},\ t\leq s\leq T.

We take place ${\beta_{s}X_{s}^{*}}$ as $\rho_{s}$ , and thus

u^{\rho}_{s}=-X_{s}^{\rho}-{\rho_{s}},\ t\leq s\leq T.

Then, equation (4.12) becomes

\mathrm{d}X^{\rho}_{s}=[(\beta_{s}-1)X_{s}^{\rho}-\rho_{s}]\mathrm{d}s-{\rho}_{s}\mathrm{d}W_{s},\ X^{\rho}_{t}=x.

(4.14)

Based on Theorem 2.3, the exploratory dynamic equation is,

\mathrm{d}X^{\pi}_{s}=\tilde{b}(\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(4.15)

where

\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}[(\beta_{s}-1)x-\rho]\pi_{s}(\rho)\mathrm{d}\rho,

and

\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\rho^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional is,

J(t,x;\pi(\cdot))=\mathbb{E}\left[\int_{t}^{T}\int_{\mathbb{R}}(\rho-\beta_{s}X^{\pi}_{s})^{2}\pi_{s}(\rho)\mathrm{d}\rho\mathrm{d}s\right].

(4.16)

By the formula of the optimal control in (3.2), we have that

\pi^{*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left((\rho-\beta_{t}x)^{2}+\frac{1}{2}\rho^{2}\partial_{xx}v(t,x)+[(\beta_{t}-1)x-\rho]\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left((u-\beta_{t}x)^{2}+\frac{1}{2}u^{2}\partial_{xx}v(t,x)+[(\beta_{t}-1)x-u]\partial_{x}v(t,x)\right)\right)\mathrm{d}u},

(4.17)

which satisfies the Gaussian density function with mean

\mu(t,x)=\frac{2\beta_{t}x+\partial_{x}v(t,x)}{2+\partial_{xx}v(t,x)},

and variance

\sigma^{2}(t,x)=\frac{\lambda}{2+\partial_{xx}v(t,x)}.

We assume

v(t,x)=\alpha_{1}(t)x^{2}+\alpha_{2}(t).

The related HJB equation becomes

0=\alpha^{\prime}_{1}(t)x^{2}+\alpha^{\prime}_{2}(t)+\beta_{t}^{2}x^{2}+2(\beta_{t}-1)\alpha_{1}(t)x^{2}-\mu^{2}(t,x)(1+\alpha_{1}(t))-\frac{\lambda}{2}\ln\frac{\pi\lambda}{1+\alpha_{1}(t)},

(4.18)

and derives that

	$\displaystyle 0=\alpha^{\prime}_{1}(t)(1+\alpha_{1}(t))+\beta_{t}^{2}\alpha_{1}(t)+2\beta_{t}\alpha^{2}_{1}(t)-2\alpha_{1}(t)-3\alpha_{1}^{2}(t);$
	$\displaystyle 0=\alpha_{2}(t)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{1+\alpha_{1}(t)},$

with $\alpha_{1}(T)=\alpha_{2}(T)=0$ , and thus

	$\displaystyle\alpha_{1}(t)=0;$
	$\displaystyle\alpha_{2}(t)=\frac{\lambda(T-t)}{2}\ln(\pi\lambda),$

which solves the value function $v(t,x)$ with the optimal control $\pi^{*}(\rho)$ ,

\pi_{t}^{*}(\rho)=\frac{1}{\sqrt{\pi\lambda}}\exp\left(-\frac{\left(\rho-\beta_{t}x\right)^{2}}{\lambda}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

\pi_{t}^{*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2}).

Based on RL, we can learn the mean $\beta_{t}x$ from the observation data. Furthermore, we can use the variance $\displaystyle\frac{\lambda}{2}$ to adjust the exploratory rate.

In this section, we have considered the parameters appeared in drift or diffusion term. In the following Section 5, we aim to consider a more general case where both drift and diffusion terms contain parameter.

5 Extension

Now, we general the model in Section 4 to the case where both the drift and diffusion terms contain unknown parameters. We show the details how to estimate the parameters in the drift and diffusion terms. Then, we give an example to verify the main results.

5.1 General model

We consider the following stochastic differential equation with the deterministic parameters $\alpha_{s},\beta_{s}\in\mathbb{R},\ t\leq s\leq T$ , and control $u(\cdot)\in\mathcal{U}[t,T]$ ,

\mathrm{d}X^{u}_{s}=b(\alpha_{s},X_{s}^{u},u_{s})\mathrm{d}s+\sigma(\beta_{s},X_{s}^{u},u_{s})\mathrm{d}W_{s},\ X^{u}_{t}=x.

(5.1)

We construct two cost functionals,

J_{1}(t,x;u(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{1}(X_{s}^{u},u_{s})\mathrm{d}s+\Phi_{1}(X_{T}^{u})\right],

(5.2)

and

J_{2}(t,x;u(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{2}(X_{s}^{u},u_{s})\mathrm{d}s+\Phi_{2}(X_{T}^{u})\right],

(5.3)

to estimate the parameters $\alpha(\cdot)$ and $\beta(\cdot)$ , respectively.

Firstly, we show how to estimate $\alpha(\cdot)$ from cost functional $J_{1}(t,x;u(\cdot))$ . By choosing appropriate functions $f_{1}(\cdot)$ and $\Phi_{1}(\cdot)$ , we obtain a feedback optimal control $u^{1,*}_{s}=u^{1,*}(\alpha_{s},s,X^{1,*}_{s}),\ t\leq s\leq T$ from cost functional $J_{1}(t,x;u(\cdot))$ , where $X^{1,*}(\cdot)$ is the optimal state with the optimal control $u^{1,*}(\cdot)$ . However, we need that $u^{1,*}(\cdot)$ contains the parameter $\alpha(\cdot)$ only. The question is that if $u^{1,*}(\cdot)$ contains the parameter $\beta(\cdot)$ , we don’t know the value of the new input control (see $u^{\rho}(\cdot)$ ). We take place $\alpha(\cdot)$ as a deterministic process $\rho(\cdot)$ in the feedback optimal control $u^{1,*}(\alpha_{s},s,X^{1,*}_{s})$ , and denotes by $u^{\rho}_{s}=u^{1,*}(\rho_{s},s,X^{\rho}_{s})$ , where $X^{\rho}_{s}$ satisfies the following stochastic differential equation,

\mathrm{d}X^{\rho}_{s}=\hat{b}(\alpha_{s},X_{s}^{\rho},\rho_{s})\mathrm{d}s+\hat{\sigma}(\beta_{s},X_{s}^{\rho},{\rho}_{s})\mathrm{d}W_{s},\ X^{\rho}_{t}=x,

(5.4)

where $\hat{b}(\alpha_{s},X_{s}^{\rho},\rho_{s})={b}(\alpha_{s},X_{s}^{\rho},u^{\rho}_{s})$ and $\hat{\sigma}(\beta_{s},X_{s}^{\rho},\rho_{s})={\sigma}(\beta_{s},X_{s}^{\rho},u^{\rho}_{s})$ . Furthermore, we rewrite the cost functional (5.2) as follows,

J_{1}(t,x;\rho(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{1}(X_{s}^{\rho},\rho_{s})\mathrm{d}s+\Phi_{1}(X_{T}^{\rho})\right],

(5.5)

where $\rho_{s}\in\mathbb{R},\ t\leq s\leq T$ . Based on Theorem 2.3, we consider the following exploratory dynamic equation,

\mathrm{d}X^{\pi}_{s}=\tilde{b}(\alpha_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(5.6)

where $\pi_{s}$ is the density function of the input control $\rho_{s}$ ,

\tilde{b}(\alpha_{s},x,\pi_{s})=\int_{\mathbb{R}}\hat{b}(\alpha_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho,

and

\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\hat{\sigma}^{2}(\beta_{s},x,\rho)\pi_{s}(\rho)\mathrm{d}\rho}

for $t\leq s\leq T$ . The cost functional (5.5) is rewritten as follows,

\tilde{J}_{1}(t,x;\pi(\cdot))=\mathbb{E}\left[\int_{t}^{T}\int_{\mathbb{R}}\left[f_{1}(X_{s}^{\pi},\rho)\pi_{s}(\rho)+\lambda\pi_{s}(\rho)\ln\pi_{s}(\rho)\right]\mathrm{d}\rho\mathrm{d}s+\Phi_{1}(X_{T}^{\pi})\right].

(5.7)

Secondly, we estimate $\beta(\cdot)$ by cost functional $J_{2}(t,x;u(\cdot))$ . We choose appropriate functions $f_{2}(\cdot)$ and $\Phi_{2}(\cdot)$ , such that the feedback optimal control $u^{2,*}_{s}=u^{2,*}(\beta_{s},s,X^{2,*}_{s}),\ t\leq s\leq T$ of the cost functional $J_{2}(t,x;u(\cdot))$ contains the parameter $\beta(\cdot)$ , where $X^{2,*}(\cdot)$ is the optimal state with the optimal control $u^{2,*}(\cdot)$ . We take place $\beta(\cdot)$ as a deterministic process $\gamma(\cdot)$ in the feedback optimal control $u^{2,*}(\beta_{s},s,X^{2,*}_{s})$ , and denotes by $u^{\gamma}_{s}=u^{2,*}(\gamma_{s},s,X^{\gamma}_{s})$ , where $X^{\gamma}_{s}$ satisfies the following stochastic differential equation.

\mathrm{d}X^{\gamma}_{s}=\hat{b}(\alpha_{s},X_{s}^{\gamma},\gamma_{s})\mathrm{d}s+\hat{\sigma}(\beta_{s},X_{s}^{\gamma},{\gamma}_{s})\mathrm{d}W_{s},\ X^{\gamma}_{t}=x.

(5.8)

where $\hat{b}(\alpha_{s},X_{s}^{\gamma},\gamma_{s})={b}(\alpha_{s},X_{s}^{\gamma},u^{\gamma}_{s})$ and $\hat{\sigma}(\beta_{s},X_{s}^{\gamma},\gamma_{s})={\sigma}(\beta_{s},X_{s}^{\gamma},u^{\gamma}_{s})$ . Furthermore, we rewrite the cost functional (5.3) as follows,

J_{2}(t,x;\gamma(\cdot))=\mathbb{E}\left[\int_{t}^{T}f_{2}(X_{s}^{\gamma},\gamma_{s})\mathrm{d}s+\Phi_{2}(X_{T}^{\gamma})\right],

(5.9)

where $\gamma_{s}\in\mathbb{R},\ t\leq s\leq T$ . Then, based on Theorem 2.3, we can introduce the exploratory dynamic equations almost same with (5.6) and (5.7). For notations simplicity, we omit them.

5.2 Linear SDE with unknown parameters

Now, we consider the following example.

\mathrm{d}X^{u}_{s}=\left[\alpha_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}s+\left[\beta_{s}X^{u}_{s}+u_{s}\right]\mathrm{d}W_{s},

(5.10)

where $\alpha(\cdot)$ and $\beta(\cdot)$ need to be estimated. Note that, equation (5.10) contains the parameters $\alpha(\cdot)$ and $\beta(\cdot)$ . Thus, we cannot use the cost functional developed in Subsection 4.2 to investigate the estimation for $\alpha(\cdot)$ , directly. Therefore, we first consider to estimate the parameter $\beta(\cdot)$ . Based on the estimation of $\beta(\cdot)$ , we can use the cost functional constructed in Subsection 4.2 to estimate the parameter $\alpha(\cdot)$ .

Step 1: We first construct the cost functional $J_{1}(t,x;u(\cdot))$ used to estimate parameter $\beta(\cdot)$ , where

J_{1}(t,x;u(\cdot))=\mathbb{E}[(X^{u}_{T})^{2}].

(5.11)

Based on the classical optimal control theory, the related HJB equation is given by

0=\partial_{t}v(t,x)+\inf_{u\in U}\left[\frac{1}{2}(\beta_{t}x+u)^{2}\partial_{xx}v(t,x)+(\alpha_{t}x+u)\partial_{x}v(t,x)\right].

The related optimal control is

u^{*}(s,X^{*}_{s})=-(1+\beta_{s})X^{*}_{s},\quad t\leq s.

We take place ${\beta_{s}X_{s}^{*}}$ as $\gamma_{s}$ , and have that

u^{\gamma}_{s}=-{\gamma_{s}}-{X_{s}^{*}},\ t\leq s\leq T.

Equation (5.10) becomes

\mathrm{d}X^{\gamma}_{s}=[(\alpha_{s}-1)X^{\gamma}_{s}-\gamma_{s}]\mathrm{d}s+[(\beta_{s}-1)X^{\gamma}_{s}-{\gamma}_{s}]\mathrm{d}W_{s},\ X^{\gamma}_{t}=x.

(5.12)

From Theorem 2.3, the exploratory dynamic equation is,

\mathrm{d}X^{\pi}_{s}=\tilde{b}(\alpha_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s}-\alpha_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(5.13)

where

\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}[(\alpha_{s}-1)x-\gamma]\pi_{s}(\gamma)\mathrm{d}\gamma,

and

\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}[(\beta_{s}-1)x-\gamma]^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional is,

J_{1}(t,x;\pi(\cdot))=\mathbb{E}[(X^{\pi}_{T})^{2}].

(5.14)

By the formula of the optimal control in (3.2), we have that

\pi^{1,*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}[(\beta_{t}-1)x-\gamma]^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-1)x-\gamma]\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left(\frac{1}{2}[(\beta_{t}-1)x-u]^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-1)x-u]\partial_{x}v(t,x)\right)\right)\mathrm{d}u},

(5.15)

which satisfies the Gaussian density function with mean

\mu(t,x)=\frac{(\beta_{t}-1)x\partial_{xx}v+\partial_{x}v(t,x)}{\partial_{xx}v(t,x)},

and variance

\sigma^{2}(t,x)=\frac{\lambda}{\partial_{xx}v(t,x)}.

We assume

v(t,x)=\theta_{1}(t)x^{2}+\theta_{2}(t).

The related HJB equation becomes

0=\theta^{\prime}_{1}(t)x^{2}+\theta^{\prime}_{2}(t)+(\beta_{t}-1)^{2}\theta_{1}(t)x^{2}+2(\alpha_{t}-1)\theta_{1}(t)x^{2}-\mu^{2}(t,x)\theta_{1}(t)-\frac{\lambda}{2}\ln\frac{\pi\lambda}{\theta_{1}(t)},

(5.16)

and derives that

	$\displaystyle\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right);$
	$\displaystyle\theta_{2}(t)=\frac{\lambda}{2}\int_{t}^{T}\ln(\frac{\pi\lambda}{\theta_{1}(s)})\mathrm{d}s,$

which solves the value function $v(t,x)$ with the optimal control $\pi^{1,*}(\gamma)$ satisfies Gaussian density function,

\pi_{t}^{1,*}(\gamma)=\sqrt{\frac{\theta_{1}(t)}{\pi\lambda}}\exp\left(-\frac{\left(\gamma-\beta_{t}x\right)^{2}}{\lambda{/}\theta_{1}(t)}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

\pi_{t}^{1,*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\theta_{1}(t)}),

where

\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right).

Step 2: We then construct the cost functional $J_{2}(t,x;u(\cdot))$ used to estimate parameter $\alpha(\cdot)$ , where

J_{2}(t,x;u(\cdot))=\mathbb{E}\int_{t}^{T}[(u_{s}+\alpha_{s}X^{u}_{s}+\beta_{s}X^{u}_{s})^{2}]\mathrm{d}s.

(5.17)

Minimizing the cost functional (5.17), the related optimal control is given by

u^{*}_{s}=-\alpha_{s}X_{s}^{*}-{\beta_{s}X_{s}^{*}},\ t\leq s\leq T.

We take place ${\alpha_{s}X_{s}^{*}}$ as $\rho_{s}$ , and thus

u^{\rho}_{s}=-{\rho_{s}}-{\beta_{s}X_{s}^{*}},\ t\leq s\leq T.

Then, equation (5.10) becomes

\mathrm{d}X^{\rho}_{s}=[(\alpha_{s}-\beta_{s})X_{s}^{\rho}-\rho_{s}]\mathrm{d}s-{\rho}_{s}\mathrm{d}W_{s},\ X^{\rho}_{t}=x.

(5.18)

From Theorem 2.3, the exploratory dynamic equation is,

\mathrm{d}X^{\pi}_{s}=\tilde{b}(\alpha_{s}-\beta_{s},X_{s}^{\pi},\pi_{s})\mathrm{d}s+\tilde{\sigma}(\beta_{s},X_{s}^{\pi},{\pi}_{s})\mathrm{d}W_{s},\ X^{\pi}_{t}=x,

(5.19)

where

\tilde{b}(\beta_{s},x,\pi_{s})=\int_{\mathbb{R}}[(\alpha_{s}-\beta_{s})x-\rho]\pi_{s}(\rho)\mathrm{d}\rho,

and

\tilde{\sigma}(\beta_{s},x,{\pi}_{s})=\sqrt{\int_{\mathbb{R}}\rho^{2}\pi_{s}(\rho)\mathrm{d}\rho}.

The exploratory cost functional is,

J_{2}(t,x;\pi(\cdot))=\mathbb{E}[\int_{t}^{T}\int_{\mathbb{R}}(\rho-\alpha_{s}X^{\pi}_{s})^{2}\pi_{s}(\rho)\mathrm{d}\rho\mathrm{d}s].

(5.20)

By the formula of the optimal control in (3.2), we have that

\pi^{2,*}_{t}(\rho)=\frac{\exp\left(-\frac{1}{\lambda}\left((\rho-\alpha_{t}x)^{2}+\frac{1}{2}\rho^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-\beta_{t})x-\rho]\partial_{x}v(t,x)\right)\right)}{\int_{\mathbb{R}}\exp\left(-\frac{1}{\lambda}\left((u-\alpha_{t}x)^{2}+\frac{1}{2}u^{2}\partial_{xx}v(t,x)+[(\alpha_{t}-\beta_{t})x-u]\partial_{x}v(t,x)\right)\right)\mathrm{d}u},

(5.21)

which satisfies the Gaussian density function with mean

\mu(t,x)=\frac{2\alpha_{t}x+\partial_{x}v(t,x)}{2+\partial_{xx}v(t,x)},

and variance

\sigma^{2}(t,x)=\frac{\lambda}{2+\partial_{xx}v(t,x)}.

We assume

v(t,x)=\theta_{1}(t)x^{2}+\theta_{2}(t).

The related HJB equation becomes

0=\theta^{\prime}_{1}(t)x^{2}+\theta^{\prime}_{2}(t)+\alpha_{t}^{2}x^{2}+2(\alpha_{t}-\beta_{t})\theta_{1}(t)x^{2}-\mu^{2}(t,x)(1+\theta_{1}(t))-\frac{\lambda}{2}\ln\frac{\pi\lambda}{1+\theta_{1}(t)}.

(5.22)

and derives that

	$\displaystyle\theta_{1}(t)=0;$
	$\displaystyle\theta_{2}(t)=\frac{\lambda(T-t)}{2}\ln(\pi\lambda),$

which solves the value function $v(t,x)$ with the optimal control $\pi^{2,*}(\rho)$ satisfies Gaussian density function,

\pi_{t}^{2,*}(\rho)=\frac{1}{\sqrt{\pi\lambda}}\exp\left(-\frac{\left(\rho-\alpha_{t}x\right)^{2}}{\lambda}\right),\quad 0\leq t\leq T.

For notation simplicity, we denote by

\pi_{t}^{2,*}(\cdot)\overset{d}{=}\mathcal{N}(\alpha_{t}x,\ \frac{\lambda}{2}).

We conclude the above main results in the following theorem.

Theorem 5.1.

When both the drift and diffusion terms of the linear SDE (5.1) contain the unknown parameters $\alpha(\cdot)$ and $\beta(\cdot)$ , based on the exploratory equations (5.13) and (5.19), and related cost functionals (5.14) and (5.20), we have the optimal density functions for the parameters $\beta(\cdot)$ and $\alpha(\cdot)$ , respectively,

\pi_{t}^{1,*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\theta_{1}(t)}),\quad\pi_{t}^{2,*}(\cdot)\overset{d}{=}\mathcal{N}(\alpha_{t}x,\ \frac{\lambda}{2}),

where

\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right).

Based on $\pi_{t}^{1,*}(\cdot)$ and $\pi_{t}^{2,*}(\cdot)$ , we can learn the unknown parameters $\beta(\cdot)$ and $\alpha(\cdot)$ .

Remark 5.1.

Wang and Zhou (2020) investigated a continuous time mean-variance portfolio selection with reinforcement learning model, and developed an implementable reinforcement learning algorithm. Based on the results in Wang and Zhou (2020), we can obtain the parameters appeared in density functions $\pi_{t}^{1,*}(\cdot)$ and $\pi_{t}^{2,*}(\cdot)$ , which can be used to estimate parameters $\beta(\cdot)$ and $\alpha(\cdot)$ . Based on the estimations of $\beta(\cdot)$ and $\alpha(\cdot)$ , we can do empirical analysis for the classical optimal control problem and so on.

6 Conclusion

Combining the stochastic optimal control model and the reinforcement learning structure, we develop a novel approach to learn the unknown parameters appeared in the drift and diffusion terms of the stochastic differential equation. By choosing an appropriate cost functional, we can obtain a feedback optimal control $u^{*}(\cdot)$ which contains the unknown parameter $\beta(\cdot)$ , denote by $u^{*}_{s}=u^{*}(\beta_{s},s,X^{*}_{s}),\ t\leq s\leq T$ . We take place the unknown parameter $\beta(\cdot)$ in the optimal control $u^{*}(\cdot)$ as a new deterministic control $\rho(\cdot)$ , and obtain a new input control $u^{\rho}_{s}=u^{\rho}(\rho_{s},s,X^{\rho}_{s}),\ t\leq s\leq T$ . Putting the new control $u^{\rho}(\cdot)$ into the related stochastic differential equation and cost functional. We establish the mathematical framework of the exploratory dynamic equation and cost functional, which is consistent with the structure introduced in Wang et al. (2020).

Indeed, the optimal control of the new exploratory control problem can be used to estimate unknown parameters. Therefore, we consider the linear stochastic differential equation case where the drift or diffusion term with unknown parameter. Then, we investigate the general case where both the drift and diffusion terms contain unknown parameters. For the above cases, we show that the optimal density function satisfies a Gaussian distribution which can be used to estimate the unknown parameters. When both the drift and diffusion terms of the linear SDE contain the unknown parameters $\alpha(\cdot)$ and $\beta(\cdot)$ , based on the exploratory equations, and related cost functionals, we obtain the optimal density functions for the parameters $\alpha(\cdot)$ and $\beta(\cdot)$ , respectively. The optimal density functions are given by

\pi_{t}^{1,*}(\cdot)\overset{d}{=}\mathcal{N}(\beta_{t}x,\ \frac{\lambda}{2\theta_{1}(t)}),\quad\pi_{t}^{2,*}(\cdot)\overset{d}{=}\mathcal{N}(\alpha_{t}x,\ \frac{\lambda}{2}),

where

\theta_{1}(t)=\exp\left(\int_{t}^{T}[2(\alpha_{s}-\beta_{s})-1]\mathrm{d}s\right).

Based on $\pi_{t}^{1,*}(\cdot)$ and $\pi_{t}^{2,*}(\cdot)$ , we can learn the unknown parameters $\beta(\cdot)$ and $\alpha(\cdot)$ .

In this present paper, we focus on develop the theory of a stochastic optimal control approach with reinforcement learning structure to estimate the unknown parameters appeared in the model. Indeed, based on the existing optimal control learning methods, for example policy evaluation and policy improvement developed in Wang and Zhou (2020), we can learn the optimal density functions, and then obtain the estimations of the unknown parameters. These estimations are useful in the related investment portfolio problems, for example mean-variance investment portfolio. We will further consider these problems in the future work.

References

Basak and Chabakauri (2010) S. Basak and G. Chabakauri. Dynamic mean-variance asset allocation. The Review of Financial Studies, 23(8):2970–3016, 2010.
Björk et al. (2014) T. Björk, A. Murgoci, and X. Y. Zhou. Mean-variance portfolio optimization with state-dependent risk aversion. Mathematical Finance, 24:1–24, 2014.
Dai et al. (2021) M. Dai, H. Jin, S. Kou, and Y. Xu. A dynamic mean-variance analysis for log returns. Management Science, 67(2):1093–1108, 2021.
Dong (2022) Y. C. Dong. Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. arxiv.org/abs/2208.02409, pages 1–19, 2022.
Doya (2000) K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
Fleming and Rishel (1975) W. H. Fleming and R. W. Rishel. Determnistic and stochastic optimal control. Springer-Verlag, New York, 1975.
Fleming and Soner (2006) W. H. Fleming and H. M. Soner. Controlled markov processes and viscosity solutions. Springer, New York, 2006.
Gao et al. (2022) X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for langevin diffusions. SIAM Journal on Control and Optimization, 60(3):1250–1268, 2022.
Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, http://www.deeplearningbook.org, 2016.
Han et al. (2023) X. Han, R. Wang, and X. Y. Zhou. Choquet regularization for continuous-time reinforcement learning. pages 1–35, 2023.
Jia and Zhou (2022a) Y. Jia and X. Y. Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. The Journal of Machine Learning Research, 23(154):1–55, 2022a.
Jia and Zhou (2022b) Y. Jia and X. Y. Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. The Journal of Machine Learning Research, 23(275):1–50, 2022b.
Jia and Zhou (2022c) Y. Jia and X. Y. Zhou. q-learning in continuous time. arXiv: 2207.00713, 2022c.
Pardoux and Peng (1990) E. Pardoux and S. Peng. Adapted solutions of backward stochastic equations. Systerm and Control Letters, 14:55–61, 1990.
Peng (1990) S. Peng. A general stochastic maximum principle for optimal control problems. SIAM J.Control and Optim., 28(4):966–979, 1990.
Peng (1992) S. Peng. A generalized dynamic programming principle and Hamilton-Jacobi-Bellmen equation. Stochastics Stochastics Rep., 38:119–134, 1992.
Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
Tang et al. (2022) W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022.
Wang and Zhou (2020) H. Wang and X. Y. Zhou. Continuous-time mean-variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273–1308, 2020.
Wang et al. (2020) H. Wang, T. Zariphopoulou, and X. Y. Zhou. Exploration versus exploitation in reinforcement learning: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34, 2020.
Yong and Zhou (1999) J. Yong and X. Y. Zhou. Stochastic controls: Hamiltonian systems and HJB equations. Springer, New York, 1999.
Zhou (2023) X. Y. Zhou. The curse of optimality, and how to break it? Cambridge University Press, Part V-New Frontiers for Stochastic Control in Finance, 2023.
Zhou and Li (2000) X. Y. Zhou and D. Li. Continuous-time mean-variance portfolio selection: A stochastic LQ frame work. Appl. Math. Optim., 42:19–33, 2000.