This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pricing Bermudan options using Regression Trees / Random Forests

Zineb El Filali Ech-chafiq Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. Quantitative analyst at Natixis, Paris. [email protected]    Pierre Henry Labordère Head of Quantitative Research cross asset, Natixis, 47 quai d’austerlitz 75013, Paris. CMAP, Ecole Polytechnique    Jérôme Lelong Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. [email protected]
Abstract

The value of an American option is the maximized value of the discounted cash flows from the option. At each time step, one needs to compare the immediate exercise value with the continuation value and decide to exercise as soon as the exercise value is strictly greater than the continuation value. We can formulate this problem as a dynamic programming equation, where the main difficulty comes from the computation of the conditional expectations representing the continuation values at each time step. In (Longstaff and Schwartz, 2001), these conditional expectations were estimated using regressions on a finite-dimensional vector space (typically a polynomial basis). In this paper, we follow the same algorithm; only the conditional expectations are estimated using Regression trees or Random forests. We discuss the convergence of the LS algorithm when the standard least squares regression is replaced by regression trees. Finally, we expose some numerical results with regression trees and random forests. The random forest algorithm gives excellent results in high dimensions.

keywords: Regression trees, Random forests, Bermudan options, Optimal stopping

1 Introduction

Bermudan options are very widespread in financial markets. Their valuation adds a challenge of optimal stopping determination in comparison to European options. Bermudan options offer the investor the possibility to exercise his option at any date of his choice among a certain number of dates prior to the option expiry, called exercise dates. Naturally, the option holder will have to find the most optimal date to exercise. To do so, at each exercise date, he will compare the payoff of the immediate exercise to the expected value of continuation of the option and decide to exercise only if the immediate exercise value is the highest. We can formulate this problem as a dynamic programming equation, where the main difficulty comes from the computation of the conditional expectation representing the expected continuation value of the option. Many papers have discussed this issue, starting with regression-based algorithms; see for example (Tsitsiklis and Van Roy, 1999) and (Carriere, 1996). Also, in this category falls the most commonly used method for pricing Bermudan options which is the Least Squares Method (LSM) presented by Longstaff and Schwarz in (Longstaff and Schwartz, 2001) where the conditional expectation is estimated by a least squares regression of the post realized payoffs from continuation on some basis functions of the state variables (usually polynomial functions). Alternatively, an approximately optimal but interpretable policy can be obtained as the solution to a sample average approximation problem, see Ciocan and Mišić (2022). Another class of algorithms focuses on quantization approaches, see for example (Bally et al., 2005). The algorithm consists in computing the conditional expectations by projecting the diffusion on some optimal grid. We also have a class of duality based methods that give an upper bound on the option value for a given exercise policy by adding a quantity that penalizes the incorrect exercise decisions made by the sub-optimal policy, see for example (Rogers, 2002), (Andersen and Broadie, 2004) and (Lelong, 2018). The last class of algorithms is based on machine learning techniques. Deep learning has been widely used to solve stochastic control problems Han and E (2016) or their PDE formulation Han et al. (2018). Neural networks can also be used to estimate the continuation values as in (Kohler et al., 2010; Han et al., 2018; Lapeyre and Lelong, 2021) or the boundary (Reppen et al., 2022). The continuation value can also be approximated using Gaussian processes as in (Ludkovski, 2018). Our solution falls in this last category of algorithms. We examine Bermudan options’ prices when the continuation values’ estimation is done using regression trees or random forests.

Let X,YX,Y be two random variables with vales in [0,1]d[0,1]^{d} and \mathbb{R} respectively. A regression tree approximates the conditional expectation 𝔼[Y|X]\mathbb{E}\left[Y|X\right] with a piecewise constant function. The tree is built recursively, generating a sequence of partitions of [0,1]d[0,1]^{d} that are finer and finer. The approximation value on each set in the partition can be seen as a terminal leaf of the tree. This algorithm is very simple and efficient. However, it can easily over-fit the data, which results in high generalization errors. To solve this issue, we use ensemble methods to aggregate multiple trees, which means that we create multiple trees and then combine them to produce improved results. We suggest using random forests (see (Breiman, 2001)). This method consists in averaging a combination of trees where each tree depends on a random vector sampled independently and identically for each tree in the forest. This vector will allow to differentiate the trees in the random forest and can be chosen in different ways. For example, one can draw for each tree a sub-sample of training from the global training data without replacement (this method is called bagging and is thoroughly studied in (Breiman, 1999)). A second method is random split selection, where at each node, the split is selected at random from among the KK best splits, see (Dietterich, 2000). Other methods for aggregating regression trees into random forests can be found in the literature, see for example (Breiman, 2001) or (Ho, 1998).
The structure of the paper will be as follows. First, we present the regression trees algorithm and the algorithm of least squares using regression trees. Then, we proceed to present some convergence results for regression trees and study the convergence of the LS algorithm when regression trees are used to estimate the continuation values. We prove two results: the convergence of the regression solution for infinite sample size as the depth of the regression tree goes to infinity and the convergence of the empirical price as the number of samples go to infinity for a fixed depth. Then, we briefly talk about Random Forests before we finally study some numerical examples.

2 Regression trees

Let XX be a r.v with values in [0,1]d[0,1]^{d} and YY a squared integrable real-valued r.v. We want to approximate the conditional expectation 𝔼[Y|X]\mathbb{E}[Y|X]. Throughout this paper, we will consider for computational convenience that XX has a density fXf_{X} in [0,1]d[0,1]^{d} w.r.t the Lebesgue measure. We assume given a training sample DM={(X1,Y1),,(XM,YM)[0,1]d×}D_{M}=\{(X_{1},Y_{1}),\ldots,(X_{M},Y_{M})\in[0,1]^{d}\times\mathbb{R}\} where the (Xi,Yi)(X_{i},Y_{i})’s are i.i.d random variables following the law of (X,Y)(X,Y). An approximation using a regression tree consists in writing the conditional expectation as a piecewise constant function of XX. Each domain where the function is constant can be seen as a terminal leaf of a tree. Let us start with the one-dimensional case (d=1)(d=1). Consider the function

mM:yl,yr,x[0,1]1Mi=1M(Yiyl𝟙{Xix}yr𝟙{Xi>x})2.m_{M}:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[0,1]\longmapsto\frac{1}{M}\sum_{i=1}^{M}(Y_{i}-y_{l}\mathbb{1}_{\left\{X_{i}\leq x\right\}}-y_{r}\mathbb{1}_{\left\{X_{i}>x\right\}})^{2}.

With probability q>0q>0, choose xx as the midpoint x=12x=\frac{1}{2} and solve the minimization problem infyl,yrmM(yl,yr,12)\inf_{y_{l},y_{r}}m_{M}(y_{l},y_{r},\frac{1}{2}). With probability 0<1q<10<1-q<1, solve the minimization problem infyl,yr,xmM(yl,yr,x)\inf_{y_{l},y_{r},x}m_{M}(y_{l},y_{r},x). For a fixed xx^{*}, the optimal values of yly_{l} and yry_{r} are given by

yr=i=1MYi𝟙{Xi>x}i=1M𝟙{Xi>x};yl=i=1MYi𝟙{Xix}i=1M𝟙{Xix}.y_{r}^{*}=\frac{\sum_{i=1}^{M}Y_{i}\mathbb{1}_{\{X_{i}>x^{*}\}}}{\sum_{i=1}^{M}\mathbb{1}_{\{X_{i}>x^{*}\}}};\quad y_{l}^{*}=\frac{\sum_{i=1}^{M}Y_{i}\mathbb{1}_{\{X_{i}\leq x^{*}\}}}{\sum_{i=1}^{M}\mathbb{1}_{\{X_{i}\leq x^{*}\}}}. (1)

The problem infyl,yr,xmM(yl,yr,x)\inf_{y_{l},y_{r},x}m_{M}(y_{l},y_{r},x) may admit more than one minimizer. To make the tree unique, we choose the minimizer (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}) with the minimal third component, ie any solution (yl,yr,x)(y_{l},y_{r},x) to infyl,yr,xmM(yl,yr,x)\inf_{y_{l},y_{r},x}m_{M}(y_{l},y_{r},x) satisfies x>xx>x^{*}.

Once the threshold xx^{*} is determined, we split the samples into two groups following the sign of XixX_{i}-x^{*} and repeat the process for each group. We stop the process if introducing a new leaf does not improve the MSE (ie when xx^{*} is on the boundary of the interval on which we solve the minimization problem) or when enough iterations have been made. In the end, we have a tree that approximates the conditional expectation with a piecewise constant function. The regression trees are an algorithmic tool to find an adapted partition and the corresponding weights of this piecewise constant function.
In the multi-dimensional case, we choose the direction (the index along which the optimization is performed) uniformly for each new split. Then, the process is iterated as in the one-dimensional case.

We denote the resulting tree by 𝒯^pM(X)\hat{\mathcal{T}}_{p}^{M}(X) where pp represents the depth of the tree, i.e., the number of iterations done in the process described above. A tree of depth pp has 2p2^{p} leaves.
When the size of the training data is infinite, we use the same procedure as above but with expectations rather that empirical means. Consider the function

m:yl,yr,x[0,1]𝔼[(Yyl𝟙{Xx}yr𝟙{X>x})2].m:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[0,1]\longmapsto\mathbb{E}\left[(Y-y_{l}\mathbb{1}_{\left\{X\leq x\right\}}-y_{r}\mathbb{1}_{\left\{X>x\right\}})^{2}\right].

Equation (1) becomes

yr=𝔼[Y|X>x];yl=𝔼[Y|Xx].y_{r}^{*}=\mathbb{E}\left[Y|X>x^{*}\right];\quad y_{l}^{*}=\mathbb{E}\left[Y|X\leq x^{*}\right].

Again, when we optimize the value of xx^{*}, we choose the minimizer (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}) with the minimal third component. We denote the tree of depth pp obtained with an infinite data set by 𝒯p(X)\mathcal{T}_{p}(X).

3 LS algorithm with regression trees

3.1 Notation

For pp\in\mathbb{N}, let 𝒜=(j=1d[api1(j),api(j)))1i2p\mathcal{A}=\left(\prod_{j=1}^{d}\left[a_{p}^{i-1}(j),a_{p}^{i}(j)\right)\right)_{1\leq i\leq 2^{p}} be a partition of [0,1]d[0,1]^{d} with 2p2^{p} elements. We write

[api1,api[:=j=1d[api1(j),api(j)[\left[a_{p}^{i-1},a_{p}^{i}\right[:=\prod_{j=1}^{d}\left[a_{p}^{i-1}(j),a_{p}^{i}(j)\right[

For (αpi)1i2p2p(\alpha_{p}^{i})_{1\leq i\leq 2^{p}}\in\mathbb{R}^{2^{p}}, we define 𝒫p\mathcal{P}_{p} as the piecewise constant function on the partition 𝒜\mathcal{A} with values αpi\alpha_{p}^{i}. For x[0,1]dx\in[0,1]^{d},

𝒫p(x,(api)0i2p,(αpi)1i2p)=i=12pαpi𝟙{x[api1,api)}\mathcal{P}_{p}(x,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})=\sum_{i=1}^{2^{p}}\alpha_{p}^{i}\mathbb{1}_{\{x\in[a^{i-1}_{p},a^{i}_{p})\}}

If 𝒜\mathcal{A} denotes the partition obtained in the regression tree 𝒯p\mathcal{T}_{p}, and we choose

αpi=𝔼[Y|X[api1,api)],\alpha_{p}^{i}=\mathbb{E}\left[Y|X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right],

then the regression tree 𝒯p(X)\mathcal{T}_{p}(X) can be written as follows

𝒯p(X)=𝒫p(X,(api)0i2p,(αpi)1i2p)\mathcal{T}_{p}(X)=\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}}) (2)

Similarly, if 𝒜M=([api1,M,api,M))1i2p\mathcal{A}^{M}=\left(\left[a_{p}^{i-1,M},a_{p}^{i,M}\right)\right)_{1\leq i\leq 2^{p}} denotes the partition obtained in the regression tree 𝒯^pM\hat{\mathcal{T}}_{p}^{M} and

αpi,M=m=1MYm𝟙{Xm[api1,api)}m=1M𝟙{Xm[api1,M,api,M)},\alpha_{p}^{i,M}=\frac{\sum_{m=1}^{M}Y_{m}\mathbb{1}_{\left\{X_{m}\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right\}}}{\sum_{m=1}^{M}\mathbb{1}_{\left\{X_{m}\in\left[a_{p}^{i-1,M},a^{i,M}_{p}\right)\right\}}},

then the regression tree 𝒯^pM(X)\hat{\mathcal{T}}^{M}_{p}(X) can be written as follows

𝒯^pM(X)=𝒫p(X,(api,M)0i2p,(αpi,M)1i2p)\hat{\mathcal{T}}^{M}_{p}(X)=\mathcal{P}_{p}(X,(a_{p}^{i,M})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i,M})_{1\leq i\leq 2^{p}}) (3)

3.2 Description of the algorithm

Let TT be a fixed maturity, and consider the filtered probability space (Ω,,(t)0tT,)(\Omega,\mathcal{F},(\mathcal{F}_{t})_{0\leq t\leq T},\mathbb{P}) where \mathbb{P} is the risk neutral measure. Consider a Bermudan option that can be exercised at dates 0=t0<t1<t2<<tN=T0=t_{0}<t_{1}<t_{2}<\cdots<t_{N}=T. When exercised at time tjt_{j}, the option’s discounted payoff is given by Ztj=hj(Xtj)Z_{t_{j}}=h_{j}(X_{t_{j}}) with (Xtj)j(X_{t_{j}})_{j} being an adapted Markov process taking values in [0,1]d[0,1]^{d} such that for every jj, XtjX_{t_{j}} has a density fXjf^{j}_{X} on [0,1]d[0,1]^{d} uniformly bounded from below on any compact set of ]0,1[d]0,1[^{d} by a strictly positive number. The discounted value (Utj)0jN(U_{t_{j}})_{0\leq j\leq N} of this option is given by

Utj=esssupτ𝒮tj,T𝔼[Zτ|tj],U_{t_{j}}=\mathop{\mathrm{esssup}}_{\tau\in\mathcal{S}_{t_{j},T}}\mathbb{E}\left[Z_{\tau}|\mathcal{F}_{t_{j}}\right], (4)

where 𝒮tj,T\mathcal{S}_{t_{j},T} is the set of all stopping times taking values in {tj,,T}\{t_{j},\dots,T\}. Using the Snell envelope theory, we know that UU solves the dynamic programming equation

{UtN=ZtNUtj=max(Ztj,𝔼[Utj+1|tj]) for 1jN1.\left\{\begin{array}[]{ll}U_{t_{N}}&=Z_{t_{N}}\\ U_{t_{j}}&=\max\left(Z_{t_{j}},\mathbb{E}\left[U_{t_{j+1}}|\mathcal{F}_{t_{j}}\right]\right)\textnormal{ for }1\leq j\leq N-1.\end{array}\right. (5)

This equation can be rewritten in terms of optimal policy as follows

{τN=tN=Tτj=tj𝟙{Ztj𝔼[Zτj+1|tj]}+τj+1𝟙{Ztj<𝔼[Zτj+1|tj]} for 1jN1\left\{\begin{array}[]{ll}\tau_{N}&=t_{N}=T\\ \tau_{j}&=t_{j}\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}+\tau_{j+1}\mathbb{1}_{\left\{Z_{t_{j}}<\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\textnormal{ for }1\leq j\leq N-1\end{array}\right. (6)

where τj\tau_{j} is the smallest optimal stopping time after tjt_{j}. As we are in a Markovian setting, we can write 𝔼[Zτj+1|tj]=𝔼[Zτj+1|Xtj]\mathbb{E}\left[Z{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]=\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\right]. The main difficulty in solving this equation comes from the computation of the continuation value 𝔼[Zτj+1|Xtj]\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\right] . In the Least Squares approach presented by (Longstaff and Schwartz, 2001), this conditional expectation is estimated by a linear regression on a countable set of basis functions of XtjX_{t_{j}}. In our approach, we suggest to estimate it using a regression tree of depth pp. Let 𝒯pj(Xtj)\mathcal{T}_{p}^{j}(X_{t_{j}}) be the regression tree approximating 𝔼[Zτj+1|Xtj]\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\right] with an infinite data set. The algorithm solves the following policy

{τNp=tN=Tτjp=tj𝟙{Ztj𝒯pj(Xtj)}+τj+1𝟙{Ztj<𝒯pj(Xtj)} for 1jN1.\left\{\begin{array}[]{ll}\tau_{N}^{p}&=t_{N}=T\\ \tau_{j}^{p}&=t_{j}\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}+\tau_{j+1}\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}\textnormal{ for }1\leq j\leq N-1.\end{array}\right. (7)

We sample MM paths of the model Xt0(m),,XtN(m)X_{t_{0}}^{(m)},\ldots,X_{t_{N}}^{(m)} along with the corresponding payoff paths Zt0(m),,ZtN(m)Z_{t_{0}}^{(m)},\ldots,Z_{t_{N}}^{(m)}, m=1,,Mm=1,\ldots,M. For each date j=1,,N1j=1,\ldots,N-1 we approximate the conditional expectations 𝔼[Zτj+1|Xtj]\mathbb{E}[Z_{\tau_{j+1}}|X_{t_{j}}] on the path mm using the regression tree 𝒯^pj,M(Xtj(m))\hat{\mathcal{T}}^{j,M}_{p}(X^{(m)}_{t_{j}}) built with the samples (Xtj(m),Zτ^j+1p,(m)m)1mM(X_{t_{j}}^{(m)},Z^{m}_{\hat{\tau}_{j+1}^{p,(m)}})_{1\leq m\leq M} where

{τ^Np,(m)=tN=Tτ^jp,(m)=tj𝟙{Ztj(m)𝒯^pj,M(Xtj(m))}+τ^j+1(m)𝟙{Ztj(m)<𝒯^pj,M(Xtj(m))} for 1jN1.\left\{\begin{array}[]{ll}\hat{\tau}_{N}^{p,(m)}&=t_{N}=T\\ \hat{\tau}_{j}^{p,(m)}&=t_{j}\mathbb{1}_{\left\{Z_{t_{j}}^{(m)}\geq\hat{\mathcal{T}}_{p}^{j,M}(X_{t_{j}}^{(m)})\right\}}+\hat{\tau}_{j+1}^{(m)}\mathbb{1}_{\left\{Z_{t_{j}}^{(m)}<\hat{\mathcal{T}}_{p}^{j,M}(X_{t_{j}}^{(m)})\right\}}\textnormal{ for }1\leq j\leq N-1.\end{array}\right. (8)

Finally, the time-0 price of the option is approximated by

U0p,M=max(Z0,1Mm=1MZτ^1p,(m)(m)).U_{0}^{p,M}=\max\left(Z_{0},\frac{1}{M}\sum\limits_{m=1}^{M}Z_{\hat{\tau}_{1}^{p,(m)}}^{(m)}\right). (9)

4 Convergence of the algorithm

4.1 Notation

For each time step 1jN11\leq j\leq N-1, let θjp=((ai,jp)0i2p,(αi,jp)1i2p)\theta_{j}^{p}=((a^{p}_{i,j})_{0\leq i\leq 2^{p}},(\alpha^{p}_{i,j})_{1\leq i\leq 2^{p}}) be the parameters of 𝒯pj\mathcal{T}_{p}^{j} and θ^jp,M=((ai,jp,M)0i2p,(αi,jp,M)1i2p)\hat{\theta}_{j}^{p,M}=((a^{p,M}_{i,j})_{0\leq i\leq 2^{p}},(\alpha^{p,M}_{i,j})_{1\leq i\leq 2^{p}}) be the parameters of 𝒯p,Mj\mathcal{T}_{p,M}^{j}. Note that with this notation and Equations (2) and (3), we have

𝒯pj(Xtj)=𝒫p(Xtj,θjp)\displaystyle\mathcal{T}_{p}^{j}(X_{t_{j}})=\mathcal{P}_{p}(X_{t_{j}},\theta_{j}^{p}) (10)
𝒯^pj,M(Xtj(m))=𝒫p(Xtj(m),θ^jp,M)\displaystyle\hat{\mathcal{T}}_{p}^{j,M}(X_{t_{j}}^{(m)})=\mathcal{P}_{p}(X_{t_{j}}^{(m)},\hat{\theta}_{j}^{p,M}) (11)

where for te sake of conciseness we have shrunk the partition and weight parameters of the tree into a single multi-dimensional parameter.

We introduce the vector ϑ\vartheta of the coefficients of the successive expansions ϑp=(θ1p,,θN1p)\vartheta^{p}=(\theta_{1}^{p},\ldots,\theta_{N-1}^{p}) and ϑ^p,M=(θ^1p,M,,θ^N1p,M)\hat{\vartheta}^{p,M}=(\hat{\theta}_{1}^{p,M},\ldots,\hat{\theta}_{N-1}^{p,M}).
Let tp=(t1p,,tN1p)(([0,1]d)2p+1)N1t^{p}=(t_{1}^{p},\ldots,t_{N-1}^{p})\in\left(([0,1]^{d})^{2^{p}+1}\right)^{N-1} be a deterministic parameter, z=(z1,,zN)Nz=(z_{1},\ldots,z_{N})\in\mathbb{R}^{N} and x=(x1,,xN)([0,1]d)Nx=(x_{1},\ldots,x_{N})\in([0,1]^{d})^{N} be deterministic vectors. Following the notation of (Clément et al., 2002), we define the function F=(F1,,FN)F=(F_{1},\ldots,F_{N}) taking values in N\mathbb{R}^{N} by

{FN(tp,z,x)=zNFj(tp,z,x)=zj𝟙{zj𝒫p(xj,tjp)}+Fj+1(tp,z,x)𝟙{zj<𝒫p(xj,tjp)}, for 1jN1.\left\{\begin{array}[]{ll}F_{N}(t^{p},z,x)&=z_{N}\\ F_{j}(t^{p},z,x)&=z_{j}\mathbb{1}_{\{z_{j}\geq\mathcal{P}_{p}(x_{j},t_{j}^{p})\}}+F_{j+1}(t^{p},z,x)\mathbb{1}_{\{z_{j}<\mathcal{P}_{p}(x_{j},t_{j}^{p})\}},\textnormal{ for }1\leq j\leq N-1.\end{array}\right.

Fj(tp,z,x)F_{j}(t^{p},z,x) only depends on tjp,,tN1pt_{j}^{p},\ldots,t_{N-1}^{p} and not on the first j1j-1 components of tpt^{p}. Using (10), we have

Fj(ϑp,Z,X)\displaystyle F_{j}(\vartheta^{p},Z,X) =Zτjp,\displaystyle=Z_{\tau_{j}^{p}},
Fj(ϑ^p,M,Z(m),X(m))\displaystyle F_{j}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)}) =Zτ^jp,(m)(m).\displaystyle=Z_{\hat{\tau}_{j}^{p,(m)}}^{(m)}.

Moreover, we clearly have that for all tp(([0,1]d)2p)N1t^{p}\in(([0,1]^{d})^{2^{p}})^{N-1}:

|Fj(tp,Z,X)|maxkj|Ztk|.\left\lvert F_{j}(t^{p},Z,X)\right\rvert\leq\max\limits_{k\geq j}\left\lvert Z_{t_{k}}\right\rvert. (12)

4.2 Convergence of the conditional expectations

4.2.1 Preliminary results

We define

Jp(ap0,,app):=𝔼[|i=1pαpi𝟙{X[api1api)}𝔼[Y|X]|2].J_{p}\left(a_{p}^{0},\ldots,a_{p}^{p}\right):=\mathbb{E}\left[\left\lvert\sum_{i=1}^{p}\alpha^{i}_{p}\mathbb{1}_{\{X\in\left[a^{i-1}_{p}-a^{i}_{p}\right)\}}-\mathbb{E}\left[Y|X\right]\right\rvert^{2}\right].

with αpi=𝔼[Y|X[api1api)]\alpha_{p}^{i}=\mathbb{E}[Y|X\in[a^{i-1}_{p}-a^{i}_{p})]. We will use the following standard lemma to prove the convergence of random tree estimators.

Lemma 4.1.

Let XX be a r.v with a density fXf_{X} w.r.t to the Lebesgue measure on [0,1]d[0,1]^{d} and YY be a real valued square integrable random variable. Let (([api1,api))1ip)p\left(\left(\left[a^{i-1}_{p},a^{i}_{p}\right)\right)_{1\leq i\leq p}\right)_{p\in\mathbb{N}} be a sequence of partitions of [0,1]d[0,1]^{d} such that limpmax1ipmax1jd|api(j)api1(j)|=0\lim\limits_{p\to\infty}\max\limits_{1\leq i\leq p}\max\limits_{1\leq j\leq d}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert=0. Then,

limpJp(ap0,,app)=0.\lim_{p\to\infty}J_{p}\left(a^{0}_{p},\ldots,a^{p}_{p}\right)=0.
Theorem 4.2.
limp𝔼[|𝒯p(X)𝔼[Y|X]|2]=0.\lim_{p\to\infty}\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right\rvert^{2}\right]=0.
Proof.

With probability 1d\frac{1}{d}, the index jj is chosen for optimisation. In the other d1d-1 cases, we do not even cut along that direction, in which case the interval length is at most equal to the length of the largest interval at step p1p-1. When the index jj is chosen: with probability qq, the length of the interval is cut in two, and with probability 1q1-q, it is cut to optimize the MSE. In that case, the interval length is at most equal to the length of the largest interval at step p1p-1.

For all 1jd1\leq j\leq d

𝔼[max1i2p|api(j)api1(j)|]\displaystyle\mathbb{E}\left[\max_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert\right]
1d[q12𝔼[max1i2p1|ap1i(j)ap1i1(j)|]+(1q)𝔼[max1i2p1|ap1i(j)ap1i1(j)|]]\displaystyle\leq\frac{1}{d}\left[q\frac{1}{2}\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]+(1-q)\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]\right]
+d1d𝔼[max1i2p1|ap1i(j)ap1i1(j)|]\displaystyle+\frac{d-1}{d}\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]
(1q2d)𝔼[max1i2p1|ap1i(j)ap1i1(j)|]\displaystyle\leq(1-\frac{q}{2d})\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]
(1q2d)2p\displaystyle\leq(1-\frac{q}{2d})^{2^{p}}

Since p=0(1q2d)2p<\sum\limits_{p=0}^{\infty}(1-\frac{q}{2d})^{2^{p}}<\infty, so is p=0𝔼[max1i2p|api(j)api1(j)|]\sum\limits_{p=0}^{\infty}\mathbb{E}\left[\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert\right]. As max1i2p|api(j)api1(j)|\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert is non negative for all pp, using Tonelli’s theorem we conclude that 𝔼[p=0max1i2p|api(j)api1(j)|]<\mathbb{E}\left[\sum\limits_{p=0}^{\infty}\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert\right]<\infty. As a result, the series p=0max1i2p|api(j)api1(j)|\sum\limits_{p=0}^{\infty}\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert converges a.s. Then, limpmax1i2p|api(j)api1(j)|=0\lim\limits_{p\to\infty}\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert=0 a.s for all jj and limpmax1i2pmax1jd|api(j)api1(j)|=0\lim\limits_{p\to\infty}\max\limits_{1\leq i\leq 2^{p}}\max\limits_{1\leq j\leq d}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert=0.

Let 𝒢\mathcal{G} be the σ\sigma-field generated by the splitting strategy (direction choice and threshold strategy). Conditioning by 𝒢\mathcal{G} allows us to consider the partition ([api1api))1i2p\left([a^{i-1}_{p}-a^{i}_{p})\right)_{1\leq i\leq 2^{p}} deterministic and we can apply Lemma 4.1 to prove

limp𝔼[|𝒯p(X)𝔼[Y|X]|2|𝒢]=0 a.s.\lim_{p\to\infty}\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}\left[Y|X\right]\right\rvert^{2}|\mathcal{G}\right]=0\text{ a.s}.

Note that

𝔼[|𝒯p(X)|2|𝒢]\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)\right\rvert^{2}|\mathcal{G}\right] 𝔼[i=12p𝔼[Y2|X[api1,api)]𝟙{X[api1,api)}|𝒢]\displaystyle\leq\mathbb{E}\left[\sum_{i=1}^{2^{p}}\mathbb{E}\left[Y^{2}|X\in[a^{i-1}_{p},a^{i}_{p})\right]\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}|\mathcal{G}\right]
i=12p𝔼[𝔼[Y2𝟙{X[api1,api)}|X[api1,api)]|𝒢]\displaystyle\leq\sum_{i=1}^{2^{p}}\mathbb{E}\left[\mathbb{E}\left[Y^{2}\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}|X\in[a^{i-1}_{p},a^{i}_{p})\right]|\mathcal{G}\right]
i=12p𝔼[Y2𝟙{X[api1,api)}|𝒢]\displaystyle\leq\sum_{i=1}^{2^{p}}\mathbb{E}\left[Y^{2}\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}|\mathcal{G}\right]
𝔼[Y2|𝒢]\displaystyle\leq\mathbb{E}\left[Y^{2}|\mathcal{G}\right]

Then,

𝔼[|𝒯p(X)𝔼[Y|X]|2|𝒢]\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}\left[Y|X\right]\right\rvert^{2}|\mathcal{G}\right] 2𝔼[|𝒯p(X)|2|𝒢]+2𝔼[𝔼[Y|X]2]\displaystyle\leq 2\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)\right\rvert^{2}|\mathcal{G}\right]+2\mathbb{E}\left[\mathbb{E}\left[Y|X\right]^{2}\right]
2(𝔼[Y2|𝒢]+𝔼[𝔼[Y|X]2])\displaystyle\leq 2\left(\mathbb{E}\left[Y^{2}|\mathcal{G}\right]+\mathbb{E}\left[\mathbb{E}\left[Y|X\right]^{2}\right]\right)

Using Lebesgue’s bounded convergence theorem,

limp𝔼[|𝒯p(X)𝔼[Y|X]|2]=limp𝔼[𝔼[|𝒯p(X)𝔼[Y|X]|2|𝒢]]=0.\lim_{p\to\infty}\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right\rvert^{2}\right]=\lim_{p\to\infty}\mathbb{E}\left[\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right\rvert^{2}|\mathcal{G}\right]\right]=0.\qed

Once the partition of the tree is computed, the values associated to each element of the partition are uniquely determined by a global optimization problem

Proposition 4.3.

Let 𝒜=(j=1d[api1(j),api(j)))1i2p\mathcal{A}=\left(\prod_{j=1}^{d}\left[a_{p}^{i-1}(j),a_{p}^{i}(j)\right)\right)_{1\leq i\leq 2^{p}} denote the partition associated to 𝒯p\mathcal{T}_{p}. Then,

𝔼[(𝒯p(X)𝔼[Y|X])2]=inf(αpi)1i2p𝔼[(𝒫p(X,(api)0i2p,(αpi)1i2p)𝔼[Y|X])2]\mathbb{E}\left[\left(\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right)^{2}\right]=\inf_{(\alpha_{p}^{i})_{1\leq i\leq 2^{p}}}\mathbb{E}\left[\left(\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})-\mathbb{E}[Y|X]\right)^{2}\right]

and moreover the right hand side admits a unique minimizer given by

αpi=𝔼[Y|X[api1,api)].\alpha_{p}^{i}=\mathbb{E}\left[Y|X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right].
Proof.

Since the function 𝒫p\mathcal{P}_{p} is linear w.r.t the parameter α\alpha, the function

α2p+1𝔼[(𝒫p(X,(api)0i2p,(αpi)1i2p)𝔼[Y|X])2]\alpha\in\mathbb{R}^{2^{p}+1}\longmapsto\mathbb{E}\left[\left(\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})-\mathbb{E}[Y|X]\right)^{2}\right]

is strongly convex. Hence, it admits a unique minimizer defined by the first order optimality condition stating that for all 0i2p+10\leq i\leq 2^{p}+1,

𝔼[(𝒫p(X,(api)0i2p,(αpi)1i2p)𝔼[Y|X])𝟙{X[api1,api)}]=0\displaystyle\mathbb{E}\left[\left(\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})-\mathbb{E}[Y|X]\right)\mathbb{1}_{\left\{X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right\}}\right]=0
𝔼[(αpi𝔼[Y|X])𝟙{X[api1,api)}]=0\displaystyle\mathbb{E}\left[\left(\alpha_{p}^{i}-\mathbb{E}[Y|X]\right)\mathbb{1}_{\left\{X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right\}}\right]=0
αpi=𝔼[Y|X[api1,api)].\displaystyle\alpha_{p}^{i}=\mathbb{E}\left[Y|X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right].\qed

A similar result holds for 𝒯^pM\hat{\mathcal{T}}_{p}^{M} by replacing the conditional expectations by empirical conditional expectations.

4.2.2 Approximation of the conditional expectations with regression trees

The following result proves that the approximation of the Bermudan price produced by our algorithm converges to the true Bermudan price.

Proposition 4.4.
limp𝔼[Zτjp|tj]=𝔼[Zτj|tj] in𝕃2(Ω) for 1jN\lim_{p\to\infty}\mathbb{E}\left[Z_{\tau_{j}^{p}}|\mathcal{F}_{t_{j}}\right]=\mathbb{E}\left[Z_{\tau_{j}}|\mathcal{F}_{t_{j}}\right]\textnormal{\;}{\text{in}}\;\mathbb{L}^{2}(\Omega)\textnormal{ for }1\leq j\leq N (13)
Proof.

We proceed by induction.
For j=Nj=N, the proposition is true since τNp=τN=T\tau_{N}^{p}=\tau_{N}=T. Assume that the result holds for j+1j+1, Let us prove that it still holds for j:

𝔼[ZτjpZτj|jj]\displaystyle\mathbb{E}\left[Z_{\tau_{j}^{p}}-Z_{\tau_{j}}|\mathcal{F}_{j_{j}}\right] =Ztj(𝟙{Ztj𝒯pj(Xtj)}𝟙{Ztj𝔼[Zτj+1|tj]})\displaystyle=Z_{t_{j}}\left(\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\right)
+𝔼[Zτj+1p𝟙{Ztj<𝒯pj(Xtj)}Zτj+1𝟙{Ztj<𝔼[Zτj+1|tj]}|tj]\displaystyle+\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-Z_{\tau_{j+1}}\mathbb{1}_{\left\{Z_{t_{j}}<\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}|\mathcal{F}_{t_{j}}\right]
=(Ztj𝔼[Zτj+1|tj])(𝟙{Ztj𝒯pj(Xtj)}𝟙{Ztj𝔼[Zτj+1|tj]})\displaystyle=(Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}])\left(\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\right)
+𝔼[Zτj+1pZτj+1|tj]𝟙{Ztj<𝒯pj(Xtj)}\displaystyle+\mathbb{E}[Z_{\tau_{j+1}}^{p}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}]\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}
=Ajp+𝔼[Zτj+1pZτj+1|tj]𝟙{Ztj<𝒯pj(Xtj)}\displaystyle=A_{j}^{p}+\mathbb{E}[Z_{\tau_{j+1}}^{p}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}]\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}

Where AjpA_{j}^{p} is defined by

Ajp=(Ztj𝔼[Zτj+1|tj])(𝟙{Zjk𝒯pj(Xtj)}𝟙{Ztj𝔼[Zτj+1|tj]})A_{j}^{p}=(Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}])\left(\mathbb{1}_{\left\{Z_{j_{k}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\right)

On one hand, since the conditional expectation is an orthogonal projection, we have

𝔼[|𝔼[Zτj+1pZτj+1|tj]|2]𝔼[|𝔼[Zτj+1pZτj+1|tj+1]|2]\mathbb{E}\left[\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]\leq\mathbb{E}\left[\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j+1}}\right]\right\rvert^{2}\right]

and using the induction assumption 𝔼[Zτj+1pZτj+1|tj+1]0\mathbb{E}\left[Z_{\tau_{j+1}^{p}}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j+1}}\right]\to 0 in 𝕃2(Ω)\mathbb{L}^{2}(\Omega) when pp\to\infty. On the other hand,

|Ajp|\displaystyle\left\lvert A_{j}^{p}\right\rvert =|Ztj𝔼[Zτj+1|tj]||𝟙{Ztj𝒯pj(Xtj)}𝟙{Ztj𝔼[Zτj+1|tj]}|\displaystyle=\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}]\right\rvert\left\lvert\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\right\rvert
|Ztj𝔼[Zτj+1|tj]||𝟙{𝔼[Zτj+1|tj]>Ztj𝒯pj(Xtj)}𝟙{𝒯pj(Xtj)>Ztj𝔼[Zτj+1|tj]}|\displaystyle\leq\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}]\right\rvert\;\left\lvert\mathbb{1}_{\left\{\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]>Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{\mathcal{T}_{p}^{j}(X_{t_{j}})>Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\right\rvert
|Ztj𝔼[Zτj+1|tj]||𝟙{|Ztj𝔼[Zτj+1|tj]||𝒯pj(Xtj)𝔼[Zτj+1|tj]|}|\displaystyle\leq\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}]\right\rvert\;\left\lvert\mathbb{1}_{\left\{\left\lvert Z_{t_{j}}-\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert\right\}}\right\rvert
|𝒯pj(Xtj)𝔼[Zτj+1|tj]|\displaystyle\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert
|𝒯pj(Xtj)𝔼[Zτj+1p|tj]|+|𝔼[Zτj+1p|tj]𝔼[Zτj+1|tj]|.\displaystyle\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}|\mathcal{F}_{t_{j}}\right]\right\rvert+\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}|\mathcal{F}_{t_{j}}\right]-\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert.

Using the induction assumption, the second term goes to zero in 𝕃2(Ω)\mathbb{L}^{2}(\Omega) when pp\to\infty. Let ([ai1(p),ai(p)))1i2p\left(\left[a_{i-1}(p),a_{i}(p)\right)\right)_{1\leq i\leq 2^{p}} be the partition generated by 𝒯pj\mathcal{T}_{p}^{j}. We define

𝒯¯pj(Xtj)=i=12p𝔼[Zτj+1|Xtj[ai1(p),ai(p))]𝟙{Xtj[ai1(p),ai(p))}.\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})=\sum_{i=1}^{2^{p}}\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\in\left[a_{i-1}(p),a_{i}(p)\right)\right]\mathbb{1}_{\{X_{t_{j}}\in\left[a_{i-1}(p),a_{i}(p)\right)\}}.

Note that 𝒯¯pj\mathcal{\bar{T}}_{p}^{j} uses the partition given by 𝒯pj(Xtj)\mathcal{T}_{p}^{j}(X_{t_{j}}) but the coefficients αi(p)\alpha_{i}(p) are given by the conditional expectations of Zτj+1Z_{\tau_{j+1}} w.r.t XtjX_{t_{j}} and not those of Zτj+1pZ_{\tau_{j+1}^{p}}. Using Proposition 4.3, we have the following inequality stating that 𝒯¯pj(Xtj)\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}}) is sub-optimal compared to 𝒯pj(Xtj)\mathcal{T}_{p}^{j}(X_{t_{j}})

𝔼[|𝒯pj(Xtj)𝔼[Zτj+1p|tj]|2]\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]
𝔼[|𝒯¯pj(Xtj)𝔼[Zτj+1p|tj]|2]\displaystyle\leq\mathbb{E}\left[\left\lvert\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]
2𝔼[|𝒯¯pj(Xtj)𝔼[Zτj+1|tj]|2]+2𝔼[|𝔼[Zτj+1|tj]𝔼[Zτj+1p|tj]|2].\displaystyle\leq 2\mathbb{E}\left[\left\lvert\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]+2\mathbb{E}\left[\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right].

The second term goes to 0 using the induction assumption. As for the first term, note that the partition obtained with 𝒯pj\mathcal{T}_{p}^{j} verifies the conditions of Lemma 4.1. Then, using the same arguments as in the proof of Theorem 4.2, we can show that the first term also goes to 0. ∎

4.3 Convergence of the Monte Carlo approximation

In this section, the depth pp of the trees is fixed and we study the convergence with respect to the number of samples MM. We assume that, for all dates jj, the trees 𝒯pj(Xtj)\mathcal{T}^{j}_{p}(X_{t_{j}}) and 𝒯^pj,M(Xtj)\hat{\mathcal{T}}^{j,M}_{p}(X_{t_{j}}) are built using the same splitting strategy (ie the same splitting directions and threshold strategies).

4.3.1 Convergence of optimisation problems

We present three major results to study the convergence of stochastic optimization problems related to regression trees.

The first result is a uniform strong law of large numbers, see (Leake et al., 1994, Chap. 2, Lemma. A1), which can be seen as a particular case of the strong law of large numbers in Banach spaces (Ledoux and Talagrand, 1991, Corollary 7.10, page 189).

Lemma 4.5.

Let (ξi)i1(\xi_{i})_{i\geq 1} be a sequence of i.i.d n\mathbb{R}^{n}-valued random vectors and h:d×nh:\mathbb{R}^{d}\times\mathbb{R}^{n}\to\mathbb{R} be a measurable function. Assume that

  • (i)

    For all θ¯d\bar{\theta}\in\mathbb{R}^{d}, the function θdh(θ,ξ1)\theta\in\mathbb{R}^{d}\mapsto h(\theta,\xi_{1}) is continuous at θ¯\bar{\theta} a.s.,

  • (ii)

    C>0,𝔼[sup|θ|C|h(θ,ξ1)|]<\forall C>0,\mathbb{E}\left[\sup_{\left\lvert\theta\right\rvert\leq C}\left\lvert h(\theta,\xi_{1})\right\rvert\right]<\infty.

Then, a.s θd1ni=1nh(θ,ξi)\theta\in\mathbb{R}^{d}\mapsto\frac{1}{n}\sum_{i=1}^{n}h(\theta,\xi_{i}) converges locally uniformly to the continuous function θd𝔼[h(θ,ξ1)]\theta\in\mathbb{R}^{d}\mapsto\mathbb{E}\left[h(\theta,\xi_{1})\right], i.e

C>0,limnsup|θ|C|1ni=1nh(θ,ξi)𝔼[h(θ,ξ1)]|=0 a.s.\forall\,C>0,\;\lim_{n\to\infty}\sup_{\left\lvert\theta\right\rvert\leq C}\left\lvert\frac{1}{n}\sum_{i=1}^{n}h(\theta,\xi_{i})-\mathbb{E}\left[h(\theta,\xi_{1})\right]\right\rvert=0\textnormal{ a.s.}

This lemma is a slight improvement of (Leake et al., 1994, Chap. 2, Lemma. A1), which was formulated under the assumption that the function θdh(θ,ξ1)\theta\in\mathbb{R}^{d}\mapsto h(\theta,\xi_{1}) is almost surely continuous. However, looking closely at their proof, it turns that it sufficient to assume (i)(i) for the conclusion to hold. Condition (i)(i) allows the \mathbb{P}-null sets on which the continuity at θ¯\bar{\theta} does not hold to depend on θ¯\bar{\theta}.

Consider a sequence of real valued functions (fn)n(f_{n})_{n} defined on a compact set KdK\subset\mathbb{R}^{d} such that there exists a sequence of (xn)n(x_{n})_{n} satisfying

fn(xn)=infxKfn(x).f_{n}(x_{n})=\inf_{x\in K}f_{n}(x).

From (Leake et al., 1994, Chap. 2, Theorem A1), we can derive the following lemma

Lemma 4.6.

Assume that the sequence (fn)n(f_{n})_{n} converges uniformly on KK to a continuous function ff. Let v=infxKf(x)v^{*}=\inf_{x\in K}f(x) and 𝒮={xK:f(x)=v}\mathcal{S}^{*}=\{x\in K\>:\>f(x)=v^{*}\}. Then, fn(xn)infxKf(x)f_{n}(x_{n})\to\inf_{x\in K}f(x) and d(xn,𝒮)0d(x_{n},\mathcal{S}^{*})\to 0.

Now, we focus on a canonical minimization problem appearing at each node of the regression tree. Let [a¯,a¯][\underline{a},\bar{a}] be a rectangle in [0,1][0,1]. For some index 1δd1\leq\delta\leq d, and a real number aa in ]a¯δ,a¯δ[]\underline{a}^{\delta},\bar{a}^{\delta}[, we define the two new rectangles

a¯,a¯δ,l(a)={x[a¯,a¯]:xδa};a¯,a¯δ,r(a)={xδ[a¯,a¯]:x>a}.\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(a)=\{x\in[\underline{a},\bar{a}]\;:\;x^{\delta}\leq a\};\quad\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(a)=\{x^{\delta}\in[\underline{a},\bar{a}]\;:\;x>a\}. (14)

We consider the cost functions

mM:yl,yr,x[a¯δ,a¯δ]1Mi=1M(Yiyl𝟙{Xia¯,a¯δ,l(x)}yr𝟙{Xia¯,a¯δ,r(x)})2\displaystyle m_{M}:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[\underline{a}^{\delta},\bar{a}^{\delta}]\longmapsto\frac{1}{M}\sum_{i=1}^{M}\left(Y_{i}-y_{l}\mathbb{1}_{\left\{X_{i}\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{X_{i}\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)\right\}}\right)^{2} (15)
m:yl,yr,x[a¯δ,a¯δ]𝔼[(Yyl𝟙{Xa¯,a¯δ,l(x)}yr𝟙{Xa¯,a¯δ,r(x)})2].\displaystyle m:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[\underline{a}^{\delta},\bar{a}^{\delta}]\longmapsto\mathbb{E}\left[\left(Y-y_{l}\mathbb{1}_{\left\{X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)\right\}}\right)^{2}\right]. (16)

Let (y^lM,y^rM,x^M)(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) be the solution to

infyl,yr,x[a¯δ+ε0,a¯δε0]mM(yl,yr,x)\displaystyle\inf_{y_{l},y_{r},x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]}m_{M}(y_{l},y_{r},x) (17)

with the smallest third component for some arbitrary small ε0>0\varepsilon_{0}>0. It is clear that (17) may not have a unique solution. Choosing the solution with the minimal third component is a standard way to get a unique minimizer (see for instance Seijo and Sen (2011)).

Lemma 4.7.

Assume that the density fXf_{X} of XX satisfies fX(x)f¯>0f_{X}(x)\geq\underline{f}>0 for all xx in any compact set of ]0,1[d]0,1[^{d} and that the minimization problem

infyl,yr,x[a¯δ+ε0,a¯δε0]m(yl,yr,x)\displaystyle\inf_{y_{l},y_{r},x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]}m(y_{l},y_{r},x) (18)

has a unique minimizer (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}).

Then, (y^lM,y^rM,x^M)(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) converges almost surely to (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}) and mM(y^lM,y^rM,x^M)m_{M}(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) converges almost surely to m(yl,yr,x)m(y_{l}^{*},y_{r}^{*},x^{*}) when MM goes to infinity.

Proof.

Consider the function

h(yl,yr,x,y,ξ)=(yyl𝟙{ξa¯,a¯δ,l(x)}yr𝟙{ξa¯,a¯δ,r(x)})2h(y_{l},y_{r},x,y,\xi)=\left(y-y_{l}\mathbb{1}_{\left\{\xi\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{\xi\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)\right\}}\right)^{2}

Since XX has a density on [0,1]d[0,1]^{d}, the function (yl,yr,x)h(yl,yr,x,Y,X)(y_{l},y_{r},x)\longmapsto h(y_{l},y_{r},x,Y,X) satisfies Condition (i)(i) of Lemma 4.5. Moreover, for C>0C>0

𝔼[sup|yl|C,|yr|C,x[a¯δ+ε0,a¯δε0]|h(yl,yr,x,Y,X)|]2𝔼[Z2]+2C\displaystyle\mathbb{E}\left[\sup_{\lvert y_{l}\rvert\leq C,\lvert y_{r}\rvert\leq C,x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]}\left\lvert h(y_{l},y_{r},x,Y,X)\right\rvert\right]\leq 2\mathbb{E}[Z^{2}]+2C

Hence, we deduce from Lemma 4.5 that mMm_{M} converges a.s. locally uniformly mm. Note that for a fixed x[a¯δ+ε0,a¯δε0]x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}], the optimal values of yly_{l} and yry_{r} are given by

yl\displaystyle y_{l} =𝔼[Y|Xa¯,a¯δ,l(x)]\displaystyle=\mathbb{E}[Y|X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)]
yr\displaystyle y_{r} =𝔼[Y|Xa¯,a¯δ,r(x)]\displaystyle=\mathbb{E}[Y|X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)]

Observe that for x[a¯δ+ε0,a¯δε0]x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]

|𝔼[Y|Xa¯,a¯δ,l(x)]|\displaystyle\lvert\mathbb{E}[Y|X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)]\rvert 𝔼[Y2](Xa¯,a¯δ,l(x))𝔼[Y2](Xa¯,a¯δ,l(aδε0))\displaystyle\leq\frac{\sqrt{\mathbb{E}[Y^{2}]}}{\sqrt{\mathbb{P}(X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x))}}\leq\frac{\sqrt{\mathbb{E}[Y^{2}]}}{\sqrt{\mathbb{P}(X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(a^{\delta}-\varepsilon_{0}))}}

The uniform upper bound is uniform is finite thanks to the assumption on the density fXf_{X} of ff. Hence, as a function of xx, yly_{l} is uniformly bounded. A similar results holds for yly_{l}. Then, we deduce from Lemma 4.6, that mM(y^lM,y^rM,x^M)m_{M}(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) converges almost surely to m(yl,yr,x)m(y_{l}^{*},y_{r}^{*},x^{*}) and (y^lM,y^rM,x^M)(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) converges almost surely to (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}). ∎

To the best of our knowledge, without the uniqueness assumption on (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}), we can only prove that (y^lM,y^rM,x^M)(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) converges in probability to (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}). See for instance (Ferger, 2004, Corollary 1), which yields that for any η>0\eta>0, (x^M<x+η)1\mathbb{P}(\hat{x}^{M}<x^{*}+\eta)\to 1. Combining this result with d((y^lM,y^rM,x^M),S)0d((\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}),S^{*})\to 0 a.s. yields the convergence in probability of (y^lM,y^rM,x^M)(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}) to (yl,yr,x)(y_{l}^{*},y_{r}^{*},x^{*}).

4.3.2 Strong law of large numbers

In order to prove the almost sure convergence of 𝒯^j,pM(X)\hat{\mathcal{T}}^{M}_{j,p}(X), we slightly modify the design of our regression trees. At every node, when computing the optimal splitting point xx^{*}, we do not perform the optimization on the entire cell but we actually a margin to make sure that yly^{*}_{l} and yry^{*}_{r} are uniformly bounded w.r.t xx^{*}. Consider a node at depth pp, the optimal splitting is obtained by minimizing mMm_{M} defined in (15) where a¯\underline{a} and a¯\bar{a} were computed at depth p1p-1. As in (17), we search for the optimal value of xx^{*} in [a¯δεp,a¯δεp][\underline{a}^{\delta}-\varepsilon_{p},\bar{a}^{\delta}-\varepsilon_{p}] where δ\delta is the coordinate chosen for the splitting and εp>0\varepsilon_{p}>0 is a technical security padding decreasing to 0. When there is no room left for this padding (|a¯δa¯δ|<2εp)\lvert\underline{a}^{\delta}-\bar{a}^{\delta}\rvert<2\varepsilon_{p}), we stop the splitting procedure.

From (Clément et al., 2002), we have the following result

Lemma 4.8.

For every j=1,,N1j=1,\ldots,N-1,

|Fj(a,Z,X)Fj(b,Z,X)|(i=jN|Zti|)(i=jN1𝟙{|Zti𝒫pi(Xti,bi)||𝒫pi(Xti,ai)𝒫pi(Xti,bi)|}).\left\lvert F_{j}(a,Z,X)-F_{j}(b,Z,X)\right\rvert\leq\left(\sum_{i=j}^{N}\left\lvert Z_{t_{i}}\right\rvert\right)\left(\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{P}_{p}^{i}(X_{t_{i}},b_{i})\right\rvert\leq\left\lvert\mathcal{P}_{p}^{i}(X_{t_{i}},a_{i})-\mathcal{P}_{p}^{i}(X_{t_{i}},b_{i})\right\rvert\}}\right).
Proposition 4.9.

Assume that for all pp\in\mathbb{N}^{*}, and all 1jN11\leq j\leq N-1, (Ztj=𝒫pj(Xtj,θjp))=0\mathbb{P}(Z_{t_{j}}=\mathcal{P}_{p}^{j}(X_{t_{j}},\theta_{j}^{p}))=0. For all 1jN11\leq j\leq N-1 and all 1i2p1\leq i\leq 2^{p}, the optimization problems

infyl,yr,x𝔼[(Zτjpyl𝟙{Xtjapi1,apiδ,l(x)}yr𝟙{Xtjapi1,apiδ,r(x)})2]\inf_{y_{l},y_{r},x}\mathbb{E}\left[\left(Z_{\tau^{p}_{j}}-y_{l}\mathbb{1}_{\left\{X_{t_{j}}\in\mathcal{R}_{a^{i-1}_{p},a^{i}_{p}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{X_{t_{j}}\in\mathcal{R}_{a^{i-1}_{p},a^{i}_{p}}^{\delta,r}(x)\right\}}\right)^{2}\right]

with x[api1,δ+εp,api,δεp]x\in[a^{i-1,\delta}_{p}+\varepsilon_{p},a^{i,\delta}_{p}-\varepsilon_{p}] admit a unique solution where δ\delta is the coordinate selected by the splitting strategy.

Then, for all j=1,,N1j=1,\ldots,N-1, θ^jp,M\hat{\theta}_{j}^{p,M} converges a.s. to θjp\theta_{j}^{p} and 𝒫pj(Xtj,θ^jp,M)\mathcal{P}_{p}^{j}(X_{t_{j}},\hat{\theta}_{j}^{p,M}) converges a.s. to 𝒫pj(Xtj,θjp)\mathcal{P}_{p}^{j}(X_{t_{j}},\theta_{j}^{p}) as MM\to\infty.

Proof.

We proceed by backward induction on jj with a nested forward induction on pp.

Step 1: j=N1j=N-1

  • For p=1p=1, conditionally on splitting at the best point (and not at the midpoint) the new nodes are obtained by solving

    𝒫1N1(XtN1,θ^N11,M)=\displaystyle{\mathcal{P}}_{1}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{1,M})= infα,β,a1Mm=1M|ZtN(m)α𝟙{XtN1(m)0,1δ,l(a)}β𝟙{XtN1(m)[a,1]}|2\displaystyle\inf\limits_{\alpha,\beta,a}\frac{1}{M}\sum_{m=1}^{M}\left\lvert Z_{t_{N}}^{(m)}-\alpha\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{0,1}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in[a,1]\}}\right\rvert^{2}
    𝒫1N1(XtN1,θN11)=\displaystyle\mathcal{P}_{1}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{1})= infα,β,a𝔼[|ZtNα𝟙{XtN10,1δ,l(a)}β𝟙{XtN1[a,1]}|2]\displaystyle\inf_{\alpha,\beta,a}\mathbb{E}\left[\left\lvert Z_{t_{N}}-\alpha\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{0,1}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}\in[a,1]\}}\right\rvert^{2}\right]

    where a[ε1,1ε1]a\in[\varepsilon_{1},1-\varepsilon_{1}]. We can apply Lemma 4.7 to obtain 𝒫1N1(XtN1,θ^N11,M){\mathcal{P}}_{1}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{1,M}) converges to 𝒫1N1(XtN1,θN11)\mathcal{P}_{1}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{1}) a.s as MM\to\infty and so does θ^N11,M\hat{\theta}_{N-1}^{1,M} converge a.s. to θN11\theta_{N-1}^{1}. If the split is at the midpoint, the conclusion is even easier to obtain as the infimums are only computed w.r.t α\alpha and β\beta but not aa. The same situation will occur repeatedly and for the sake of clearness, we will only treat the case of splitting at the best point which is harder to handle.

  • Assume that 𝒫pN1(XtN1,θ^N1p,M){\mathcal{P}}_{p}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{p,M}) converges to 𝒫pN1(XtN1,θN1p)\mathcal{P}_{p}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{p}) a.s and that θ^N1p,M\hat{\theta}_{N-1}^{p,M} converges a.s. to θN1p\theta_{N-1}^{p} as MM\to\infty for p1p\geq 1, we will prove it for p+1p+1. For i{1,,2p}i\in\{1,\ldots,2^{p}\}, we consider the computation of the ithi-th node in 𝒯^p+1j,M\hat{\mathcal{T}}_{p+1}^{j,M} at depth p+1p+1.

    ν^p,N1M(α,β,a)\displaystyle\hat{\nu}_{p,N-1}^{M}(\alpha,\beta,a) =1Mm=1M|ZtN(m)α𝟙{XtN1(m)ai1,N1p,M,ai,N1p,Mδ,l(a)}β𝟙{XtN1(m)ai1,N1p,M,ai,N1p,Mδ,r(a)}|2.\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert Z_{t_{N}}^{(m)}-\alpha\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a^{p,M}_{i-1,N-1},a^{p,M}_{i,N-1}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a^{p,M}_{i-1,N-1},a^{p,M}_{i,N-1}}^{\delta,r}(a)\}}\right\rvert^{2}.

    The parameters θ^N1p+1,M\hat{\theta}_{N-1}^{p+1,M} are obtained by computing all the nodes. We also introduce a modified version of ν^p,N1M\hat{\nu}_{p,N-1}^{M} in which we use the splits computed in 𝒯pj\mathcal{T}_{p}^{j}.

    νp,N1M(α,β,a)\displaystyle\nu^{M}_{p,N-1}(\alpha,\beta,a) =1Mm=1M|ZtN(m)α𝟙{XtN1(m)ai1,N1p,ai1,N1pδ,l(a)}β𝟙{XtN1(m)ai1,N1p,ai1,N1pδ,r(a)}|2.\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert Z_{t_{N}}^{(m)}-\alpha\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a_{i-1,N-1}^{p},a^{p}_{i-1,N-1}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a^{p}_{i-1,N-1},a^{p}_{i-1,N-1}}^{\delta,r}(a)\}}\right\rvert^{2}.

    The random functions νp,N1M\nu^{M}_{p,N-1} write as standard empirical means and we will prove they are close to ν^p,N1M\hat{\nu}_{p,N-1}^{M} for MM large.

    Using Lemma 4.5 along with the arguments of the proof of Lemma 4.7, it is easy to see that the random function α,β,aνp,N1M(α,β,a)\alpha,\beta,a\mapsto\nu^{M}_{p,N-1}(\alpha,\beta,a) converges a.s locally uniformly to the function α,β,a𝔼[|ZtNα𝟙{XtN1ai1,N1p,ai,N1pδ,l(a)}β𝟙{XtN1ai1,N1p,ai,N1pδ,r(a)}|2]\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert Z_{t_{N}}-\alpha\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a_{i-1,N-1}^{p},a^{p}_{i,N-1}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a_{i-1,N-1}^{p},a^{p}_{i,N-1}}^{\delta,r}(a)\}}\right\rvert^{2}\right].

    supa[0,1]d,|α|C,|β|C|ν^p,N1M(α,β,a)νp,N1M(α,β,a)|\displaystyle\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}^{M}_{p,N-1}(\alpha,\beta,a)-\nu^{M}_{p,N-1}(\alpha,\beta,a)\right\rvert
    1Mm=1M[|2ZtN(m)|+4C]C(𝟙{XtN1(m),δ(ai1p,M,δ,ai1p,δ)}+𝟙{XtN1(m),δ(aip,M,δ,aip,δ)}).\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}\left[\left\lvert 2Z_{t_{N}}^{(m)}\right\rvert+4C\right]C\left(\mathbb{1}_{\{X_{t_{N-1}}^{(m),\delta}\in(a^{p,M,\delta}_{i-1},a^{p,\delta}_{i-1})\}}+\mathbb{1}_{\{X_{t_{N-1}}^{(m),\delta}\in(a^{p,M,\delta}_{i},a^{p,\delta}_{i})\}}\right).

    Using the induction assumption on pp, aip,Ma^{p,M}_{i} (resp. ai1p,Ma^{p,M}_{i-1}) converges a.s. to aipa_{i}^{p} (resp. ai1pa_{i-1}^{p}). Let ϵ>0\epsilon>0,

    lim supM|ν^p,N1M(α,β,a)νp,N1M(α,β,a)|\displaystyle\limsup_{M}\left\lvert\hat{\nu}^{M}_{p,N-1}(\alpha,\beta,a)-\nu^{M}_{p,N-1}(\alpha,\beta,a)\right\rvert
    lim supM1Mm=1M[|2ZtN(m)|+4C]C(|𝟙{|XtN1(m),δaip,δ|ϵ}|+|𝟙{|XtN1(m),δai+1p,δ|ϵ}|)\displaystyle\leq\limsup_{M}\frac{1}{M}\sum_{m=1}^{M}\left[\left\lvert 2Z_{t_{N}}^{(m)}\right\rvert+4C\right]C\left(\left\lvert\mathbb{1}_{\{\left\lvert X_{t_{N-1}}^{(m),\delta}-a^{p,\delta}_{i}\right\rvert\leq\epsilon\}}\right\rvert+\left\lvert\mathbb{1}_{\{\left\lvert X_{t_{N-1}}^{(m),\delta}-a^{p,\delta}_{i+1}\right\rvert\leq\epsilon\}}\right\rvert\right)
    C(4C+2𝔼[|2ZtN|])((|XtN1δaiap,δ|ϵ)+(|XtN1δai+1p,δ|)ϵ).\displaystyle\leq C(4C+2\mathbb{E}\left[\left\lvert 2Z_{t_{N}}\right\rvert\right])\left(\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{ia}\right\rvert\leq\epsilon)+\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{i+1}\right\rvert)\leq\epsilon\right).

    Since XtN1X_{t_{N-1}} has a density, limϵ0(|XtN1δaip,δ|ϵ)=(XtN1δ=api,δ)=0\lim_{\epsilon\to 0}\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{i}\right\rvert\leq\epsilon)=\mathbb{P}(X_{t_{N-1}}^{\delta}=a_{p}^{i,\delta})=0 and limϵ0(|XtN1δai+1p,δ|ϵ)=(XtN1δ=ai+1p,δ)=0\lim_{\epsilon\to 0}\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{i+1}\right\rvert\leq\epsilon)=\mathbb{P}(X_{t_{N-1}}^{\delta}=a^{p,\delta}_{i+1})=0. As a result, |ν^p,N1Mνp,N1M|0\left\lvert\hat{\nu}^{M}_{p,N-1}-\nu_{p,N-1}^{M}\right\rvert\to 0 a.s. locally uniformly when MM\to\infty. Thus, the random function ν^p,N1M\hat{\nu}_{p,N-1}^{M} converges a.s. locally uniformly to the function

    α,β,a𝔼[|ZtNα𝟙{XtN1ai1p,aipδ,l(a)}β𝟙{XtN1ai1p,aipδ,r(a)}|2].\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert Z_{t_{N}}-\alpha\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a^{p}_{i-1},a^{p}_{i}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a^{p}_{i-1},a^{p}_{i}}^{\delta,r}(a)\}}\right\rvert^{2}\right].

    Now, we apply Lemma 4.6 along with the same arguments of Lemma 4.7 to conclude that 𝒫p+1N1(XtN1,θ^N1p+1,M){\mathcal{P}}_{p+1}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{p+1,M}) converges to 𝒫p+1N1(XtN1,θN1p+1)\mathcal{P}_{p+1}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{p+1}) a.s as MM\to\infty and that θ^N1p+1,M\hat{\theta}_{N-1}^{p+1,M} converges a.s. to θN1p+1\theta_{N-1}^{p+1}.

Step 2: j<N1j<N-1.
So far, we have proved that for all pp, 𝒫pN1(XtN1,θ^N1p,M){\mathcal{P}}_{p}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{p,M}) converges to 𝒫pN1(XtN1,θjp)\mathcal{P}_{p}^{N-1}(X_{t_{N-1}},\theta_{j}^{p}) a.s and that θ^N1p,M\hat{\theta}_{N-1}^{p,M} converges a.s. to θjp\theta_{j}^{p} as MM\to\infty. Now, suppose that 𝒫pk(Xtk,θ^kp,M){\mathcal{P}}_{p}^{k}(X_{t_{k}},\hat{\theta}_{k}^{p,M}) (resp. θ^kp,M\hat{\theta}_{k}^{p,M}) converges to 𝒫pk(Xtk,θkp)\mathcal{P}_{p}^{k}(X_{t_{k}},\theta_{k}^{p}) (resp. θkp\theta_{k}^{p}) a.s as MM\to\infty for all pp and for k=N1,,j+1k=N-1,\ldots,j+1.We should prove that these convergence results hold for jj and this will be done by induction on pp. Now that we have understood that considering multidimensional random variables XtjX_{t_{j}} does not play any role in the proof, we will make the rest of the proof as if XX were having in \mathbb{R} in order to use a little lighter notation

  • For p=1p=1, consider

    ν^1,jM(α,β,a)\displaystyle\hat{\nu}_{1,j}^{M}(\alpha,\beta,a) =1Mm=1M|Fj+1(ϑ^1,M,Z(m),X(m))α𝟙{Xtj(m)[0,a)}β𝟙{Xtj(m)[a,1]}|2\displaystyle=\frac{1}{M}\sum\limits_{m=1}^{M}\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,1]}\}\right\rvert^{2}
    ν1,jM(α,β,a)\displaystyle\nu^{M}_{1,j}(\alpha,\beta,a) =1Mm=1M|Fj+1(ϑ1,Z(m),X(m))α𝟙{Xtj(m)[0,a)}β𝟙{Xtj(m)[a,1]}|2.\displaystyle=\frac{1}{M}\sum\limits_{m=1}^{M}\left\lvert F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,1]\}}\right\rvert^{2}.

    The function ν1,jM\nu_{1,j}^{M} writes as the sum of i.i.d random variables. Let C0C\geq 0, using Equation (12)

    𝔼[supa[0,1]d,|α|C,|β|C|Fj+1(ϑ1,Z,X)α𝟙{Xtj[0,a)}β𝟙{Xtj[a,1]}|2]\displaystyle\mathbb{E}\left[\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert F_{j+1}\left(\vartheta^{1},Z,X\right)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]}\}\right\rvert^{2}\right]
    2𝔼[|Fj+1(ϑ1,Z,X)|2]+2𝔼[supa[0,1]d,|α|C,|β|C|α𝟙{Xtj[0,a)}+β𝟙{Xtj[a,1]}|2]\displaystyle\leq 2\mathbb{E}\left[\left\lvert F_{j+1}\left(\vartheta^{1},Z,X\right)\right\rvert^{2}\right]+2\mathbb{E}\left[\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}+\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]\}}\right\rvert^{2}\right]
    2𝔼[maxlj+1(Ztl)2]+2C2<.\displaystyle\leq 2\mathbb{E}\left[\max_{l\geq j+1}(Z_{t_{l}})^{2}\right]+2C^{2}<\infty.

    Using Lemma 4.5, α,β,aν1,jM(α,β,a)\alpha,\beta,a\mapsto\nu_{1,j}^{M}(\alpha,\beta,a) converges a.s locally uniformly to the function α,β,a𝔼[|Fj+1(ϑ1,Z,X)α𝟙{Xtj[0,a)}β𝟙{Xtj[a,1]}|2]\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{1},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]\}}\right\rvert^{2}\right]. It remains to prove that C>0supa[0,1]d,|α|C,|β|C|ν^1,jM(a,α,β)ν1,jM(a,α,β)|0\forall C>0\;\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert\to 0 a.s when MM\to\infty.
    Then, using Equation (12) and Lemma 4.8

    supa[0,1]d,|α|C,|β|C|ν^1,jM(a,α,β)ν1,jM(a,α,β)|\displaystyle\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert
    supa[0,1]d,|α|C,|β|C1Mm=1M|Fj+1(ϑ^1,M,Z(m),X(m))Fj+1(ϑ1,Z(m),X(m))|\displaystyle\leq\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\frac{1}{M}\sum_{m=1}^{M}\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)-F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)\right\rvert
    |Fj+1(ϑ^1,M,Z(m),X(m))+Fj+1(ϑ1,Z(m),X(m))2α𝟙{Xtj(m)[0,a)}2β𝟙{Xtj(m)[a,1]}|\displaystyle\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)+F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)-2\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[0,a)\}}-2\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,1]\}}\right\rvert
    m=1M2(maxlj+1|Ztl(m)|+2C)|Fj+1(ϑ^1,M,Z(m),X(m))Fj+1(ϑ1,Z(m),X(m))|\displaystyle\leq\sum_{m=1}^{M}2\left(\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+2C\right)\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)-F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)\right\rvert
    1Mm=1M2(maxlj+1|Ztl(m)|+2C)(i=j+1N|Zti(m)|i=j+1N1𝟙{|Zti(m)𝒯1i(Xti(m))||𝒯^1i,M(Xti(m))𝒯1i(Xti(m))|}).\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}2\left(\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+2C\right)\left(\sum_{i=j+1}^{N}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\left\lvert\hat{\mathcal{T}}_{1}^{i,M}(X_{t_{i}}^{(m)})-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\}}\right).

    Using the induction assumption on jj, 𝒯^1i,M(Xti(m))\hat{\mathcal{T}}_{1}^{i,M}(X_{t_{i}}^{(m)}) converges a.s. to 𝒯1i(Xti(m))\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)}) for all N1ij+1N-1\geq i\geq j+1. Let ϵ<0\epsilon<0.

    lim supMsupa[0,1]d,|α|C,|β|C|ν^1,jM(a,α,β)ν1,jM(a,α,β)|\displaystyle\limsup_{M}\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert
    1Mm=1M2(maxlj+1|Ztl(m)|+2C)(i=j+1N|Zti(m)|i=j+1N1𝟙{|Zti(m)𝒯1i(Xti(m))|ϵ}).\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}2\left(\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+2C\right)\left(\sum_{i=j+1}^{N}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}\right).

    Since (Ztj(m)=𝒯pj(Xtj(m)))=0\mathbb{P}(Z_{t_{j}}^{(m)}={\mathcal{T}}_{p}^{j}(X_{t_{j}}^{(m)}))=0, then limϵ0𝟙{|Ztim)𝒯1i(Xti(m))|ϵ}=0\lim_{\epsilon\to 0}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{m)}-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}=0 a.s and we conclude that a.s. |ν^1,jM(a,α,β)ν1,jM(a,α,β)|\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert converges to zero uniformly. Thus, ν^1,jM\hat{\nu}_{1,j}^{M} converges a.s uniformly to the function a,α,β𝔼[|Fj+1(ϑ1,Z,X)α𝟙{Xtj[0,a)}β𝟙{Xtj[a,1]}|2]a,\alpha,\beta\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{1},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]\}}\right\rvert^{2}\right]. Then, we apply Lemma 4.5 and the arguments of the proof of Lemma 4.7 to deduce that 𝒫1j(Xtj,θ^j1,M){\mathcal{P}}_{1}^{j}(X_{t_{j}},\hat{\theta}_{j}^{1,M}) converges to 𝒫jj(Xtj,θjj)\mathcal{P}_{j}^{j}(X_{t_{j}},\theta_{j}^{j}) a.s and that θ^jj,M\hat{\theta}_{j}^{j,M} converges a.s. to θjp\theta_{j}^{p} as MM\to\infty.

  • We assume that 𝒫pk(Xtk,θ^kp,M){\mathcal{P}}_{p}^{k}(X_{t_{k}},\hat{\theta}_{k}^{p,M}) (resp. θ^kp,M\hat{\theta}_{k}^{p,M}) converges to 𝒫pk(Xtk,θkp)\mathcal{P}_{p}^{k}(X_{t_{k}},\theta_{k}^{p}) (resp. θkp\theta_{k}^{p}) a.s as MM\to\infty for some pp and for k=N1,,jk=N-1,\ldots,j. We will probe that it holds for p+1p+1.

    Let i{1,,2p}i\in\{1,\ldots,2^{p}\} and consider

    ν^p,jM(α,β,a)\displaystyle\hat{\nu}_{p,j}^{M}(\alpha,\beta,a) =1Mm=1M|Fj+1(ϑ^p,M,Z(m),X(m))α𝟙{Xtj(m)[ai1,jp,M,a)}β𝟙{Xtj(m)[a,ai,jp,M)}|2.\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert F_{j+1}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p,M}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,a^{p,M}_{i,j})\}}\right\rvert^{2}.
    νp,jM(α,β,a)\displaystyle\nu_{p,j}^{M}(\alpha,\beta,a) =1Mm=1M|Fj+1(ϑp,Z(m),X(m))α𝟙{Xtj(m)[ai1,jp,a)}β𝟙{Xtj(m)[a,ai,jp)}|2\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert F_{j+1}(\vartheta^{p},Z^{(m)},X^{(m)})-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,a^{p}_{i,j})\}}\right\rvert^{2}

    The function νp,jM\nu_{p,j}^{M} writes as the sum of i.i.d random variables. Let C0C\geq 0,

    𝔼[supa[0,1]d,|α|C,|β|C|Fj+1(ϑp,Z(m),X(m))α𝟙{Xtj(m)[ai1,jp,a)}β𝟙{Xtj(m)[a,ai,jp)}|2]\displaystyle\mathbb{E}\left[\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert F_{j+1}\left(\vartheta^{p},Z^{(m)},X^{(m)}\right)-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,a^{p}_{i,j})\}}\right\rvert^{2}\right]
    2𝔼[|Fj+1(ϑp,Z(m),X(m))|2]+2C2\displaystyle\leq 2\mathbb{E}\left[\left\lvert F_{j+1}\left(\vartheta^{p},Z^{(m)},X^{(m)}\right)\right\rvert^{2}\right]+2C^{2}
    2𝔼[maxlj+1(Ztl)2]+2C2<.\displaystyle\leq 2\mathbb{E}\left[\max_{l\geq j+1}(Z_{t_{l}})^{2}\right]+2C^{2}<\infty.

    We conclude using Lemma 4.5 that a.s νp,jM\nu_{p,j}^{M} converges locally uniformly to the function

    α,β,a𝔼[|Fj+1(ϑp,Z,X)α𝟙{Xtj[ai1,jp,a)}β𝟙{Xtj(a,ai,jp)}|2].\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{p},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[a^{p}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in(a,a^{p}_{i,j})\}}\right\rvert^{2}\right].

    Let C>0C>0,

    supa[0,1]d,|α|C,|β|C|ν^p,jM(α,β,a)νp,jM(α,β,a)|\displaystyle\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)-\nu_{p,j}^{M}(\alpha,\beta,a)\right\rvert
    supa[0,1]d,|α|C,|β|C1Mm=1M(2maxlj+1|Ztl(m)|+4C)\displaystyle\leq\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\frac{1}{M}\sum_{m=1}^{M}\left(2\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+4C\right)
    [((i=j+1N|Zti(m)|)i=j+1N1𝟙{|Zti𝒯pi(Xti(m))||𝒯^pi,M(Xti(m))𝒯pi(Xti(m))|})\displaystyle\left[\left(\left(\sum_{i=j+1}^{N}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\right)\sum_{i=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\left\lvert\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}}^{(m)})-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\}}\right)\right.
    +|α𝟙{Xtj(m)[ai1,jp,ai1,jp,M)}+β𝟙{Xtj(m)[ai,jp,M,ai,jp]}|].\displaystyle\left.+\left\lvert\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p}_{i-1,j},a^{p,M}_{i-1,j})\}}+\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p,M}_{i,j},a^{p}_{i,j}]\}}\right\rvert\right].

    Using the induction assumption on pp, limMai1,jp,M=ai1,jp\lim_{M\to\infty}a^{p,M}_{i-1,j}=a^{p}_{i-1,j} and limMapi,M=api\lim_{M\to\infty}a_{p}^{i,M}=a_{p}^{i} a.s and using the induction assumption on jj, limM𝒯^pi,M(Xti(m))=𝒯pi(Xti(m))ij+1\lim_{M\to\infty}\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}}^{(m)})=\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\;\forall i\geq j+1. Let ϵ<0\epsilon<0,

    lim supMsupa[0,1]d,|α|C,|β|C|ν^p,jM(α,β,a)νp,jM(α,β,a)|\displaystyle\limsup_{M}\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)-\nu_{p,j}^{M}(\alpha,\beta,a)\right\rvert
    lim supM1Mm=1M(2maxlj+1|Ztl(m)|+4C)\displaystyle\leq\limsup_{M}\frac{1}{M}\sum_{m=1}^{M}\left(2\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+4C\right)
    [C(𝟙{|Xtj(m)ai1,jp|ϵ}+β𝟙{|Xtj(m)ai,jp|ϵ})+((k=j+1N|Ztk(m)|)k=j+1N1𝟙{|Ztk(m)𝒯pk(Xtk(m))|ϵ})].\displaystyle\left[C\left(\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i-1,j}\right\rvert\leq\epsilon\}}+\beta\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i,j}\right\rvert\leq\epsilon\}}\right)+\left(\left(\sum_{k=j+1}^{N}\left\lvert Z_{t_{k}}^{(m)}\right\rvert\right)\sum_{k=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{k}}^{(m)}-\mathcal{T}_{p}^{k}(X_{t_{k}}^{(m)})\right\rvert\leq\epsilon\}}\right)\right].

    Since limϵ0𝟙{|Zti𝒯pi(Xti(m))|ϵ}=limϵ0|α𝟙{|Xtj(m)ai1,jp|ϵ}+β𝟙{|Xtj(m)ai,jp|ϵ}|=0\lim_{\epsilon\to 0}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}=\lim_{\epsilon\to 0}\left\lvert\alpha\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i-1,j}\right\rvert\leq\epsilon\}}+\beta\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i,j}\right\rvert\leq\epsilon\}}\right\rvert=0, we conclude that a.s |ν^p,jM(α,β,a)νp,jM(α,β,a)|0\left\lvert\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)-\nu_{p,j}^{M}(\alpha,\beta,a)\right\rvert\to 0 locally uniformly when MM\to\infty, and thus the random function ν^p,jM\hat{\nu}_{p,j}^{M} converges a.s locally uniformly to the function α,β,a𝔼[|Fj+1(ϑp,Z,X)α𝟙{Xtj[api1,a)}β𝟙{Xtj(a,api)}|2]\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{p},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[a_{p}^{i-1},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in(a,a_{p}^{i})\}}\right\rvert^{2}\right]. Then we apply Lemma 4.6 along with the arguments of the proof of Lemma 4.7 to conclude that 𝒫p+1k(Xtk,θ^kp+1,M){\mathcal{P}}_{p+1}^{k}(X_{t_{k}},\hat{\theta}_{k}^{p+1,M}) (resp. θ^kp+1,M\hat{\theta}_{k}^{p+1,M}) converges to 𝒫p+1k(Xtk,θkp)\mathcal{P}_{p+1}^{k}(X_{t_{k}},\theta_{k}^{p}) (resp. θkp+1\theta_{k}^{p+1}) a.s as MM\to\infty for k=N1,,jk=N-1,\ldots,j. This completes the induction.

Theorem 4.10.

Assume that for all pp\in\mathbb{N}^{*}, and all 1jN11\leq j\leq N-1, (Ztj=𝒯pj(Xtj))=0\mathbb{P}(Z_{t_{j}}=\mathcal{T}_{p}^{j}(X_{t_{j}}))=0. Then, for α=1,2\alpha=1,2 and for every j=1,,Nj=1,\ldots,N,

limM1Mi=1M(Zτjp,(m)(m))α=𝔼[(Zτjp)α]a.s.\lim\limits_{M\to\infty}\frac{1}{M}\sum\limits_{i=1}^{M}\left(Z_{\tau_{j}^{p,(m)}}^{(m)}\right)^{\alpha}=\mathbb{E}\left[(Z_{\tau_{j}^{p}})^{\alpha}\right]\textnormal{a.s}.
Proof.

Note that 𝔼[(Zτjp)α]=𝔼[Fj(ϑp,Z,X)α)]\mathbb{E}\left[(Z_{\tau_{j}^{p}})^{\alpha}\right]=\mathbb{E}\left[F_{j}(\vartheta^{p},Z,X)^{\alpha})\right] and by the strong law of large numbers

limM1Mm=1MFj((ϑp,Z(m),X(m))α)=𝔼[Fj(ϑp,Z,X)α]a.s.\lim\limits_{M\to\infty}\frac{1}{M}\sum_{m=1}^{M}F_{j}\left((\vartheta^{p},Z^{(m)},X^{(m)})^{\alpha}\right)=\mathbb{E}\left[F_{j}(\vartheta^{p},Z,X)^{\alpha}\right]\textnormal{a.s.}

It remains to prove that

ΔFM=1Mm=1MFj(ϑ^p,M,Z(m),X(m))αFj(ϑp,Z(m),X(m))αMa.s0.\Delta F_{M}=\frac{1}{M}\sum\limits_{m=1}^{M}F_{j}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})^{\alpha}-F_{j}(\vartheta^{p},Z^{(m)},X^{(m)})^{\alpha}\xrightarrow[M\to\infty]{a.s}0.

For any x,yx,y\in\mathbb{R}, and α=1,2\alpha=1,2, |xαyα||xy||xα1+yα1|\left\lvert x^{\alpha}-y^{\alpha}\right\rvert\leq\left\lvert x-y\right\rvert\left\lvert x^{\alpha-1}+y^{\alpha-1}\right\rvert. Using Lemma 4.8 and Equation (12), we have

|ΔFM|\displaystyle\left\lvert\Delta F_{M}\right\rvert 1Mm=1M|Fj(ϑ^p,M,Z(m),X(m))αFj(ϑp,Z(m),X(m))α|\displaystyle\leq\frac{1}{M}\sum\limits_{m=1}^{M}\left\lvert F_{j}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})^{\alpha}-F_{j}(\vartheta^{p},Z^{(m)},X^{(m)})^{\alpha}\right\rvert
21Mm=1Mi=jNmaxkj|Ztk(m)|α1|Zti(m)|i=jN1𝟙{|Zti(m)𝒯pi(Xti(m))||𝒯^pi,M(Xti(m))𝒯pi(Xti(m))|}.\displaystyle\leq 2\frac{1}{M}\sum\limits_{m=1}^{M}\sum_{i=j}^{N}\max\limits_{k\geq j}\left\lvert Z_{t_{k}}^{(m)}\right\rvert^{\alpha-1}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\left\lvert\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}}^{(m)})-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\}}.

Using Proposition 4.9, for all i=j,,N1i=j,\ldots,N-1, |𝒯^pi,M(Xti)𝒯pi(Xti)|0\left\lvert\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}})-\mathcal{T}_{p}^{i}(X_{t_{i}})\right\rvert\to 0 a.s when MM\to\infty. Then for any ϵ>0\epsilon>0,

lim supM|ΔFM|\displaystyle\limsup_{M}\left\lvert\Delta F_{M}\right\rvert
2lim supM1Mm=1Mi=jNmaxkj|Ztk(m)|α1|Zti(m)|i=jN1𝟙{|Zti(m)𝒯pi(Xti(m))|ϵ}\displaystyle\leq 2\limsup_{M}\frac{1}{M}\sum\limits_{m=1}^{M}\sum\limits_{i=j}^{N}\max\limits_{k\geq j}\left\lvert Z_{t_{k}}^{(m)}\right\rvert^{\alpha-1}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}
2𝔼[i=jNmaxkj|Ztk|α1|Zti|i=jN1𝟙{|Zti𝒯pi(Xti)|ϵ}].\displaystyle\leq 2\mathbb{E}\left[\sum\limits_{i=j}^{N}\max\limits_{k\geq j}\left\lvert Z_{t_{k}}\right\rvert^{\alpha-1}\left\lvert Z_{t_{i}}\right\rvert\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{T}_{p}^{i}(X_{t_{i}})\right\rvert\leq\epsilon\}}\right].

We conclude that lim supM|ΔFM|=0\limsup_{M}\left\lvert\Delta F_{M}\right\rvert=0 by letting ϵ\epsilon go to 0 which ends the proof. ∎

Proving the global convergence of our algorithm amounts to studying the difference

1Mi=1MZτjp,(m)(m)𝔼[(Zτj)]=(1Mi=1MZτjp,(m)(m)𝔼[(Zτjp)])+(𝔼[Zτjp]𝔼[(Zτjp)])\displaystyle\frac{1}{M}\sum_{i=1}^{M}Z_{\tau_{j}^{p,(m)}}^{(m)}-\mathbb{E}\left[(Z_{\tau_{j}})\right]=\left(\frac{1}{M}\sum_{i=1}^{M}Z_{\tau_{j}^{p,(m)}}^{(m)}-\mathbb{E}\left[(Z_{\tau_{j}^{p}})\right]\right)+\left(\mathbb{E}\left[Z_{\tau_{j}^{p}}\right]-\mathbb{E}\left[(Z_{\tau_{j}^{p}})\right]\right)

By Proposition 4.4, the 𝔼[(Zτjp)]𝔼[Zτjp]\mathbb{E}\left[(Z_{\tau_{j}^{p}})\right]-\mathbb{E}\left[Z_{\tau_{j}^{p}}\right] goes to zero when pp\to\infty independently of MM. Even if Theorem 4.10 proves the a.s. convergence of the first term for any fixed pp when MM\to\infty, its behaviour seems less clear when both MM and pp go to infinity. Note that the same difficult already occurred in Clément et al. (2002) for the more standard least square approach. The main difficulty stems for the inability to compute the variance of the limiting distribution appearing in the central limit theorem governing the convergence of 1Mi=1M(Zτjp,(m)(m))\frac{1}{M}\sum_{i=1}^{M}\left(Z_{\tau_{j}^{p,(m)}}^{(m)}\right) even in the least square framework. In particular, we do seem to control the behaviour of the limiting variance with respect to pp.

5 Random forests

Regression trees are barely used as standalone estimators but are often aggregated to obtain a random forest. Considering our training sample 𝒟M={(Xi,Yi)1iM}\mathcal{D}_{M}=\{(X_{i},Y_{i})_{1\leq i\leq M}\} of size MM. Let Θ\Theta be a random variable independent of 𝒟M\mathcal{D}_{M} and used to resample the training set without replacement. Note that the size of the resampled set is usually smaller that MM. Let Θ1,,ΘB\Theta_{1},\dots,\Theta_{B} BB iid samples of Θ\Theta. For every k=1,,Bk=1,\dots,B, 𝒯^pM(X,Θk)\hat{\mathcal{T}}_{p}^{M}(X,\Theta_{k}) denotes the regression tree of depth pp computed on the resampled set obtained from 𝒟M\mathcal{D}_{M} by using Θk\Theta_{k}. Then, the random forest estimator is defined by

B,p(X)=k=1B1B𝒯^p,ΘkM(X).\mathcal{H}_{B,p}(X)=\sum_{k=1}^{B}\frac{1}{B}\hat{\mathcal{T}}^{M}_{p,\Theta_{k}}(X).

We also introduce p(X)=𝔼Θ[𝒯^p,ΘM(X)]=limBB,p(X)\mathcal{H}_{p}(X)=\mathbb{E}_{\Theta}\left[\hat{\mathcal{T}}^{M}_{p,\Theta}(X)\right]=\lim_{B\to\infty}\mathcal{H}_{B,p}(X).

Theorem 5.1.
limB𝔼[|YB,p(X)|2]=𝔼[|Yp(X)|2]\lim_{B\to\infty}\mathbb{E}\left[\left\lvert Y-\mathcal{H}_{B,p}(X)\right\rvert^{2}\right]=\mathbb{E}\left[\left\lvert Y-\mathcal{H}_{p}(X)\right\rvert^{2}\right]

See Theorem 11.1 in (Breiman, 2001).

Theorem 5.2.
𝔼[|Yp(X)|2]ρ¯𝔼Θ[𝔼[|Y𝒯^p,ΘM(X)|2]]\mathbb{E}\left[\left\lvert Y-\mathcal{H}_{p}(X)\right\rvert^{2}\right]\leq\bar{\rho}\mathbb{E}_{\Theta}\left[\mathbb{E}\left[\left\lvert Y-\hat{\mathcal{T}}^{M}_{p,\Theta}(X)\right\rvert^{2}\right]\right]

where ρ¯\bar{\rho} is the weighted correlation between the residuals Y𝒯^p,ΘM(X)Y-\hat{\mathcal{T}}^{M}_{p,\Theta}(X) and Y𝒯^p,ΘM(X)Y-\hat{\mathcal{T}}^{M}_{p,\Theta^{\prime}}(X) and Θ\Theta and Θ\Theta^{\prime} are independent. See Theorem 11.2 in (Breiman, 2001)

Theorem 5.2 says that to have a good generalization error in the random forest, one should have small generalization errors in the basis trees, and the basis trees should not be highly correlated.

6 Numerical results

6.1 Description

This section studies the price of some Bermudan options using regression trees or random forests to approximate the conditional expectations. We compare the results to some reference prices and those given by the standard Longstaff Schwarz method with regression on polynomial functions, which is a basic LSM algorithms. More sophisticated representations can be used to mildly temper the curse of dimensionality. We use the Scikit-Learn library in Python, (Pedregosa et al., 2011). For regression trees, this library offers two methods of splitting: “best” to choose the best split, meaning that the split threshold is the one that minimizes the MSE and the direction for splitting is the one that gives the lowest MSE among all directions; “random” to choose the best random split, meaning that the split threshold is the one that minimizes the MSE and the direction for splitting is chosen randomly. For the following tests, we will use the latter method, which is just slightly different from what we presented in Section 2 in the way that no mid-point cuts will be considered. We also use the feature min_samples_leaf which allows us to set a minimum number of samples in each node. This will allow us to avoid over-fitting. For random forests, we will use the bootstrapping method (Bootstrap=True), meaning that for each tree in the forest, we will use a sub-sample drawn randomly and with replacement from the training data. We will also use the feature max_samples which allows having a specific number of data points or a percentage of the training data attributed to each tree. Having the trees trained on different data as much as possible allows us to have a low correlation between the trees which, using Theorem 5.2, should make the random forest more robust.
Following the work of (Longstaff and Schwartz, 2001), we only use the in-the-money paths to learn the continuations values, which significantly improves the numerical computations. All the prices that we show are obtained after resimulation, meaning that the paths used in the estimation of the conditional expectations are not the same ones used by the Monte Carlo which means that the prices we show are unbiased.

For small dimensional problems, our algorithm takes approximately the same computation time as the LSM approach for a comparable accuracy. To measure the computational time of our algorithm, we take into account both the training and running times. However, note that the training step over weights the prediction step by a great deal. Once the trees are trained, evaluating them is straightforward, which makes the resimulation price almost free. The computational time of the random forest approach linearly depends on the number of regression trees in the forest.

6.2 Black and Scholes

Consider the Black and Scholes model

{dSti=rStidt+σiStidBti,d<Bi,Bj>t=ρijdt.\begin{cases}dS_{t}^{i}&=rS_{t}^{i}dt+\sigma_{i}S_{t}^{i}dB_{t}^{i},\\ d<B^{i},B^{j}>_{t}&=\rho_{ij}dt.\end{cases}

where σi\sigma_{i} is the volatility of the underlying SiS^{i}, assumed to be deterministic, rr is the interest rate, assumed constant, and ρij\rho_{ij}, represents the correlation between the underlyings SiS^{i} and SjS^{j}, assumed constant.

6.2.1 One-dimensional put option

We consider the Bermudan put option with payoff (KSτ)+(K-S_{\tau})^{+} with maturity T=1T=1 year, K=110K=110, S0=100S_{0}=100, σ=0.25\sigma=0.25, exercisable at N=10N=10 different dates. We consider r=0.1r=0.1. We have a reference price for this option of 11.987 computed by a convolution method in (Lord et al., 2007). The LSM algorithm converges to the correct price with only a polynomial of degree 3. Figure 1, shows the price of the option when we use regression trees with a random split strategy (continuous line) or a best split strategy (dotted line) to estimate the conditional expectations. With the random strategy, the best price we get is 11.89. The case min_samples_leaf=1 and max_depth=20 gives a price of 10.5, which is far from the reference price. This result is due to over-fitting. In fact, for this case, the number of degrees of freedom is too big. The tree fits the training data too well, but it cannot generalize when confronted with new data. For the best split strategy, we obtain a slightly better price of 11.94. However, depending on the tree parameters, the price fluctuates, and we can see that the best split strategy is not necessarily better than the random split strategy. Thus, for the following, we will keep using the random split strategy. Random forests with basis trees of maximum depth 5 and minimum 100 samples in each leaf converge to the correct price with only ten trees.

Refer to caption
Figure 1: one dimensional put with regression trees, true price=11.987

6.2.2 Call option on the maximum of two assets

We consider a call option on the maximum of 2 assets with payoff (max(Sτ1,Sτ2)K)+(\max(S_{\tau}^{1},S_{\tau}^{2})-K)^{+}, we use the same set of parameters as in (Glasserman, 2004), for which we have reference prices of 13.90, 8.08 and 21.34 for S0i=100,90S_{0}^{i}=100,90 and 110110 respectively. The LSM algorithm using a polynomial of degree 5 converges to a price of 13.90, 8.06, 21.34 for the cases K=100,90,110K=100,90,110 respectively. This is a small dimensional problem, so the convergence of the LSM is expected. With regressions trees we have slightly less satisfying results as shown in Figure 2. We can still see the case of over-fitting when giving the regression trees too many degrees of freedom. Aggregating the regression trees into random forests immediately improves the results as shown in Figure 3. Note that the lower the percentage of data in each basis tree, the better the results. This confirms the results of Theorem 5.2 .

Refer to caption
Figure 2: Call on the maximum of two assets with regression trees, K=100,T=3K=100,T=3 years, σi=0.2,r=0.05,ρij=0,δi=0.1,N=9,M=100,000\sigma^{i}=0.2,r=0.05,\rho_{ij}=0,\delta_{i}=0.1,N=9,M=100,000
Refer to caption
Figure 3: Call on the maximum of two assets with random forests, K=100,T=3K=100,T=3 years, σi=0.2,r=0.05,ρij=0,δi=0.1,N=9,M=100,000\sigma^{i}=0.2,r=0.05,\rho_{ij}=0,\delta_{i}=0.1,N=9,M=100,000

6.2.3 Geometric basket option

We consider a Bermudan Put option on a geometric basket of dd underlying with payoff

(K(i=1dSτi)1d)+.\left(K-\left(\prod\limits_{i=1}^{d}S_{\tau}^{i}\right)^{\frac{1}{d}}\right)^{+}.

We test the following option for d=2,10,40d=2,10,40 for which we have reference prices from (Cox et al., 1979) using the CRR tree method. With the LSM algorithm, we converge to the correct price 4.57 for the case d=2d=2, using only a polynomial of degree 3. For the case d=10d=10, we can at most use a polynomial of degree 3 due to the curse of dimensionality. With this parametrization, we obtain a price of 2.90 for a true price of 2.92. For the case d=40d=40, we cannot go further than a polynomial of degree 1, which yields a price of 2.48 for a reference price of 2.52. Figure 4 shows the results obtained with regression trees. For the case d=2d=2, the best price we get is 4.47 and, as expected, the LSM algorithm has a better performance. This is also the case for the cases d=10d=10 and d=40d=40 where the best prices we obtain are 2.84 and 2.46 respectively. Notice that even though these are high dimensional cases, the trees converge with only a depth of 5 or 8. For d=10d=10, our algorithm with depth equal 88 and min_samples_leaf equal 200200 runs in 33 seconds compared to the 1111 seconds needed by the LSM with polynomial degree 33. The running times of the two approaches are pretty comparable. For d=40d=40, our algorithm takes the same computational time as the LSM with degree 11 polynomials. We also notice the importance of the parameter min_samples_leaf. In fact, letting the trees grow without managing this parameter (case leaf1) leads to a problem of over-fitting. The results get better when we use random forests as shown in Figure 5. For these random forests, we used basis trees of max_depth=8 and min_samples_leaf=100. Notice for the case d=2d=2, the curve where only 50% of the data is used gives much better results as in this case the basis trees are the less correlated. For the cases d=10d=10 and d=40d=40, the best choice is not necessarily to use 50% of the data in each tree. As these are larger dimensions, having the trees trained on a small percentage of the training data maybe not enough. One may consider extending the size of the training data itself. Furthermore, we notice that once the percentage of data to use in each tree is chosen, the price of the option converges as the number of trees in the forest grows. However, note the computational time of the random forest method linearly depends on the number of trees inside the forest. Although the regression tree approach runs within roughly the same computational time as the LSM for a similar accuracy, the random forests approach may take much longer as the number of trees increases.

Refer to caption
Figure 4: Geometric put option with regression trees
Refer to caption
Figure 5: Geometric put option with random forests

6.2.4 A put basket option

We consider a put option on the basket of d=40d=40 asset with payoff (Ki=1dωiSTi)+\left(K-\sum_{i=1}^{d}\omega_{i}S_{T}^{i}\right)^{+}. We test this payoff for d=40d=40 for which we have a reference price from (Goudenège et al., 2019) between 2.15 and 2.22 using the following set of parameters: T=1,Si=100,K=100,r=0.05,σi=0.2,ρij=0.2,ωi=1dT=1,S_{i}=100,K=100,r=0.05,\sigma_{i}=0.2,\rho_{ij}=0.2,\omega_{i}=\frac{1}{d} and N=10N=10. With a polynomial of degree 1, we obtain a price of 2.15 using the LSM algorithm. The results obtained with regression trees are shown in Figure 6.

Refer to caption
Figure 6: Put on a basket of 40 asset with regression trees

Even though this example is high dimensional, we do not need a lot of parameters to estimate the conditional expectations (the trees converge for very small depths). This will not be the case for the next example which is very non linear. The aggregation into random forests leads to a price of 2.16 using only 50 trees.

6.2.5 A call on the max of 50 asset

We consider a call option on the maximum of d=50d=50 asset with payoff (max1id(STi)K)+\left(\max\limits_{1\leq i\leq d}(S_{T}^{i})-K\right)^{+} with the following characteristics: K=100,T=3K=100,T=3 years, S0i=100,σi=0.2,δi=0.1,ρij=0i,j,r=0.05,N=9,M=100000S_{0}^{i}=100,\sigma_{i}=0.2,\delta_{i}=0.1,\rho_{ij}=0\;\forall i,j,r=0.05,N=9,M=100000. (Becker et al., 2019) report [69.56,69.95][69.56,69.95] as the 95% confidence interval for the option price. With the LSM algorithm we find a price of 67.88 with a polynomial of degree 1. This is a difficult example for which bigger trees are needed to approach the conditional expectations. At maturity, the payoff depends only on one direction (corresponding to the best performance), if the cuts in the tree never consider that direction, the estimation will not be correct. As a result, we consider a number of cuts big enough to ensure that each direction is taken into consideration. We allow the depth to grow while monitoring the min_samples_leaf in order to have a significant number of samples in each leaf. Table 1 shows the results obtained with regression trees. As the best price we obtain is given by depth=100 and min_samples_leaf = 100, we use this set of parameters for the random forest part. Table 2 shows the results that we obtain with this method.

depth min_samples_leaf price
50 50 66,89
50 100 66.88
100 50 67.13
100 100 67.31
200 50 67.16
200 100 67.28
Table 1: A call option on the maximum of 50 asset with regression trees
nb_trees max_samples price
10 50% 68,32
10 70% 68,32
10 90% 68,29
Table 2: A call option on the maximum of 50 asset with random forests

Using only regression trees is not enough to have acceptable results. However, as soon as we aggregate the regressor into random forests, we obtain very satisfying results and with just 10 trees we converge to a good price. We can also notice in this example that using uncorrelated trees leads to better results (see the case max_samples= 50% or 70% against the case max_samples = 90%).

6.3 A put in the Heston model

We consider the Heston model defined by

dSt\displaystyle dS_{t} =St(rtdt+σt(ρdWt1+1ρ2dWt2))\displaystyle=S_{t}(r_{t}dt+\sqrt{\sigma_{t}}(\rho dW_{t}^{1}+\sqrt{1-\rho^{2}}dW_{t}^{2}))
dσt\displaystyle d\sigma_{t} =κ(θσt)dt+ξσtdWt1\displaystyle=\kappa(\theta-\sigma_{t})dt+\xi\sqrt{\sigma_{t}}dW_{t}^{1}

and we consider a put option with payoff (KST)+\left(K-S_{T}\right)^{+}. we have no reference price for this option, so we will just compare the results of regression trees and random forests to the LSM method. We use the following set of parameters: K=100,S0=100,T=1,σ0=0.01,ξ=0.2,κ=2,ρ=0.3,r=0.1,N=10K=100,S_{0}=100,T=1,\sigma_{0}=0.01,\xi=0.2,\kappa=2,\rho=-0.3,r=0.1,N=10 and M=100,000M=100,000. The LSM method yields a price of 1.70. Figures 7 and 8 show the results obtained with regression trees and random forests. Both methods converge to the same price of LSM. We notice for this example the occurrence of the over-fitting phenomenon for regression trees with max_depth=15 and min_sample_leaf=1. We also have the same behavior for random forests in function of the percentage of data given to each basis tree.

Refer to caption
Figure 7: A put option in the Heston model with regression trees
Refer to caption
Figure 8: A put option in the Heston model with random forests

7 Conclusion

Pricing Bermudan options comes down to solving a dynamic programming equation where the main trouble comes from the computation of the conditional expectations representing the conditional expectations. We have explored the usage of regression trees and random forests for the computations of these quantities. We have proved in two steps the convergence of the algorithm when regression trees are used: first, the convergence of the conditional expectations; Then, the convergence of the Monte Carlo approximation. This problem was particularly hard to solve given that the regression trees do not solve a global optimization problem as does the functional regression used in the LSM algorithm. We have shown through numerical experiments that we obtain good prices for some classical examples using regression trees. The aggregation of regression trees into random forests yields even better results. We came to the conclusion that for small dimensional problems, a simpler algorithm like the LSM is efficient enough. However, for high dimensional problems, it is interesting to consider using random forests. Instead of using all the features of the problem, the basis trees in the forest only use a subset of the features which can help combat the problem of the curse of dimensionality.

References

  • Andersen and Broadie (2004) Leif Andersen and Mark Broadie. Primal-dual simulation algorithm for pricing multidimensional American options. Management Science, 50(9), 2004. ISSN 00251909. doi: 10.1287/mnsc.1040.0258.
  • Bally et al. (2005) Vlad Bally, Gilles Pagès, and Jacques Printems. A quantization tree method for pricing and hedging multidimensional american options. Mathematical Finance, 15(1), 2005. ISSN 09601627. doi: 10.1111/j.0960-1627.2005.00213.x.
  • Becker et al. (2019) Sebastian Becker, Patrick Cheridito, and Arnulf Jentzen. Deep optimal stopping. The Journal of Machine Learning Research, 20(1), January 2019. doi: 10.5555/3322706.3362015.
  • Breiman (1999) Leo Breiman. Using Adaptive Bagging to Debias Regressions. Technical Report 547, 1(October), 1999. ISSN 1098-6596.
  • Breiman (2001) Leo Breiman. Random forests. Machine Learning, 45(1), 2001. ISSN 08856125. doi: 10.1023/A:1010933404324.
  • Carriere (1996) Jacques F. Carriere. Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics and Economics, 19(1), 1996. ISSN 01676687. doi: 10.1016/S0167-6687(96)00004-2.
  • Ciocan and Mišić (2022) Dragos Florin Ciocan and Velibor V Mišić. Interpretable optimal stopping. Management Science, 68(3):1616–1638, 2022.
  • Clément et al. (2002) Emmanuelle Clément, Damien Lamberton, and Philip Protter. An analysis of a least squares regression method for American option pricing. Finance and Stochastics, 6(4), 2002. ISSN 09492984. doi: 10.1007/s007800200071.
  • Cox et al. (1979) John C. Cox, Stephen A. Ross, and Mark Rubinstein. Option pricing: A simplified approach. Journal of Financial Economics, 7(3), 1979. ISSN 0304405X. doi: 10.1016/0304-405X(79)90015-1.
  • Dietterich (2000) Thomas G. Dietterich. Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2), 2000. ISSN 08856125. doi: 10.1023/A:1007607513941.
  • Ferger (2004) Dietmar Ferger. A continuous mapping theorem for the argmax-functional in the non-unique case. Statistica Neerlandica, 58(1):83–96, February 2004. doi: 10.1046/j.0039-0402.2003.
  • Glasserman (2004) Paul Glasserman. Monte Carlo method in financial engineering, 2004. ISSN 14697688.
  • Goudenège et al. (2019) Ludovic Goudenège, Andrea Molent, and Antonino Zanette. Variance reduction applied to machine learning for pricing bermudan/american options in high dimension, 2019.
  • Han and E (2016) Jiequn Han and Weinan E. Deep learning approximation for stochastic control problems. CoRR, abs/1611.07422, 2016. URL http://arxiv.org/abs/1611.07422.
  • Han et al. (2018) Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • Ho (1998) Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 1998. ISSN 01628828. doi: 10.1109/34.709601.
  • Kohler et al. (2010) Michael Kohler, Adam Krzyzak, and Nebojsa Todorovic. Pricing of high-dimensional american options by neural networks. Mathematical Finance, 20(3), 2010. ISSN 09601627. doi: 10.1111/j.1467-9965.2010.00404.x.
  • Lapeyre and Lelong (2021) Bernard Lapeyre and Jérôme Lelong. Neural network regression for Bermudan option pricing. Monte Carlo Methods and Applications, 27(3), 2021. ISSN 15693961. doi: 10.1515/mcma-2021-2091.
  • Leake et al. (1994) Charles Leake, Reuven Y. Rubinstein, and Alexander Shapiro. Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. The Journal of the Operational Research Society, 45(8), 1994. ISSN 01605682. doi: 10.2307/2584023.
  • Ledoux and Talagrand (1991) Michel Ledoux and Michel Talagrand. Probability in Banach Spaces. Springer, 1991. doi: 10.1007/978-3-642-20212-4.
  • Lelong (2018) Jérôme Lelong. Dual pricing of American options by wiener chaos expansion. SIAM Journal on Financial Mathematics, 9(2), 2018. ISSN 1945497X. doi: 10.1137/16M1102161.
  • Longstaff and Schwartz (2001) Francis A. Longstaff and Eduardo S. Schwartz. Valuing American options by simulation: A simple least-squares approach. Review of Financial Studies, 14(1), 2001. ISSN 08939454. doi: 10.1093/rfs/14.1.113.
  • Lord et al. (2007) R. Lord, F. Fang, F. Bervoets, and C. W. Oosterlee. A fast and accurate FFT-based method for pricing early-exercise options under levy processes. SIAM Journal on Scientific Computing, 30(4), 2007. ISSN 10648275. doi: 10.1137/070683878.
  • Ludkovski (2018) Mike Ludkovski. Kriging metamodels and experimental design for bermudan option pricing. Journal of Computational Finance, 22(1), 2018. ISSN 17552850. doi: 10.21314/JCF.2018.347.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Reppen et al. (2022) A Max Reppen, H Mete Soner, and Valentin Tissot-Daguette. Neural optimal stopping boundary. arXiv preprint arXiv:2205.04595, 2022.
  • Rogers (2002) L. C.G. Rogers. Monte Carlo valuation of American options. Mathematical Finance, 12(3), 2002. ISSN 09601627. doi: 10.1111/1467-9965.02010.
  • Seijo and Sen (2011) Emilio Seijo and Bodhisattva Sen. A continuous mapping theorem for the smallest argmax functional. Electronic Journal of Statistics, 5(none):421 – 439, 2011.
  • Tsitsiklis and Van Roy (1999) John N. Tsitsiklis and Benjamin Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10), 1999. ISSN 00189286. doi: 10.1109/9.793723.