Pricing Bermudan options using Regression Trees / Random Forests

Zineb El Filali Ech-chafiq Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. Quantitative analyst at Natixis, Paris. [email protected] Pierre Henry Labordère Head of Quantitative Research cross asset, Natixis, 47 quai d’austerlitz 75013, Paris. CMAP, Ecole Polytechnique Jérôme Lelong Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. [email protected]

Abstract

The value of an American option is the maximized value of the discounted cash flows from the option. At each time step, one needs to compare the immediate exercise value with the continuation value and decide to exercise as soon as the exercise value is strictly greater than the continuation value. We can formulate this problem as a dynamic programming equation, where the main difficulty comes from the computation of the conditional expectations representing the continuation values at each time step. In (Longstaff and Schwartz, 2001), these conditional expectations were estimated using regressions on a finite-dimensional vector space (typically a polynomial basis). In this paper, we follow the same algorithm; only the conditional expectations are estimated using Regression trees or Random forests. We discuss the convergence of the LS algorithm when the standard least squares regression is replaced by regression trees. Finally, we expose some numerical results with regression trees and random forests. The random forest algorithm gives excellent results in high dimensions.

keywords: Regression trees, Random forests, Bermudan options, Optimal stopping

1 Introduction

Bermudan options are very widespread in financial markets. Their valuation adds a challenge of optimal stopping determination in comparison to European options. Bermudan options offer the investor the possibility to exercise his option at any date of his choice among a certain number of dates prior to the option expiry, called exercise dates. Naturally, the option holder will have to find the most optimal date to exercise. To do so, at each exercise date, he will compare the payoff of the immediate exercise to the expected value of continuation of the option and decide to exercise only if the immediate exercise value is the highest. We can formulate this problem as a dynamic programming equation, where the main difficulty comes from the computation of the conditional expectation representing the expected continuation value of the option. Many papers have discussed this issue, starting with regression-based algorithms; see for example (Tsitsiklis and Van Roy, 1999) and (Carriere, 1996). Also, in this category falls the most commonly used method for pricing Bermudan options which is the Least Squares Method (LSM) presented by Longstaff and Schwarz in (Longstaff and Schwartz, 2001) where the conditional expectation is estimated by a least squares regression of the post realized payoffs from continuation on some basis functions of the state variables (usually polynomial functions). Alternatively, an approximately optimal but interpretable policy can be obtained as the solution to a sample average approximation problem, see Ciocan and Mišić (2022). Another class of algorithms focuses on quantization approaches, see for example (Bally et al., 2005). The algorithm consists in computing the conditional expectations by projecting the diffusion on some optimal grid. We also have a class of duality based methods that give an upper bound on the option value for a given exercise policy by adding a quantity that penalizes the incorrect exercise decisions made by the sub-optimal policy, see for example (Rogers, 2002), (Andersen and Broadie, 2004) and (Lelong, 2018). The last class of algorithms is based on machine learning techniques. Deep learning has been widely used to solve stochastic control problems Han and E (2016) or their PDE formulation Han et al. (2018). Neural networks can also be used to estimate the continuation values as in (Kohler et al., 2010; Han et al., 2018; Lapeyre and Lelong, 2021) or the boundary (Reppen et al., 2022). The continuation value can also be approximated using Gaussian processes as in (Ludkovski, 2018). Our solution falls in this last category of algorithms. We examine Bermudan options’ prices when the continuation values’ estimation is done using regression trees or random forests.

Let $X,Y$ be two random variables with vales in $[0,1]^{d}$ and $\mathbb{R}$ respectively. A regression tree approximates the conditional expectation $\mathbb{E}\left[Y|X\right]$ with a piecewise constant function. The tree is built recursively, generating a sequence of partitions of $[0,1]^{d}$ that are finer and finer. The approximation value on each set in the partition can be seen as a terminal leaf of the tree. This algorithm is very simple and efficient. However, it can easily over-fit the data, which results in high generalization errors. To solve this issue, we use ensemble methods to aggregate multiple trees, which means that we create multiple trees and then combine them to produce improved results. We suggest using random forests (see (Breiman, 2001)). This method consists in averaging a combination of trees where each tree depends on a random vector sampled independently and identically for each tree in the forest. This vector will allow to differentiate the trees in the random forest and can be chosen in different ways. For example, one can draw for each tree a sub-sample of training from the global training data without replacement (this method is called bagging and is thoroughly studied in (Breiman, 1999)). A second method is random split selection, where at each node, the split is selected at random from among the $K$ best splits, see (Dietterich, 2000). Other methods for aggregating regression trees into random forests can be found in the literature, see for example (Breiman, 2001) or (Ho, 1998).
The structure of the paper will be as follows. First, we present the regression trees algorithm and the algorithm of least squares using regression trees. Then, we proceed to present some convergence results for regression trees and study the convergence of the LS algorithm when regression trees are used to estimate the continuation values. We prove two results: the convergence of the regression solution for infinite sample size as the depth of the regression tree goes to infinity and the convergence of the empirical price as the number of samples go to infinity for a fixed depth. Then, we briefly talk about Random Forests before we finally study some numerical examples.

2 Regression trees

Let $X$ be a r.v with values in $[0,1]^{d}$ and $Y$ a squared integrable real-valued r.v. We want to approximate the conditional expectation $\mathbb{E}[Y|X]$ . Throughout this paper, we will consider for computational convenience that $X$ has a density $f_{X}$ in $[0,1]^{d}$ w.r.t the Lebesgue measure. We assume given a training sample $D_{M}=\{(X_{1},Y_{1}),\ldots,(X_{M},Y_{M})\in[0,1]^{d}\times\mathbb{R}\}$ where the $(X_{i},Y_{i})$ ’s are i.i.d random variables following the law of $(X,Y)$ . An approximation using a regression tree consists in writing the conditional expectation as a piecewise constant function of $X$ . Each domain where the function is constant can be seen as a terminal leaf of a tree. Let us start with the one-dimensional case $(d=1)$ . Consider the function

m_{M}:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[0,1]\longmapsto\frac{1}{M}\sum_{i=1}^{M}(Y_{i}-y_{l}\mathbb{1}_{\left\{X_{i}\leq x\right\}}-y_{r}\mathbb{1}_{\left\{X_{i}>x\right\}})^{2}.

With probability $q>0$ , choose $x$ as the midpoint $x=\frac{1}{2}$ and solve the minimization problem $\inf_{y_{l},y_{r}}m_{M}(y_{l},y_{r},\frac{1}{2})$ . With probability $0<1-q<1$ , solve the minimization problem $\inf_{y_{l},y_{r},x}m_{M}(y_{l},y_{r},x)$ . For a fixed $x^{*}$ , the optimal values of $y_{l}$ and $y_{r}$ are given by

y_{r}^{*}=\frac{\sum_{i=1}^{M}Y_{i}\mathbb{1}_{\{X_{i}>x^{*}\}}}{\sum_{i=1}^{M}\mathbb{1}_{\{X_{i}>x^{*}\}}};\quad y_{l}^{*}=\frac{\sum_{i=1}^{M}Y_{i}\mathbb{1}_{\{X_{i}\leq x^{*}\}}}{\sum_{i=1}^{M}\mathbb{1}_{\{X_{i}\leq x^{*}\}}}.

(1)

The problem $\inf_{y_{l},y_{r},x}m_{M}(y_{l},y_{r},x)$ may admit more than one minimizer. To make the tree unique, we choose the minimizer $(y_{l}^{*},y_{r}^{*},x^{*})$ with the minimal third component, ie any solution $(y_{l},y_{r},x)$ to $\inf_{y_{l},y_{r},x}m_{M}(y_{l},y_{r},x)$ satisfies $x>x^{*}$ .

Once the threshold $x^{*}$ is determined, we split the samples into two groups following the sign of $X_{i}-x^{*}$ and repeat the process for each group. We stop the process if introducing a new leaf does not improve the MSE (ie when $x^{*}$ is on the boundary of the interval on which we solve the minimization problem) or when enough iterations have been made. In the end, we have a tree that approximates the conditional expectation with a piecewise constant function. The regression trees are an algorithmic tool to find an adapted partition and the corresponding weights of this piecewise constant function.
In the multi-dimensional case, we choose the direction (the index along which the optimization is performed) uniformly for each new split. Then, the process is iterated as in the one-dimensional case.

We denote the resulting tree by $\hat{\mathcal{T}}_{p}^{M}(X)$ where $p$ represents the depth of the tree, i.e., the number of iterations done in the process described above. A tree of depth $p$ has $2^{p}$ leaves.
When the size of the training data is infinite, we use the same procedure as above but with expectations rather that empirical means. Consider the function

m:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[0,1]\longmapsto\mathbb{E}\left[(Y-y_{l}\mathbb{1}_{\left\{X\leq x\right\}}-y_{r}\mathbb{1}_{\left\{X>x\right\}})^{2}\right].

Equation (1) becomes

y_{r}^{*}=\mathbb{E}\left[Y|X>x^{*}\right];\quad y_{l}^{*}=\mathbb{E}\left[Y|X\leq x^{*}\right].

Again, when we optimize the value of $x^{*}$ , we choose the minimizer $(y_{l}^{*},y_{r}^{*},x^{*})$ with the minimal third component. We denote the tree of depth $p$ obtained with an infinite data set by $\mathcal{T}_{p}(X)$ .

3 LS algorithm with regression trees

3.1 Notation

For $p\in\mathbb{N}$ , let $\mathcal{A}=\left(\prod_{j=1}^{d}\left[a_{p}^{i-1}(j),a_{p}^{i}(j)\right)\right)_{1\leq i\leq 2^{p}}$ be a partition of $[0,1]^{d}$ with $2^{p}$ elements. We write

\left[a_{p}^{i-1},a_{p}^{i}\right[:=\prod_{j=1}^{d}\left[a_{p}^{i-1}(j),a_{p}^{i}(j)\right[

For $(\alpha_{p}^{i})_{1\leq i\leq 2^{p}}\in\mathbb{R}^{2^{p}}$ , we define $\mathcal{P}_{p}$ as the piecewise constant function on the partition $\mathcal{A}$ with values $\alpha_{p}^{i}$ . For $x\in[0,1]^{d}$ ,

\mathcal{P}_{p}(x,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})=\sum_{i=1}^{2^{p}}\alpha_{p}^{i}\mathbb{1}_{\{x\in[a^{i-1}_{p},a^{i}_{p})\}}

If $\mathcal{A}$ denotes the partition obtained in the regression tree $\mathcal{T}_{p}$ , and we choose

\alpha_{p}^{i}=\mathbb{E}\left[Y|X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right],

then the regression tree $\mathcal{T}_{p}(X)$ can be written as follows

\mathcal{T}_{p}(X)=\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})

(2)

Similarly, if $\mathcal{A}^{M}=\left(\left[a_{p}^{i-1,M},a_{p}^{i,M}\right)\right)_{1\leq i\leq 2^{p}}$ denotes the partition obtained in the regression tree $\hat{\mathcal{T}}_{p}^{M}$ and

\alpha_{p}^{i,M}=\frac{\sum_{m=1}^{M}Y_{m}\mathbb{1}_{\left\{X_{m}\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right\}}}{\sum_{m=1}^{M}\mathbb{1}_{\left\{X_{m}\in\left[a_{p}^{i-1,M},a^{i,M}_{p}\right)\right\}}},

then the regression tree $\hat{\mathcal{T}}^{M}_{p}(X)$ can be written as follows

\hat{\mathcal{T}}^{M}_{p}(X)=\mathcal{P}_{p}(X,(a_{p}^{i,M})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i,M})_{1\leq i\leq 2^{p}})

(3)

3.2 Description of the algorithm

Let $T$ be a fixed maturity, and consider the filtered probability space $(\Omega,\mathcal{F},(\mathcal{F}_{t})_{0\leq t\leq T},\mathbb{P})$ where $\mathbb{P}$ is the risk neutral measure. Consider a Bermudan option that can be exercised at dates $0=t_{0}<t_{1}<t_{2}<\cdots<t_{N}=T$ . When exercised at time $t_{j}$ , the option’s discounted payoff is given by $Z_{t_{j}}=h_{j}(X_{t_{j}})$ with $(X_{t_{j}})_{j}$ being an adapted Markov process taking values in $[0,1]^{d}$ such that for every $j$ , $X_{t_{j}}$ has a density $f^{j}_{X}$ on $[0,1]^{d}$ uniformly bounded from below on any compact set of $]0,1[^{d}$ by a strictly positive number. The discounted value $(U_{t_{j}})_{0\leq j\leq N}$ of this option is given by

U_{t_{j}}=\mathop{\mathrm{esssup}}_{\tau\in\mathcal{S}_{t_{j},T}}\mathbb{E}\left[Z_{\tau}|\mathcal{F}_{t_{j}}\right],

(4)

where $\mathcal{S}_{t_{j},T}$ is the set of all stopping times taking values in $\{t_{j},\dots,T\}$ . Using the Snell envelope theory, we know that $U$ solves the dynamic programming equation

\left\{\begin{array}[]{ll}U_{t_{N}}&=Z_{t_{N}}\\ U_{t_{j}}&=\max\left(Z_{t_{j}},\mathbb{E}\left[U_{t_{j+1}}|\mathcal{F}_{t_{j}}\right]\right)\textnormal{ for }1\leq j\leq N-1.\end{array}\right.

(5)

This equation can be rewritten in terms of optimal policy as follows

\left\{\begin{array}[]{ll}\tau_{N}&=t_{N}=T\\ \tau_{j}&=t_{j}\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}+\tau_{j+1}\mathbb{1}_{\left\{Z_{t_{j}}<\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\textnormal{ for }1\leq j\leq N-1\end{array}\right.

(6)

where $\tau_{j}$ is the smallest optimal stopping time after $t_{j}$ . As we are in a Markovian setting, we can write $\mathbb{E}\left[Z{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]=\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\right]$ . The main difficulty in solving this equation comes from the computation of the continuation value $\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\right]$ . In the Least Squares approach presented by (Longstaff and Schwartz, 2001), this conditional expectation is estimated by a linear regression on a countable set of basis functions of $X_{t_{j}}$ . In our approach, we suggest to estimate it using a regression tree of depth $p$ . Let $\mathcal{T}_{p}^{j}(X_{t_{j}})$ be the regression tree approximating $\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\right]$ with an infinite data set. The algorithm solves the following policy

\left\{\begin{array}[]{ll}\tau_{N}^{p}&=t_{N}=T\\ \tau_{j}^{p}&=t_{j}\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}+\tau_{j+1}\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}\textnormal{ for }1\leq j\leq N-1.\end{array}\right.

(7)

We sample $M$ paths of the model $X_{t_{0}}^{(m)},\ldots,X_{t_{N}}^{(m)}$ along with the corresponding payoff paths $Z_{t_{0}}^{(m)},\ldots,Z_{t_{N}}^{(m)}$ , $m=1,\ldots,M$ . For each date $j=1,\ldots,N-1$ we approximate the conditional expectations $\mathbb{E}[Z_{\tau_{j+1}}|X_{t_{j}}]$ on the path $m$ using the regression tree $\hat{\mathcal{T}}^{j,M}_{p}(X^{(m)}_{t_{j}})$ built with the samples $(X_{t_{j}}^{(m)},Z^{m}_{\hat{\tau}_{j+1}^{p,(m)}})_{1\leq m\leq M}$ where

\left\{\begin{array}[]{ll}\hat{\tau}_{N}^{p,(m)}&=t_{N}=T\\ \hat{\tau}_{j}^{p,(m)}&=t_{j}\mathbb{1}_{\left\{Z_{t_{j}}^{(m)}\geq\hat{\mathcal{T}}_{p}^{j,M}(X_{t_{j}}^{(m)})\right\}}+\hat{\tau}_{j+1}^{(m)}\mathbb{1}_{\left\{Z_{t_{j}}^{(m)}<\hat{\mathcal{T}}_{p}^{j,M}(X_{t_{j}}^{(m)})\right\}}\textnormal{ for }1\leq j\leq N-1.\end{array}\right.

(8)

Finally, the time-0 price of the option is approximated by

U_{0}^{p,M}=\max\left(Z_{0},\frac{1}{M}\sum\limits_{m=1}^{M}Z_{\hat{\tau}_{1}^{p,(m)}}^{(m)}\right).

(9)

4 Convergence of the algorithm

4.1 Notation

For each time step $1\leq j\leq N-1$ , let $\theta_{j}^{p}=((a^{p}_{i,j})_{0\leq i\leq 2^{p}},(\alpha^{p}_{i,j})_{1\leq i\leq 2^{p}})$ be the parameters of $\mathcal{T}_{p}^{j}$ and $\hat{\theta}_{j}^{p,M}=((a^{p,M}_{i,j})_{0\leq i\leq 2^{p}},(\alpha^{p,M}_{i,j})_{1\leq i\leq 2^{p}})$ be the parameters of $\mathcal{T}_{p,M}^{j}$ . Note that with this notation and Equations (2) and (3), we have

	$\displaystyle\mathcal{T}_{p}^{j}(X_{t_{j}})=\mathcal{P}_{p}(X_{t_{j}},\theta_{j}^{p})$		(10)
	$\displaystyle\hat{\mathcal{T}}_{p}^{j,M}(X_{t_{j}}^{(m)})=\mathcal{P}_{p}(X_{t_{j}}^{(m)},\hat{\theta}_{j}^{p,M})$		(11)

where for te sake of conciseness we have shrunk the partition and weight parameters of the tree into a single multi-dimensional parameter.

We introduce the vector $\vartheta$ of the coefficients of the successive expansions $\vartheta^{p}=(\theta_{1}^{p},\ldots,\theta_{N-1}^{p})$ and $\hat{\vartheta}^{p,M}=(\hat{\theta}_{1}^{p,M},\ldots,\hat{\theta}_{N-1}^{p,M})$ .
Let $t^{p}=(t_{1}^{p},\ldots,t_{N-1}^{p})\in\left(([0,1]^{d})^{2^{p}+1}\right)^{N-1}$ be a deterministic parameter, $z=(z_{1},\ldots,z_{N})\in\mathbb{R}^{N}$ and $x=(x_{1},\ldots,x_{N})\in([0,1]^{d})^{N}$ be deterministic vectors. Following the notation of (Clément et al., 2002), we define the function $F=(F_{1},\ldots,F_{N})$ taking values in $\mathbb{R}^{N}$ by

\left\{\begin{array}[]{ll}F_{N}(t^{p},z,x)&=z_{N}\\ F_{j}(t^{p},z,x)&=z_{j}\mathbb{1}_{\{z_{j}\geq\mathcal{P}_{p}(x_{j},t_{j}^{p})\}}+F_{j+1}(t^{p},z,x)\mathbb{1}_{\{z_{j}<\mathcal{P}_{p}(x_{j},t_{j}^{p})\}},\textnormal{ for }1\leq j\leq N-1.\end{array}\right.

$F_{j}(t^{p},z,x)$ only depends on $t_{j}^{p},\ldots,t_{N-1}^{p}$ and not on the first $j-1$ components of $t^{p}$ . Using (10), we have

	$\displaystyle F_{j}(\vartheta^{p},Z,X)$	$\displaystyle=Z_{\tau_{j}^{p}},$
	$\displaystyle F_{j}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})$	$\displaystyle=Z_{\hat{\tau}_{j}^{p,(m)}}^{(m)}.$

Moreover, we clearly have that for all $t^{p}\in(([0,1]^{d})^{2^{p}})^{N-1}$ :

\left\lvert F_{j}(t^{p},Z,X)\right\rvert\leq\max\limits_{k\geq j}\left\lvert Z_{t_{k}}\right\rvert.

(12)

4.2 Convergence of the conditional expectations

4.2.1 Preliminary results

We define

J_{p}\left(a_{p}^{0},\ldots,a_{p}^{p}\right):=\mathbb{E}\left[\left\lvert\sum_{i=1}^{p}\alpha^{i}_{p}\mathbb{1}_{\{X\in\left[a^{i-1}_{p}-a^{i}_{p}\right)\}}-\mathbb{E}\left[Y|X\right]\right\rvert^{2}\right].

with $\alpha_{p}^{i}=\mathbb{E}[Y|X\in[a^{i-1}_{p}-a^{i}_{p})]$ . We will use the following standard lemma to prove the convergence of random tree estimators.

Lemma 4.1.

Let $X$ be a r.v with a density $f_{X}$ w.r.t to the Lebesgue measure on $[0,1]^{d}$ and $Y$ be a real valued square integrable random variable. Let $\left(\left(\left[a^{i-1}_{p},a^{i}_{p}\right)\right)_{1\leq i\leq p}\right)_{p\in\mathbb{N}}$ be a sequence of partitions of $[0,1]^{d}$ such that $\lim\limits_{p\to\infty}\max\limits_{1\leq i\leq p}\max\limits_{1\leq j\leq d}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert=0$ . Then,

\lim_{p\to\infty}J_{p}\left(a^{0}_{p},\ldots,a^{p}_{p}\right)=0.

Theorem 4.2.

\lim_{p\to\infty}\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right\rvert^{2}\right]=0.

Proof.

With probability $\frac{1}{d}$ , the index $j$ is chosen for optimisation. In the other $d-1$ cases, we do not even cut along that direction, in which case the interval length is at most equal to the length of the largest interval at step $p-1$ . When the index $j$ is chosen: with probability $q$ , the length of the interval is cut in two, and with probability $1-q$ , it is cut to optimize the MSE. In that case, the interval length is at most equal to the length of the largest interval at step $p-1$ .

For all $1\leq j\leq d$

	$\displaystyle\mathbb{E}\left[\max_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert\right]$
	$\displaystyle\leq\frac{1}{d}\left[q\frac{1}{2}\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]+(1-q)\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]\right]$
	$\displaystyle+\frac{d-1}{d}\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]$
	$\displaystyle\leq(1-\frac{q}{2d})\mathbb{E}\left[\max_{1\leq i\leq 2^{p-1}}\left\lvert a^{i}_{p-1}(j)-a^{i-1}_{p-1}(j)\right\rvert\right]$
	$\displaystyle\leq(1-\frac{q}{2d})^{2^{p}}$

Since $\sum\limits_{p=0}^{\infty}(1-\frac{q}{2d})^{2^{p}}<\infty$ , so is $\sum\limits_{p=0}^{\infty}\mathbb{E}\left[\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert\right]$ . As $\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert$ is non negative for all $p$ , using Tonelli’s theorem we conclude that $\mathbb{E}\left[\sum\limits_{p=0}^{\infty}\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert\right]<\infty$ . As a result, the series $\sum\limits_{p=0}^{\infty}\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert$ converges a.s. Then, $\lim\limits_{p\to\infty}\max\limits_{1\leq i\leq 2^{p}}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert=0$ a.s for all $j$ and $\lim\limits_{p\to\infty}\max\limits_{1\leq i\leq 2^{p}}\max\limits_{1\leq j\leq d}\left\lvert a^{i}_{p}(j)-a^{i-1}_{p}(j)\right\rvert=0$ .

Let $\mathcal{G}$ be the $\sigma$ -field generated by the splitting strategy (direction choice and threshold strategy). Conditioning by $\mathcal{G}$ allows us to consider the partition $\left([a^{i-1}_{p}-a^{i}_{p})\right)_{1\leq i\leq 2^{p}}$ deterministic and we can apply Lemma 4.1 to prove

\lim_{p\to\infty}\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}\left[Y|X\right]\right\rvert^{2}|\mathcal{G}\right]=0\text{ a.s}.

Note that

	$\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)\right\rvert^{2}\|\mathcal{G}\right]$	$\displaystyle\leq\mathbb{E}\left[\sum_{i=1}^{2^{p}}\mathbb{E}\left[Y^{2}\|X\in[a^{i-1}_{p},a^{i}_{p})\right]\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}\|\mathcal{G}\right]$
		$\displaystyle\leq\sum_{i=1}^{2^{p}}\mathbb{E}\left[\mathbb{E}\left[Y^{2}\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}\|X\in[a^{i-1}_{p},a^{i}_{p})\right]\|\mathcal{G}\right]$
		$\displaystyle\leq\sum_{i=1}^{2^{p}}\mathbb{E}\left[Y^{2}\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}\|\mathcal{G}\right]$
		$\displaystyle\leq\mathbb{E}\left[Y^{2}\|\mathcal{G}\right]$

Then,

	$\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}\left[Y\|X\right]\right\rvert^{2}\|\mathcal{G}\right]$	$\displaystyle\leq 2\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)\right\rvert^{2}\|\mathcal{G}\right]+2\mathbb{E}\left[\mathbb{E}\left[Y\|X\right]^{2}\right]$
		$\displaystyle\leq 2\left(\mathbb{E}\left[Y^{2}\|\mathcal{G}\right]+\mathbb{E}\left[\mathbb{E}\left[Y\|X\right]^{2}\right]\right)$

Using Lebesgue’s bounded convergence theorem,

\lim_{p\to\infty}\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right\rvert^{2}\right]=\lim_{p\to\infty}\mathbb{E}\left[\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right\rvert^{2}|\mathcal{G}\right]\right]=0.\qed

Once the partition of the tree is computed, the values associated to each element of the partition are uniquely determined by a global optimization problem

Proposition 4.3.

Let $\mathcal{A}=\left(\prod_{j=1}^{d}\left[a_{p}^{i-1}(j),a_{p}^{i}(j)\right)\right)_{1\leq i\leq 2^{p}}$ denote the partition associated to $\mathcal{T}_{p}$ . Then,

\mathbb{E}\left[\left(\mathcal{T}_{p}(X)-\mathbb{E}[Y|X]\right)^{2}\right]=\inf_{(\alpha_{p}^{i})_{1\leq i\leq 2^{p}}}\mathbb{E}\left[\left(\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})-\mathbb{E}[Y|X]\right)^{2}\right]

and moreover the right hand side admits a unique minimizer given by

\alpha_{p}^{i}=\mathbb{E}\left[Y|X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right].

Proof.

Since the function $\mathcal{P}_{p}$ is linear w.r.t the parameter $\alpha$ , the function

\alpha\in\mathbb{R}^{2^{p}+1}\longmapsto\mathbb{E}\left[\left(\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})-\mathbb{E}[Y|X]\right)^{2}\right]

is strongly convex. Hence, it admits a unique minimizer defined by the first order optimality condition stating that for all $0\leq i\leq 2^{p}+1$ ,

	$\displaystyle\mathbb{E}\left[\left(\mathcal{P}_{p}(X,(a_{p}^{i})_{0\leq i\leq 2^{p}},(\alpha_{p}^{i})_{1\leq i\leq 2^{p}})-\mathbb{E}[Y\|X]\right)\mathbb{1}_{\left\{X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right\}}\right]=0$
	$\displaystyle\mathbb{E}\left[\left(\alpha_{p}^{i}-\mathbb{E}[Y\|X]\right)\mathbb{1}_{\left\{X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right\}}\right]=0$
	$\displaystyle\alpha_{p}^{i}=\mathbb{E}\left[Y\|X\in\left[a_{p}^{i-1},a^{i}_{p}\right)\right].\qed$

A similar result holds for $\hat{\mathcal{T}}_{p}^{M}$ by replacing the conditional expectations by empirical conditional expectations.

4.2.2 Approximation of the conditional expectations with regression trees

The following result proves that the approximation of the Bermudan price produced by our algorithm converges to the true Bermudan price.

Proposition 4.4.

\lim_{p\to\infty}\mathbb{E}\left[Z_{\tau_{j}^{p}}|\mathcal{F}_{t_{j}}\right]=\mathbb{E}\left[Z_{\tau_{j}}|\mathcal{F}_{t_{j}}\right]\textnormal{\;}{\text{in}}\;\mathbb{L}^{2}(\Omega)\textnormal{ for }1\leq j\leq N

(13)

Proof.

We proceed by induction.
For $j=N$ , the proposition is true since $\tau_{N}^{p}=\tau_{N}=T$ . Assume that the result holds for $j+1$ , Let us prove that it still holds for j:

	$\displaystyle\mathbb{E}\left[Z_{\tau_{j}^{p}}-Z_{\tau_{j}}\|\mathcal{F}_{j_{j}}\right]$	$\displaystyle=Z_{t_{j}}\left(\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right)$
		$\displaystyle+\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-Z_{\tau_{j+1}}\mathbb{1}_{\left\{Z_{t_{j}}<\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\|\mathcal{F}_{t_{j}}\right]$
		$\displaystyle=(Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}])\left(\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right)$
		$\displaystyle+\mathbb{E}[Z_{\tau_{j+1}}^{p}-Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}$
		$\displaystyle=A_{j}^{p}+\mathbb{E}[Z_{\tau_{j+1}}^{p}-Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}$

Where $A_{j}^{p}$ is defined by

A_{j}^{p}=(Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}])\left(\mathbb{1}_{\left\{Z_{j_{k}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\}}\right)

On one hand, since the conditional expectation is an orthogonal projection, we have

\mathbb{E}\left[\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]\leq\mathbb{E}\left[\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j+1}}\right]\right\rvert^{2}\right]

and using the induction assumption $\mathbb{E}\left[Z_{\tau_{j+1}^{p}}-Z_{\tau_{j+1}}|\mathcal{F}_{t_{j+1}}\right]\to 0$ in $\mathbb{L}^{2}(\Omega)$ when $p\to\infty$ . On the other hand,

	$\displaystyle\left\lvert A_{j}^{p}\right\rvert$	$\displaystyle=\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\right\rvert\left\lvert\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right\rvert$
		$\displaystyle\leq\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\right\rvert\;\left\lvert\mathbb{1}_{\left\{\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]>Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{\mathcal{T}_{p}^{j}(X_{t_{j}})>Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right\rvert$
		$\displaystyle\leq\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\right\rvert\;\left\lvert\mathbb{1}_{\left\{\left\lvert Z_{t_{j}}-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert\right\}}\right\rvert$
		$\displaystyle\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert$
		$\displaystyle\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]\right\rvert+\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert.$

Using the induction assumption, the second term goes to zero in $\mathbb{L}^{2}(\Omega)$ when $p\to\infty$ . Let $\left(\left[a_{i-1}(p),a_{i}(p)\right)\right)_{1\leq i\leq 2^{p}}$ be the partition generated by $\mathcal{T}_{p}^{j}$ . We define

\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})=\sum_{i=1}^{2^{p}}\mathbb{E}\left[Z_{\tau_{j+1}}|X_{t_{j}}\in\left[a_{i-1}(p),a_{i}(p)\right)\right]\mathbb{1}_{\{X_{t_{j}}\in\left[a_{i-1}(p),a_{i}(p)\right)\}}.

Note that $\mathcal{\bar{T}}_{p}^{j}$ uses the partition given by $\mathcal{T}_{p}^{j}(X_{t_{j}})$ but the coefficients $\alpha_{i}(p)$ are given by the conditional expectations of $Z_{\tau_{j+1}}$ w.r.t $X_{t_{j}}$ and not those of $Z_{\tau_{j+1}^{p}}$ . Using Proposition 4.3, we have the following inequality stating that $\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})$ is sub-optimal compared to $\mathcal{T}_{p}^{j}(X_{t_{j}})$

	$\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left\lvert\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\lvert\mathcal{\bar{T}}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right]+2\mathbb{E}\left[\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]\right\rvert^{2}\right].$

The second term goes to 0 using the induction assumption. As for the first term, note that the partition obtained with $\mathcal{T}_{p}^{j}$ verifies the conditions of Lemma 4.1. Then, using the same arguments as in the proof of Theorem 4.2, we can show that the first term also goes to $0$ . ∎

4.3 Convergence of the Monte Carlo approximation

In this section, the depth $p$ of the trees is fixed and we study the convergence with respect to the number of samples $M$ . We assume that, for all dates $j$ , the trees $\mathcal{T}^{j}_{p}(X_{t_{j}})$ and $\hat{\mathcal{T}}^{j,M}_{p}(X_{t_{j}})$ are built using the same splitting strategy (ie the same splitting directions and threshold strategies).

4.3.1 Convergence of optimisation problems

We present three major results to study the convergence of stochastic optimization problems related to regression trees.

The first result is a uniform strong law of large numbers, see (Leake et al., 1994, Chap. 2, Lemma. A1), which can be seen as a particular case of the strong law of large numbers in Banach spaces (Ledoux and Talagrand, 1991, Corollary 7.10, page 189).

Lemma 4.5.

Let $(\xi_{i})_{i\geq 1}$ be a sequence of i.i.d $\mathbb{R}^{n}$ -valued random vectors and $h:\mathbb{R}^{d}\times\mathbb{R}^{n}\to\mathbb{R}$ be a measurable function. Assume that

(i)

For all $\bar{\theta}\in\mathbb{R}^{d}$ , the function $\theta\in\mathbb{R}^{d}\mapsto h(\theta,\xi_{1})$ is continuous at $\bar{\theta}$ a.s.,
(ii)

$\forall C>0,\mathbb{E}\left[\sup_{\left\lvert\theta\right\rvert\leq C}\left\lvert h(\theta,\xi_{1})\right\rvert\right]<\infty$ .

Then, a.s $\theta\in\mathbb{R}^{d}\mapsto\frac{1}{n}\sum_{i=1}^{n}h(\theta,\xi_{i})$ converges locally uniformly to the continuous function $\theta\in\mathbb{R}^{d}\mapsto\mathbb{E}\left[h(\theta,\xi_{1})\right]$ , i.e

\forall\,C>0,\;\lim_{n\to\infty}\sup_{\left\lvert\theta\right\rvert\leq C}\left\lvert\frac{1}{n}\sum_{i=1}^{n}h(\theta,\xi_{i})-\mathbb{E}\left[h(\theta,\xi_{1})\right]\right\rvert=0\textnormal{ a.s.}

This lemma is a slight improvement of (Leake et al., 1994, Chap. 2, Lemma. A1), which was formulated under the assumption that the function $\theta\in\mathbb{R}^{d}\mapsto h(\theta,\xi_{1})$ is almost surely continuous. However, looking closely at their proof, it turns that it sufficient to assume $(i)$ for the conclusion to hold. Condition $(i)$ allows the $\mathbb{P}-$ null sets on which the continuity at $\bar{\theta}$ does not hold to depend on $\bar{\theta}$ .

Consider a sequence of real valued functions $(f_{n})_{n}$ defined on a compact set $K\subset\mathbb{R}^{d}$ such that there exists a sequence of $(x_{n})_{n}$ satisfying

f_{n}(x_{n})=\inf_{x\in K}f_{n}(x).

From (Leake et al., 1994, Chap. 2, Theorem A1), we can derive the following lemma

Lemma 4.6.

Assume that the sequence $(f_{n})_{n}$ converges uniformly on $K$ to a continuous function $f$ . Let $v^{*}=\inf_{x\in K}f(x)$ and $\mathcal{S}^{*}=\{x\in K\>:\>f(x)=v^{*}\}$ . Then, $f_{n}(x_{n})\to\inf_{x\in K}f(x)$ and $d(x_{n},\mathcal{S}^{*})\to 0$ .

Now, we focus on a canonical minimization problem appearing at each node of the regression tree. Let $[\underline{a},\bar{a}]$ be a rectangle in $[0,1]$ . For some index $1\leq\delta\leq d$ , and a real number $a$ in $]\underline{a}^{\delta},\bar{a}^{\delta}[$ , we define the two new rectangles

\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(a)=\{x\in[\underline{a},\bar{a}]\;:\;x^{\delta}\leq a\};\quad\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(a)=\{x^{\delta}\in[\underline{a},\bar{a}]\;:\;x>a\}.

(14)

We consider the cost functions

		$\displaystyle m_{M}:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[\underline{a}^{\delta},\bar{a}^{\delta}]\longmapsto\frac{1}{M}\sum_{i=1}^{M}\left(Y_{i}-y_{l}\mathbb{1}_{\left\{X_{i}\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{X_{i}\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)\right\}}\right)^{2}$		(15)
		$\displaystyle m:y_{l}\in\mathbb{R},y_{r}\in\mathbb{R},x\in[\underline{a}^{\delta},\bar{a}^{\delta}]\longmapsto\mathbb{E}\left[\left(Y-y_{l}\mathbb{1}_{\left\{X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)\right\}}\right)^{2}\right].$		(16)

Let $(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ be the solution to

\displaystyle\inf_{y_{l},y_{r},x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]}m_{M}(y_{l},y_{r},x)

(17)

with the smallest third component for some arbitrary small $\varepsilon_{0}>0$ . It is clear that (17) may not have a unique solution. Choosing the solution with the minimal third component is a standard way to get a unique minimizer (see for instance Seijo and Sen (2011)).

Lemma 4.7.

Assume that the density $f_{X}$ of $X$ satisfies $f_{X}(x)\geq\underline{f}>0$ for all $x$ in any compact set of $]0,1[^{d}$ and that the minimization problem

\displaystyle\inf_{y_{l},y_{r},x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]}m(y_{l},y_{r},x)

(18)

has a unique minimizer $(y_{l}^{*},y_{r}^{*},x^{*})$ .

Then, $(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ converges almost surely to $(y_{l}^{*},y_{r}^{*},x^{*})$ and $m_{M}(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ converges almost surely to $m(y_{l}^{*},y_{r}^{*},x^{*})$ when $M$ goes to infinity.

Proof.

Consider the function

h(y_{l},y_{r},x,y,\xi)=\left(y-y_{l}\mathbb{1}_{\left\{\xi\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{\xi\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)\right\}}\right)^{2}

Since $X$ has a density on $[0,1]^{d}$ , the function $(y_{l},y_{r},x)\longmapsto h(y_{l},y_{r},x,Y,X)$ satisfies Condition $(i)$ of Lemma 4.5. Moreover, for $C>0$

\displaystyle\mathbb{E}\left[\sup_{\lvert y_{l}\rvert\leq C,\lvert y_{r}\rvert\leq C,x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]}\left\lvert h(y_{l},y_{r},x,Y,X)\right\rvert\right]\leq 2\mathbb{E}[Z^{2}]+2C

Hence, we deduce from Lemma 4.5 that $m_{M}$ converges a.s. locally uniformly $m$ . Note that for a fixed $x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]$ , the optimal values of $y_{l}$ and $y_{r}$ are given by

	$\displaystyle y_{l}$	$\displaystyle=\mathbb{E}[Y\|X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)]$
	$\displaystyle y_{r}$	$\displaystyle=\mathbb{E}[Y\|X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,r}(x)]$

Observe that for $x\in[\underline{a}^{\delta}+\varepsilon_{0},\bar{a}^{\delta}-\varepsilon_{0}]$

\displaystyle\lvert\mathbb{E}[Y|X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x)]\rvert

\displaystyle\leq\frac{\sqrt{\mathbb{E}[Y^{2}]}}{\sqrt{\mathbb{P}(X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(x))}}\leq\frac{\sqrt{\mathbb{E}[Y^{2}]}}{\sqrt{\mathbb{P}(X\in\mathcal{R}_{\underline{a},\bar{a}}^{\delta,l}(a^{\delta}-\varepsilon_{0}))}}

The uniform upper bound is uniform is finite thanks to the assumption on the density $f_{X}$ of $f$ . Hence, as a function of $x$ , $y_{l}$ is uniformly bounded. A similar results holds for $y_{l}$ . Then, we deduce from Lemma 4.6, that $m_{M}(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ converges almost surely to $m(y_{l}^{*},y_{r}^{*},x^{*})$ and $(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ converges almost surely to $(y_{l}^{*},y_{r}^{*},x^{*})$ . ∎

To the best of our knowledge, without the uniqueness assumption on $(y_{l}^{*},y_{r}^{*},x^{*})$ , we can only prove that $(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ converges in probability to $(y_{l}^{*},y_{r}^{*},x^{*})$ . See for instance (Ferger, 2004, Corollary 1), which yields that for any $\eta>0$ , $\mathbb{P}(\hat{x}^{M}<x^{*}+\eta)\to 1$ . Combining this result with $d((\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M}),S^{*})\to 0$ a.s. yields the convergence in probability of $(\hat{y}_{l}^{M},\hat{y}_{r}^{M},\hat{x}^{M})$ to $(y_{l}^{*},y_{r}^{*},x^{*})$ .

4.3.2 Strong law of large numbers

In order to prove the almost sure convergence of $\hat{\mathcal{T}}^{M}_{j,p}(X)$ , we slightly modify the design of our regression trees. At every node, when computing the optimal splitting point $x^{*}$ , we do not perform the optimization on the entire cell but we actually a margin to make sure that $y^{*}_{l}$ and $y^{*}_{r}$ are uniformly bounded w.r.t $x^{*}$ . Consider a node at depth $p$ , the optimal splitting is obtained by minimizing $m_{M}$ defined in (15) where $\underline{a}$ and $\bar{a}$ were computed at depth $p-1$ . As in (17), we search for the optimal value of $x^{*}$ in $[\underline{a}^{\delta}-\varepsilon_{p},\bar{a}^{\delta}-\varepsilon_{p}]$ where $\delta$ is the coordinate chosen for the splitting and $\varepsilon_{p}>0$ is a technical security padding decreasing to $0$ . When there is no room left for this padding ( $\lvert\underline{a}^{\delta}-\bar{a}^{\delta}\rvert<2\varepsilon_{p})$ , we stop the splitting procedure.

From (Clément et al., 2002), we have the following result

Lemma 4.8.

For every $j=1,\ldots,N-1$ ,

\left\lvert F_{j}(a,Z,X)-F_{j}(b,Z,X)\right\rvert\leq\left(\sum_{i=j}^{N}\left\lvert Z_{t_{i}}\right\rvert\right)\left(\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{P}_{p}^{i}(X_{t_{i}},b_{i})\right\rvert\leq\left\lvert\mathcal{P}_{p}^{i}(X_{t_{i}},a_{i})-\mathcal{P}_{p}^{i}(X_{t_{i}},b_{i})\right\rvert\}}\right).

Proposition 4.9.

Assume that for all $p\in\mathbb{N}^{*}$ , and all $1\leq j\leq N-1$ , $\mathbb{P}(Z_{t_{j}}=\mathcal{P}_{p}^{j}(X_{t_{j}},\theta_{j}^{p}))=0$ . For all $1\leq j\leq N-1$ and all $1\leq i\leq 2^{p}$ , the optimization problems

\inf_{y_{l},y_{r},x}\mathbb{E}\left[\left(Z_{\tau^{p}_{j}}-y_{l}\mathbb{1}_{\left\{X_{t_{j}}\in\mathcal{R}_{a^{i-1}_{p},a^{i}_{p}}^{\delta,l}(x)\right\}}-y_{r}\mathbb{1}_{\left\{X_{t_{j}}\in\mathcal{R}_{a^{i-1}_{p},a^{i}_{p}}^{\delta,r}(x)\right\}}\right)^{2}\right]

with $x\in[a^{i-1,\delta}_{p}+\varepsilon_{p},a^{i,\delta}_{p}-\varepsilon_{p}]$ admit a unique solution where $\delta$ is the coordinate selected by the splitting strategy.

Then, for all $j=1,\ldots,N-1$ , $\hat{\theta}_{j}^{p,M}$ converges a.s. to $\theta_{j}^{p}$ and $\mathcal{P}_{p}^{j}(X_{t_{j}},\hat{\theta}_{j}^{p,M})$ converges a.s. to $\mathcal{P}_{p}^{j}(X_{t_{j}},\theta_{j}^{p})$ as $M\to\infty$ .

Proof.

We proceed by backward induction on $j$ with a nested forward induction on $p$ .

Step 1: $j=N-1$

•

For $p=1$ , conditionally on splitting at the best point (and not at the midpoint) the new nodes are obtained by solving

	$\displaystyle{\mathcal{P}}_{1}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{1,M})=$	$\displaystyle\inf\limits_{\alpha,\beta,a}\frac{1}{M}\sum_{m=1}^{M}\left\lvert Z_{t_{N}}^{(m)}-\alpha\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{0,1}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in[a,1]\}}\right\rvert^{2}$
	$\displaystyle\mathcal{P}_{1}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{1})=$	$\displaystyle\inf_{\alpha,\beta,a}\mathbb{E}\left[\left\lvert Z_{t_{N}}-\alpha\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{0,1}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}\in[a,1]\}}\right\rvert^{2}\right]$

where $a\in[\varepsilon_{1},1-\varepsilon_{1}]$ . We can apply Lemma 4.7 to obtain ${\mathcal{P}}_{1}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{1,M})$ converges to $\mathcal{P}_{1}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{1})$ a.s as $M\to\infty$ and so does $\hat{\theta}_{N-1}^{1,M}$ converge a.s. to $\theta_{N-1}^{1}$ . If the split is at the midpoint, the conclusion is even easier to obtain as the infimums are only computed w.r.t $\alpha$ and $\beta$ but not $a$ . The same situation will occur repeatedly and for the sake of clearness, we will only treat the case of splitting at the best point which is harder to handle.

•

Assume that ${\mathcal{P}}_{p}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{p,M})$ converges to $\mathcal{P}_{p}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{p})$ a.s and that $\hat{\theta}_{N-1}^{p,M}$ converges a.s. to $\theta_{N-1}^{p}$ as $M\to\infty$ for $p\geq 1$ , we will prove it for $p+1$ . For $i\in\{1,\ldots,2^{p}\}$ , we consider the computation of the $i-th$ node in $\hat{\mathcal{T}}_{p+1}^{j,M}$ at depth $p+1$ .

\displaystyle\hat{\nu}_{p,N-1}^{M}(\alpha,\beta,a)

\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert Z_{t_{N}}^{(m)}-\alpha\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a^{p,M}_{i-1,N-1},a^{p,M}_{i,N-1}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a^{p,M}_{i-1,N-1},a^{p,M}_{i,N-1}}^{\delta,r}(a)\}}\right\rvert^{2}.

The parameters $\hat{\theta}_{N-1}^{p+1,M}$ are obtained by computing all the nodes. We also introduce a modified version of $\hat{\nu}_{p,N-1}^{M}$ in which we use the splits computed in $\mathcal{T}_{p}^{j}$ .

\displaystyle\nu^{M}_{p,N-1}(\alpha,\beta,a)

\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert Z_{t_{N}}^{(m)}-\alpha\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a_{i-1,N-1}^{p},a^{p}_{i-1,N-1}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}^{(m)}\in\mathcal{R}_{a^{p}_{i-1,N-1},a^{p}_{i-1,N-1}}^{\delta,r}(a)\}}\right\rvert^{2}.

The random functions $\nu^{M}_{p,N-1}$ write as standard empirical means and we will prove they are close to $\hat{\nu}_{p,N-1}^{M}$ for $M$ large.

Using Lemma 4.5 along with the arguments of the proof of Lemma 4.7, it is easy to see that the random function $\alpha,\beta,a\mapsto\nu^{M}_{p,N-1}(\alpha,\beta,a)$ converges a.s locally uniformly to the function $\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert Z_{t_{N}}-\alpha\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a_{i-1,N-1}^{p},a^{p}_{i,N-1}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a_{i-1,N-1}^{p},a^{p}_{i,N-1}}^{\delta,r}(a)\}}\right\rvert^{2}\right]$ .

	$\displaystyle\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}^{M}_{p,N-1}(\alpha,\beta,a)-\nu^{M}_{p,N-1}(\alpha,\beta,a)\right\rvert$
	$\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}\left[\left\lvert 2Z_{t_{N}}^{(m)}\right\rvert+4C\right]C\left(\mathbb{1}_{\{X_{t_{N-1}}^{(m),\delta}\in(a^{p,M,\delta}_{i-1},a^{p,\delta}_{i-1})\}}+\mathbb{1}_{\{X_{t_{N-1}}^{(m),\delta}\in(a^{p,M,\delta}_{i},a^{p,\delta}_{i})\}}\right).$

Using the induction assumption on $p$ , $a^{p,M}_{i}$ (resp. $a^{p,M}_{i-1}$ ) converges a.s. to $a_{i}^{p}$ (resp. $a_{i-1}^{p}$ ). Let $\epsilon>0$ ,

	$\displaystyle\limsup_{M}\left\lvert\hat{\nu}^{M}_{p,N-1}(\alpha,\beta,a)-\nu^{M}_{p,N-1}(\alpha,\beta,a)\right\rvert$
	$\displaystyle\leq\limsup_{M}\frac{1}{M}\sum_{m=1}^{M}\left[\left\lvert 2Z_{t_{N}}^{(m)}\right\rvert+4C\right]C\left(\left\lvert\mathbb{1}_{\{\left\lvert X_{t_{N-1}}^{(m),\delta}-a^{p,\delta}_{i}\right\rvert\leq\epsilon\}}\right\rvert+\left\lvert\mathbb{1}_{\{\left\lvert X_{t_{N-1}}^{(m),\delta}-a^{p,\delta}_{i+1}\right\rvert\leq\epsilon\}}\right\rvert\right)$
	$\displaystyle\leq C(4C+2\mathbb{E}\left[\left\lvert 2Z_{t_{N}}\right\rvert\right])\left(\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{ia}\right\rvert\leq\epsilon)+\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{i+1}\right\rvert)\leq\epsilon\right).$

Since $X_{t_{N-1}}$ has a density, $\lim_{\epsilon\to 0}\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{i}\right\rvert\leq\epsilon)=\mathbb{P}(X_{t_{N-1}}^{\delta}=a_{p}^{i,\delta})=0$ and $\lim_{\epsilon\to 0}\mathbb{P}(\left\lvert X_{t_{N-1}}^{\delta}-a^{p,\delta}_{i+1}\right\rvert\leq\epsilon)=\mathbb{P}(X_{t_{N-1}}^{\delta}=a^{p,\delta}_{i+1})=0$ . As a result, $\left\lvert\hat{\nu}^{M}_{p,N-1}-\nu_{p,N-1}^{M}\right\rvert\to 0$ a.s. locally uniformly when $M\to\infty$ . Thus, the random function $\hat{\nu}_{p,N-1}^{M}$ converges a.s. locally uniformly to the function

\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert Z_{t_{N}}-\alpha\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a^{p}_{i-1},a^{p}_{i}}^{\delta,l}(a)\}}-\beta\mathbb{1}_{\{X_{t_{N-1}}\in\mathcal{R}_{a^{p}_{i-1},a^{p}_{i}}^{\delta,r}(a)\}}\right\rvert^{2}\right].

Now, we apply Lemma 4.6 along with the same arguments of Lemma 4.7 to conclude that ${\mathcal{P}}_{p+1}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{p+1,M})$ converges to $\mathcal{P}_{p+1}^{N-1}(X_{t_{N-1}},\theta_{N-1}^{p+1})$ a.s as $M\to\infty$ and that $\hat{\theta}_{N-1}^{p+1,M}$ converges a.s. to $\theta_{N-1}^{p+1}$ .

Step 2: $j<N-1$ .
So far, we have proved that for all $p$ , ${\mathcal{P}}_{p}^{N-1}(X_{t_{N-1}},\hat{\theta}_{N-1}^{p,M})$ converges to $\mathcal{P}_{p}^{N-1}(X_{t_{N-1}},\theta_{j}^{p})$ a.s and that $\hat{\theta}_{N-1}^{p,M}$ converges a.s. to $\theta_{j}^{p}$ as $M\to\infty$ . Now, suppose that ${\mathcal{P}}_{p}^{k}(X_{t_{k}},\hat{\theta}_{k}^{p,M})$ (resp. $\hat{\theta}_{k}^{p,M}$ ) converges to $\mathcal{P}_{p}^{k}(X_{t_{k}},\theta_{k}^{p})$ (resp. $\theta_{k}^{p}$ ) a.s as $M\to\infty$ for all $p$ and for $k=N-1,\ldots,j+1$ .We should prove that these convergence results hold for $j$ and this will be done by induction on $p$ . Now that we have understood that considering multidimensional random variables $X_{t_{j}}$ does not play any role in the proof, we will make the rest of the proof as if $X$ were having in $\mathbb{R}$ in order to use a little lighter notation

•

For $p=1$ , consider

	$\displaystyle\hat{\nu}_{1,j}^{M}(\alpha,\beta,a)$	$\displaystyle=\frac{1}{M}\sum\limits_{m=1}^{M}\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,1]}\}\right\rvert^{2}$
	$\displaystyle\nu^{M}_{1,j}(\alpha,\beta,a)$	$\displaystyle=\frac{1}{M}\sum\limits_{m=1}^{M}\left\lvert F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,1]\}}\right\rvert^{2}.$

The function $\nu_{1,j}^{M}$ writes as the sum of i.i.d random variables. Let $C\geq 0$ , using Equation (12)

	$\displaystyle\mathbb{E}\left[\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert F_{j+1}\left(\vartheta^{1},Z,X\right)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]}\}\right\rvert^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\lvert F_{j+1}\left(\vartheta^{1},Z,X\right)\right\rvert^{2}\right]+2\mathbb{E}\left[\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}+\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]\}}\right\rvert^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\max_{l\geq j+1}(Z_{t_{l}})^{2}\right]+2C^{2}<\infty.$

Using Lemma 4.5, $\alpha,\beta,a\mapsto\nu_{1,j}^{M}(\alpha,\beta,a)$ converges a.s locally uniformly to the function $\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{1},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]\}}\right\rvert^{2}\right]$ . It remains to prove that $\forall C>0\;\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert\to 0$ a.s when $M\to\infty$ .
Then, using Equation (12) and Lemma 4.8

	$\displaystyle\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert$
	$\displaystyle\leq\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\frac{1}{M}\sum_{m=1}^{M}\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)-F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)\right\rvert$
	$\displaystyle\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)+F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)-2\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[0,a)\}}-2\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,1]\}}\right\rvert$
	$\displaystyle\leq\sum_{m=1}^{M}2\left(\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+2C\right)\left\lvert F_{j+1}\left(\hat{\vartheta}^{1,M},Z^{(m)},X^{(m)}\right)-F_{j+1}\left(\vartheta^{1},Z^{(m)},X^{(m)}\right)\right\rvert$
	$\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}2\left(\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+2C\right)\left(\sum_{i=j+1}^{N}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\left\lvert\hat{\mathcal{T}}_{1}^{i,M}(X_{t_{i}}^{(m)})-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\}}\right).$

Using the induction assumption on $j$ , $\hat{\mathcal{T}}_{1}^{i,M}(X_{t_{i}}^{(m)})$ converges a.s. to $\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})$ for all $N-1\geq i\geq j+1$ . Let $\epsilon<0$ .

	$\displaystyle\limsup_{M}\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert$
	$\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}2\left(\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+2C\right)\left(\sum_{i=j+1}^{N}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}\right).$

Since $\mathbb{P}(Z_{t_{j}}^{(m)}={\mathcal{T}}_{p}^{j}(X_{t_{j}}^{(m)}))=0$ , then $\lim_{\epsilon\to 0}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{m)}-\mathcal{T}_{1}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}=0$ a.s and we conclude that a.s. $\left\lvert\hat{\nu}_{1,j}^{M}(a,\alpha,\beta)-\nu_{1,j}^{M}(a,\alpha,\beta)\right\rvert$ converges to zero uniformly. Thus, $\hat{\nu}_{1,j}^{M}$ converges a.s uniformly to the function $a,\alpha,\beta\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{1},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[0,a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in[a,1]\}}\right\rvert^{2}\right]$ . Then, we apply Lemma 4.5 and the arguments of the proof of Lemma 4.7 to deduce that ${\mathcal{P}}_{1}^{j}(X_{t_{j}},\hat{\theta}_{j}^{1,M})$ converges to $\mathcal{P}_{j}^{j}(X_{t_{j}},\theta_{j}^{j})$ a.s and that $\hat{\theta}_{j}^{j,M}$ converges a.s. to $\theta_{j}^{p}$ as $M\to\infty$ .

•

We assume that ${\mathcal{P}}_{p}^{k}(X_{t_{k}},\hat{\theta}_{k}^{p,M})$ (resp. $\hat{\theta}_{k}^{p,M}$ ) converges to $\mathcal{P}_{p}^{k}(X_{t_{k}},\theta_{k}^{p})$ (resp. $\theta_{k}^{p}$ ) a.s as $M\to\infty$ for some $p$ and for $k=N-1,\ldots,j$ . We will probe that it holds for $p+1$ .

Let $i\in\{1,\ldots,2^{p}\}$ and consider

	$\displaystyle\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)$	$\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert F_{j+1}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p,M}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,a^{p,M}_{i,j})\}}\right\rvert^{2}.$
	$\displaystyle\nu_{p,j}^{M}(\alpha,\beta,a)$	$\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left\lvert F_{j+1}(\vartheta^{p},Z^{(m)},X^{(m)})-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,a^{p}_{i,j})\}}\right\rvert^{2}$

The function $\nu_{p,j}^{M}$ writes as the sum of i.i.d random variables. Let $C\geq 0$ ,

	$\displaystyle\mathbb{E}\left[\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert F_{j+1}\left(\vartheta^{p},Z^{(m)},X^{(m)}\right)-\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a,a^{p}_{i,j})\}}\right\rvert^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\lvert F_{j+1}\left(\vartheta^{p},Z^{(m)},X^{(m)}\right)\right\rvert^{2}\right]+2C^{2}$
	$\displaystyle\leq 2\mathbb{E}\left[\max_{l\geq j+1}(Z_{t_{l}})^{2}\right]+2C^{2}<\infty.$

We conclude using Lemma 4.5 that a.s $\nu_{p,j}^{M}$ converges locally uniformly to the function

\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{p},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[a^{p}_{i-1,j},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in(a,a^{p}_{i,j})\}}\right\rvert^{2}\right].

Let $C>0$ ,

	$\displaystyle\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)-\nu_{p,j}^{M}(\alpha,\beta,a)\right\rvert$
	$\displaystyle\leq\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\frac{1}{M}\sum_{m=1}^{M}\left(2\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+4C\right)$
	$\displaystyle\left[\left(\left(\sum_{i=j+1}^{N}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\right)\sum_{i=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\left\lvert\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}}^{(m)})-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\}}\right)\right.$
	$\displaystyle\left.+\left\lvert\alpha\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p}_{i-1,j},a^{p,M}_{i-1,j})\}}+\beta\mathbb{1}_{\{X_{t_{j}}^{(m)}\in[a^{p,M}_{i,j},a^{p}_{i,j}]\}}\right\rvert\right].$

Using the induction assumption on $p$ , $\lim_{M\to\infty}a^{p,M}_{i-1,j}=a^{p}_{i-1,j}$ and $\lim_{M\to\infty}a_{p}^{i,M}=a_{p}^{i}$ a.s and using the induction assumption on $j$ , $\lim_{M\to\infty}\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}}^{(m)})=\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\;\forall i\geq j+1$ . Let $\epsilon<0$ ,

	$\displaystyle\limsup_{M}\sup\limits_{a\in[0,1]^{d},\left\lvert\alpha\right\rvert\leq C,\left\lvert\beta\right\rvert\leq C}\left\lvert\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)-\nu_{p,j}^{M}(\alpha,\beta,a)\right\rvert$
	$\displaystyle\leq\limsup_{M}\frac{1}{M}\sum_{m=1}^{M}\left(2\max_{l\geq j+1}\left\lvert Z_{t_{l}}^{(m)}\right\rvert+4C\right)$
	$\displaystyle\left[C\left(\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i-1,j}\right\rvert\leq\epsilon\}}+\beta\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i,j}\right\rvert\leq\epsilon\}}\right)+\left(\left(\sum_{k=j+1}^{N}\left\lvert Z_{t_{k}}^{(m)}\right\rvert\right)\sum_{k=j+1}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{k}}^{(m)}-\mathcal{T}_{p}^{k}(X_{t_{k}}^{(m)})\right\rvert\leq\epsilon\}}\right)\right].$

Since $\lim_{\epsilon\to 0}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}=\lim_{\epsilon\to 0}\left\lvert\alpha\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i-1,j}\right\rvert\leq\epsilon\}}+\beta\mathbb{1}_{\{\left\lvert X_{t_{j}}^{(m)}-a^{p}_{i,j}\right\rvert\leq\epsilon\}}\right\rvert=0$ , we conclude that a.s $\left\lvert\hat{\nu}_{p,j}^{M}(\alpha,\beta,a)-\nu_{p,j}^{M}(\alpha,\beta,a)\right\rvert\to 0$ locally uniformly when $M\to\infty$ , and thus the random function $\hat{\nu}_{p,j}^{M}$ converges a.s locally uniformly to the function $\alpha,\beta,a\mapsto\mathbb{E}\left[\left\lvert F_{j+1}(\vartheta^{p},Z,X)-\alpha\mathbb{1}_{\{X_{t_{j}}\in[a_{p}^{i-1},a)\}}-\beta\mathbb{1}_{\{X_{t_{j}}\in(a,a_{p}^{i})\}}\right\rvert^{2}\right]$ . Then we apply Lemma 4.6 along with the arguments of the proof of Lemma 4.7 to conclude that ${\mathcal{P}}_{p+1}^{k}(X_{t_{k}},\hat{\theta}_{k}^{p+1,M})$ (resp. $\hat{\theta}_{k}^{p+1,M}$ ) converges to $\mathcal{P}_{p+1}^{k}(X_{t_{k}},\theta_{k}^{p})$ (resp. $\theta_{k}^{p+1}$ ) a.s as $M\to\infty$ for $k=N-1,\ldots,j$ . This completes the induction.

∎

Theorem 4.10.

Assume that for all $p\in\mathbb{N}^{*}$ , and all $1\leq j\leq N-1$ , $\mathbb{P}(Z_{t_{j}}=\mathcal{T}_{p}^{j}(X_{t_{j}}))=0$ . Then, for $\alpha=1,2$ and for every $j=1,\ldots,N$ ,

\lim\limits_{M\to\infty}\frac{1}{M}\sum\limits_{i=1}^{M}\left(Z_{\tau_{j}^{p,(m)}}^{(m)}\right)^{\alpha}=\mathbb{E}\left[(Z_{\tau_{j}^{p}})^{\alpha}\right]\textnormal{a.s}.

Proof.

Note that $\mathbb{E}\left[(Z_{\tau_{j}^{p}})^{\alpha}\right]=\mathbb{E}\left[F_{j}(\vartheta^{p},Z,X)^{\alpha})\right]$ and by the strong law of large numbers

\lim\limits_{M\to\infty}\frac{1}{M}\sum_{m=1}^{M}F_{j}\left((\vartheta^{p},Z^{(m)},X^{(m)})^{\alpha}\right)=\mathbb{E}\left[F_{j}(\vartheta^{p},Z,X)^{\alpha}\right]\textnormal{a.s.}

It remains to prove that

\Delta F_{M}=\frac{1}{M}\sum\limits_{m=1}^{M}F_{j}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})^{\alpha}-F_{j}(\vartheta^{p},Z^{(m)},X^{(m)})^{\alpha}\xrightarrow[M\to\infty]{a.s}0.

For any $x,y\in\mathbb{R}$ , and $\alpha=1,2$ , $\left\lvert x^{\alpha}-y^{\alpha}\right\rvert\leq\left\lvert x-y\right\rvert\left\lvert x^{\alpha-1}+y^{\alpha-1}\right\rvert$ . Using Lemma 4.8 and Equation (12), we have

	$\displaystyle\left\lvert\Delta F_{M}\right\rvert$	$\displaystyle\leq\frac{1}{M}\sum\limits_{m=1}^{M}\left\lvert F_{j}(\hat{\vartheta}^{p,M},Z^{(m)},X^{(m)})^{\alpha}-F_{j}(\vartheta^{p},Z^{(m)},X^{(m)})^{\alpha}\right\rvert$
		$\displaystyle\leq 2\frac{1}{M}\sum\limits_{m=1}^{M}\sum_{i=j}^{N}\max\limits_{k\geq j}\left\lvert Z_{t_{k}}^{(m)}\right\rvert^{\alpha-1}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\left\lvert\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}}^{(m)})-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\}}.$

Using Proposition 4.9, for all $i=j,\ldots,N-1$ , $\left\lvert\hat{\mathcal{T}}_{p}^{i,M}(X_{t_{i}})-\mathcal{T}_{p}^{i}(X_{t_{i}})\right\rvert\to 0$ a.s when $M\to\infty$ . Then for any $\epsilon>0$ ,

	$\displaystyle\limsup_{M}\left\lvert\Delta F_{M}\right\rvert$
	$\displaystyle\leq 2\limsup_{M}\frac{1}{M}\sum\limits_{m=1}^{M}\sum\limits_{i=j}^{N}\max\limits_{k\geq j}\left\lvert Z_{t_{k}}^{(m)}\right\rvert^{\alpha-1}\left\lvert Z_{t_{i}}^{(m)}\right\rvert\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}^{(m)}-\mathcal{T}_{p}^{i}(X_{t_{i}}^{(m)})\right\rvert\leq\epsilon\}}$
	$\displaystyle\leq 2\mathbb{E}\left[\sum\limits_{i=j}^{N}\max\limits_{k\geq j}\left\lvert Z_{t_{k}}\right\rvert^{\alpha-1}\left\lvert Z_{t_{i}}\right\rvert\sum_{i=j}^{N-1}\mathbb{1}_{\{\left\lvert Z_{t_{i}}-\mathcal{T}_{p}^{i}(X_{t_{i}})\right\rvert\leq\epsilon\}}\right].$

We conclude that $\limsup_{M}\left\lvert\Delta F_{M}\right\rvert=0$ by letting $\epsilon$ go to 0 which ends the proof. ∎

Proving the global convergence of our algorithm amounts to studying the difference

\displaystyle\frac{1}{M}\sum_{i=1}^{M}Z_{\tau_{j}^{p,(m)}}^{(m)}-\mathbb{E}\left[(Z_{\tau_{j}})\right]=\left(\frac{1}{M}\sum_{i=1}^{M}Z_{\tau_{j}^{p,(m)}}^{(m)}-\mathbb{E}\left[(Z_{\tau_{j}^{p}})\right]\right)+\left(\mathbb{E}\left[Z_{\tau_{j}^{p}}\right]-\mathbb{E}\left[(Z_{\tau_{j}^{p}})\right]\right)

By Proposition 4.4, the $\mathbb{E}\left[(Z_{\tau_{j}^{p}})\right]-\mathbb{E}\left[Z_{\tau_{j}^{p}}\right]$ goes to zero when $p\to\infty$ independently of $M$ . Even if Theorem 4.10 proves the a.s. convergence of the first term for any fixed $p$ when $M\to\infty$ , its behaviour seems less clear when both $M$ and $p$ go to infinity. Note that the same difficult already occurred in Clément et al. (2002) for the more standard least square approach. The main difficulty stems for the inability to compute the variance of the limiting distribution appearing in the central limit theorem governing the convergence of $\frac{1}{M}\sum_{i=1}^{M}\left(Z_{\tau_{j}^{p,(m)}}^{(m)}\right)$ even in the least square framework. In particular, we do seem to control the behaviour of the limiting variance with respect to $p$ .

5 Random forests

Regression trees are barely used as standalone estimators but are often aggregated to obtain a random forest. Considering our training sample $\mathcal{D}_{M}=\{(X_{i},Y_{i})_{1\leq i\leq M}\}$ of size $M$ . Let $\Theta$ be a random variable independent of $\mathcal{D}_{M}$ and used to resample the training set without replacement. Note that the size of the resampled set is usually smaller that $M$ . Let $\Theta_{1},\dots,\Theta_{B}$ $B$ iid samples of $\Theta$ . For every $k=1,\dots,B$ , $\hat{\mathcal{T}}_{p}^{M}(X,\Theta_{k})$ denotes the regression tree of depth $p$ computed on the resampled set obtained from $\mathcal{D}_{M}$ by using $\Theta_{k}$ . Then, the random forest estimator is defined by

\mathcal{H}_{B,p}(X)=\sum_{k=1}^{B}\frac{1}{B}\hat{\mathcal{T}}^{M}_{p,\Theta_{k}}(X).

We also introduce $\mathcal{H}_{p}(X)=\mathbb{E}_{\Theta}\left[\hat{\mathcal{T}}^{M}_{p,\Theta}(X)\right]=\lim_{B\to\infty}\mathcal{H}_{B,p}(X)$ .

Theorem 5.1.

\lim_{B\to\infty}\mathbb{E}\left[\left\lvert Y-\mathcal{H}_{B,p}(X)\right\rvert^{2}\right]=\mathbb{E}\left[\left\lvert Y-\mathcal{H}_{p}(X)\right\rvert^{2}\right]

See Theorem 11.1 in (Breiman, 2001).

Theorem 5.2.

\mathbb{E}\left[\left\lvert Y-\mathcal{H}_{p}(X)\right\rvert^{2}\right]\leq\bar{\rho}\mathbb{E}_{\Theta}\left[\mathbb{E}\left[\left\lvert Y-\hat{\mathcal{T}}^{M}_{p,\Theta}(X)\right\rvert^{2}\right]\right]

where $\bar{\rho}$ is the weighted correlation between the residuals $Y-\hat{\mathcal{T}}^{M}_{p,\Theta}(X)$ and $Y-\hat{\mathcal{T}}^{M}_{p,\Theta^{\prime}}(X)$ and $\Theta$ and $\Theta^{\prime}$ are independent. See Theorem 11.2 in (Breiman, 2001)

Theorem 5.2 says that to have a good generalization error in the random forest, one should have small generalization errors in the basis trees, and the basis trees should not be highly correlated.

6 Numerical results

6.1 Description

This section studies the price of some Bermudan options using regression trees or random forests to approximate the conditional expectations. We compare the results to some reference prices and those given by the standard Longstaff Schwarz method with regression on polynomial functions, which is a basic LSM algorithms. More sophisticated representations can be used to mildly temper the curse of dimensionality. We use the Scikit-Learn library in Python, (Pedregosa et al., 2011). For regression trees, this library offers two methods of splitting: “best” to choose the best split, meaning that the split threshold is the one that minimizes the MSE and the direction for splitting is the one that gives the lowest MSE among all directions; “random” to choose the best random split, meaning that the split threshold is the one that minimizes the MSE and the direction for splitting is chosen randomly. For the following tests, we will use the latter method, which is just slightly different from what we presented in Section 2 in the way that no mid-point cuts will be considered. We also use the feature min_samples_leaf which allows us to set a minimum number of samples in each node. This will allow us to avoid over-fitting. For random forests, we will use the bootstrapping method (Bootstrap=True), meaning that for each tree in the forest, we will use a sub-sample drawn randomly and with replacement from the training data. We will also use the feature max_samples which allows having a specific number of data points or a percentage of the training data attributed to each tree. Having the trees trained on different data as much as possible allows us to have a low correlation between the trees which, using Theorem 5.2, should make the random forest more robust.
Following the work of (Longstaff and Schwartz, 2001), we only use the in-the-money paths to learn the continuations values, which significantly improves the numerical computations. All the prices that we show are obtained after resimulation, meaning that the paths used in the estimation of the conditional expectations are not the same ones used by the Monte Carlo which means that the prices we show are unbiased.

For small dimensional problems, our algorithm takes approximately the same computation time as the LSM approach for a comparable accuracy. To measure the computational time of our algorithm, we take into account both the training and running times. However, note that the training step over weights the prediction step by a great deal. Once the trees are trained, evaluating them is straightforward, which makes the resimulation price almost free. The computational time of the random forest approach linearly depends on the number of regression trees in the forest.

6.2 Black and Scholes

Consider the Black and Scholes model

\begin{cases}dS_{t}^{i}&=rS_{t}^{i}dt+\sigma_{i}S_{t}^{i}dB_{t}^{i},\\ d<B^{i},B^{j}>_{t}&=\rho_{ij}dt.\end{cases}

where $\sigma_{i}$ is the volatility of the underlying $S^{i}$ , assumed to be deterministic, $r$ is the interest rate, assumed constant, and $\rho_{ij}$ , represents the correlation between the underlyings $S^{i}$ and $S^{j}$ , assumed constant.

6.2.1 One-dimensional put option

We consider the Bermudan put option with payoff $(K-S_{\tau})^{+}$ with maturity $T=1$ year, $K=110$ , $S_{0}=100$ , $\sigma=0.25$ , exercisable at $N=10$ different dates. We consider $r=0.1$ . We have a reference price for this option of 11.987 computed by a convolution method in (Lord et al., 2007). The LSM algorithm converges to the correct price with only a polynomial of degree 3. Figure 1, shows the price of the option when we use regression trees with a random split strategy (continuous line) or a best split strategy (dotted line) to estimate the conditional expectations. With the random strategy, the best price we get is 11.89. The case min_samples_leaf=1 and max_depth=20 gives a price of 10.5, which is far from the reference price. This result is due to over-fitting. In fact, for this case, the number of degrees of freedom is too big. The tree fits the training data too well, but it cannot generalize when confronted with new data. For the best split strategy, we obtain a slightly better price of 11.94. However, depending on the tree parameters, the price fluctuates, and we can see that the best split strategy is not necessarily better than the random split strategy. Thus, for the following, we will keep using the random split strategy. Random forests with basis trees of maximum depth 5 and minimum 100 samples in each leaf converge to the correct price with only ten trees.

Refer to caption — Figure 1: one dimensional put with regression trees, true price=11.987

6.2.2 Call option on the maximum of two assets

We consider a call option on the maximum of 2 assets with payoff $(\max(S_{\tau}^{1},S_{\tau}^{2})-K)^{+}$ , we use the same set of parameters as in (Glasserman, 2004), for which we have reference prices of 13.90, 8.08 and 21.34 for $S_{0}^{i}=100,90$ and $110$ respectively. The LSM algorithm using a polynomial of degree 5 converges to a price of 13.90, 8.06, 21.34 for the cases $K=100,90,110$ respectively. This is a small dimensional problem, so the convergence of the LSM is expected. With regressions trees we have slightly less satisfying results as shown in Figure 2. We can still see the case of over-fitting when giving the regression trees too many degrees of freedom. Aggregating the regression trees into random forests immediately improves the results as shown in Figure 3. Note that the lower the percentage of data in each basis tree, the better the results. This confirms the results of Theorem 5.2 .

6.2.3 Geometric basket option

We consider a Bermudan Put option on a geometric basket of $d$ underlying with payoff

\left(K-\left(\prod\limits_{i=1}^{d}S_{\tau}^{i}\right)^{\frac{1}{d}}\right)^{+}.

We test the following option for $d=2,10,40$ for which we have reference prices from (Cox et al., 1979) using the CRR tree method. With the LSM algorithm, we converge to the correct price 4.57 for the case $d=2$ , using only a polynomial of degree 3. For the case $d=10$ , we can at most use a polynomial of degree 3 due to the curse of dimensionality. With this parametrization, we obtain a price of 2.90 for a true price of 2.92. For the case $d=40$ , we cannot go further than a polynomial of degree 1, which yields a price of 2.48 for a reference price of 2.52. Figure 4 shows the results obtained with regression trees. For the case $d=2$ , the best price we get is 4.47 and, as expected, the LSM algorithm has a better performance. This is also the case for the cases $d=10$ and $d=40$ where the best prices we obtain are 2.84 and 2.46 respectively. Notice that even though these are high dimensional cases, the trees converge with only a depth of 5 or 8. For $d=10$ , our algorithm with depth equal $8$ and min_samples_leaf equal $200$ runs in $3$ seconds compared to the $11$ seconds needed by the LSM with polynomial degree $3$ . The running times of the two approaches are pretty comparable. For $d=40$ , our algorithm takes the same computational time as the LSM with degree $1$ polynomials. We also notice the importance of the parameter min_samples_leaf. In fact, letting the trees grow without managing this parameter (case leaf1) leads to a problem of over-fitting. The results get better when we use random forests as shown in Figure 5. For these random forests, we used basis trees of max_depth=8 and min_samples_leaf=100. Notice for the case $d=2$ , the curve where only 50% of the data is used gives much better results as in this case the basis trees are the less correlated. For the cases $d=10$ and $d=40$ , the best choice is not necessarily to use 50% of the data in each tree. As these are larger dimensions, having the trees trained on a small percentage of the training data maybe not enough. One may consider extending the size of the training data itself. Furthermore, we notice that once the percentage of data to use in each tree is chosen, the price of the option converges as the number of trees in the forest grows. However, note the computational time of the random forest method linearly depends on the number of trees inside the forest. Although the regression tree approach runs within roughly the same computational time as the LSM for a similar accuracy, the random forests approach may take much longer as the number of trees increases.

6.2.4 A put basket option

We consider a put option on the basket of $d=40$ asset with payoff $\left(K-\sum_{i=1}^{d}\omega_{i}S_{T}^{i}\right)^{+}$ . We test this payoff for $d=40$ for which we have a reference price from (Goudenège et al., 2019) between 2.15 and 2.22 using the following set of parameters: $T=1,S_{i}=100,K=100,r=0.05,\sigma_{i}=0.2,\rho_{ij}=0.2,\omega_{i}=\frac{1}{d}$ and $N=10$ . With a polynomial of degree 1, we obtain a price of 2.15 using the LSM algorithm. The results obtained with regression trees are shown in Figure 6.

Even though this example is high dimensional, we do not need a lot of parameters to estimate the conditional expectations (the trees converge for very small depths). This will not be the case for the next example which is very non linear. The aggregation into random forests leads to a price of 2.16 using only 50 trees.

6.2.5 A call on the max of 50 asset

We consider a call option on the maximum of $d=50$ asset with payoff $\left(\max\limits_{1\leq i\leq d}(S_{T}^{i})-K\right)^{+}$ with the following characteristics: $K=100,T=3$ years, $S_{0}^{i}=100,\sigma_{i}=0.2,\delta_{i}=0.1,\rho_{ij}=0\;\forall i,j,r=0.05,N=9,M=100000$ . (Becker et al., 2019) report $[69.56,69.95]$ as the 95% confidence interval for the option price. With the LSM algorithm we find a price of 67.88 with a polynomial of degree 1. This is a difficult example for which bigger trees are needed to approach the conditional expectations. At maturity, the payoff depends only on one direction (corresponding to the best performance), if the cuts in the tree never consider that direction, the estimation will not be correct. As a result, we consider a number of cuts big enough to ensure that each direction is taken into consideration. We allow the depth to grow while monitoring the min_samples_leaf in order to have a significant number of samples in each leaf. Table 1 shows the results obtained with regression trees. As the best price we obtain is given by depth=100 and min_samples_leaf = 100, we use this set of parameters for the random forest part. Table 2 shows the results that we obtain with this method.

depth	min_samples_leaf	price
50	50	66,89
50	100	66.88
100	50	67.13
100	100	67.31
200	50	67.16
200	100	67.28

Table 1: A call option on the maximum of 50 asset with regression trees

nb_trees	max_samples	price
10	50%	68,32
10	70%	68,32
10	90%	68,29

Table 2: A call option on the maximum of 50 asset with random forests

Using only regression trees is not enough to have acceptable results. However, as soon as we aggregate the regressor into random forests, we obtain very satisfying results and with just 10 trees we converge to a good price. We can also notice in this example that using uncorrelated trees leads to better results (see the case max_samples= 50% or 70% against the case max_samples = 90%).

6.3 A put in the Heston model

We consider the Heston model defined by

	$\displaystyle dS_{t}$	$\displaystyle=S_{t}(r_{t}dt+\sqrt{\sigma_{t}}(\rho dW_{t}^{1}+\sqrt{1-\rho^{2}}dW_{t}^{2}))$
	$\displaystyle d\sigma_{t}$	$\displaystyle=\kappa(\theta-\sigma_{t})dt+\xi\sqrt{\sigma_{t}}dW_{t}^{1}$

and we consider a put option with payoff $\left(K-S_{T}\right)^{+}$ . we have no reference price for this option, so we will just compare the results of regression trees and random forests to the LSM method. We use the following set of parameters: $K=100,S_{0}=100,T=1,\sigma_{0}=0.01,\xi=0.2,\kappa=2,\rho=-0.3,r=0.1,N=10$ and $M=100,000$ . The LSM method yields a price of 1.70. Figures 7 and 8 show the results obtained with regression trees and random forests. Both methods converge to the same price of LSM. We notice for this example the occurrence of the over-fitting phenomenon for regression trees with max_depth=15 and min_sample_leaf=1. We also have the same behavior for random forests in function of the percentage of data given to each basis tree.

7 Conclusion

Pricing Bermudan options comes down to solving a dynamic programming equation where the main trouble comes from the computation of the conditional expectations representing the conditional expectations. We have explored the usage of regression trees and random forests for the computations of these quantities. We have proved in two steps the convergence of the algorithm when regression trees are used: first, the convergence of the conditional expectations; Then, the convergence of the Monte Carlo approximation. This problem was particularly hard to solve given that the regression trees do not solve a global optimization problem as does the functional regression used in the LSM algorithm. We have shown through numerical experiments that we obtain good prices for some classical examples using regression trees. The aggregation of regression trees into random forests yields even better results. We came to the conclusion that for small dimensional problems, a simpler algorithm like the LSM is efficient enough. However, for high dimensional problems, it is interesting to consider using random forests. Instead of using all the features of the problem, the basis trees in the forest only use a subset of the features which can help combat the problem of the curse of dimensionality.

References

Andersen and Broadie (2004) Leif Andersen and Mark Broadie. Primal-dual simulation algorithm for pricing multidimensional American options. Management Science, 50(9), 2004. ISSN 00251909. doi: 10.1287/mnsc.1040.0258.
Bally et al. (2005) Vlad Bally, Gilles Pagès, and Jacques Printems. A quantization tree method for pricing and hedging multidimensional american options. Mathematical Finance, 15(1), 2005. ISSN 09601627. doi: 10.1111/j.0960-1627.2005.00213.x.
Becker et al. (2019) Sebastian Becker, Patrick Cheridito, and Arnulf Jentzen. Deep optimal stopping. The Journal of Machine Learning Research, 20(1), January 2019. doi: 10.5555/3322706.3362015.
Breiman (1999) Leo Breiman. Using Adaptive Bagging to Debias Regressions. Technical Report 547, 1(October), 1999. ISSN 1098-6596.
Breiman (2001) Leo Breiman. Random forests. Machine Learning, 45(1), 2001. ISSN 08856125. doi: 10.1023/A:1010933404324.
Carriere (1996) Jacques F. Carriere. Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics and Economics, 19(1), 1996. ISSN 01676687. doi: 10.1016/S0167-6687(96)00004-2.
Ciocan and Mišić (2022) Dragos Florin Ciocan and Velibor V Mišić. Interpretable optimal stopping. Management Science, 68(3):1616–1638, 2022.
Clément et al. (2002) Emmanuelle Clément, Damien Lamberton, and Philip Protter. An analysis of a least squares regression method for American option pricing. Finance and Stochastics, 6(4), 2002. ISSN 09492984. doi: 10.1007/s007800200071.
Cox et al. (1979) John C. Cox, Stephen A. Ross, and Mark Rubinstein. Option pricing: A simplified approach. Journal of Financial Economics, 7(3), 1979. ISSN 0304405X. doi: 10.1016/0304-405X(79)90015-1.
Dietterich (2000) Thomas G. Dietterich. Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2), 2000. ISSN 08856125. doi: 10.1023/A:1007607513941.
Ferger (2004) Dietmar Ferger. A continuous mapping theorem for the argmax-functional in the non-unique case. Statistica Neerlandica, 58(1):83–96, February 2004. doi: 10.1046/j.0039-0402.2003.
Glasserman (2004) Paul Glasserman. Monte Carlo method in financial engineering, 2004. ISSN 14697688.
Goudenège et al. (2019) Ludovic Goudenège, Andrea Molent, and Antonino Zanette. Variance reduction applied to machine learning for pricing bermudan/american options in high dimension, 2019.
Han and E (2016) Jiequn Han and Weinan E. Deep learning approximation for stochastic control problems. CoRR, abs/1611.07422, 2016. URL http://arxiv.org/abs/1611.07422.
Han et al. (2018) Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
Ho (1998) Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 1998. ISSN 01628828. doi: 10.1109/34.709601.
Kohler et al. (2010) Michael Kohler, Adam Krzyzak, and Nebojsa Todorovic. Pricing of high-dimensional american options by neural networks. Mathematical Finance, 20(3), 2010. ISSN 09601627. doi: 10.1111/j.1467-9965.2010.00404.x.
Lapeyre and Lelong (2021) Bernard Lapeyre and Jérôme Lelong. Neural network regression for Bermudan option pricing. Monte Carlo Methods and Applications, 27(3), 2021. ISSN 15693961. doi: 10.1515/mcma-2021-2091.
Leake et al. (1994) Charles Leake, Reuven Y. Rubinstein, and Alexander Shapiro. Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. The Journal of the Operational Research Society, 45(8), 1994. ISSN 01605682. doi: 10.2307/2584023.
Ledoux and Talagrand (1991) Michel Ledoux and Michel Talagrand. Probability in Banach Spaces. Springer, 1991. doi: 10.1007/978-3-642-20212-4.
Lelong (2018) Jérôme Lelong. Dual pricing of American options by wiener chaos expansion. SIAM Journal on Financial Mathematics, 9(2), 2018. ISSN 1945497X. doi: 10.1137/16M1102161.
Longstaff and Schwartz (2001) Francis A. Longstaff and Eduardo S. Schwartz. Valuing American options by simulation: A simple least-squares approach. Review of Financial Studies, 14(1), 2001. ISSN 08939454. doi: 10.1093/rfs/14.1.113.
Lord et al. (2007) R. Lord, F. Fang, F. Bervoets, and C. W. Oosterlee. A fast and accurate FFT-based method for pricing early-exercise options under levy processes. SIAM Journal on Scientific Computing, 30(4), 2007. ISSN 10648275. doi: 10.1137/070683878.
Ludkovski (2018) Mike Ludkovski. Kriging metamodels and experimental design for bermudan option pricing. Journal of Computational Finance, 22(1), 2018. ISSN 17552850. doi: 10.21314/JCF.2018.347.
Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Reppen et al. (2022) A Max Reppen, H Mete Soner, and Valentin Tissot-Daguette. Neural optimal stopping boundary. arXiv preprint arXiv:2205.04595, 2022.
Rogers (2002) L. C.G. Rogers. Monte Carlo valuation of American options. Mathematical Finance, 12(3), 2002. ISSN 09601627. doi: 10.1111/1467-9965.02010.
Seijo and Sen (2011) Emilio Seijo and Bodhisattva Sen. A continuous mapping theorem for the smallest argmax functional. Electronic Journal of Statistics, 5(none):421 – 439, 2011.
Tsitsiklis and Van Roy (1999) John N. Tsitsiklis and Benjamin Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10), 1999. ISSN 00189286. doi: 10.1109/9.793723.

	$\displaystyle\mathbb{E}\left[\left\lvert\mathcal{T}_{p}(X)\right\rvert^{2}\|\mathcal{G}\right]$	$\displaystyle\leq\mathbb{E}\left[\sum_{i=1}^{2^{p}}\mathbb{E}\left[Y^{2}\|X\in[a^{i-1}_{p},a^{i}_{p})\right]\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}\|\mathcal{G}\right]$
		$\displaystyle\leq\sum_{i=1}^{2^{p}}\mathbb{E}\left[\mathbb{E}\left[Y^{2}\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}\|X\in[a^{i-1}_{p},a^{i}_{p})\right]\|\mathcal{G}\right]$
		$\displaystyle\leq\sum_{i=1}^{2^{p}}\mathbb{E}\left[Y^{2}\mathbb{1}_{\{X\in[a^{i-1}_{p},a^{i}_{p})\}}\|\mathcal{G}\right]$
		$\displaystyle\leq\mathbb{E}\left[Y^{2}\|\mathcal{G}\right]$

	$\displaystyle\mathbb{E}\left[Z_{\tau_{j}^{p}}-Z_{\tau_{j}}\|\mathcal{F}_{j_{j}}\right]$	$\displaystyle=Z_{t_{j}}\left(\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right)$
		$\displaystyle+\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-Z_{\tau_{j+1}}\mathbb{1}_{\left\{Z_{t_{j}}<\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\|\mathcal{F}_{t_{j}}\right]$
		$\displaystyle=(Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}])\left(\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right)$
		$\displaystyle+\mathbb{E}[Z_{\tau_{j+1}}^{p}-Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}$
		$\displaystyle=A_{j}^{p}+\mathbb{E}[Z_{\tau_{j+1}}^{p}-Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\mathbb{1}_{\left\{Z_{t_{j}}<\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}$

	$\displaystyle\left\lvert A_{j}^{p}\right\rvert$	$\displaystyle=\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\right\rvert\left\lvert\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right\rvert$
		$\displaystyle\leq\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\right\rvert\;\left\lvert\mathbb{1}_{\left\{\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]>Z_{t_{j}}\geq\mathcal{T}_{p}^{j}(X_{t_{j}})\right\}}-\mathbb{1}_{\left\{\mathcal{T}_{p}^{j}(X_{t_{j}})>Z_{t_{j}}\geq\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\}}\right\rvert$
		$\displaystyle\leq\left\lvert Z_{t_{j}}-\mathbb{E}[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}]\right\rvert\;\left\lvert\mathbb{1}_{\left\{\left\lvert Z_{t_{j}}-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert\right\}}\right\rvert$
		$\displaystyle\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert$
		$\displaystyle\leq\left\lvert\mathcal{T}_{p}^{j}(X_{t_{j}})-\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]\right\rvert+\left\lvert\mathbb{E}\left[Z_{\tau_{j+1}^{p}}\|\mathcal{F}_{t_{j}}\right]-\mathbb{E}\left[Z_{\tau_{j+1}}\|\mathcal{F}_{t_{j}}\right]\right\rvert.$