Non-asymptotic Properties of Generalized Mondrian Forests in Statistical Learning

Zhan Haoran
Department of DSDS, National University of Singapore
and
Wang Jingli
School of Statistics and Data Science, Nankai University
and
Xia Yingcun
Department of DSDS, National University of Singapore The authors gratefully acknowledge please remember to list all relevant funding sources in the unblinded version

Abstract

Since the publication of Breiman (2001), Random Forests (RF) have been widely used in both regression and classification. Later on, other forests are also proposed and studied in literature and Mondrian Forests are notable examples built on the Mondrian process; see Lakshminarayanan et al. (2014). In this paper, we propose an ensemble estimator in general statistical learning based on Mondrian Forests, which can be regarded as an extension of RF. This general framework includes many common learning problems, such as least squared regression, least $\ell_{1}$ regression, quantile regression and classification. Under mild conditions of loss functions, we give the upper bound of the regret function of our estimator and show that such estimator is statistically consistent.

Keywords: Statistical learning, Machine learning, Random forests, Ensemble learning, Regret function.

1 Introduction

Random Forest (RF) in Breiman (2001) is a very popular ensemble learning technique in machine learning used for classification and regression tasks. It operates by constructing multiple decision trees during training and averaging their predictions for improved accuracy and robustness. Many empirical studies have delved into its powerful performance across different domains and data characteristics, see for example Liaw et al. (2002). However, until now we only know a few of its theoretical properties due to the complicated tree structure and the data-dependent splitting rule (CART). Scornet et al. (2015) was the first one who showed its statistical consistency when the true regressor follows an additive model. Later, Klusowski (2021) established the consistency of RF for any general true regressor. While there are many other papers studying the theory about RF, only these two focused on the original RF in Breiman (2001) by considering how to analyze the important splitting rule, CART.

To gain a deeper insight into the random forest, additional research delves into modified and stylized versions of RF in Breiman (2001). One such method is Purely Random Forests (PRF) (see, for example Arlot and Genuer (2014), Biau (2012) and Biau et al. (2008)), where individual trees are grown independently of the sample, making them well-suited for theoretical analysis. In this paper, our interest is Mondrian Forest, which is an ensemble learning algorithm that combines elements of decision trees and random forests with partitioning strategies inspired by the Mondrian process. This kind of forest was introduced in Lakshminarayanan et al. (2014) that showed that Mondrian Forest has competitive online performance in classification problems compared to other state-of-the-art methods. Inspired by its nice online property, Mourtada et al. (2021) studied the theory of online regression and classification by utilizing Mondrian forest. On the other hand, people also focused on the theory study of Mondrian forest for batched data. Namely, in this case, the data set is already collected by experimenters. Mourtada et al. (2020) gave the consistency and convergence rates for Mondrian trees and forests, that are minimax optimal on the set of Hölder continuous functions. Later on, Cattaneo et al. (2023) follows the line in Mourtada et al. (2020) and gave the asymptotic normal distribution of Mondrian forest for offline regression problem.

Instead of considering classical regression and classification problems, we find in this paper that Mondrian forest can also be applied in more statistical/machine learning problems, such as generalized regression, density estimation and quantile regression. Our main contributions are two-fold:

•

First, we propose a general framework (estimator) based on Mondrian forest that can be used in different learning problems.
•

Second, we study the upper bound of the regret (risk) function of above forest estimator. Our theoretical results can be applied in many learning cases and corresponding examples are given in Section 6.

1.1 Related work

Our method about generalizing RF is based on a global perspective, while the method of generalizing RF in Athey et al. (2019) starts from a local perspective. In other words, by doing one optimization with full data points, we can estimate the objective function $m(x),\forall x\in[0,1]^{d}$ and Athey et al. (2019) can only estimate a specific point $m(x_{0})$ . Therefore, our generalized method can save a lot of time in computation especially when the dimension $d$ is large. Secondly, the generalized method based on globalization can also be easily applied to the statistical problem with a penalization $Pen(m)$ , where $Pen(m)$ is a functional of $m(x),\forall x\in[0,1]^{d}$ . In Section 6.4, we show the application of our method in one of these penalty optimizations, namely the nonparametric density estimation. Since Athey et al. (2019) only perform estimation pointwisely, it is difficult for them to ensure the obtained estimator satisfies shape constraints and the corresponding case is not included in their scope.

2 Background and Preliminaries

2.1 Goal in Statistical Learning

Let $(X,Y)\in[0,1]^{d}\times{\mathbb{R}}$ be the random vector, where we normalize the range of $X$ . In statistical learning, we would like a policy $h$ to learn $Y$ from $X$ , which is defined as a function $h:[0,1]^{d}\to{\mathbb{R}}$ . Usually, a loss function $\ell(h(x),y):{\mathbb{R}}\times{\mathbb{R}}\to[0,\infty)$ is used to measure the difference or loss between the decision $h(x)$ and goal $y$ . Taking expectation w.r.t. $X,Y$ , the risk function

R(h):={\mathbf{E}}(\ell(h(X),Y)

(1)

denotes the averaged loss by using the policy $h$ . Naturally, people have reasons to select the best policy $h^{*}$ by minimizing the averaged loss over some function class $\mathcal{H}_{1}$ , namely

h^{*}=\arg\min_{h\in\mathcal{H}_{1}}R(h).

Therefore, $R(h^{*})$ has the minimal risk in theory that one can try the best to achieve. In practice, the distribution of $(X,Y)$ is unknown and (1) is not able to be used for the calculation of $R(h)$ . Thus, such best $h^{*}$ can not be obtained in a direct way. On the other hand, we usually use i.i.d. data ${\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ to approximate $R(h)$ . Thus, the empirical risk function can be approximated by

\hat{R}(h):=\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i}).

Traditionally, people always find an estimator/policy $\hat{h}_{n}:[0,1]^{d}\times{\mathbb{R}}$ by minimizing $\hat{R}(h)$ over a known function class; see spline regression in Györfi et al. (2002), wavelet regression in Györfi et al. (2002) and regression by deep neural networks in Schmidt-Hieber (2020). Recently, instead of minimizing $\hat{R}(h)$ globally, tree-based algorithms have been applied to construct the empirical estimator $\hat{h}_{n}$ . According to many practitioners’ experience, the strong prediction ability of RF’s have shown the superiority of tree-based estimators over many traditional methods in statistical learning. In this paper, we are interested in bounding the regret function

\varepsilon(\hat{h}_{n}):=R(\hat{h}_{n})-R(h^{*}),

where $\hat{h}_{n}$ will be an ensemble estimator obtained by Mondrian Forests.

2.2 Mondrian partitions

Mondian partitions correspond with a case of random tree partitions, where the partition of $[0,1]^{d}$ is independent of data points. This scheme totally depends on a stochastic process, named as Mondrian process and denoted by $MP([0,1]^{d})$ . The Mondrian process $MP([0,1]^{d})$ is a distribution on infinite tree partitions of $[0,1]^{d}$ introduced by Roy and Teh (2008) and Roy (2011). To reduce notations, the details of its definition is omitted here and can be checked in Definition 3 in Mourtada et al. (2020).

Let us come back to the Mondrian partitions. The partition that will be used in this paper is called the Mondrian partitions with stopping time $\lambda$ and is denoted by $MP(\lambda,[0,1]^{d})$ . (see Section 1.3 in Mourtada et al. (2020)) Its construction consists of two steps. Firstly, we construct partitions according to $MP([0,1]^{d})$ by iteratively splitting cells at random times, which depend on the linear dimension of each cell. The probability of splitting along each side is proportional to the side length of the cell, and the splitting position is chosen uniformly. Secondly, we cut the nodes whose birth time is behind the tuning parameter $\lambda>0$ . Note that each tree node generated in the first step was given a specific birth time. When the tree layer grows, the birth time also increases. Therefore, the second step can be regarded as a pruning process, which helps us choose the best tree model in learning problems.

Let $C=\Pi_{j=1}^{d}{C^{j}}\subseteq[0,1]^{d}$ with closed intervals $C^{j}=[a_{j},b_{j}]$ . Denote $|C^{j}|=b_{j}-a_{j}$ and $|C|=\sum_{j=1}^{d}{|C^{j}|}$ . Let $Exp(|C|)$ be an exponential distribution with expectation $|C|>0$ . Then, our Mondrian partition with stopping time $\lambda>0$ can be generated according to Algorithm 1. In fact, this algorithm is a recursive process and in the initial time we input the root node $[0,1]^{d}$ . On the other hand, we give an example in Figure 1 to show how Algorithm 1 works.

Input: A cell

C=\Pi_{j=1}^{d}{C^{j}}\subseteq[0,1]^{d}

, starting time

\tau

and stopping

\lambda

1 Sample a random variable

E_{C}\sim Exp(|C|)

2if $\lambda+E_{C}\leq\lambda$ then

3 Randomly choose a split dimension

J\in\{1,\ldots,d\}

with

{\mathbf{P}}(J=j)=(b_{j}-a_{j})/|C|

;

4 Randomly choose a split threshold

S_{J}

[a_{J},b_{J}]

;

5 Split

C

along the split

(J,S_{J})

: let

C_{0}=\{x\in C:x_{J}\leq S_{J}\}

and

C_{1}=C/\ C_{0}

;

6 return

\text{Mondrian}\text{Partition}(C_{0},\tau+E_{C},\lambda)\cup\text{Mondrian}\text{Partition}(C_{1},\tau+E_{C},\lambda)

;

8else

9 return

\{C\}

(i.e., do not split

C

)

10 end if

Algorithm 1

\text{Mondrian}\text{Partition}(C,\tau,\lambda)

: Generate a Mondrian partition of cell

C\subseteq[0,1]^{d}

, starting from time

\tau

and until time

\lambda

Refer to caption — Figure 1: An example of Mondrian partition (left) with corresponding tree structure (right). This indicates how the tree grows with time. There are four splitting times in this demo, namely $1,2,3,4$ , that are denoted by bullets ( $\bullet$ ) and the stopping time is $\lambda=4$ .

3 Methodology

Let $MP_{b}([0,1]^{d}),b=1,\ldots,B$ be independent Mondrian processes. When we prune each tree at time $\lambda>0$ , independent partitions $MP_{b}(\lambda,[0,1]^{d}),j=1,\ldots,B$ are obtained, where all cuts after time $\lambda$ are ignored. In this case, we can write $MP_{b}([0,1]^{d},\lambda)=\{{\small\mathcal{C}}_{b,\lambda,j}\}_{j=1}^{K_{b}(\lambda)}$ satisfying

[0,1]^{d}=\bigcup_{j=1}^{K_{b}(\lambda)}{\small\mathcal{C}}_{b,\lambda,j}\ \ \text{and}\ \ {\small\mathcal{C}}_{b,\lambda,j_{1}}\cap{\small\mathcal{C}}_{b,\lambda,j_{2}}=\varnothing,\ \forall j_{1}\neq j_{2},

where ${\small\mathcal{C}}_{b,\lambda,j}\subseteq[0,1]^{d}$ denotes a cell in the partition $MP_{b}([0,1]^{d})$ . For each cell ${\small\mathcal{C}}_{b,\lambda,j}$ , a constant $\hat{c}_{b,\lambda,j}\in{\mathbb{R}}$ is used as the predictor of $h(x)$ in this small region, where

\hat{c}_{b,\lambda,j}=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})}

(2)

and $\beta_{n}>0$ denotes the threshold. For any fixed $y\in{\mathbb{R}}$ , $\ell(\cdot,y)$ is usually a convex function w.r.t. the first variable in machine learning. Therefore, the optimization (2) over $[-\beta_{n},\beta_{n}]$ guarantees the existence of $\hat{c}_{b,\lambda,j}$ in general and we let $\beta_{n}\to\infty$ as $n$ goes to infinity. Then, for each $1\leq b\leq B$ , we can get an estimator of $h(x)$ :

\hat{h}_{b,n}(x):=\sum_{j=1}^{K_{b}(\lambda)}{\hat{c}_{b,\lambda,j}\cdot{\mathbb{I}}(x\in{\small\mathcal{C}}_{b,\lambda,j})},\ x\in[0,1]^{d},

where ${\mathbb{I}}(\cdot)$ denotes the indicator function. By applying the ensemble technique, the final estimator is given by

\hat{h}_{n}(x):=\frac{1}{B}\sum_{b=1}^{B}{\hat{h}_{b,n}(x)},\ x\in[0,1]^{d}.

(3)

If the cell $C_{b,\lambda,j}$ that does not contain any data point, we just use $0$ as the predictor in the corresponding region.

Let us clarify the relationship between (3) and the traditional RF. If we take the $\ell^{2}$ loss function $\ell(x,y)=(x-y)^{2}$ and $|Y|\leq\beta_{n}$ , it can be checked that

\hat{c}_{b,\lambda,j}=\frac{1}{Card(\{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}\})}\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}Y_{i},

where $Card(\cdot)$ denotes the cardinality of a set. In this case, it is a problem about least squared regression and the estimator in (3) exactly coincides with that in Mourtada et al. (2020). In conclusion, our estimator $\hat{h}_{b,n}(x)$ can be regarded as an extension of regression forests in Mourtada et al. (2020) since $\ell(x,y)$ can be chosen arbitrarily by a practitioner.

By examining the above algorithm carefully, we know there are two tuning parameters in the construction of $\hat{h}_{n}$ : $\lambda$ and $B$ . The stopping time, $\lambda$ , controls the model complexity of Mondrian forests. Generally speaking, the cardinality of a tree partition increases when $\lambda$ goes to infinity. Thus, a large $\lambda$ is useful to reduce the bias of the forest estimator. However, a small $\lambda$ helps controls the generalization error of (3). Therefore, there should a balance about the choice of $\lambda$ . To ensure the consistency of $\hat{h}_{n}$ , we suppose $\lambda$ is dependent with the sample size $n$ and write $\lambda_{n}$ in the following analysis. The second parameter, $B$ , denotes the number of Mondrian trees. There are studies about its selection for RF; see, for example, Zhang and Wang (2009). In practice, many practitioners take $B=500$ during their computations.

4 Regret function of Mondrian forests

In this section, we study the upper bound of the forest estimator (3) that is constructed by Mondrian processes. First, we need some mild restrictions on the loss functions $\ell(x,y)$ .

Assumption 1.

The loss function $\ell(x,y)\in{\mathbb{R}}$ for all $x,y\in{\mathbb{R}}$ and $\ell(,y)$ is convex for any fixed $y\in S\subseteq{\mathbb{R}}$ with ${\mathbf{P}}(Y\in S)=1$ .

Assumption 2.

There exists a non-negative function $M_{1}(x,y)>0$ with $x>0,y\in{\mathbb{R}}$ such that for any $y\in{\mathbb{R}}$ , $\ell(\cdot,y)$ is Lipshitz continuous and for any $x_{1},x_{2}\in[-x,x]$ , $y\in{\mathbb{R}}$ , we have

|\ell(x_{1},y)-\ell(x_{2},y)|\leq M_{1}(x,y)|x_{1}-x_{2}|.

Assumption 3.

There exists an envelop function $M_{2}(x,y)>0$ such that for any $x,y\in{\mathbb{R}}$ ,

\left|\sup_{x^{\prime}\in[-x,x]}\ell(x^{\prime},y)\right|\leq M_{2}(x,y)\ \ \text{and}\ \ {\mathbf{E}}(M_{2}^{2}(x,Y))<\infty.

We will see in next section that many commonly used loss functions satisfy Assumption 1- 3 including $\ell_{2}$ loss and $\ell_{1}$ loss. In the following analysis, we suppose $Y$ is a sub-Gaussian random variable and $X$ takes value in $[0,1]^{d}$ . In detail, we make the following assumption about the distribution of $Y$ .

Assumption 4.

For some $\sigma>0$ , we have ${\mathbf{P}}(|Y-{\mathbf{E}}(Y)|>t)\leq 2\exp{(-t^{2}/(2\sigma^{2}))}$ for each $t>0$ . To simplify notation, we always assume later that $\sigma=1$ .

Our theoretical results relate to the $(p,C)$ -smooth class below. This class is considered in the calculation of the regret function since it is large enough and dense in the $L^{2}$ integrable space generated by any probability measure.

Definition 1 ( $(p,C)$ -smooth class).

Let $p=s+\beta>0$ , $\beta\in(0,1]$ and $C>0$ . The $(p,C)$ -smooth ball with radius $C$ , denoted by $\mathcal{H}^{p,\beta}([0,1]^{d},C)$ , is the set of $s$ times differentiable functions $h:[0,1]^{d}\to{\mathbb{R}}$ such that

|\nabla^{s}h(x_{1})-\nabla^{s}h(x_{2})|\leq C\|x_{1}-x_{2}\|_{2}^{\beta},\ \ \forall x_{1},x_{2}\in[0,1]^{d}.

and

\sup_{x\in[0,1]^{d}}{|h(x)|}<C,

where $\|\cdot\|_{2}$ denotes the $\ell^{2}$ norm in ${\mathbb{R}}^{d}$ space and $\nabla$ is the gradient operator.

The main result in this section is presented in Theorem 1.

Theorem 1 (Excess risk of Mondrian forests).

Suppose the loss function $\ell(\cdot,\cdot)$ satisfies Assumption 1-3 and the distribution of $Y$ satisfies Assumption 4. For any $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ with $0<p\leq 1$ , we have

$\displaystyle{\mathbf{E}}R(\hat{h}_{n})-R(h)$	$\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\frac{1}{\lambda_{n}^{p}}$
	$\displaystyle+\frac{c_{1}}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}$
	$\displaystyle+c_{1}\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\|\ell(x,\ln n)\|}+\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}+C\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\right)\cdot e^{-c_{2}\cdot\ln^{2}n},$	(4)

where $c_{1},c_{2}>0$ are some universal constants.

Remark 1.

The first term of the RHS of (4) relates to the generalization error of forest, and the second one is the approximation error of Mondrian forest to $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ . The second line of (4) measures the convergence error when the empirical loss $\hat{R}(h)$ is used to approximate its theoretical version $R(h)$ . Finally, the last line is due to the sub-Gaussian property of $Y$ and terms in this line will disappear if we further assume $Y$ is bounded.

Remark 2.

In many applications, we will see later those coefficients above, such as ${\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))$ and $\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}$ , are only of polynomial order of $\ln n$ . Since the term $e^{-c_{2}\cdot\ln^{2}n}$ decays to zero at any polynomial rate, the last line of (4) usually has no influence on the convergence speed of the regret function. In conclusion, we can always hope only the first two lines dominate the convergence rate.

In statistics, we always have the true function $h^{*}$ that satisfies

m:=\arg min_{\forall g}{R(g)}.

(5)

On the other hand, Theorem 1 can also be used to analyze the consistency of $\hat{h}_{n}$ . To meet this requirement, the following assumption is necessary.

Assumption 5.

For any $h:[0,1]^{d}\to{\mathbb{R}}$ , there are $c>0$ and $\kappa\geq 1$ such that

c^{-1}\cdot{\mathbf{E}}(h(X)-m(X))^{\kappa}\leq R(h)-R(m)\leq c\cdot{\mathbf{E}}(h(X)-m(X))^{\kappa}.

To simplify notation, we denote the last line of (4) by $Res(n)$ . Namely,

Res(n):=c_{1}\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|}+\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}+C\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\right)\cdot e^{-c_{2}\cdot\ln^{2}n}.

(6)

Then, the consistency results of Mondrian forests are concluded in Corollary 1 and 2.

Corollary 1 (Consistency rate of Mondrian forests).

Suppose the loss function $\ell(\cdot,\cdot)$ satisfies Assumption 1-3 and the distribution of $Y$ satisfies Assumption 4. Suppose the true function $m\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ with $0<p\leq 1$ and Assumption 5 is satisfied. Then,

	$\displaystyle{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{\kappa}$	$\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\frac{1}{\lambda_{n}^{p}}$
		$\displaystyle+\frac{c_{1}}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}+Res(n),$		(7)

where $c_{1},c_{2}>0$ are some universal constants.

Corollary 2 (Consistency of Mondrian forests).

Suppose the loss function $\ell(\cdot,\cdot)$ satisfies Assumption 1-3 and the distribution of $Y$ satisfies Assumption 4. Suppose $m(X)$ is $L^{2}$ integrable only on $[0,1]^{p}$ and ${\mathbf{E}}(\ell^{2}(m(X),Y))<\infty$ and Assumption 5 is satisfied. If $\lambda_{n}=o\left(\left(\frac{\sqrt{n}}{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}\right)^{\frac{1}{d}}\right)$ , ${\lambda_{n}^{-1}}\cdot\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\to 0$ and $Res(n)\to 0$ , we have

\lim_{n\to\infty}{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{\kappa}=0.

5 The choice of $\lambda_{n}$

In practice, the best $\lambda_{n}$ is unknown and we need a criterion to stop the growth of Mondrian forests. Here, we use a penalty method. For each $1\leq b\leq B$ , define

Pen(\lambda_{n,b}):=\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{b,n}(X_{i}),Y_{i})}+\alpha_{n}\cdot\lambda_{n,b},

(8)

where $\alpha_{n,b}>0$ controls the power of penalty and $\hat{h}_{b,n}$ is constructed by the Mondrian process $MP_{b}(\lambda_{n},[0,1]^{d})$ . Then, the best $\lambda_{n,b}^{*}$ is chosen by

\lambda_{n,b}^{*}:=\arg min_{\lambda\geq 0}Pen(\lambda).

Denote by $\hat{h}_{b,n}^{*}$ the tree estimator that is constructed by using Mondrian process $MP_{b}(\lambda_{n,b}^{*},[0,1]^{d})$ . Then, the data driven estimator is given by

\hat{h}_{n}^{*}(x):=\frac{1}{B}\sum_{b=1}^{B}{\hat{h}_{b,n}^{*}(x)},\ x\in[0,1]^{d}.

(9)

Theorem 2.

Suppose the loss function $\ell(\cdot,\cdot)$ satisfies Assumption 1-3. Meanwhile, suppose the distribution of $Y$ satisfies Assumption 4. For any $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ with $0<p\leq 1$ and $0<\alpha_{n}\leq 1$ , we have

$\displaystyle{\mathbf{E}}R(\hat{h}_{n}^{*})-R(h)$	$\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\left(1+\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}\right)^{d}$
	$\displaystyle+(2d^{\frac{3}{2}p}C\cdot\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)})\cdot\left(\alpha_{n}\right)^{\frac{p}{2}}+\frac{c_{1}}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}$
	$\displaystyle+Res(n),$	(10)

where $c_{1},c_{2}>0$ are some universal constants and $Res(n)$ is defined in (6).

Therefore, by properly choosing $\alpha_{n}$ , we can obtain a nice rate of the regret function of Mondrian forests according to (2). Theorem 2 also tells us the estimator (9) is adaptive if we ignore some unimportant coefficients first, such as $M_{1}(\beta_{n},\ln n)$ . The application of Theorem 2 are given in next section, where some examples are discussed in detail and we will show in those cases coefficients in (2), such as $M_{1}(\beta_{n},\ln n)$ , can indeed be bounded away from a polynomial of $\ln n$ .

6 Examples

In this section, we show how to use Mondrian forests in different statistical learning problems. Meanwhile, theoretical properties of forest estimators, namely $\hat{h}_{n}(x)$ in (3) and $\hat{h}_{n}^{*}(x)$ in (9), are given based on Theorem 1-2 in each learning problem. Sometimes, the result in Lemma 1 is useful for the verification of Assumption 5. The proof of this result can be done by considering the Taylor expansion of the real function $R(h^{*}+\alpha h)$ w.r.t. $\alpha\in[0,1]$ .

Lemma 1.

For any $h:[0,1]^{p}\to{\mathbb{R}}$ and $\alpha\in[0,1]$ , we have

C_{1}{\mathbf{E}}(h(X)^{2})\leq\frac{d^{2}}{d\alpha^{2}}R(h^{*}+\alpha h)\leq C_{2}{\mathbf{E}}(h(X)^{2}),

where constants $C_{1}>0,C_{2}>0$ are universal. Then, Assumption 5 holds with $\kappa=2$ .

6.1 Least square regression

Nonparametric least squares regression refers to methods that do not assume a specific parametric form of the conditional expectation ${\mathbf{E}}(Y|X)$ . Instead, these methods are flexible and can adapt to the underlying structure of the data. The loss function of least square regression is given by $\ell_{1}(x,y)=(x-y)^{2}$ . First, we define the event $A_{n}:=\{\max_{1\leq i\leq n}{|Y_{i}|}\leq\ln n\}$ . Under the Assumption 4, by (34) we can find constants $c,c^{\prime}>0$ such that ${\mathbf{P}}(A_{n})\geq 1-c^{\prime}\cdot ne^{-c\ln^{2}n}$ . This means ${\mathbf{P}}(A_{n})$ is very close to $1$ as $n\to\infty$ . On the event $A_{n}$ , from (2) we further know

	$\displaystyle\hat{c}_{b,\lambda,j}$	$\displaystyle=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})}$
		$\displaystyle=\frac{1}{Card(\{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}\})}\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}Y_{i},$

where $Card(\cdot)$ denotes the cardinality of a set. Therefore, $\hat{c}_{b,\lambda,j}$ is just the average of $Y_{i}$ s that are in the leaf ${\small\mathcal{C}}_{b,\lambda,j}$ .

Let us discuss the property of $\ell_{1}(x,y)=(x-y)^{2}$ . First, it is obvious that Assumption 1 holds for this $\ell^{2}$ loss. By some simple calculations, we also know Assumption 1 is satisfied with $M_{1}(x,y)=2(|x|+|y|)$ and Assumption 2 is satisfied with $M_{2}(x,y)=2(x^{2}+y^{2})$ . Choosing $\lambda_{n}=n^{\frac{1}{2(p+d)}}$ and $\beta_{n}\asymp\ln n$ , Theorem 1 implies the following property of $\hat{h}_{n}$ .

Proposition 1.

For any $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ , there exists an integer $n_{1}(C)\geq 1$ such that for any $n>n_{1}(C)$ ,

{\mathbf{E}}R(\hat{h}_{n})-R(h)\leq\left(2\sqrt{2}\ln^{2}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

Then, we check Theorem 1. By some simple calculations, we have $m(x)={\mathbf{E}}(Y|X=x)$ and

	$\displaystyle R(h)-R(m)$	$\displaystyle={\mathbf{E}}(Y-h(X))^{2}-{\mathbf{E}}(Y-m(X))^{2}$
		$\displaystyle={\mathbf{E}}(h(X)-m(X))^{2}.$

The above inequality shows Assumption 5 holds with $c=1$ and $\kappa=2$ . When $\lambda_{n}=n^{\frac{1}{2(p+d)}}$ is selected, Corollary 1 tells us

Proposition 2.

For any $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ , there exists an integer $n_{2}(C)\geq 1$ such that for any $n>n_{2}(C)$ ,

{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{2}\leq\left(2\sqrt{2}\ln^{2}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

We can also show $\hat{h}_{n}$ is consistent for any general function $m$ defined in (5) if the convergence rate of $\lambda_{n}$ is chosen as stated in Corollary 2. Finally, choose $\alpha_{n}=n^{-\frac{p}{2p+4d}}$ and $\beta_{n}\asymp\ln n$ in Theorem 2. The regret function of the estimator $\hat{h}^{*}_{n}$ that is based on the model selection in (8) has the upper bound below.

Proposition 3.

For any $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ , there exists an integer $n_{3}(C)\geq 1$ such that for any $n>n_{3}(C)$ ,

{\mathbf{E}}R(\hat{h}_{n}^{*})-R(h)\leq(c_{1}\cdot 2^{2d+2}\ln^{2d+1}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+2d}}.

6.2 Generalized regression

Generalized regression refers to a broad class of regression models that extend beyond the traditional ordinary least squares (OLS) regression, accommodating various types of response variables and relationships between predictors and response. Usually, in this model the conditional distribution of $Y$ given $X$ follows an exponential family of distribution

{\mathbf{P}}(Y\in dy|X=x)=\exp\{B(m(x))y-D(m(x)))\}\Psi(dy),

(11)

where $\Psi(dy)$ is a positive measure defined on ${\mathbb{R}}$ and $\Psi({\mathbb{R}})>\Psi(y)$ for any $y\in{\mathbb{R}}$ . The above function $D(\cdot)$ is used as the aim of normalization that is defined on a open subinterval $\mathcal{I}$ of ${\mathbb{R}}$ and $D(m)=\ln\int_{R}{\exp\{B(m)y\}}\Psi(dy)$ . Now, we suppose the function $A(m):=D^{\prime}(m)/B^{\prime}(m)$ exists and we have ${\mathbf{E}}(Y|X=x)=A(m(x))$ by some calculations. Thus, the conditional expectation ${\mathbf{E}}(Y|X=x)$ will be known if we can estimate the unknown function $m(x)$ . More information about (11) can be found in Stone (1986), Stone (1994) and Huang (1998).

The problem of generalized regression is to estimate the unknown function $m(x)$ by using the i.i.d. data ${\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ . Note that both $B(\cdot)$ and $D(\cdot)$ are known in (11). In this case, we use the maximal likelihood method for the estimation and the corresponding loss function is given by

\ell_{2}(x,y)=-B(x)y+D(x)

and by some calculations we know the true function $m$ satisfies Definition 5, namely

m\in\arg min_{h}{{\mathbf{E}}(-B(h(X))Y+D(h(X)))}.

Therefore, we have reasons to believe Mondrian forests is statistically consistent in this problem, which is stated in Corollary 2. Now we give some mild restrictions on $B(\cdot)$ and $D(\cdot)$ in order to make sure our general results can be applied in generalized regression.

Conditions

(i)

$B(\cdot)$ has the 2nd continuous derivative and its first derivative is strictly positive on $\mathcal{I}$ .
(ii)

We can find a subinterval $S$ of ${\mathbb{R}}$ satisfying the measure $\Psi$ is concentrated on $S$ and

$-B^{\prime\prime}(\xi)y+D^{\prime\prime}(\xi)>0,\quad\quad y\in\breve{S},\xi\in\mathcal{I}$ (12)

where $\breve{S}$ denotes the interior of $S$ . If $S$ is bounded, (12) holds for at least one of endpoints.
(iii)

${\mathbf{P}}(Y\in S)=1$ and ${\mathbf{E}}(Y|X=x)=A(m(x))$ for each $x\in[0,1]^{d}$ .
(iv)

There is a compact subinterval $\mathcal{K}_{0}$ of $\mathcal{I}$ such that the range of $m$ is contained in $\mathcal{K}_{0}$ .

The above restrictions on $B(\cdot)$ and $D(\cdot)$ were used in Huang (1998). In fact, we can know from Huang (1998) that many commonly used distributions satisfy these conditions, including normal distribution, Poisson distribution and Bernoulli distribution. Now, let us verify our Assumption 1-5 under this setting.

In particular, Assumption 1 is verified by using Condition (i)-(iii). On the other hand, we choose the Lipchitz constant in Assumption 2 by

M_{1}(x,y):=\left|\sup_{\tilde{x}\in[-x,x]}{B^{\prime}(\tilde{x})}\right|\cdot|y|+\left|\sup_{\tilde{x}\in[-x,x]}{D^{\prime}(\tilde{x})}\right|.

Thirdly, the envelop function of $\ell_{2}(x,y)$ can be set by

M_{2}(x,y):=\left|\sup_{\tilde{x}\in[-x,x]}{B(\tilde{x})}\right|\cdot|y|+\left|\sup_{\tilde{x}\in[-x,x]}{D(\tilde{x})}\right|\cdot|y|.

Since we assume $Y$ is a sub-Gaussian random variable in Assumption 4, thus ${\mathbf{E}}(M_{2}^{2}(x,y))<\infty$ in this case. This indicates Assumption 3 is satisfied. Finally, under conditions (i)-(iv), Lemma 4.1 in Huang (1998) shows that our Assumption 5 holds with $\kappa=2$ and

c=\max\left\{\sup_{\begin{subarray}{c}\xi\in[-\beta_{n},\beta_{n}]\cap\mathcal{I}\\ m\in\mathcal{K}_{0}\end{subarray}}{(-B^{\prime\prime}(\xi)A(m)+D^{\prime\prime}(\xi))}],\left[\inf_{\begin{subarray}{c}\xi\in[-\beta_{n},\beta_{n}]\cap\mathcal{I}\\ m\in\mathcal{K}_{0}\end{subarray}}{(-B^{\prime\prime}(\xi)A(m)+D^{\prime\prime}(\xi))}]\right]^{-1}\right\}.

(13)

From (12), the constant $c$ in (13) must be larger than zero. On the other hand, we will see later the above $c$ does not equal to infinity in many cases.

Therefore, those general theoretical results in Section 4 & 5 can be applied in generalized regression. Finally, we need to stress those coefficients in the general results, such as $M_{1}(\beta_{n},\ln n)$ in Theorem 1 and $c$ in (13), are only of polynomial order of $\ln n$ if we choose $\beta_{n}$ properly. Let us give some specific examples.

Examples

1.

The first example is Gaussian regression, where the conditional distribution $Y|X=x$ follows $N(m(x),1)$ . Therefore, $B(x)=x$ , $D(x)=\frac{1}{2}x^{2}$ , $\mathcal{I}={\mathbb{R}}$ and $S={\mathbb{R}}$ . Our goal is to estimate the unknown conditional mean $m(x)$ . Now, Conditions (i)-(iii) are satisfied. To satisfy the fourth one, we assume the range of $m$ is contained in a compact set of ${\mathbb{R}}$ , denoted by $\mathcal{K}_{0}$ . Choose $\beta_{n}\asymp\ln n$ . Meanwhile, we can ensure $Y$ is a sub-Gaussian random variable. The constant $c$ in (13) equals to $1$ and those coefficients in general theoretical results are all of polynomial order of $\ln n$ , such as $M_{1}(\beta_{n},\ln n)\asymp 2\ln n$ .
2.

The second example is Poisson regression, where the conditional distribution $Y|X=x$ follows $Poisson(\lambda(x))$ with $\lambda(x)>0$ . Therefore, $B(x)=-x$ , $D(x)=\exp(x)$ , $\mathcal{I}={\mathbb{R}}$ and $S=[0,\infty)$ . Our goal is to estimate the $\ln n$ transformation of conditional mean, namely $m(x)=\ln\lambda(x)$ in (11), by using Mondrian forest (3). It is not difficult to show Conditions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of $\lambda(x)$ is contained in a compact set of $(0,\infty)$ . Thus, $m(x)$ satisfies Assumption (iv). Choose $\beta_{n}\asymp\ln\ln n$ . Meanwhile, we can ensure $Y$ satisfies Assumption 4 by using the fact that Poisson distribution is sub-Gaussian. The constant $c$ in (13) equals to $\ln n$ and those coefficients in general theoretical results are also all of polynomial order of $\ln n$ , such as $M_{1}(\beta_{n},\ln n)\asymp 2\ln n$ .
3.

The third example is $0-1$ classification, where the conditional distribution $Y|X=x$ follows Bernoulli distribution (taking values in $\{0,1\}$ ) with ${\mathbf{P}}(Y=1|X=x)=p(x)\in(0,1)$ . It is well known that the best classifier is named as Bayes rule,

$C^{Bayes}(x)=\left\{\begin{array}[]{lc}1,&p(x)-\frac{1}{2}\geq 0\\ 0,&p(x)-\frac{1}{2}<0.\\ \end{array}\right.$

Therefore, the main focus in this problem is how to estimate the conditional probability $p(x)$ well. Here, we use Mondrian forest for the estimation. First, we make a shift of $p(x)$ , which means $m(x):=p(x)-\frac{1}{2}\in(-\frac{1}{2},\frac{1}{2})$ is used in (11) instead. By some calculations, $B(x)=\ln(0.5+x)-\ln(0.5-x)$ , $D(x)=-\ln(0.5-x)$ , $\mathcal{I}=(-0.5,0.5)$ and $S=[0,1]$ in this case. Now, the final goal is to estimate $m(x)$ by using the forest estimator (3). It is not difficult to show Conditions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of $m(x)$ is contained in a compact set of $(-\frac{1}{2},\frac{1}{2})$ . Now, choose $\beta_{n}\asymp\frac{1}{2}-(\frac{1}{\ln n})^{\gamma}$ for some $\gamma>0$ . Meanwhile, Assumption 4 is satisfied since $Y$ is bounded. The constant $c$ in (13) equals to $(\ln n)^{2\gamma}$ and those coefficients in general theoretical results are also all of polynomial order of $\ln n$ , such as $M_{1}(\beta_{n},\ln n)\asymp 2(\ln n)^{\gamma+1}$ .

In each of three examples above, the convergence rate of $\varepsilon(\hat{h}_{n})$ is $O_{p}(n^{-\frac{1}{2}\cdot\frac{p}{p+d}})$ up to a polynomial of $\ln n$ .

6.3 Quantile regression

Quantile regression is a type of regression analysis used in statistics and econometrics that focuses on estimating the conditional quantiles (such as the median or other percentiles) of the response variable distribution given a set of predictor variables. Unlike ordinary least squares (OLS) regression, which estimates the mean of the response variable conditional on the predictor variables, quantile regression provides a more comprehensive analysis by estimating the conditional median or other quantiles.

Specifically, suppose $m(x)$ is the $\tau$ -th quantile ( $0<\tau<1$ ) of the conditional distribution of $Y|X=x$ . Our interest is to estimate $m(x)$ by using i.i.d. data ${\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ . The loss function in this case is given by

\ell_{3}(x,y):=\rho_{\tau}(y-x),

where $\rho_{\tau}(u)=(\tau-\mathbb{I}(u<0))u$ denotes the check function for the quantile $\tau$ . Meanwhile, by some calculations we know the quantile function $m(x)$ maximizes the population risk w.r.t. $\ell_{3}(x,y)$ . Namely, we have

m(x)\in\arg min_{h}{{\mathbf{E}}(\rho_{\tau}(Y-h(X)))}.

Therefore, we have reason to believe the forest estimator in (3) work well in this problem.

Let us verify Assumption 1-5. First, we choose $S={\mathbb{R}}$ in Assumption 1. Then, it is easy to check $\ell_{3}(x,y)$ is convex for any $y\in S$ . Second, the loss function $\ell_{3}(x,y)$ is also Lipschitz continuous w.r.t. the first variable when $y\in S$ is given and the Lipschitz constant $M_{1}(x,y):=\max\{\tau,1-\tau\},\forall x,y,\in{\mathbb{R}}$ . Third, we choose the envelop function $M_{2}(x,y):=\max\{\tau,1-\tau\}\cdot(|x|+|y|)$ in Assumption 3. Fourth, we always suppose $Y$ is a sub-Gaussian random variable to meet the requirement in Assumption 4. Finally, we consider the sufficient condition of Assumption 5.

The Knight equality in Knight (1998) tells us

\rho_{\tau}(u-v)-\rho_{\tau}(u)=v(\mathbb{I}(u\leq 0)-\tau)+\int_{0}^{v}{(\mathbb{I}(u\leq s)-\mathbb{I}(u\leq 0))ds},

from which we get

	$\displaystyle R(h)-R(m)$	$\displaystyle={\mathbf{E}}\left[\rho_{\tau}(Y-h(X))-\rho_{\tau}(Y-m(X))\right]$
		$\displaystyle={\mathbf{E}}\left[(h(X)-m(X))\mathbb{I}(Y-m(X)\leq 0)-\tau\right]$
		$\displaystyle+{\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{(\mathbb{I}(Y-m(X)\leq s)-\mathbb{I}(Y-m(X)\leq 0))ds}\right].$

Conditioning on $X$ , we know the first part of above inequality equals to zero by using the definition of $m(x)$ . Thus, we have

	$\displaystyle R(h)-R(m)$	$\displaystyle={\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{(\mathbb{I}(Y-m(X)\leq s)-\mathbb{I}(Y-m(X)\leq 0))ds}\right]$
		$\displaystyle={\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{sgn(s){\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))\|X]}ds\right].$		(14)

To illustrate our basic idea clearly, we just consider a normal case, where $m_{1}(X):={\mathbf{E}}(Y|X)$ is independent with the residual $\varepsilon=Y-{\mathbf{E}}(Y|X)\sim N(0,1)$ and $\sup_{x\in[0,1]^{d}}{|m_{1}(x)|}<\infty$ . The generalization of this case can be finished by following the spirit below. Under this normal case, $m(X)$ is equal to the sum of $m_{1}(X)$ and the $\tau$ -th quantile of $\varepsilon$ . Denote by $q_{\tau}(\varepsilon)\in{\mathbb{R}}$ the $\tau$ -th quantile of $\varepsilon$ . Thus, the conditional distribution of $(Y-m(X))|X$ is same with the distribution of $\varepsilon-q_{\tau}(\varepsilon)$ . For any $s_{0}>0$ ,

\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds=\int_{0}^{s_{0}}{\mathbf{P}}[\varepsilon\in(q_{\tau}(\varepsilon),q_{\tau}(\varepsilon)+s)]ds.

Since $q_{\tau}(\varepsilon)$ is a fixed number, we assume $s_{0}$ is a large number later. By the Lagrange mean value theorem, the following probability bound holds

\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+s_{0})^{2}/2)}\cdot s\leq{\mathbf{P}}[\varepsilon\in(q_{\tau}(\varepsilon),q_{\tau}(\varepsilon)+s)]\leq\frac{1}{\sqrt{2\pi}}\cdot s.

Then,

\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+s_{0})^{2}/2)}\cdot\frac{1}{2}s_{0}^{2}\leq\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}s_{0}^{2}.

With the same argument, we also have

\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+|s_{0}|)^{2}/2)}\cdot\frac{1}{2}s_{0}^{2}\leq\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}s_{0}^{2}

once $s_{0}<0$ . Therefore, (14) implies

R(h)-R(m)\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}{\mathbf{E}}(h(X)-m(X))^{2}

(15)

and

R(h)-R(m)\geq\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+\sup_{x\in[0,1]^{d}}|h(x)-m(x)|)^{2}/2)}\cdot\frac{1}{2}{\mathbf{E}}(h(X)-m(X))^{2}.

(16)

Now, we choose $\beta_{n}\asymp\sqrt{\ln\ln n}$ . The combination of (15) and (16) implies Assumption 5 holds with $\kappa=2$ and $c=(\ln n)^{-1}$ . Meanwhile, those coefficients in general theoretical results are also of polynomial order of $\ln\ln n$ , such as $\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\asymp\ln\ln n$ . The above setting results that the convergence rate of $\varepsilon(\hat{h}_{n})$ is $O_{p}(n^{-\frac{1}{2}\cdot\frac{p}{p+d}})$ up to a polynomial of $\ln n$ .

6.4 Nonparametric density estimation

Assume $X$ is a continuous random vector defined on $[0,1]^{p}$ and has a density function $f_{0}(x),x\in[0,1]^{d}$ . Our interest lies in the estimation of the unknown function $f_{0}(x)$ based on an i.i.d. sample of $X$ , namely data ${\mathcal{D}}_{n}=\{X_{i}\}_{i=1}^{n}$ . However, any density estimator has to satisfy two shape requirements that $f_{0}$ is non-negative, namely, $f_{0}(x)\geq 0,x\in[0,1]^{d}$ and $\int f_{0}(x)dx=1$ . These two restrictions can be losen by making a transformation. In fact, we have the decomposition

f_{0}(x)=\frac{\exp(h_{0}(x))}{\int\exp(h_{0}(x))dx},x\in[0,1]^{d}7,

where $h(x)$ can be any integrable function on $[0,1]^{d}$ . The above link helps us to focus on the estimation of $h_{0}(x)$ only, which will be a statistical learning problem without constraint. However, this transformation introduces a model identifiability problem since $h_{0}+c$ and $h_{0}$ give the same density function, where $c\in{\mathbb{R}}$ . To solve this problem, we impose an additional requirement $\int_{[0,1]^{p}}{h_{0}(x)}=0$ , which guarantees a one-to-one map between $f_{0}$ and $h_{0}$ .

In the case of density estiamtion, the scaled log-likelihood for any function $h(x)$ based on the sampled data is

\hat{R}(h):=\frac{1}{n}\left(-\sum_{i=1}^{n}{h(X_{i})}+\ln\int_{[0,1]^{p}}{\exp(h(x))dx}\right)

and its population version is

R(h)=-{\mathbf{E}}(h(X))+\ln\int_{[0,1]^{p}}{\exp{(h(x))}dx}.

With a slight modification, Mondrian forests can also be applied in this problem. Recall the partition $\{{\small\mathcal{C}}_{b,\lambda,j}\}_{j=1}^{K_{b}(\lambda)}$ of the $b$ -th Mondrian with stopping time $\lambda$ satisfies

[0,1]^{d}=\bigcup_{j=1}^{K_{b}(\lambda)}{\small\mathcal{C}}_{b,\lambda,j}\ \ \text{and}\ \ {\small\mathcal{C}}_{b,\lambda,j_{1}}\cap{\small\mathcal{C}}_{b,\lambda,j_{2}}=\varnothing,\ \forall j_{1}\neq j_{2}.

For each cell ${\small\mathcal{C}}_{b,\lambda,j}$ , a constant $\hat{c}_{b,\lambda,j}\in{\mathbb{R}}$ is used as the predictor of $h(x)$ in this small region. Thus, the estimator of $\eta_{0}(x)$ based on a single tree has the form

\hat{h}_{b,n}^{pre}(x)=\sum_{j=1}^{K_{b}(\lambda)}{\hat{c}_{b,\lambda,j}\cdot{\mathbb{I}}(x\in{\small\mathcal{C}}_{b,\lambda,j})},

where coefficients are obtained by minimizing the empirical risk function,

	$\displaystyle(\hat{c}_{b,\lambda,1},\ldots,\hat{c}_{b,\lambda,K_{b}(\lambda)}):=$	$\displaystyle\arg min_{\begin{subarray}{c}c_{b,\lambda,j}\in[-\beta_{n},\beta_{n}]\\ j=1,\ldots,K_{b}(\lambda)\end{subarray}}\frac{1}{n}\sum_{j=1}^{K_{b}(\lambda)}\sum_{i=1}^{n}{-c_{b,\lambda,j}\cdot\mathbb{I}(X_{i}\in{\small\mathcal{C}}_{b,\lambda,j})}$
		$\displaystyle+\ln\sum_{j=1}^{K_{b}(\lambda)}\int_{{\small\mathcal{C}}_{b,\lambda,j}}{\exp{(c_{b,\lambda,j})}dx}.$

Since the optimization function above is differentiable w.r.t. variables $c_{b,\lambda,j}$ s, therefore the minimal value can indeed be achieved. To meet the requirement of identification, the estimator based on a single tree is given by

\hat{h}_{b,n}(x)=\hat{h}_{b,n}^{pre}(x)-\int_{[0,1]^{p}}{\hat{h}_{b,n}^{pre}(x)dx}.

Finally, by applying the ensemble technique again, the estimator of $h_{0}$ based on the Mondrian forest is

\hat{h}_{n}(x):=\frac{1}{B}\sum_{b=1}^{B}\hat{h}_{b,n}(x),x\in[0,1]^{B}.

(17)

Next, we analyze the theoretical properties of $\hat{h}_{n}$ . In this case, the forest estimator (17) is similar to the previous one defined in (3) which is obtained by optimization without constraint. Although there is a difference that here we add an extra penalty function:

Pen(h):=\ln\int_{[0,1]^{p}}{\exp{(h(x))}dx},

where $h:[0,1]^{p}\to{\mathbb{R}}$ , those arguments in Theorem 1 can also be applied here. Let the pseudo loss function be $\ell^{pse}(x,y):=-x,x\in[0,1]^{p},y\in{\mathbb{R}}$ . It is obvious that Assumption 1 holds for $\ell^{pse}(x,y)$ . Assumption 2 is satisfied with $M_{1}(x)=1$ and Assumption 3 is satisfied with $M_{2}(x)=x$ . Choose $\beta_{n}\asymp\ln\ln n$ . Following similar arguments in Theorem 1, we have

	$\displaystyle{\mathbf{E}}R(\hat{h}_{n})-R(h)$	$\displaystyle\leq c_{1}\frac{\ln n}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\cdot\frac{1}{\lambda_{n}^{p}}$
		$\displaystyle+\frac{C}{\sqrt{n}}+{\mathbf{E}}_{\lambda_{n}}\|Pen(h)-Pen(h_{n}^{*}(x))\|,$		(18)

where $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ ( $0<p\leq 1$ ) and $h_{n}^{*}(x):=\sum_{j=1}^{K_{1}(\lambda_{n})}{\mathbb{I}}(x\in{\small\mathcal{C}}_{1,\lambda_{n},j})h(x_{1,\lambda_{n},j})$ ( $x_{1,\lambda_{n},j}$ is the center of cell ${\small\mathcal{C}}_{1,\lambda_{n},j}$ ) and $c_{1}$ is the coefficient in Theorem 1 . Therefore, the left thing is to bound ${\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|$

Lemma 2.

For any $h(x)\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ with $0<p\leq 1$ and $h_{n}^{*}(x)$ ,

{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\cdot\left(\frac{1}{\lambda_{n}}\right)^{p}.

Choosing $\lambda_{n}=n^{\frac{1}{2(p+d)}}$ , the combination of (18) and Lemma 2 implies

{\mathbf{E}}R(\hat{h}_{n})-R(h)\leq\left(c_{1}\ln\ln n+3d^{\frac{3}{2}p}C+\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}

when $n$ is large enough.

To obtain the consistency rate of Mondrian forest estimator, we need a result that is similar to Assumption 5.

Lemma 3.

Suppose the true density $f_{0}(x)$ is bounded away from zero and infinity, namely $c_{0}<h_{0}(x)<c_{0}^{-1},\forall x\in[0,1]^{p}$ and $\beta_{n}\asymp\ln\ln n$ . For any function $h:[0,1]^{d}\to{\mathbb{R}}$ with $\|h\|_{\infty}\leq\beta_{n}$ and $\int_{[0,1]^{p}}{h(x)dx}=0$ , we have

c_{0}\cdot\frac{1}{\ln n}\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}\leq R(h)-R(h_{0})\leq c_{0}^{-1}\cdot\ln n\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}.

Thus, Lemma 3 immediately implies the following result.

Theorem 3.

Suppose the true density $f_{0}(x)$ is bounded away from zero and infinity, namely $c_{0}<h_{0}(x)<c_{0}^{-1},\ \forall x\in[0,1]^{p}$ . If $\lambda_{n}=n^{\frac{1}{2(p+d)}}$ and the true function $h_{0}\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ with $0<p\leq 1$ satisfying $\int_{[0,1]^{p}}{h_{0}(x)dx}=0$ , there is $n_{den}\in\mathbb{Z}^{+}$ such that when $n>n_{den}$ ,

{\mathbf{E}}(\hat{h}_{n}(X)-h_{0}(X))^{2}\leq c_{0}^{-1}\cdot\ln n\cdot\left(c_{1}\ln\ln n+3d^{\frac{3}{2}p}C+\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

7 Conclusion

In this paper, we proposed a general framework about Mondrian forests, which can be used in many statistical or machine learning problems. These applications includes but not limits to LSE, generalized regression, density estimation and quantile regression. Meanwhile, we studied the upper bound of its regret function and statistical consistency and show how to use them in specific applications listed above. The future work can be the study of the asymptotic distribution of this kind of general Mondrian forests as suggested in Cattaneo et al. (2023).

8 Proofs

This section contains proofs of theoretical results in the paper. Several useful preliminaries and notations are introduced first. Meanwhile, the constant $c$ in this section will change from line to line in order to simplify notations.

Definition 2 (Blumer et al. (1989)).

Let $\mathcal{F}$ be a Boolean function class in which each $f:\mathcal{Z}\to\{0,1\}$ is binary-valued. The growth function of $\mathcal{F}$ is defined by

\Pi_{\mathcal{F}}(m)=\max_{z_{1},\ldots,z_{m}\in\mathcal{Z}}|\{(f(z_{1}),\ldots,f(z_{m})):f\in\mathcal{F}\}|

for each positive integer $m\in\mathbb{Z}_{+}$ .

Definition 3 (Györfi et al. (2002)).

Let $z_{1},\ldots,z_{n}\in\mathbb{R}^{p}$ and $z_{1}^{n}=\{z_{1},\ldots,z_{n}\}$ . Let $\mathcal{H}$ be a class of functions $h:\mathbb{R}^{p}\to\mathbb{R}$ . An $L_{q}$ $\varepsilon$ -cover of $\mathcal{H}$ on $z_{1}^{n}$ is a finite set of functions $h_{1},\ldots,h_{N}:\mathbb{R}^{p}\to\mathbb{R}$ satisfying

\min_{1\leq j\leq N}{\left(\frac{1}{n}\sum_{i=1}^{n}{|h(z_{i})-h_{j}(z_{i})|^{q}}\right)^{\frac{1}{q}}}<\varepsilon,\ \ \forall h\in\mathcal{H}.

Then, the $L_{q}$ $\varepsilon$ -cover number of $\mathcal{H}$ on $z_{1}^{n}$ , denoted by $\mathcal{N}_{q}(\varepsilon,\mathcal{H},z_{1}^{n})$ , is the minimal size of an $L_{q}$ $\varepsilon$ -cover of $\mathcal{H}$ on $z_{1}^{n}$ . If there exists no finite $L_{q}$ $\varepsilon$ -cover of $\mathcal{H}$ , then the above cover number is defined as $\mathcal{N}_{q}(\varepsilon,\mathcal{H},z_{1}^{n})=\infty$ .

For a VC class, there is a useful result giving the upper bound of its covering number. To make this supplementary material self-explanatory, we first introduce some basic concepts and facts about the VC dimension; see Shalev-Shwartz and Ben-David (2014) for more details.

Definition 4 (Kosorok (2008)).

The subgraph of a real function $f:\mathcal{X}\to\mathbb{R}$ is a subset of $\mathcal{X}\times\mathbb{R}$ defined by

C_{f}=\{(x,y)\in\mathcal{X}\times\mathbb{R}:f(x)>y\},

where $\mathcal{X}$ is an abstract set.

Definition 5 (Kosorok (2008)).

Let $\mathcal{C}$ be a collection of subsets of the set $\mathcal{X}$ and $\{x_{1},\ldots,x_{m}\}\subset\mathcal{X}$ be an arbitrary set of $m$ points. Define that $\mathcal{C}$ picks out a certain subset $A$ of $\{x_{1},\ldots,x_{m}\}$ if $A$ can be expressed as $C\cap\{x_{1},\ldots,x_{m}\}$ for some $C\in\mathcal{C}$ . The collection $\mathcal{C}$ is said to shatter $\{x_{1},\ldots,x_{m}\}$ if each of $2^{m}$ subsets can be picked out.

Definition 6 (Kosorok (2008)).

The VC dimension of the real function class $\mathcal{F}$ , where each $f\in\mathcal{F}$ is defined on $\mathcal{X}$ , is the largest integer $VC(\mathcal{C})$ such that a set of points in $\mathcal{X}\times\mathbb{R}$ with size $VC(\mathcal{C})$ is shattered by $\{\mathcal{C}_{f},f\in\mathcal{F}\}$ . In this paper, we use $VC(\mathcal{F})$ to denote the VC dimension of $\mathcal{F}$ .

Proof of Theorem 1. Since $\ell(\cdot,y)$ is convex in Assumption 1, we have

\ell(\hat{h}_{n}(x),y)\leq\frac{1}{B}\sum_{b=1}^{B}\ell(\hat{h}_{b,n}(x),y)

for any $x\in[0,1]^{d},y\in{\mathbb{R}}$ . Therefore, we only need to consider the excess risk of a single tree in the following analysis.

In fact, our proof is based on the following decomposition:

$\displaystyle{\mathbf{E}}R(\hat{h}_{1,n})-R(h)=$	$\displaystyle{\mathbf{E}}(R(\hat{h}_{1,n})-\hat{R}(\hat{h}_{1,n}))+{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h))$
	$\displaystyle+{\mathbf{E}}(\hat{R}(h)-R(h))$
	$\displaystyle:=I+II+III,$	(19)

where I relates to the variance term of Mondrian tree, and II is the approximation error of Mondrian tree to $h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)$ and III measures the error when the empirical loss $\hat{R}(h)$ is used to approximate the theoretical one.

Analysis of Part I. Define two classes first.

\mathcal{T}(t):=\{\text{A Mondrian tree with $t$ leaves by partitioning $[0,1]^{d}$}\}

\mathcal{G}(t):=\Big{\{}\sum_{j=1}^{t}{{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})}\cdot c_{j}:c_{j}\in{\mathbb{R}},\ {\small\mathcal{C}}_{j}\ ^{\prime}s\ \text{are leaves of a tree in $\mathcal{T}(t)$}\Big{\}}.

Thus, the truncated function class of $\mathcal{G}(t)$ is given by

\mathcal{G}(t,z):=\{\tilde{g}(x)=T_{z}g(x):g\in\mathcal{G}(t)\},

where the threshold $z>0$ . Then, the part $I$ can be bounded as follows.

$\displaystyle\|I\|$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)\|{\mathcal{D}}_{n})}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)\|{\mathcal{D}}_{n})}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n})\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n}^{c})\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle:=I_{1}+I_{2},$	(20)

where $A_{n}:=\{\max_{1\leq i\leq n}|Y_{i}|\leq\ln{n}\}$ . Next, we need to find the upper bound of $I_{1},I_{2}$ respectively.

Let us consider $I_{1}$ first. Make the decomposition of $I_{1}$ as below.

	$\displaystyle I_{1}$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n})\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))-{\mathbf{E}}(\ell(g(X),Y))\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle:=I_{1,1}+I_{1,2}.$

The part $I_{1,1}$ can be bounded by considering the covering number of the function class

\mathcal{L}_{n}:=\{\ell(g(\cdot),T_{\ln n}(\cdot)):g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})\},

where $K(\lambda_{n})$ denotes the number of regions in the partition that is constructed by the truncated Mondrian process $\pi_{\lambda_{n}}$ with stopping time $\lambda_{n}$ . Therefore, $K(\lambda_{n})$ is a deterministic number once $\pi_{\lambda_{n}}$ is given. For any $\varepsilon>0$ , recall the definition of the covering number of $\mathcal{G}(K(\lambda_{n}))$ , namely $\mathcal{N}_{1}(\varepsilon,\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n})$ shown in Definition 3. Now, we suppose

\{\eta_{1}(x),\eta_{2}(x),\ldots,\eta_{J}(x)\}

is a $\varepsilon/(M_{1}(\beta_{n},\ln n))$ -cover of class $\mathcal{G}(K(\lambda_{n}),\beta_{n})$ in $L^{1}(z_{1}^{n})$ space, where $L^{1}(z_{1}^{n}):=\{f(x):\|f\|_{z_{1}^{n}}:=\frac{1}{n}\sum_{i=1}^{n}|f(z_{i})|<\infty\}$ is equipped with norm $\|\cdot\|_{z_{1}^{n}}$ and $J\geq 1$ . Without loss of generality, we can further assume $|\eta_{j}(x)|\leq\beta_{n}$ since $\mathcal{G}(K(\lambda_{n}),\beta_{n})$ is upper bounded by $\ln n$ . Otherwise, we consider the truncation of $\eta_{j}(x)$ : $T_{\ln n}{\eta_{j}(x)}$ . According to Assumption 2, we know for any $g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})$ and $\eta_{j}(x)$ ,

|\ell(g(x),T_{\ln Y}y)-\ell(\eta_{j}(x),T_{\ln Y}y)|\leq M_{1}(\beta_{n},\ln n)|g(x)-\eta_{j}(x)|,\ \

where $x\in[0,1]^{d},y\in{\mathbb{R}}.$ The above inequality implies that

\frac{1}{n}\sum_{i=1}^{n}|\ell(g(z_{i}),T_{\ln Y}w_{i})-\ell(\eta_{j}(z_{i}),T_{\ln Y}w_{i})|\leq M_{1}(\beta_{n},\ln n)\cdot\frac{1}{n}\sum_{i=1}^{n}|g(z_{i})-\eta_{j}(z_{i})|

for any $z_{1}^{n}:=(z_{1},z_{2},\ldots,z_{n})\in{\mathbb{R}}^{d}\times\cdots\times{\mathbb{R}}^{d}$ and $(w_{1},\ldots,w_{n})\in{\mathbb{R}}\times\cdots\times{\mathbb{R}}$ . Therefore, we know $\ell(\eta_{1}(x),T_{\ln n}y),\ldots,\ell(\eta_{J}(x),T_{\ln n}y)$ is a $\varepsilon$ -cover of class $\mathcal{G}(K(\lambda_{n}),\beta_{n})$ in $L^{1}(v_{1}^{n})$ space, where $v_{1}^{n}:=((z_{1}^{T},w_{1})^{T},\ldots,(z_{n}^{T},w_{n})^{T})$ . In other words, we have

\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})\leq\mathcal{N}_{1}\left(\frac{\varepsilon}{M_{1}(\beta_{n},\ln n)},\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}\right).

(21)

Note that $\mathcal{G}(K(\lambda_{n}),\beta_{n})$ is a VC class since we have shown $\mathcal{G}(K(\lambda_{n}))$ is a VC class in (29). Furthermore, we know the function in $\mathcal{G}(K(\lambda_{n}),\beta_{n})$ is upper bounded by $\beta_{n}$ . Therefore, we can bound the RHS of (21) by using Theorem 7.12 in Sen (2018)

	$\displaystyle\mathcal{N}_{1}\left(\frac{\varepsilon}{M_{1}(\beta_{n},\ln n)},\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}\right)$
	$\displaystyle\leq c\cdot VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))(4e)^{VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))}\left(\frac{\beta_{n}}{\varepsilon}\right)^{VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))}$		(22)

for some universal constant $c>0$ . On the other hand, it is not difficult to show

VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))\leq VC(\mathcal{G}(K(\lambda_{n}))).

(23)

Thus, the combination of (23), (22) and (21) implies

\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})\leq c\cdot VC(\mathcal{G}(K(\lambda_{n})))(4e)^{VC(\mathcal{G}(K(\lambda_{n})))}\left(\frac{\beta_{n}}{\varepsilon}\right)^{VC(\mathcal{G}(K(\lambda_{n})))}

(24)

for each $v_{1}^{n}$ . Note that the class $\mathcal{L}_{n}$ has an envelop function $M_{2}(\beta_{n},y)$ satisfying $\nu(n):=\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}<\infty$ by Assumption 3. Construct a series of independent Rademacher variables $\{b_{i}\}_{i=1}^{n}$ sharing with the same distribution ${\mathbf{P}}(b_{i}=\pm 1)=0.5,i=1,\ldots,n$ . Then, the symmetrization technique (see Lemma 3.12 in Sen (2018)) and the Dudley’s entropy integral (see (41) in Sen (2018)) and (24) imply

$\displaystyle I_{1,1}$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})}b_{i}\right\|\Big{\|}\pi_{\lambda_{n}}\right)\ \ \text{(symmetrization technique)}$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\frac{24}{\sqrt{n}}\int_{0}^{\nu(n)}\sqrt{\ln\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right)\ \ \qquad\qquad\text{(Dudley's entropy integral)}$
	$\displaystyle\leq\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)}\sqrt{\ln(1+c\cdot(\beta_{n}/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right)(\text{E.q. \eqref{789nsvbfc}})$
	$\displaystyle\leq\beta_{n}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)/\beta_{n}}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right),$	(25)

where we use the fact that $\pi_{\lambda}$ is independent to the data set ${\mathcal{D}}_{n}$ and $c>0$ is universal. Without loss of generality, we can assume $\nu(n)/\beta_{n}<1$ ; otherwise just set $\beta_{n}^{\prime}=\max\{\nu(n),\beta_{n}\}$ as the new upper bound of the function class $\mathcal{G}(K(\lambda_{n}),\beta_{n})$ . Therefore, (25) also implies

$\displaystyle I_{1,1}$	$\displaystyle\leq\max\{\beta_{n},\nu(n)\}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{1}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq\max\{\beta_{n},\nu(n)\}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{VC(\mathcal{G}(K(\lambda_{n}))}{n}}\right)\cdot\int_{0}^{1}\sqrt{\ln(1/\varepsilon)}d\varepsilon$
	$\displaystyle\leq c\cdot\max\{\beta_{n},\nu(n)\}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{VC(\mathcal{G}(K(\lambda_{n}))}{n}}\right).$	(26)

The left thing is to find the VC dimension of class $\mathcal{G}(t)$ for each $t\in\mathbb{Z}_{+}$ . This result is summarized as below.

Lemma 4.

For each integer $t\in\mathbb{Z}_{+}$ , $VC(\mathcal{G}(t))\leq c(d)\cdot t\ln(t)$ .

Proof.

Recall two defined classes:

\mathcal{T}(t):=\{\text{A Mondrian tree with $t$ leaves by partitioning $[0,1]^{d}$}\}

\mathcal{G}(t):=\Big{\{}\sum_{j=1}^{t}{{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})}\cdot c_{j}:c_{j}\in{\mathbb{R}},\ {\small\mathcal{C}}_{j}\ ^{\prime}s\ \text{are leaves of a tree in $\mathcal{T}(t)$}\Big{\}}.

We first calculate the VC dimension of class $\mathcal{G}(t)$ . Define a Boolean class of functions:

\mathcal{F}_{t}=\{sgn(f(x,y)):f(x,y)=h(x)-y,h\in\mathcal{G}_{t}\},

where $sgn(v)=1$ if $v\geq 0$ and $sgn(v)=-1$ otherwise. Recall the VC dimension of $\mathcal{F}_{t}$ , denoted by $VC(\mathcal{F}_{t})$ , is the largest integer $m\in\mathbb{Z}_{+}$ satisfying $2^{m}\leq\Pi_{\mathcal{F}_{t}}(m)$ (see, for example, Kosorok (2008)). Therefore, we focus on bounding $\Pi_{\mathcal{F}_{t}}(m)$ for each positive integer $m\in\mathbb{Z}_{+}$ . Let $z_{1},\ldots,z_{m}\in{\mathbb{R}}^{d}$ be the series of points which maximize $\Pi_{\mathcal{F}_{t}}(m)$ . Under the above notations, we have two observations as follows.

•

For any $h_{t}\in\mathcal{G}_{t}$ that takes constant on each cell ${\small\mathcal{C}}_{j},j=1,\ldots,t$ , there is $h_{t-1}\in\mathcal{G}_{t-1}$ and a leaf $\mathcal{C}$ of a tree in $\mathcal{T}(t-1)$ such that $\mathcal{C}={\small\mathcal{C}}_{j}\cup{\small\mathcal{C}}_{j^{\prime}}$ for some $j^{\prime}$ . Meanwhile, $h_{t-1}$ is constant on the cell in $\{{\small\mathcal{C}}^{k}\}_{k=1}^{t}\setminus\{{\small\mathcal{C}}_{j},{\small\mathcal{C}}_{j^{\prime}}\}$ and ${\small\mathcal{C}}$ .

•

All half-planes in $\mathbb{R}^{d}$ pick out at most $\left({me}/{(d+1)}\right)^{d+1}$ subsets from $\{z_{1},\ldots,z_{m}\}$ when $m\geq d+1$ (see, e.g., Kosorok (2008)), namely

Card(\{\{z_{1},\ldots,z_{m}\}\cap\{x\in{\mathbb{R}}^{d}:\theta^{T}x\leq s\}:\theta\in\Theta^{d},s\in\mathbb{R}\})\leq\left({me}/{(d+1)}\right)^{d+1}.

Based on the above two facts, we can conclude

\Pi_{\mathcal{F}_{t}}(m)\leq\Pi_{\mathcal{F}_{t-1}}(m)\cdot\left(\frac{me}{d+1}\right)^{d+1}.

(27)

Then, combination of (27) and $\Pi_{\mathcal{F}_{1}}(m)\leq\left(\frac{me}{d+1}\right)^{d+1}$ implies that

\Pi_{\mathcal{F}_{t}}(m)\leq\left(\frac{me}{d+1}\right)^{t\cdot d+t}.

(28)

Solving the inequality

2^{m}\leq\left(\frac{me}{d+1}\right)^{t\cdot d+t}

by using the basic inequality $\ln x\leq\gamma\cdot x-\ln\gamma-1$ with $x,\gamma>0$ yields

VC(\mathcal{G}({t}))\leq\frac{4}{\ln 2}\cdot d(t+1)\ln{\left(2d(t+1)\right)}\leq c(d)\cdot t\ln(t),

(29)

where the constant $c(d)$ depends on $d$ only. ∎

Therefore, we know from Lemma 4 and (29) that

I_{1,1}\leq c\cdot\max\{\beta_{n},\nu(n)\}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{K(\lambda_{n})\ln K(\lambda_{n})}{n}}\right).

By the basic inequality $\ln x\leq x^{\beta}/\beta,\ \forall x\geq 1,\forall\beta>0$ , from above inequality we have

I_{1,1}\leq c\cdot\frac{\max\{\beta_{n},\nu(n)\}}{\sqrt{n}}\cdot{\mathbf{E}}(K(\lambda_{n})).

(30)

Next, from Proposition 2 in Mourtada et al. (2020), we know ${\mathbf{E}}(K(\lambda_{n}))=(1+\lambda_{n})^{d}$ . Finally, we have the following upper bound for $I_{1,1}$

I_{1,1}\leq c\cdot\frac{\max\{\beta_{n},\nu(n)\}}{\sqrt{n}}\cdot(1+\lambda_{n})^{d}.

(31)

Then, we bound the second part $I_{1,2}$ of $I_{1}$ by following the arguments below.

$\displaystyle I_{1,2}$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(\|\ell(g(X),T_{\ln{n}}Y)-\ell(g(X),Y)\|)\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(\|\ell(g(X),\ln n)-\ell(g(X),Y)\|\cdot\mathbb{I}(\{\|Y\|>\ln n\}))\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(\|\ell(g(X),\ln n)-\ell(g(X),Y)\|^{2})\right)^{\frac{1}{2}}\Big{\|}\pi_{\lambda_{n}}\right){\mathbf{P}}^{\frac{1}{2}}(\|Y\|>\ln n)$
	$\displaystyle\leq\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\ell^{2}(x,\ln n)}+{\mathbf{E}}(M^{2}_{2}(\beta_{n},Y))\right)^{\frac{1}{2}}\cdot\sqrt{2}\exp(-\ln^{2}n/4),$	(32)

where in the third line we use Cauchy-Schwarz inequality and in last line we use $\ell(x,y)\leq M_{2}(x,y)$ in Definition 3 and the sub-Gaussian property of $Y$ .

Finally, we end the Analysis of Part I by bounding $I_{1,2}$ . In fact, this bound can be processed as follows.

$\displaystyle I_{1,2}$	$\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n}^{c})\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left(\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})+{\mathbf{E}}(\ell(g(X),Y))}\right)\cap\mathbb{I}(A_{n}^{c})\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\frac{1}{n}\sum_{i=1}^{n}{M_{2}(\beta_{n},Y_{i})+{\mathbf{E}}(M_{2}(\beta_{n},Y))}\right)\cap\mathbb{I}(A_{n}^{c})\Big{\|}\pi_{\lambda_{n}}\right)\ \ (Assumption\ \ref{assump3})$
	$\displaystyle\leq 2\cdot\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\cdot{\mathbf{P}}(A_{n}^{c}).$	(33)

Thus, we only need to find the upper bound of ${\mathbf{P}}(A_{n}^{c})$ . By some calculations, we know

	$\displaystyle{\mathbf{P}}(A_{n}^{c})$	$\displaystyle=1-{\mathbf{P}}\left(\max_{1\leq i\leq n}\|Y_{i}\|\leq\ln n\right)=1-\left[{\mathbf{P}}(\|Y_{i}\|\leq\ln n)\right]^{n}\leq 1-(1-c\cdot e^{-c\cdot\ln^{2}n})^{n}$
		$\displaystyle\leq 1-e^{n\cdot\ln(1-c\cdot e^{-c\cdot\ln^{2}n})}\leq-n\cdot\ln(1-c\cdot e^{-c\cdot\ln^{2}{n}})\leq c^{\prime}\cdot n\cdot e^{-c\cdot\ln^{2}{n}}$		(34)

for some $c>0$ and $c^{\prime}>0$ . Therefore, (33) and (34) imply

I_{2}\leq c\cdot\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\cdot n\cdot e^{-c\cdot\ln^{2}{n}}.

(35)

Analysis of Part II. Recall

II:={\mathbf{E}}\left(\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right),

which relates to the empirical approximation error of Mondrian forests. First, suppose the first truncated Mondrian process with stopping time $\lambda_{n}$ is given, denoted by $\pi_{1,\lambda_{n}}$ . Under this restriction, the partition of $[0,1]^{d}$ is already determined, which is denoted by $\{{\small\mathcal{C}}_{1,\lambda,j}\}_{j=1}^{K_{1}(\lambda_{n})}$ . Let

	$\displaystyle\Delta_{n}$	$\displaystyle:={\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right)$
		$\displaystyle={\mathbf{E}}_{{\mathcal{D}}_{n}}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})$
		$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\bigg{)},$

where $h_{n}^{*}(x):=\sum_{j=1}^{K_{1}(\lambda_{n})}{\mathbb{I}}(x\in{\small\mathcal{C}}_{1,\lambda_{n},j})h(x_{1,\lambda_{n},j})$ and $x_{1,\lambda_{n},j}$ denotes the center of cell ${\small\mathcal{C}}_{1,\lambda_{n},j}$ . Remember that $\Delta_{n}$ depends on $\pi_{1,\lambda_{n}}$ . Since $\hat{h}_{1,n}$ is obtained by

\hat{c}_{b,\lambda,j}=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})},

we know

\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})\leq 0

(36)

once $\beta_{n}>C$ . At this point, we consider two cases about $\Delta_{n}$ :

Case I: $\Delta_{n}\leq 0$ . This case is trivial because we already have $\Delta_{n}\leq 0$ .

Case II: $\Delta_{n}>0$ . In this case, (36) implies

	$\displaystyle\Delta_{n}$	$\displaystyle\leq{\mathbf{E}}_{{\mathcal{D}}_{n}}\left\|\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right\|$
		$\displaystyle\leq\frac{1}{n}{\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\sum_{i=1}^{n}\left\|\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\ell(h(X_{i}),Y_{i})\right\|\right)$
		$\displaystyle\leq{\mathbf{E}}_{X,Y}\left(\|\ell(h_{1,n}^{*}(X),Y)-\ell(h(X),Y)\|\right).$

Let $D_{\lambda}(X)$ be the diameter of the cell that $X$ lies in. By Assumption 2, the above inequality further implies

$\displaystyle\Delta_{n}$	$\displaystyle\leq{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\|h_{1,n}^{*}(X)-h(X)\|)$
	$\displaystyle\leq{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot C\cdot D_{\lambda_{n}}(X)^{\beta})$
	$\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta})$
	$\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta}\mathbb{I}(\|Y\|\leq\ln n)+M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta}\mathbb{I}(\|Y\|>\ln n))$
	$\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot D_{\lambda_{n}}(X)^{\beta}+M_{1}(C,Y)\cdot d^{\frac{\beta}{2}}\cdot\mathbb{I}(\|Y\|>\ln n))$
	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(\|Y\|>\ln n))\right),$	(37)

where the second line holds because $h$ is a $(p,C)$ -smooth function and we use $D_{\lambda}(X)\leq\sqrt{d}\ a.s.$ to get the fifth line and Cauchy-Schwarz inequality in the sixth line .

Therefore, Case I and (37) in Case II imply that

\Delta_{n}\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)\ a.s..

(38)

Taking expectation on both sides of (38) w.r.t. $\lambda_{n}$ leads that

$\displaystyle II$	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X,\lambda_{n}}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(\|Y\|>\ln n))\right)$
	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)^{\beta}\|X=x)+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(\|Y\|>\ln n))\right)$
	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)\|X=x))^{\beta}\right]+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right),$	(39)

where $\beta\in(0,1]$ and in the second line we use the fact that the function $v^{\beta},v>0$ is concavity. For any fixed $x\in[0,1]^{d}$ , we can bound ${\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x)$ by using Corollary 1 in Mourtada et al. (2020). In detail, we have

	$\displaystyle{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)\|X=x)$	$\displaystyle\leq\int_{0}^{\infty}{d\left(1+\frac{\lambda_{n}\delta}{\sqrt{d}}\right)\exp\left(-\frac{\lambda_{n}\delta}{\sqrt{d}}\right)d\delta}$
		$\displaystyle\leq 2d^{\frac{3}{2}}\cdot\frac{1}{\lambda_{n}}.$		(40)

Thus, the combination of (39) and (40) imply that

II\leq C\cdot\left(2d^{\frac{3}{2}\beta}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-\beta}+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right).

(41)

Analysis of Part III. This part can be bounded by using the central limit theorem. Since $\|h\|_{\infty}\leq C$ , we know by Assumption 3 that

\ell(h(x),y)\leq\sup_{v\in[-C,C]}\ell(v,y)\leq M_{2}(C,y),\ \forall x\in[0,1]^{d},y\in{\mathbb{R}}

with ${\mathbf{E}}(M^{2}_{2}(C,Y))<\infty$ . Thus, $M_{2}(C,y)$ is an envelop function of $\{h\}$ . Note that a single function $h$ consists of a Glivenko-Cantelli class and has VC dimension 1. Thus, the application of equation (80) in Sen (2018) implies

III:={\mathbf{E}}(\hat{R}(h)-R(h))\leq\frac{c}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}

(42)

for some universal $c>0$ .

Finally, the combination of (31), (35), (41) and (42) completes the proof. $\Box$

Proof of Corollary 1. This theorem can be obtained directly by using Theorem 1 and Assumption 5. $\Box$

Proof of Corollary 2. The proof starts from the observation that our class $\mathcal{H}^{p,\beta}([0,1]^{d},C)$ can be used to approximate any general function. Since $m(x)\in\{f(x):{\mathbf{E}}f^{2}(X)<\infty\}$ , we know $m(x)$ can be approximated by a sequence of continuous functions in $L^{2}$ sense. Thus, we can assume $m(x),x\in[0,1]^{p}$ is continuous. Define the function $\sigma_{log}(x)=e^{x}/(1+e^{x}),x\in{\mathbb{R}}$ . For any $\varepsilon>0$ , by Lemma 16.1 in Györfi et al. (2002) there is $h_{\varepsilon}(x)=\sum_{j=1}^{J}a_{\varepsilon,j}\sigma_{log}(\theta_{\varepsilon,j}^{\top}x+s_{\varepsilon,j}),x\in[0,1]^{d}$ with $a_{\varepsilon,j},s_{\varepsilon,j}\in{\mathbb{R}}$ and $\theta_{\varepsilon,j}\in{\mathbb{R}}^{d}$ such that

{\mathbf{E}}\left(m(X)-h_{\varepsilon}(X)\right)^{2}\leq\sup_{x\in[0,1]^{p}}(m(x)-h_{\varepsilon}(x))^{2}\leq\frac{\varepsilon}{3}.

(43)

After some simple calculation, we know $h_{\varepsilon}(x)\in\mathcal{H}^{1,1}([0,1]^{d},C(h_{\varepsilon}))$ , where $C(h_{\varepsilon})>0$ depends on $h_{\varepsilon}$ only. Now we fix such $h_{\varepsilon}(x)$ . Make the decomposition as follows

$\displaystyle{\mathbf{E}}R(\hat{h}_{1,n})-R(m)=$	$\displaystyle{\mathbf{E}}(R(\hat{h}_{1,n})-\hat{R}(\hat{h}_{1,n}))+{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(m))$
	$\displaystyle+{\mathbf{E}}(\hat{R}(m)-R(m))$
	$\displaystyle:=I+II+III.$	(44)

Part I and III can be upper bounded by following similar analysis in Theorem 1. Therefore, under assumptions in our theorem, we know both of these two parts converges to zero as $n\to\infty$ . Next, we consider Part II. Note that

	$\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(m))$	$\displaystyle={\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+{\mathbf{E}}(R(h_{\varepsilon})-R(m))$
		$\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+{\mathbf{E}}(h_{\varepsilon}(X)-m(X))^{2}$
		$\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+c\cdot\frac{\varepsilon}{3},$

where in the second line we use Assumption 5. Finally, we only need to consider the behavior of term ${\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))$ as $n\to\infty$ . This can be done by using the analysis of Part II in the proof for Theorem 1. Taking $C=C(h_{\varepsilon})$ in (41), we have

{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))\leq C\cdot\left(2d^{\frac{3}{2}}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-1}+d^{\frac{1}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right),

(45)

which goes to zero as $n$ increases. In conclusion, we have proved that

\lim_{n\to\infty}{\mathbf{E}}(\hat{R}(\hat{h}_{n})-R(m))=0.

The above inequality and Assumption 5 shows that $\hat{h}_{n}$ is $L^{2}$ consistent for the general function $m(x),x\in[0,1]^{d}$ . $\Box$

Proof of Theorem 2. Based on Assumption 1, we only need to consider the regret function for $\hat{h}^{*}_{1,n}$ . For any $\lambda>0$ , by the definition of $\hat{h}^{*}_{1,n}$ we know

	$\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*}))$	$\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{})+\alpha_{n}\cdot\lambda_{n,1}^{})$
		$\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n,\lambda})+\alpha_{n}\cdot\lambda),$		(46)

where $\hat{h}_{1,n,\lambda}$ is the estimator based on the process $MP_{1}(\lambda,[0,1]^{d})$ .

On the other hand, we have the decomposition below

	$\displaystyle{\mathbf{E}}R(\hat{h}_{1,n}^{*})-R(h)$	$\displaystyle={\mathbf{E}}(R(\hat{h}_{1,n}^{})-\hat{R}(\hat{h}_{1,n}^{}))+{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})-\hat{R}(h))$
		$\displaystyle+{\mathbf{E}}(\hat{R}(h)-R(h)):=I+II+III.$		(47)

Firstly, we bound Part $I$ . Recall $A_{n}:=\{\max_{1\leq i\leq n}|Y_{i}|\leq\ln{n}\}$ , which is defined below (20). Make the decomposition of $I$ as follows.

	$\displaystyle I$	$\displaystyle={\mathbf{E}}((R(\hat{h}_{1,n}^{})-\hat{R}(\hat{h}_{1,n}^{}))\cap\mathbb{I}(A_{n}))+{\mathbf{E}}((R(\hat{h}_{1,n}^{})-\hat{R}(\hat{h}_{1,n}^{}))\cap\mathbb{I}(A_{n}^{c}))$
		$\displaystyle:=I_{1,1}+I_{1,2}$		(48)

The key for bounding $I_{1,1}$ is to find the upper bound of $\lambda_{n,1}^{*}$ . By the definition of $\hat{h}_{1,n}^{*}$ and Assumption 3, we know if $A_{n}$ occurs

\alpha_{n}\cdot\lambda_{n,1}^{*}\leq Pen(0)\leq\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y).

Therefore, when $A_{n}$ happens we have

\lambda_{n,1}^{*}\leq\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}.

Following arguments that we used to bound $I_{1,1}$ in the Proof of Theorem 1, we know

	$\displaystyle\|I_{1,1}\|$	$\displaystyle\leq c\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\cdot(1+\lambda_{n,1}^{*})^{d}$
		$\displaystyle\leq c\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\cdot\left(1+\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}\right)^{d}$		(49)

Next, the way for bounding $I_{1,2}$ in (48) is similar to that we used to bound $I_{1,2}$ in the Proof of Theorem 1. Namely, we have

|I_{1,2}|\leq\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\ell^{2}(x,\ln n)}+{\mathbf{E}}(M^{2}_{2}(\beta_{n},Y))\right)^{\frac{1}{2}}\cdot\sqrt{2}\exp(-\ln^{2}n/4).

(50)

Secondly, we use (46) to bound Part $II$ in (47). By the definition of $\hat{h}_{1,n}^{*}$ , for any $\lambda>0$ we have

II:={\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})-\hat{R}(h))\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n,\lambda})-\hat{R}(h)+\alpha_{n}\cdot\lambda).

Similar to the Proof of Theorem 1, the above inequality implies

II\leq C\cdot\left(2d^{\frac{3}{2}p}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-p}+d^{\frac{p}{2}}\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\cdot\sqrt{2}e^{-\frac{\ln^{2}n}{4}}\right)+\alpha_{n}\cdot\lambda.

(51)

Since (51) holds for all $\lambda>0$ , taking $\lambda=\left(\frac{1}{\alpha_{n}}\right)^{1/(p+1)}$ inequality (51) further implies

	$\displaystyle II$	$\displaystyle\leq(2C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}d^{\frac{3}{2}p}+1)\cdot\left(\alpha_{n}\right)^{\frac{p}{p+1}}+r_{n}$
		$\displaystyle\leq(2C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}d^{\frac{3}{2}p}+1)\cdot\left(\alpha_{n}\right)^{\frac{p}{2}}+r_{n},$		(52)

where $r_{n}:=C\cdot d^{\frac{1}{2}p}\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\cdot\sqrt{2}e^{-\frac{\ln^{2}n}{4}}$ is caused by the sub-Gaussian property of $Y$ .

Thirdly, we consider Part $III$ . The arguments for this is same with that used to obtain (42). Namely, we have

III:={\mathbf{E}}(\hat{R}(h)-R(h))\leq\frac{c}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))},

(53)

where $c$ is universal and does not depend on $C$ .

Finally, the combination of (8), (53), (49) and (50) finishes the proof. $\Box$

Proof of Lemma 2. Note that $Pen(h):=\ln\int_{[0,1]^{d}}{\exp{(h(x))}dx}$ . Define a real function as follows:

g(\alpha):=Pen((1-\alpha)\cdot h+\alpha\cdot h_{n}^{*}),\ 0\leq\alpha\leq 1.

Thus, we have $g(0)=Pen(h)$ and $g(1)=Pen(h_{n}^{*})$ . Later, it will be convenient to use the function $g$ in the analysis of this penalty function. Since both $h$ and $h_{n}^{*}$ are upper bounded, we know $g(\alpha)$ is differentiable and its derivative is

\frac{d}{d\alpha}g(\alpha)=\frac{\int_{[0,1]^{p}}{(h_{n}^{*}(x)-h(x))\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x)})dx}}{\int_{[0,1]^{p}}{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}dx}}.

(54)

Define a continuous random vector $Z_{\alpha}$ with the density function

f_{Z_{\alpha}}(x):=\frac{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}}{\int_{[0,1]^{p}}{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}dx}},\ x\in[0,1]^{d}.

(55)

From (54) and (55), we know

\frac{d}{d\alpha}g(\alpha)={\mathbf{E}}_{Z_{\alpha}}(h_{n}^{*}(Z_{\alpha})-h(Z_{\alpha})).

(56)

On the other hand, the Lagrange mean theorem implies

$\displaystyle\|Pen(h)-Pen(h_{n}^{*}(x))\|$	$\displaystyle=\|g(0)-g(1)\|$
	$\displaystyle=\left\|\frac{d}{d\alpha}g(\alpha)\|_{\alpha=\alpha^{*}}\right\|$
	$\displaystyle={\mathbf{E}}_{Z_{\alpha^{}}}(\|h_{n}^{}(Z_{\alpha^{}})-h(Z_{\alpha^{}})\|),$	(57)

where $\alpha^{*}\in[0,1]$ . Thus, later we only need to consider the last term of (57). Since $f_{Z_{\alpha^{*}}}(x)\leq\exp{(2C)},\ \forall x\in[0,1]^{p},\forall\alpha\in[0,1]$ , we know from (57) that

|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp{(2C)}\cdot{\mathbf{E}}_{U}(|h_{n}^{*}(U)-h(U)|),

(58)

where $U$ follows the uniform distribution in $[0,1]^{d}$ and is independent with $\pi_{\lambda}$ . By further calculation, we have

	$\displaystyle{\mathbf{E}}_{\pi_{\lambda}}\|Pen(h)-Pen(h_{n}^{*}(x))\|$	$\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{\pi_{\lambda}}{\mathbf{E}}_{U}(\|h_{n}^{*}(U)-h(U)\|)$
		$\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{U}{\mathbf{E}}_{\pi_{\lambda}}(C\cdot D_{\lambda_{n}}(U)^{\beta})$
		$\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{U}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(u)\|U=u))^{\beta}\right].$

From (40), we already know ${\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(u)|U=u)\leq 2d^{\frac{3}{2}}\cdot\frac{1}{\lambda_{n}}$ . Thus, above inequality implies

{\mathbf{E}}_{\pi_{\lambda}}|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp(2C)\cdot 2^{\beta}d^{\frac{3}{2}\beta}\cdot\left(\frac{1}{\lambda_{n}}\right)^{\beta}.

This completes the proof. $\Box$

Proof of Lemma 3. First, we calculate the term $\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)$ , where $\int{g(x)dx}=0$ . With some calculation, it is not difficult to know

\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)=Var\left(h(X_{\alpha})\right),

(59)

where $X_{\alpha}$ is a continuous random vector in $[0,1]^{d}$ . Furthermore, $X_{\alpha}$ has the density $f_{X_{\alpha}}(x)=\exp(g_{\alpha}(x))/\int{\exp(h_{\alpha}(x))dx}$ with $g_{\alpha}(x)=h_{0}(x)+\alpha g(x)$ . Sicne $\|h_{0}\|_{\infty}\leq c$ and $\|g\|_{\infty}\leq\beta_{n}$ , we can assume without loss generality that $\|g_{\alpha}\|_{\infty}\leq\beta_{n}$ . This results that

\exp{(-2\beta_{n})}\leq f_{X_{\alpha}}(x)\leq\exp{(2\beta_{n})},\ \forall x\in[0,1]^{d}.

(60)

Let $U$ follows uniform distribution in $[0,1]^{d}$ . (60) implies

$\displaystyle Var\left(g(X_{\alpha})\right)$	$\displaystyle=\inf_{c>0}{\mathbf{E}}(g(X_{\alpha})-c)^{2}$
	$\displaystyle\leq\exp{(2\beta_{n})}\cdot\inf_{c>0}{\mathbf{E}}(g(U)-c)^{2}$
	$\displaystyle=\exp{(2\beta_{n})}\cdot Var(g(U))=\exp{(2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(U)).$	(61)

With the same arguemnt, we also have

Var\left(g(X_{\alpha})\right)\geq\exp{(-2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(U)).

(62)

The combination of (59), (61) and (62) shows that

c\cdot\exp{(-2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(X))\leq\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)\leq c^{-1}\cdot\exp{(2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(X))

(63)

for some universal $c>0$ and any $\alpha\in[0,1]$ . Finally, by Taylor expansion, we have

R(h)=R(h_{0})+\frac{d}{d\alpha}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=0}+\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=\alpha^{*}}

for some $\alpha^{*}\in[0,1]$ . Without loss of generality, We can assume that $\|h-h_{0}\|_{\infty}\leq\beta_{n}$ . Thus, the second derivative $\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=\alpha^{*}}$ can be bounded by using (63) if we take $g=h-h_{0}$ . Meanwhile, the first derivative $\frac{d}{d\alpha}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=0}=0$ since $h_{0}$ achieves the minimal value of $R(\cdot)$ . Based on these analysis, we have

c\cdot\frac{1}{\ln n}\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}\leq R(h)-R(h_{0})\leq c^{-1}\cdot\ln n\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}

for some universal $c>0$ . This completes the proof. $\Box$

References

Arlot and Genuer (2014) Arlot, S. and R. Genuer (2014). Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939.
Athey et al. (2019) Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of Statistics 47(2), 1148–1178.
Biau (2012) Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research 13(1), 1063–1095.
Biau et al. (2008) Biau, G., L. Devroye, and G. Lugosi (2008). Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research 9(9).
Blumer et al. (1989) Blumer, A., A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989). Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM) 36(4), 929–965.
Breiman (2001) Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
Cattaneo et al. (2023) Cattaneo, M. D., J. M. Klusowski, and W. G. Underwood (2023). Inference with mondrian random forests. arXiv preprint arXiv:2310.09702.
Györfi et al. (2002) Györfi, L., M. Kohler, A. Krzyżak, and H. Walk (2002). A distribution-free theory of nonparametric regression, Volume 1. Springer.
Huang (1998) Huang, J. Z. (1998). Functional anova models for generalized regression. Journal of multivariate analysis 67(1), 49–71.
Klusowski (2021) Klusowski, J. M. (2021). Universal consistency of decision trees in high dimensions. arXiv preprint arXiv:2104.13881.
Knight (1998) Knight, K. (1998). Limiting distributions for l1 regression estimators under general conditions. Annals of statistics, 755–770.
Kosorok (2008) Kosorok, M. R. (2008). Introduction to empirical processes and semiparametric inference. Springer.
Lakshminarayanan et al. (2014) Lakshminarayanan, B., D. M. Roy, and Y. W. Teh (2014). Mondrian forests: Efficient online random forests. Advances in neural information processing systems 27.
Liaw et al. (2002) Liaw, A., M. Wiener, et al. (2002). Classification and regression by randomforest. R news 2(3), 18–22.
Mourtada et al. (2020) Mourtada, J., S. Gaïffas, and E. Scornet (2020). Minimax optimal rates for Mondrian trees and forests. The Annals of Statistics 48(4), 2253 – 2276.
Mourtada et al. (2021) Mourtada, J., S. Gaïffas, and E. Scornet (2021). Amf: Aggregated mondrian forests for online learning. Journal of the Royal Statistical Society Series B: Statistical Methodology 83(3), 505–533.
Roy (2011) Roy, D. M. (2011). Computability, inference and modeling in probabilistic programming. Ph. D. thesis, Massachusetts Institute of Technology.
Roy and Teh (2008) Roy, D. M. and Y. W. Teh (2008). The mondrian process. In Advances in neural information processing systems, pp. 1377–1384.
Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics 48(4), 1875–1897.
Scornet et al. (2015) Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests. The Annals of Statistics 43(4), 1716–1741.
Sen (2018) Sen, B. (2018). A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University 11, 28–29.
Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and S. Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Stone (1986) Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 590–606.
Stone (1994) Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The annals of statistics 22(1), 118–171.
Zhang and Wang (2009) Zhang, H. and M. Wang (2009). Search for the smallest random forest. Statistics and its interface 2 3, 381.

$\displaystyle\|I\|$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)\|{\mathcal{D}}_{n})}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)\|{\mathcal{D}}_{n})}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n})\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n}^{c})\Big{\|}\pi_{\lambda_{n}}\right)$
	$\displaystyle:=I_{1}+I_{2},$	(20)

	$\displaystyle I_{1}$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\cap\mathbb{I}(A_{n})\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))}\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))-{\mathbf{E}}(\ell(g(X),Y))\right\|\Big{\|}\pi_{\lambda_{n}}\right)$
		$\displaystyle:=I_{1,1}+I_{1,2}.$

$\displaystyle I_{1,1}$	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})}b_{i}\right\|\Big{\|}\pi_{\lambda_{n}}\right)\ \ \text{(symmetrization technique)}$
	$\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\frac{24}{\sqrt{n}}\int_{0}^{\nu(n)}\sqrt{\ln\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right)\ \ \qquad\qquad\text{(Dudley's entropy integral)}$
	$\displaystyle\leq\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)}\sqrt{\ln(1+c\cdot(\beta_{n}/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right)(\text{E.q. \eqref{789nsvbfc}})$
	$\displaystyle\leq\beta_{n}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)/\beta_{n}}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{\|}\pi_{\lambda_{n}}\right),$	(25)

	$\displaystyle\Delta_{n}$	$\displaystyle\leq{\mathbf{E}}_{{\mathcal{D}}_{n}}\left\|\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right\|$
		$\displaystyle\leq\frac{1}{n}{\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\sum_{i=1}^{n}\left\|\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\ell(h(X_{i}),Y_{i})\right\|\right)$
		$\displaystyle\leq{\mathbf{E}}_{X,Y}\left(\|\ell(h_{1,n}^{*}(X),Y)-\ell(h(X),Y)\|\right).$

$\displaystyle II$	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X,\lambda_{n}}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(\|Y\|>\ln n))\right)$
	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)^{\beta}\|X=x)+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(\|Y\|>\ln n))\right)$
	$\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)\|X=x))^{\beta}\right]+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right),$	(39)