This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-asymptotic Properties of Generalized Mondrian Forests in Statistical Learning

Zhan Haoran   
Department of DSDS, National University of Singapore
and
Wang Jingli
School of Statistics and Data Science, Nankai University
and
Xia Yingcun
Department of DSDS, National University of Singapore
The authors gratefully acknowledge please remember to list all relevant funding sources in the unblinded version
Abstract

Since the publication of Breiman (2001), Random Forests (RF) have been widely used in both regression and classification. Later on, other forests are also proposed and studied in literature and Mondrian Forests are notable examples built on the Mondrian process; see Lakshminarayanan et al. (2014). In this paper, we propose an ensemble estimator in general statistical learning based on Mondrian Forests, which can be regarded as an extension of RF. This general framework includes many common learning problems, such as least squared regression, least 1\ell_{1} regression, quantile regression and classification. Under mild conditions of loss functions, we give the upper bound of the regret function of our estimator and show that such estimator is statistically consistent.


Keywords: Statistical learning, Machine learning, Random forests, Ensemble learning, Regret function.

1 Introduction

Random Forest (RF) in Breiman (2001) is a very popular ensemble learning technique in machine learning used for classification and regression tasks. It operates by constructing multiple decision trees during training and averaging their predictions for improved accuracy and robustness. Many empirical studies have delved into its powerful performance across different domains and data characteristics, see for example Liaw et al. (2002). However, until now we only know a few of its theoretical properties due to the complicated tree structure and the data-dependent splitting rule (CART). Scornet et al. (2015) was the first one who showed its statistical consistency when the true regressor follows an additive model. Later, Klusowski (2021) established the consistency of RF for any general true regressor. While there are many other papers studying the theory about RF, only these two focused on the original RF in Breiman (2001) by considering how to analyze the important splitting rule, CART.

To gain a deeper insight into the random forest, additional research delves into modified and stylized versions of RF in Breiman (2001). One such method is Purely Random Forests (PRF) (see, for example Arlot and Genuer (2014), Biau (2012) and Biau et al. (2008)), where individual trees are grown independently of the sample, making them well-suited for theoretical analysis. In this paper, our interest is Mondrian Forest, which is an ensemble learning algorithm that combines elements of decision trees and random forests with partitioning strategies inspired by the Mondrian process. This kind of forest was introduced in Lakshminarayanan et al. (2014) that showed that Mondrian Forest has competitive online performance in classification problems compared to other state-of-the-art methods. Inspired by its nice online property, Mourtada et al. (2021) studied the theory of online regression and classification by utilizing Mondrian forest. On the other hand, people also focused on the theory study of Mondrian forest for batched data. Namely, in this case, the data set is already collected by experimenters. Mourtada et al. (2020) gave the consistency and convergence rates for Mondrian trees and forests, that are minimax optimal on the set of Hölder continuous functions. Later on, Cattaneo et al. (2023) follows the line in Mourtada et al. (2020) and gave the asymptotic normal distribution of Mondrian forest for offline regression problem.

Instead of considering classical regression and classification problems, we find in this paper that Mondrian forest can also be applied in more statistical/machine learning problems, such as generalized regression, density estimation and quantile regression. Our main contributions are two-fold:

  • First, we propose a general framework (estimator) based on Mondrian forest that can be used in different learning problems.

  • Second, we study the upper bound of the regret (risk) function of above forest estimator. Our theoretical results can be applied in many learning cases and corresponding examples are given in Section 6.

1.1 Related work

Our method about generalizing RF is based on a global perspective, while the method of generalizing RF in Athey et al. (2019) starts from a local perspective. In other words, by doing one optimization with full data points, we can estimate the objective function m(x),x[0,1]dm(x),\forall x\in[0,1]^{d} and Athey et al. (2019) can only estimate a specific point m(x0)m(x_{0}). Therefore, our generalized method can save a lot of time in computation especially when the dimension dd is large. Secondly, the generalized method based on globalization can also be easily applied to the statistical problem with a penalization Pen(m)Pen(m), where Pen(m)Pen(m) is a functional of m(x),x[0,1]dm(x),\forall x\in[0,1]^{d}. In Section 6.4, we show the application of our method in one of these penalty optimizations, namely the nonparametric density estimation. Since Athey et al. (2019) only perform estimation pointwisely, it is difficult for them to ensure the obtained estimator satisfies shape constraints and the corresponding case is not included in their scope.

2 Background and Preliminaries

2.1 Goal in Statistical Learning

Let (X,Y)[0,1]d×(X,Y)\in[0,1]^{d}\times{\mathbb{R}} be the random vector, where we normalize the range of XX. In statistical learning, we would like a policy hh to learn YY from XX, which is defined as a function h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}}. Usually, a loss function (h(x),y):×[0,)\ell(h(x),y):{\mathbb{R}}\times{\mathbb{R}}\to[0,\infty) is used to measure the difference or loss between the decision h(x)h(x) and goal yy. Taking expectation w.r.t. X,YX,Y, the risk function

R(h):=𝐄((h(X),Y)R(h):={\mathbf{E}}(\ell(h(X),Y) (1)

denotes the averaged loss by using the policy hh. Naturally, people have reasons to select the best policy hh^{*} by minimizing the averaged loss over some function class 1\mathcal{H}_{1}, namely

h=argminh1R(h).h^{*}=\arg\min_{h\in\mathcal{H}_{1}}R(h).

Therefore, R(h)R(h^{*}) has the minimal risk in theory that one can try the best to achieve. In practice, the distribution of (X,Y)(X,Y) is unknown and (1) is not able to be used for the calculation of R(h)R(h). Thus, such best hh^{*} can not be obtained in a direct way. On the other hand, we usually use i.i.d. data 𝒟n={(Xi,Yi)}i=1n{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n} to approximate R(h)R(h). Thus, the empirical risk function can be approximated by

R^(h):=1ni=1n(h(Xi),Yi).\hat{R}(h):=\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i}).

Traditionally, people always find an estimator/policy h^n:[0,1]d×\hat{h}_{n}:[0,1]^{d}\times{\mathbb{R}} by minimizing R^(h)\hat{R}(h) over a known function class; see spline regression in Györfi et al. (2002), wavelet regression in Györfi et al. (2002) and regression by deep neural networks in Schmidt-Hieber (2020). Recently, instead of minimizing R^(h)\hat{R}(h) globally, tree-based algorithms have been applied to construct the empirical estimator h^n\hat{h}_{n}. According to many practitioners’ experience, the strong prediction ability of RF’s have shown the superiority of tree-based estimators over many traditional methods in statistical learning. In this paper, we are interested in bounding the regret function

ε(h^n):=R(h^n)R(h),\varepsilon(\hat{h}_{n}):=R(\hat{h}_{n})-R(h^{*}),

where h^n\hat{h}_{n} will be an ensemble estimator obtained by Mondrian Forests.

2.2 Mondrian partitions

Mondian partitions correspond with a case of random tree partitions, where the partition of [0,1]d[0,1]^{d} is independent of data points. This scheme totally depends on a stochastic process, named as Mondrian process and denoted by MP([0,1]d)MP([0,1]^{d}). The Mondrian process MP([0,1]d)MP([0,1]^{d}) is a distribution on infinite tree partitions of [0,1]d[0,1]^{d} introduced by Roy and Teh (2008) and Roy (2011). To reduce notations, the details of its definition is omitted here and can be checked in Definition 3 in Mourtada et al. (2020).

Let us come back to the Mondrian partitions. The partition that will be used in this paper is called the Mondrian partitions with stopping time λ\lambda and is denoted by MP(λ,[0,1]d)MP(\lambda,[0,1]^{d}). (see Section 1.3 in Mourtada et al. (2020)) Its construction consists of two steps. Firstly, we construct partitions according to MP([0,1]d)MP([0,1]^{d}) by iteratively splitting cells at random times, which depend on the linear dimension of each cell. The probability of splitting along each side is proportional to the side length of the cell, and the splitting position is chosen uniformly. Secondly, we cut the nodes whose birth time is behind the tuning parameter λ>0\lambda>0. Note that each tree node generated in the first step was given a specific birth time. When the tree layer grows, the birth time also increases. Therefore, the second step can be regarded as a pruning process, which helps us choose the best tree model in learning problems.

Let C=Πj=1dCj[0,1]dC=\Pi_{j=1}^{d}{C^{j}}\subseteq[0,1]^{d} with closed intervals Cj=[aj,bj]C^{j}=[a_{j},b_{j}]. Denote |Cj|=bjaj|C^{j}|=b_{j}-a_{j} and |C|=j=1d|Cj||C|=\sum_{j=1}^{d}{|C^{j}|}. Let Exp(|C|)Exp(|C|) be an exponential distribution with expectation |C|>0|C|>0. Then, our Mondrian partition with stopping time λ>0\lambda>0 can be generated according to Algorithm 1. In fact, this algorithm is a recursive process and in the initial time we input the root node [0,1]d[0,1]^{d}. On the other hand, we give an example in Figure 1 to show how Algorithm 1 works.

Input: A cell C=Πj=1dCj[0,1]dC=\Pi_{j=1}^{d}{C^{j}}\subseteq[0,1]^{d}, starting time τ\tau and stopping λ\lambda.
1 Sample a random variable ECExp(|C|)E_{C}\sim Exp(|C|)
2if λ+ECλ\lambda+E_{C}\leq\lambda then
3       Randomly choose a split dimension J{1,,d}J\in\{1,\ldots,d\} with 𝐏(J=j)=(bjaj)/|C|{\mathbf{P}}(J=j)=(b_{j}-a_{j})/|C|;
4       Randomly choose a split threshold SJS_{J} in [aJ,bJ][a_{J},b_{J}];
5       Split CC along the split (J,SJ)(J,S_{J}): let C0={xC:xJSJ}C_{0}=\{x\in C:x_{J}\leq S_{J}\} and C1=C/C0C_{1}=C/\ C_{0};
6       return MondrianPartition(C0,τ+EC,λ)MondrianPartition(C1,τ+EC,λ)\text{Mondrian}\text{Partition}(C_{0},\tau+E_{C},\lambda)\cup\text{Mondrian}\text{Partition}(C_{1},\tau+E_{C},\lambda);
7      
8else
9       return {C}\{C\} (i.e., do not split CC)
10 end if
Algorithm 1 MondrianPartition(C,τ,λ)\text{Mondrian}\text{Partition}(C,\tau,\lambda): Generate a Mondrian partition of cell C[0,1]dC\subseteq[0,1]^{d}, starting from time τ\tau and until time λ\lambda.
Refer to caption
Figure 1: An example of Mondrian partition (left) with corresponding tree structure (right). This indicates how the tree grows with time. There are four splitting times in this demo, namely 1,2,3,41,2,3,4, that are denoted by bullets (\bullet) and the stopping time is λ=4\lambda=4.

3 Methodology

Let MPb([0,1]d),b=1,,BMP_{b}([0,1]^{d}),b=1,\ldots,B be independent Mondrian processes. When we prune each tree at time λ>0\lambda>0, independent partitions MPb(λ,[0,1]d),j=1,,BMP_{b}(\lambda,[0,1]^{d}),j=1,\ldots,B are obtained, where all cuts after time λ\lambda are ignored. In this case, we can write MPb([0,1]d,λ)={𝒞b,λ,j}j=1Kb(λ)MP_{b}([0,1]^{d},\lambda)=\{{\small\mathcal{C}}_{b,\lambda,j}\}_{j=1}^{K_{b}(\lambda)} satisfying

[0,1]d=j=1Kb(λ)𝒞b,λ,jand𝒞b,λ,j1𝒞b,λ,j2=,j1j2,[0,1]^{d}=\bigcup_{j=1}^{K_{b}(\lambda)}{\small\mathcal{C}}_{b,\lambda,j}\ \ \text{and}\ \ {\small\mathcal{C}}_{b,\lambda,j_{1}}\cap{\small\mathcal{C}}_{b,\lambda,j_{2}}=\varnothing,\ \forall j_{1}\neq j_{2},

where 𝒞b,λ,j[0,1]d{\small\mathcal{C}}_{b,\lambda,j}\subseteq[0,1]^{d} denotes a cell in the partition MPb([0,1]d)MP_{b}([0,1]^{d}). For each cell 𝒞b,λ,j{\small\mathcal{C}}_{b,\lambda,j}, a constant c^b,λ,j\hat{c}_{b,\lambda,j}\in{\mathbb{R}} is used as the predictor of h(x)h(x) in this small region, where

c^b,λ,j=argminz[βn,βn]i:Xi𝒞b,λ,j(z,Yi)\hat{c}_{b,\lambda,j}=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})} (2)

and βn>0\beta_{n}>0 denotes the threshold. For any fixed yy\in{\mathbb{R}}, (,y)\ell(\cdot,y) is usually a convex function w.r.t. the first variable in machine learning. Therefore, the optimization (2) over [βn,βn][-\beta_{n},\beta_{n}] guarantees the existence of c^b,λ,j\hat{c}_{b,\lambda,j} in general and we let βn\beta_{n}\to\infty as nn goes to infinity. Then, for each 1bB1\leq b\leq B, we can get an estimator of h(x)h(x):

h^b,n(x):=j=1Kb(λ)c^b,λ,j𝕀(x𝒞b,λ,j),x[0,1]d,\hat{h}_{b,n}(x):=\sum_{j=1}^{K_{b}(\lambda)}{\hat{c}_{b,\lambda,j}\cdot{\mathbb{I}}(x\in{\small\mathcal{C}}_{b,\lambda,j})},\ x\in[0,1]^{d},

where 𝕀(){\mathbb{I}}(\cdot) denotes the indicator function. By applying the ensemble technique, the final estimator is given by

h^n(x):=1Bb=1Bh^b,n(x),x[0,1]d.\hat{h}_{n}(x):=\frac{1}{B}\sum_{b=1}^{B}{\hat{h}_{b,n}(x)},\ x\in[0,1]^{d}. (3)

If the cell Cb,λ,jC_{b,\lambda,j} that does not contain any data point, we just use 0 as the predictor in the corresponding region.

Let us clarify the relationship between (3) and the traditional RF. If we take the 2\ell^{2} loss function (x,y)=(xy)2\ell(x,y)=(x-y)^{2} and |Y|βn|Y|\leq\beta_{n}, it can be checked that

c^b,λ,j=1Card({i:Xi𝒞b,λ,j})i:Xi𝒞b,λ,jYi,\hat{c}_{b,\lambda,j}=\frac{1}{Card(\{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}\})}\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}Y_{i},

where Card()Card(\cdot) denotes the cardinality of a set. In this case, it is a problem about least squared regression and the estimator in (3) exactly coincides with that in Mourtada et al. (2020). In conclusion, our estimator h^b,n(x)\hat{h}_{b,n}(x) can be regarded as an extension of regression forests in Mourtada et al. (2020) since (x,y)\ell(x,y) can be chosen arbitrarily by a practitioner.

By examining the above algorithm carefully, we know there are two tuning parameters in the construction of h^n\hat{h}_{n}: λ\lambda and BB. The stopping time, λ\lambda, controls the model complexity of Mondrian forests. Generally speaking, the cardinality of a tree partition increases when λ\lambda goes to infinity. Thus, a large λ\lambda is useful to reduce the bias of the forest estimator. However, a small λ\lambda helps controls the generalization error of (3). Therefore, there should a balance about the choice of λ\lambda. To ensure the consistency of h^n\hat{h}_{n}, we suppose λ\lambda is dependent with the sample size nn and write λn\lambda_{n} in the following analysis. The second parameter, BB, denotes the number of Mondrian trees. There are studies about its selection for RF; see, for example, Zhang and Wang (2009). In practice, many practitioners take B=500B=500 during their computations.

4 Regret function of Mondrian forests

In this section, we study the upper bound of the forest estimator (3) that is constructed by Mondrian processes. First, we need some mild restrictions on the loss functions (x,y)\ell(x,y).

Assumption 1.

The loss function (x,y)\ell(x,y)\in{\mathbb{R}} for all x,yx,y\in{\mathbb{R}} and (,y)\ell(,y) is convex for any fixed ySy\in S\subseteq{\mathbb{R}} with 𝐏(YS)=1{\mathbf{P}}(Y\in S)=1.

Assumption 2.

There exists a non-negative function M1(x,y)>0M_{1}(x,y)>0 with x>0,yx>0,y\in{\mathbb{R}} such that for any yy\in{\mathbb{R}}, (,y)\ell(\cdot,y) is Lipshitz continuous and for any x1,x2[x,x]x_{1},x_{2}\in[-x,x], yy\in{\mathbb{R}}, we have

|(x1,y)(x2,y)|M1(x,y)|x1x2|.|\ell(x_{1},y)-\ell(x_{2},y)|\leq M_{1}(x,y)|x_{1}-x_{2}|.
Assumption 3.

There exists an envelop function M2(x,y)>0M_{2}(x,y)>0 such that for any x,yx,y\in{\mathbb{R}},

|supx[x,x](x,y)|M2(x,y)and𝐄(M22(x,Y))<.\left|\sup_{x^{\prime}\in[-x,x]}\ell(x^{\prime},y)\right|\leq M_{2}(x,y)\ \ \text{and}\ \ {\mathbf{E}}(M_{2}^{2}(x,Y))<\infty.

We will see in next section that many commonly used loss functions satisfy Assumption 1- 3 including 2\ell_{2} loss and 1\ell_{1} loss. In the following analysis, we suppose YY is a sub-Gaussian random variable and XX takes value in [0,1]d[0,1]^{d}. In detail, we make the following assumption about the distribution of YY.

Assumption 4.

For some σ>0\sigma>0, we have 𝐏(|Y𝐄(Y)|>t)2exp(t2/(2σ2)){\mathbf{P}}(|Y-{\mathbf{E}}(Y)|>t)\leq 2\exp{(-t^{2}/(2\sigma^{2}))} for each t>0t>0. To simplify notation, we always assume later that σ=1\sigma=1.

Our theoretical results relate to the (p,C)(p,C)-smooth class below. This class is considered in the calculation of the regret function since it is large enough and dense in the L2L^{2} integrable space generated by any probability measure.

Definition 1 ((p,C)(p,C)-smooth class).

Let p=s+β>0p=s+\beta>0, β(0,1]\beta\in(0,1] and C>0C>0. The (p,C)(p,C)-smooth ball with radius CC, denoted by p,β([0,1]d,C)\mathcal{H}^{p,\beta}([0,1]^{d},C), is the set of ss times differentiable functions h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}} such that

|sh(x1)sh(x2)|Cx1x22β,x1,x2[0,1]d.|\nabla^{s}h(x_{1})-\nabla^{s}h(x_{2})|\leq C\|x_{1}-x_{2}\|_{2}^{\beta},\ \ \forall x_{1},x_{2}\in[0,1]^{d}.

and

supx[0,1]d|h(x)|<C,\sup_{x\in[0,1]^{d}}{|h(x)|}<C,

where 2\|\cdot\|_{2} denotes the 2\ell^{2} norm in d{\mathbb{R}}^{d} space and \nabla is the gradient operator.

The main result in this section is presented in Theorem 1.

Theorem 1 (Excess risk of Mondrian forests).

Suppose the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3 and the distribution of YY satisfies Assumption 4. For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1, we have

𝐄R(h^n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{n})-R(h) c1max{βn,𝐄(M22(βn,Y))}n(1+λn)d+2d32pCsupy[lnn,lnn]M1(C,y)1λnp\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\frac{1}{\lambda_{n}^{p}}
+c1n𝐄(M22(C,Y))\displaystyle+\frac{c_{1}}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}
+c1(supx[βn,βn]|(x,lnn)|+𝐄(M22(βn,Y))+C𝐄(M12(C,Y)))ec2ln2n,\displaystyle+c_{1}\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|}+\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}+C\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\right)\cdot e^{-c_{2}\cdot\ln^{2}n}, (4)

where c1,c2>0c_{1},c_{2}>0 are some universal constants.

Remark 1.

The first term of the RHS of (4) relates to the generalization error of forest, and the second one is the approximation error of Mondrian forest to hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C). The second line of (4) measures the convergence error when the empirical loss R^(h)\hat{R}(h) is used to approximate its theoretical version R(h)R(h). Finally, the last line is due to the sub-Gaussian property of YY and terms in this line will disappear if we further assume YY is bounded.

Remark 2.

In many applications, we will see later those coefficients above, such as 𝐄(M22(βn,Y)){\mathbf{E}}(M_{2}^{2}(\beta_{n},Y)) and supy[lnn,lnn]M1(C,y)\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}, are only of polynomial order of lnn\ln n. Since the term ec2ln2ne^{-c_{2}\cdot\ln^{2}n} decays to zero at any polynomial rate, the last line of (4) usually has no influence on the convergence speed of the regret function. In conclusion, we can always hope only the first two lines dominate the convergence rate.

In statistics, we always have the true function hh^{*} that satisfies

m:=argmingR(g).m:=\arg min_{\forall g}{R(g)}. (5)

On the other hand, Theorem 1 can also be used to analyze the consistency of h^n\hat{h}_{n}. To meet this requirement, the following assumption is necessary.

Assumption 5.

For any h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}}, there are c>0c>0 and κ1\kappa\geq 1 such that

c1𝐄(h(X)m(X))κR(h)R(m)c𝐄(h(X)m(X))κ.c^{-1}\cdot{\mathbf{E}}(h(X)-m(X))^{\kappa}\leq R(h)-R(m)\leq c\cdot{\mathbf{E}}(h(X)-m(X))^{\kappa}.

To simplify notation, we denote the last line of (4) by Res(n)Res(n). Namely,

Res(n):=c1(supx[βn,βn]|(x,lnn)|+𝐄(M22(βn,Y))+C𝐄(M12(C,Y)))ec2ln2n.Res(n):=c_{1}\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|}+\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}+C\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\right)\cdot e^{-c_{2}\cdot\ln^{2}n}. (6)

Then, the consistency results of Mondrian forests are concluded in Corollary 1 and 2.

Corollary 1 (Consistency rate of Mondrian forests).

Suppose the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3 and the distribution of YY satisfies Assumption 4. Suppose the true function mp,β([0,1]d,C)m\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 and Assumption 5 is satisfied. Then,

𝐄(h^n(X)m(X))κ\displaystyle{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{\kappa} c1max{βn,𝐄(M22(βn,Y))}n(1+λn)d+2d32pCsupy[lnn,lnn]M1(C,y)1λnp\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\frac{1}{\lambda_{n}^{p}}
+c1n𝐄(M22(C,Y))+Res(n),\displaystyle+\frac{c_{1}}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}+Res(n), (7)

where c1,c2>0c_{1},c_{2}>0 are some universal constants.

Corollary 2 (Consistency of Mondrian forests).

Suppose the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3 and the distribution of YY satisfies Assumption 4. Suppose m(X)m(X) is L2L^{2} integrable only on [0,1]p[0,1]^{p} and 𝐄(2(m(X),Y))<{\mathbf{E}}(\ell^{2}(m(X),Y))<\infty and Assumption 5 is satisfied. If λn=o((nmax{βn,𝐄(M22(βn,Y))})1d)\lambda_{n}=o\left(\left(\frac{\sqrt{n}}{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}\right)^{\frac{1}{d}}\right), λn1supy[lnn,lnn]M1(C,y)0{\lambda_{n}^{-1}}\cdot\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\to 0 and Res(n)0Res(n)\to 0, we have

limn𝐄(h^n(X)m(X))κ=0.\lim_{n\to\infty}{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{\kappa}=0.

5 The choice of λn\lambda_{n}

In practice, the best λn\lambda_{n} is unknown and we need a criterion to stop the growth of Mondrian forests. Here, we use a penalty method. For each 1bB1\leq b\leq B, define

Pen(λn,b):=1ni=1n(h^b,n(Xi),Yi)+αnλn,b,Pen(\lambda_{n,b}):=\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{b,n}(X_{i}),Y_{i})}+\alpha_{n}\cdot\lambda_{n,b}, (8)

where αn,b>0\alpha_{n,b}>0 controls the power of penalty and h^b,n\hat{h}_{b,n} is constructed by the Mondrian process MPb(λn,[0,1]d)MP_{b}(\lambda_{n},[0,1]^{d}). Then, the best λn,b\lambda_{n,b}^{*} is chosen by

λn,b:=argminλ0Pen(λ).\lambda_{n,b}^{*}:=\arg min_{\lambda\geq 0}Pen(\lambda).

Denote by h^b,n\hat{h}_{b,n}^{*} the tree estimator that is constructed by using Mondrian process MPb(λn,b,[0,1]d)MP_{b}(\lambda_{n,b}^{*},[0,1]^{d}). Then, the data driven estimator is given by

h^n(x):=1Bb=1Bh^b,n(x),x[0,1]d.\hat{h}_{n}^{*}(x):=\frac{1}{B}\sum_{b=1}^{B}{\hat{h}_{b,n}^{*}(x)},\ x\in[0,1]^{d}. (9)
Theorem 2.

Suppose the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3. Meanwhile, suppose the distribution of YY satisfies Assumption 4. For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 and 0<αn10<\alpha_{n}\leq 1, we have

𝐄R(h^n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{n}^{*})-R(h) c1max{βn,𝐄(M22(βn,Y))}n(1+supy[lnn,lnn]M2(βn,y)αn)d\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\left(1+\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}\right)^{d}
+(2d32pCsupy[lnn,lnn]M1(C,y))(αn)p2+c1n𝐄(M22(C,Y))\displaystyle+(2d^{\frac{3}{2}p}C\cdot\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)})\cdot\left(\alpha_{n}\right)^{\frac{p}{2}}+\frac{c_{1}}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}
+Res(n),\displaystyle+Res(n), (10)

where c1,c2>0c_{1},c_{2}>0 are some universal constants and Res(n)Res(n) is defined in (6).

Therefore, by properly choosing αn\alpha_{n}, we can obtain a nice rate of the regret function of Mondrian forests according to (2). Theorem 2 also tells us the estimator (9) is adaptive if we ignore some unimportant coefficients first, such as M1(βn,lnn)M_{1}(\beta_{n},\ln n). The application of Theorem 2 are given in next section, where some examples are discussed in detail and we will show in those cases coefficients in (2), such as M1(βn,lnn)M_{1}(\beta_{n},\ln n), can indeed be bounded away from a polynomial of lnn\ln n.

6 Examples

In this section, we show how to use Mondrian forests in different statistical learning problems. Meanwhile, theoretical properties of forest estimators, namely h^n(x)\hat{h}_{n}(x) in (3) and h^n(x)\hat{h}_{n}^{*}(x) in (9), are given based on Theorem 1-2 in each learning problem. Sometimes, the result in Lemma 1 is useful for the verification of Assumption 5. The proof of this result can be done by considering the Taylor expansion of the real function R(h+αh)R(h^{*}+\alpha h) w.r.t. α[0,1]\alpha\in[0,1].

Lemma 1.

For any h:[0,1]ph:[0,1]^{p}\to{\mathbb{R}} and α[0,1]\alpha\in[0,1], we have

C1𝐄(h(X)2)d2dα2R(h+αh)C2𝐄(h(X)2),C_{1}{\mathbf{E}}(h(X)^{2})\leq\frac{d^{2}}{d\alpha^{2}}R(h^{*}+\alpha h)\leq C_{2}{\mathbf{E}}(h(X)^{2}),

where constants C1>0,C2>0C_{1}>0,C_{2}>0 are universal. Then, Assumption 5 holds with κ=2\kappa=2.

6.1 Least square regression

Nonparametric least squares regression refers to methods that do not assume a specific parametric form of the conditional expectation 𝐄(Y|X){\mathbf{E}}(Y|X). Instead, these methods are flexible and can adapt to the underlying structure of the data. The loss function of least square regression is given by 1(x,y)=(xy)2\ell_{1}(x,y)=(x-y)^{2}. First, we define the event An:={max1in|Yi|lnn}A_{n}:=\{\max_{1\leq i\leq n}{|Y_{i}|}\leq\ln n\}. Under the Assumption 4, by (34) we can find constants c,c>0c,c^{\prime}>0 such that 𝐏(An)1cnecln2n{\mathbf{P}}(A_{n})\geq 1-c^{\prime}\cdot ne^{-c\ln^{2}n}. This means 𝐏(An){\mathbf{P}}(A_{n}) is very close to 11 as nn\to\infty. On the event AnA_{n}, from (2) we further know

c^b,λ,j\displaystyle\hat{c}_{b,\lambda,j} =argminz[βn,βn]i:Xi𝒞b,λ,j(z,Yi)\displaystyle=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})}
=1Card({i:Xi𝒞b,λ,j})i:Xi𝒞b,λ,jYi,\displaystyle=\frac{1}{Card(\{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}\})}\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}Y_{i},

where Card()Card(\cdot) denotes the cardinality of a set. Therefore, c^b,λ,j\hat{c}_{b,\lambda,j} is just the average of YiY_{i}s that are in the leaf 𝒞b,λ,j{\small\mathcal{C}}_{b,\lambda,j}.

Let us discuss the property of 1(x,y)=(xy)2\ell_{1}(x,y)=(x-y)^{2}. First, it is obvious that Assumption 1 holds for this 2\ell^{2} loss. By some simple calculations, we also know Assumption 1 is satisfied with M1(x,y)=2(|x|+|y|)M_{1}(x,y)=2(|x|+|y|) and Assumption 2 is satisfied with M2(x,y)=2(x2+y2)M_{2}(x,y)=2(x^{2}+y^{2}). Choosing λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}} and βnlnn\beta_{n}\asymp\ln n, Theorem 1 implies the following property of h^n\hat{h}_{n}.

Proposition 1.

For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C), there exists an integer n1(C)1n_{1}(C)\geq 1 such that for any n>n1(C)n>n_{1}(C),

𝐄R(h^n)R(h)(22ln2n+4d32pC(C+lnn)+1)(1n)12pp+d.{\mathbf{E}}R(\hat{h}_{n})-R(h)\leq\left(2\sqrt{2}\ln^{2}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

Then, we check Theorem 1. By some simple calculations, we have m(x)=𝐄(Y|X=x)m(x)={\mathbf{E}}(Y|X=x) and

R(h)R(m)\displaystyle R(h)-R(m) =𝐄(Yh(X))2𝐄(Ym(X))2\displaystyle={\mathbf{E}}(Y-h(X))^{2}-{\mathbf{E}}(Y-m(X))^{2}
=𝐄(h(X)m(X))2.\displaystyle={\mathbf{E}}(h(X)-m(X))^{2}.

The above inequality shows Assumption 5 holds with c=1c=1 and κ=2\kappa=2. When λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}} is selected, Corollary 1 tells us

Proposition 2.

For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C), there exists an integer n2(C)1n_{2}(C)\geq 1 such that for any n>n2(C)n>n_{2}(C),

𝐄(h^n(X)m(X))2(22ln2n+4d32pC(C+lnn)+1)(1n)12pp+d.{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{2}\leq\left(2\sqrt{2}\ln^{2}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

We can also show h^n\hat{h}_{n} is consistent for any general function mm defined in (5) if the convergence rate of λn\lambda_{n} is chosen as stated in Corollary 2. Finally, choose αn=np2p+4d\alpha_{n}=n^{-\frac{p}{2p+4d}} and βnlnn\beta_{n}\asymp\ln n in Theorem 2. The regret function of the estimator h^n\hat{h}^{*}_{n} that is based on the model selection in (8) has the upper bound below.

Proposition 3.

For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C), there exists an integer n3(C)1n_{3}(C)\geq 1 such that for any n>n3(C)n>n_{3}(C),

𝐄R(h^n)R(h)(c122d+2ln2d+1n+4d32pC(C+lnn)+1)(1n)12pp+2d.{\mathbf{E}}R(\hat{h}_{n}^{*})-R(h)\leq(c_{1}\cdot 2^{2d+2}\ln^{2d+1}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+2d}}.

6.2 Generalized regression

Generalized regression refers to a broad class of regression models that extend beyond the traditional ordinary least squares (OLS) regression, accommodating various types of response variables and relationships between predictors and response. Usually, in this model the conditional distribution of YY given XX follows an exponential family of distribution

𝐏(Ydy|X=x)=exp{B(m(x))yD(m(x)))}Ψ(dy),{\mathbf{P}}(Y\in dy|X=x)=\exp\{B(m(x))y-D(m(x)))\}\Psi(dy), (11)

where Ψ(dy)\Psi(dy) is a positive measure defined on {\mathbb{R}} and Ψ()>Ψ(y)\Psi({\mathbb{R}})>\Psi(y) for any yy\in{\mathbb{R}}. The above function D()D(\cdot) is used as the aim of normalization that is defined on a open subinterval \mathcal{I} of {\mathbb{R}} and D(m)=lnRexp{B(m)y}Ψ(dy)D(m)=\ln\int_{R}{\exp\{B(m)y\}}\Psi(dy). Now, we suppose the function A(m):=D(m)/B(m)A(m):=D^{\prime}(m)/B^{\prime}(m) exists and we have 𝐄(Y|X=x)=A(m(x)){\mathbf{E}}(Y|X=x)=A(m(x)) by some calculations. Thus, the conditional expectation 𝐄(Y|X=x){\mathbf{E}}(Y|X=x) will be known if we can estimate the unknown function m(x)m(x). More information about (11) can be found in Stone (1986), Stone (1994) and Huang (1998).

The problem of generalized regression is to estimate the unknown function m(x)m(x) by using the i.i.d. data 𝒟n={(Xi,Yi)}i=1n{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}. Note that both B()B(\cdot) and D()D(\cdot) are known in (11). In this case, we use the maximal likelihood method for the estimation and the corresponding loss function is given by

2(x,y)=B(x)y+D(x)\ell_{2}(x,y)=-B(x)y+D(x)

and by some calculations we know the true function mm satisfies Definition 5, namely

margminh𝐄(B(h(X))Y+D(h(X))).m\in\arg min_{h}{{\mathbf{E}}(-B(h(X))Y+D(h(X)))}.

Therefore, we have reasons to believe Mondrian forests is statistically consistent in this problem, which is stated in Corollary 2. Now we give some mild restrictions on B()B(\cdot) and D()D(\cdot) in order to make sure our general results can be applied in generalized regression.

Conditions

  1. (i)

    B()B(\cdot) has the 2nd continuous derivative and its first derivative is strictly positive on \mathcal{I}.

  2. (ii)

    We can find a subinterval SS of {\mathbb{R}} satisfying the measure Ψ\Psi is concentrated on SS and

    B′′(ξ)y+D′′(ξ)>0,yS˘,ξ-B^{\prime\prime}(\xi)y+D^{\prime\prime}(\xi)>0,\quad\quad y\in\breve{S},\xi\in\mathcal{I} (12)

    where S˘\breve{S} denotes the interior of SS. If SS is bounded, (12) holds for at least one of endpoints.

  3. (iii)

    𝐏(YS)=1{\mathbf{P}}(Y\in S)=1 and 𝐄(Y|X=x)=A(m(x)){\mathbf{E}}(Y|X=x)=A(m(x)) for each x[0,1]dx\in[0,1]^{d}.

  4. (iv)

    There is a compact subinterval 𝒦0\mathcal{K}_{0} of \mathcal{I} such that the range of mm is contained in 𝒦0\mathcal{K}_{0}.

The above restrictions on B()B(\cdot) and D()D(\cdot) were used in Huang (1998). In fact, we can know from Huang (1998) that many commonly used distributions satisfy these conditions, including normal distribution, Poisson distribution and Bernoulli distribution. Now, let us verify our Assumption 1-5 under this setting.

In particular, Assumption 1 is verified by using Condition (i)-(iii). On the other hand, we choose the Lipchitz constant in Assumption 2 by

M1(x,y):=|supx~[x,x]B(x~)||y|+|supx~[x,x]D(x~)|.M_{1}(x,y):=\left|\sup_{\tilde{x}\in[-x,x]}{B^{\prime}(\tilde{x})}\right|\cdot|y|+\left|\sup_{\tilde{x}\in[-x,x]}{D^{\prime}(\tilde{x})}\right|.

Thirdly, the envelop function of 2(x,y)\ell_{2}(x,y) can be set by

M2(x,y):=|supx~[x,x]B(x~)||y|+|supx~[x,x]D(x~)||y|.M_{2}(x,y):=\left|\sup_{\tilde{x}\in[-x,x]}{B(\tilde{x})}\right|\cdot|y|+\left|\sup_{\tilde{x}\in[-x,x]}{D(\tilde{x})}\right|\cdot|y|.

Since we assume YY is a sub-Gaussian random variable in Assumption 4, thus 𝐄(M22(x,y))<{\mathbf{E}}(M_{2}^{2}(x,y))<\infty in this case. This indicates Assumption 3 is satisfied. Finally, under conditions (i)-(iv), Lemma 4.1 in Huang (1998) shows that our Assumption 5 holds with κ=2\kappa=2 and

c=max{supξ[βn,βn]m𝒦0(B′′(ξ)A(m)+D′′(ξ))],[infξ[βn,βn]m𝒦0(B′′(ξ)A(m)+D′′(ξ))]]1}.c=\max\left\{\sup_{\begin{subarray}{c}\xi\in[-\beta_{n},\beta_{n}]\cap\mathcal{I}\\ m\in\mathcal{K}_{0}\end{subarray}}{(-B^{\prime\prime}(\xi)A(m)+D^{\prime\prime}(\xi))}],\left[\inf_{\begin{subarray}{c}\xi\in[-\beta_{n},\beta_{n}]\cap\mathcal{I}\\ m\in\mathcal{K}_{0}\end{subarray}}{(-B^{\prime\prime}(\xi)A(m)+D^{\prime\prime}(\xi))}]\right]^{-1}\right\}. (13)

From (12), the constant cc in (13) must be larger than zero. On the other hand, we will see later the above cc does not equal to infinity in many cases.

Therefore, those general theoretical results in Section 4 & 5 can be applied in generalized regression. Finally, we need to stress those coefficients in the general results, such as M1(βn,lnn)M_{1}(\beta_{n},\ln n) in Theorem 1 and cc in (13), are only of polynomial order of lnn\ln n if we choose βn\beta_{n} properly. Let us give some specific examples.

Examples

  1. 1.

    The first example is Gaussian regression, where the conditional distribution Y|X=xY|X=x follows N(m(x),1)N(m(x),1). Therefore, B(x)=xB(x)=x, D(x)=12x2D(x)=\frac{1}{2}x^{2}, =\mathcal{I}={\mathbb{R}} and S=S={\mathbb{R}}. Our goal is to estimate the unknown conditional mean m(x)m(x). Now, Conditions (i)-(iii) are satisfied. To satisfy the fourth one, we assume the range of mm is contained in a compact set of {\mathbb{R}}, denoted by 𝒦0\mathcal{K}_{0}. Choose βnlnn\beta_{n}\asymp\ln n. Meanwhile, we can ensure YY is a sub-Gaussian random variable. The constant cc in (13) equals to 11 and those coefficients in general theoretical results are all of polynomial order of lnn\ln n, such as M1(βn,lnn)2lnnM_{1}(\beta_{n},\ln n)\asymp 2\ln n.

  2. 2.

    The second example is Poisson regression, where the conditional distribution Y|X=xY|X=x follows Poisson(λ(x))Poisson(\lambda(x)) with λ(x)>0\lambda(x)>0. Therefore, B(x)=xB(x)=-x, D(x)=exp(x)D(x)=\exp(x), =\mathcal{I}={\mathbb{R}} and S=[0,)S=[0,\infty). Our goal is to estimate the lnn\ln n transformation of conditional mean, namely m(x)=lnλ(x)m(x)=\ln\lambda(x) in (11), by using Mondrian forest (3). It is not difficult to show Conditions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of λ(x)\lambda(x) is contained in a compact set of (0,)(0,\infty). Thus, m(x)m(x) satisfies Assumption (iv). Choose βnlnlnn\beta_{n}\asymp\ln\ln n. Meanwhile, we can ensure YY satisfies Assumption 4 by using the fact that Poisson distribution is sub-Gaussian. The constant cc in (13) equals to lnn\ln n and those coefficients in general theoretical results are also all of polynomial order of lnn\ln n, such as M1(βn,lnn)2lnnM_{1}(\beta_{n},\ln n)\asymp 2\ln n.

  3. 3.

    The third example is 010-1 classification, where the conditional distribution Y|X=xY|X=x follows Bernoulli distribution (taking values in {0,1}\{0,1\}) with 𝐏(Y=1|X=x)=p(x)(0,1){\mathbf{P}}(Y=1|X=x)=p(x)\in(0,1). It is well known that the best classifier is named as Bayes rule,

    CBayes(x)={1,p(x)1200,p(x)12<0.C^{Bayes}(x)=\left\{\begin{array}[]{lc}1,&p(x)-\frac{1}{2}\geq 0\\ 0,&p(x)-\frac{1}{2}<0.\\ \end{array}\right.

    Therefore, the main focus in this problem is how to estimate the conditional probability p(x)p(x) well. Here, we use Mondrian forest for the estimation. First, we make a shift of p(x)p(x), which means m(x):=p(x)12(12,12)m(x):=p(x)-\frac{1}{2}\in(-\frac{1}{2},\frac{1}{2}) is used in (11) instead. By some calculations, B(x)=ln(0.5+x)ln(0.5x)B(x)=\ln(0.5+x)-\ln(0.5-x), D(x)=ln(0.5x)D(x)=-\ln(0.5-x), =(0.5,0.5)\mathcal{I}=(-0.5,0.5) and S=[0,1]S=[0,1] in this case. Now, the final goal is to estimate m(x)m(x) by using the forest estimator (3). It is not difficult to show Conditions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of m(x)m(x) is contained in a compact set of (12,12)(-\frac{1}{2},\frac{1}{2}). Now, choose βn12(1lnn)γ\beta_{n}\asymp\frac{1}{2}-(\frac{1}{\ln n})^{\gamma} for some γ>0\gamma>0. Meanwhile, Assumption 4 is satisfied since YY is bounded. The constant cc in (13) equals to (lnn)2γ(\ln n)^{2\gamma} and those coefficients in general theoretical results are also all of polynomial order of lnn\ln n, such as M1(βn,lnn)2(lnn)γ+1M_{1}(\beta_{n},\ln n)\asymp 2(\ln n)^{\gamma+1}.

In each of three examples above, the convergence rate of ε(h^n)\varepsilon(\hat{h}_{n}) is Op(n12pp+d)O_{p}(n^{-\frac{1}{2}\cdot\frac{p}{p+d}}) up to a polynomial of lnn\ln n.

6.3 Quantile regression

Quantile regression is a type of regression analysis used in statistics and econometrics that focuses on estimating the conditional quantiles (such as the median or other percentiles) of the response variable distribution given a set of predictor variables. Unlike ordinary least squares (OLS) regression, which estimates the mean of the response variable conditional on the predictor variables, quantile regression provides a more comprehensive analysis by estimating the conditional median or other quantiles.

Specifically, suppose m(x)m(x) is the τ\tau-th quantile (0<τ<10<\tau<1) of the conditional distribution of Y|X=xY|X=x. Our interest is to estimate m(x)m(x) by using i.i.d. data 𝒟n={(Xi,Yi)}i=1n{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}. The loss function in this case is given by

3(x,y):=ρτ(yx),\ell_{3}(x,y):=\rho_{\tau}(y-x),

where ρτ(u)=(τ𝕀(u<0))u\rho_{\tau}(u)=(\tau-\mathbb{I}(u<0))u denotes the check function for the quantile τ\tau. Meanwhile, by some calculations we know the quantile function m(x)m(x) maximizes the population risk w.r.t. 3(x,y)\ell_{3}(x,y). Namely, we have

m(x)argminh𝐄(ρτ(Yh(X))).m(x)\in\arg min_{h}{{\mathbf{E}}(\rho_{\tau}(Y-h(X)))}.

Therefore, we have reason to believe the forest estimator in (3) work well in this problem.

Let us verify Assumption 1-5. First, we choose S=S={\mathbb{R}} in Assumption 1. Then, it is easy to check 3(x,y)\ell_{3}(x,y) is convex for any ySy\in S. Second, the loss function 3(x,y)\ell_{3}(x,y) is also Lipschitz continuous w.r.t. the first variable when ySy\in S is given and the Lipschitz constant M1(x,y):=max{τ,1τ},x,y,M_{1}(x,y):=\max\{\tau,1-\tau\},\forall x,y,\in{\mathbb{R}}. Third, we choose the envelop function M2(x,y):=max{τ,1τ}(|x|+|y|)M_{2}(x,y):=\max\{\tau,1-\tau\}\cdot(|x|+|y|) in Assumption 3. Fourth, we always suppose YY is a sub-Gaussian random variable to meet the requirement in Assumption 4. Finally, we consider the sufficient condition of Assumption 5.

The Knight equality in Knight (1998) tells us

ρτ(uv)ρτ(u)=v(𝕀(u0)τ)+0v(𝕀(us)𝕀(u0))𝑑s,\rho_{\tau}(u-v)-\rho_{\tau}(u)=v(\mathbb{I}(u\leq 0)-\tau)+\int_{0}^{v}{(\mathbb{I}(u\leq s)-\mathbb{I}(u\leq 0))ds},

from which we get

R(h)R(m)\displaystyle R(h)-R(m) =𝐄[ρτ(Yh(X))ρτ(Ym(X))]\displaystyle={\mathbf{E}}\left[\rho_{\tau}(Y-h(X))-\rho_{\tau}(Y-m(X))\right]
=𝐄[(h(X)m(X))𝕀(Ym(X)0)τ]\displaystyle={\mathbf{E}}\left[(h(X)-m(X))\mathbb{I}(Y-m(X)\leq 0)-\tau\right]
+𝐄[0h(X)m(X)(𝕀(Ym(X)s)𝕀(Ym(X)0))𝑑s].\displaystyle+{\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{(\mathbb{I}(Y-m(X)\leq s)-\mathbb{I}(Y-m(X)\leq 0))ds}\right].

Conditioning on XX, we know the first part of above inequality equals to zero by using the definition of m(x)m(x). Thus, we have

R(h)R(m)\displaystyle R(h)-R(m) =𝐄[0h(X)m(X)(𝕀(Ym(X)s)𝕀(Ym(X)0))𝑑s]\displaystyle={\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{(\mathbb{I}(Y-m(X)\leq s)-\mathbb{I}(Y-m(X)\leq 0))ds}\right]
=𝐄[0h(X)m(X)sgn(s)𝐏[(Ym(X))(0,s)(s,0))|X]ds].\displaystyle={\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{sgn(s){\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]}ds\right]. (14)

To illustrate our basic idea clearly, we just consider a normal case, where m1(X):=𝐄(Y|X)m_{1}(X):={\mathbf{E}}(Y|X) is independent with the residual ε=Y𝐄(Y|X)N(0,1)\varepsilon=Y-{\mathbf{E}}(Y|X)\sim N(0,1) and supx[0,1]d|m1(x)|<\sup_{x\in[0,1]^{d}}{|m_{1}(x)|}<\infty. The generalization of this case can be finished by following the spirit below. Under this normal case, m(X)m(X) is equal to the sum of m1(X)m_{1}(X) and the τ\tau-th quantile of ε\varepsilon. Denote by qτ(ε)q_{\tau}(\varepsilon)\in{\mathbb{R}} the τ\tau-th quantile of ε\varepsilon. Thus, the conditional distribution of (Ym(X))|X(Y-m(X))|X is same with the distribution of εqτ(ε)\varepsilon-q_{\tau}(\varepsilon). For any s0>0s_{0}>0,

0s0𝐏[(Ym(X))(0,s)(s,0))|X]ds=0s0𝐏[ε(qτ(ε),qτ(ε)+s)]ds.\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds=\int_{0}^{s_{0}}{\mathbf{P}}[\varepsilon\in(q_{\tau}(\varepsilon),q_{\tau}(\varepsilon)+s)]ds.

Since qτ(ε)q_{\tau}(\varepsilon) is a fixed number, we assume s0s_{0} is a large number later. By the Lagrange mean value theorem, the following probability bound holds

12πexp((|qτ(ε)|+s0)2/2)s𝐏[ε(qτ(ε),qτ(ε)+s)]12πs.\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+s_{0})^{2}/2)}\cdot s\leq{\mathbf{P}}[\varepsilon\in(q_{\tau}(\varepsilon),q_{\tau}(\varepsilon)+s)]\leq\frac{1}{\sqrt{2\pi}}\cdot s.

Then,

12πexp((|qτ(ε)|+s0)2/2)12s020s0𝐏[(Ym(X))(0,s)(s,0))|X]ds12π12s02.\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+s_{0})^{2}/2)}\cdot\frac{1}{2}s_{0}^{2}\leq\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}s_{0}^{2}.

With the same argument, we also have

12πexp((|qτ(ε)|+|s0|)2/2)12s020s0𝐏[(Ym(X))(0,s)(s,0))|X]ds12π12s02\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+|s_{0}|)^{2}/2)}\cdot\frac{1}{2}s_{0}^{2}\leq\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}s_{0}^{2}

once s0<0s_{0}<0. Therefore, (14) implies

R(h)R(m)12π12𝐄(h(X)m(X))2R(h)-R(m)\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}{\mathbf{E}}(h(X)-m(X))^{2} (15)

and

R(h)R(m)12πexp((|qτ(ε)|+supx[0,1]d|h(x)m(x)|)2/2)12𝐄(h(X)m(X))2.R(h)-R(m)\geq\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+\sup_{x\in[0,1]^{d}}|h(x)-m(x)|)^{2}/2)}\cdot\frac{1}{2}{\mathbf{E}}(h(X)-m(X))^{2}. (16)

Now, we choose βnlnlnn\beta_{n}\asymp\sqrt{\ln\ln n}. The combination of (15) and (16) implies Assumption 5 holds with κ=2\kappa=2 and c=(lnn)1c=(\ln n)^{-1}. Meanwhile, those coefficients in general theoretical results are also of polynomial order of lnlnn\ln\ln n, such as 𝐄(M22(βn,Y))lnlnn\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\asymp\ln\ln n. The above setting results that the convergence rate of ε(h^n)\varepsilon(\hat{h}_{n}) is Op(n12pp+d)O_{p}(n^{-\frac{1}{2}\cdot\frac{p}{p+d}}) up to a polynomial of lnn\ln n.

6.4 Nonparametric density estimation

Assume XX is a continuous random vector defined on [0,1]p[0,1]^{p} and has a density function f0(x),x[0,1]df_{0}(x),x\in[0,1]^{d}. Our interest lies in the estimation of the unknown function f0(x)f_{0}(x) based on an i.i.d. sample of XX, namely data 𝒟n={Xi}i=1n{\mathcal{D}}_{n}=\{X_{i}\}_{i=1}^{n}. However, any density estimator has to satisfy two shape requirements that f0f_{0} is non-negative, namely, f0(x)0,x[0,1]df_{0}(x)\geq 0,x\in[0,1]^{d} and f0(x)𝑑x=1\int f_{0}(x)dx=1. These two restrictions can be losen by making a transformation. In fact, we have the decomposition

f0(x)=exp(h0(x))exp(h0(x))𝑑x,x[0,1]d7,f_{0}(x)=\frac{\exp(h_{0}(x))}{\int\exp(h_{0}(x))dx},x\in[0,1]^{d}7,

where h(x)h(x) can be any integrable function on [0,1]d[0,1]^{d}. The above link helps us to focus on the estimation of h0(x)h_{0}(x) only, which will be a statistical learning problem without constraint. However, this transformation introduces a model identifiability problem since h0+ch_{0}+c and h0h_{0} give the same density function, where cc\in{\mathbb{R}}. To solve this problem, we impose an additional requirement [0,1]ph0(x)=0\int_{[0,1]^{p}}{h_{0}(x)}=0, which guarantees a one-to-one map between f0f_{0} and h0h_{0}.

In the case of density estiamtion, the scaled log-likelihood for any function h(x)h(x) based on the sampled data is

R^(h):=1n(i=1nh(Xi)+ln[0,1]pexp(h(x))𝑑x)\hat{R}(h):=\frac{1}{n}\left(-\sum_{i=1}^{n}{h(X_{i})}+\ln\int_{[0,1]^{p}}{\exp(h(x))dx}\right)

and its population version is

R(h)=𝐄(h(X))+ln[0,1]pexp(h(x))𝑑x.R(h)=-{\mathbf{E}}(h(X))+\ln\int_{[0,1]^{p}}{\exp{(h(x))}dx}.

With a slight modification, Mondrian forests can also be applied in this problem. Recall the partition {𝒞b,λ,j}j=1Kb(λ)\{{\small\mathcal{C}}_{b,\lambda,j}\}_{j=1}^{K_{b}(\lambda)} of the bb-th Mondrian with stopping time λ\lambda satisfies

[0,1]d=j=1Kb(λ)𝒞b,λ,jand𝒞b,λ,j1𝒞b,λ,j2=,j1j2.[0,1]^{d}=\bigcup_{j=1}^{K_{b}(\lambda)}{\small\mathcal{C}}_{b,\lambda,j}\ \ \text{and}\ \ {\small\mathcal{C}}_{b,\lambda,j_{1}}\cap{\small\mathcal{C}}_{b,\lambda,j_{2}}=\varnothing,\ \forall j_{1}\neq j_{2}.

For each cell 𝒞b,λ,j{\small\mathcal{C}}_{b,\lambda,j}, a constant c^b,λ,j\hat{c}_{b,\lambda,j}\in{\mathbb{R}} is used as the predictor of h(x)h(x) in this small region. Thus, the estimator of η0(x)\eta_{0}(x) based on a single tree has the form

h^b,npre(x)=j=1Kb(λ)c^b,λ,j𝕀(x𝒞b,λ,j),\hat{h}_{b,n}^{pre}(x)=\sum_{j=1}^{K_{b}(\lambda)}{\hat{c}_{b,\lambda,j}\cdot{\mathbb{I}}(x\in{\small\mathcal{C}}_{b,\lambda,j})},

where coefficients are obtained by minimizing the empirical risk function,

(c^b,λ,1,,c^b,λ,Kb(λ)):=\displaystyle(\hat{c}_{b,\lambda,1},\ldots,\hat{c}_{b,\lambda,K_{b}(\lambda)}):= argmincb,λ,j[βn,βn]j=1,,Kb(λ)1nj=1Kb(λ)i=1ncb,λ,j𝕀(Xi𝒞b,λ,j)\displaystyle\arg min_{\begin{subarray}{c}c_{b,\lambda,j}\in[-\beta_{n},\beta_{n}]\\ j=1,\ldots,K_{b}(\lambda)\end{subarray}}\frac{1}{n}\sum_{j=1}^{K_{b}(\lambda)}\sum_{i=1}^{n}{-c_{b,\lambda,j}\cdot\mathbb{I}(X_{i}\in{\small\mathcal{C}}_{b,\lambda,j})}
+lnj=1Kb(λ)𝒞b,λ,jexp(cb,λ,j)𝑑x.\displaystyle+\ln\sum_{j=1}^{K_{b}(\lambda)}\int_{{\small\mathcal{C}}_{b,\lambda,j}}{\exp{(c_{b,\lambda,j})}dx}.

Since the optimization function above is differentiable w.r.t. variables cb,λ,jc_{b,\lambda,j}s, therefore the minimal value can indeed be achieved. To meet the requirement of identification, the estimator based on a single tree is given by

h^b,n(x)=h^b,npre(x)[0,1]ph^b,npre(x)𝑑x.\hat{h}_{b,n}(x)=\hat{h}_{b,n}^{pre}(x)-\int_{[0,1]^{p}}{\hat{h}_{b,n}^{pre}(x)dx}.

Finally, by applying the ensemble technique again, the estimator of h0h_{0} based on the Mondrian forest is

h^n(x):=1Bb=1Bh^b,n(x),x[0,1]B.\hat{h}_{n}(x):=\frac{1}{B}\sum_{b=1}^{B}\hat{h}_{b,n}(x),x\in[0,1]^{B}. (17)

Next, we analyze the theoretical properties of h^n\hat{h}_{n}. In this case, the forest estimator (17) is similar to the previous one defined in (3) which is obtained by optimization without constraint. Although there is a difference that here we add an extra penalty function:

Pen(h):=ln[0,1]pexp(h(x))𝑑x,Pen(h):=\ln\int_{[0,1]^{p}}{\exp{(h(x))}dx},

where h:[0,1]ph:[0,1]^{p}\to{\mathbb{R}}, those arguments in Theorem 1 can also be applied here. Let the pseudo loss function be pse(x,y):=x,x[0,1]p,y\ell^{pse}(x,y):=-x,x\in[0,1]^{p},y\in{\mathbb{R}}. It is obvious that Assumption 1 holds for pse(x,y)\ell^{pse}(x,y). Assumption 2 is satisfied with M1(x)=1M_{1}(x)=1 and Assumption 3 is satisfied with M2(x)=xM_{2}(x)=x. Choose βnlnlnn\beta_{n}\asymp\ln\ln n. Following similar arguments in Theorem 1, we have

𝐄R(h^n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{n})-R(h) c1lnnn(1+λn)d+2d32pC1λnp\displaystyle\leq c_{1}\frac{\ln n}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\cdot\frac{1}{\lambda_{n}^{p}}
+Cn+𝐄λn|Pen(h)Pen(hn(x))|,\displaystyle+\frac{C}{\sqrt{n}}+{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|, (18)

where hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) (0<p10<p\leq 1) and hn(x):=j=1K1(λn)𝕀(x𝒞1,λn,j)h(x1,λn,j)h_{n}^{*}(x):=\sum_{j=1}^{K_{1}(\lambda_{n})}{\mathbb{I}}(x\in{\small\mathcal{C}}_{1,\lambda_{n},j})h(x_{1,\lambda_{n},j}) (x1,λn,jx_{1,\lambda_{n},j} is the center of cell 𝒞1,λn,j{\small\mathcal{C}}_{1,\lambda_{n},j}) and c1c_{1} is the coefficient in Theorem 1 . Therefore, the left thing is to bound 𝐄λn|Pen(h)Pen(hn(x))|{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|

Lemma 2.

For any h(x)p,β([0,1]d,C)h(x)\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 and hn(x)h_{n}^{*}(x),

𝐄λn|Pen(h)Pen(hn(x))|exp(2C)2pd32p(1λn)p.{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\cdot\left(\frac{1}{\lambda_{n}}\right)^{p}.

Choosing λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}}, the combination of (18) and Lemma 2 implies

𝐄R(h^n)R(h)(c1lnlnn+3d32pC+exp(2C)2pd32p)(1n)12pp+d{\mathbf{E}}R(\hat{h}_{n})-R(h)\leq\left(c_{1}\ln\ln n+3d^{\frac{3}{2}p}C+\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}

when nn is large enough.

To obtain the consistency rate of Mondrian forest estimator, we need a result that is similar to Assumption 5.

Lemma 3.

Suppose the true density f0(x)f_{0}(x) is bounded away from zero and infinity, namely c0<h0(x)<c01,x[0,1]pc_{0}<h_{0}(x)<c_{0}^{-1},\forall x\in[0,1]^{p} and βnlnlnn\beta_{n}\asymp\ln\ln n. For any function h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}} with hβn\|h\|_{\infty}\leq\beta_{n} and [0,1]ph(x)𝑑x=0\int_{[0,1]^{p}}{h(x)dx}=0, we have

c01lnn𝐄(h(X)h0(X))2R(h)R(h0)c01lnn𝐄(h(X)h0(X))2.c_{0}\cdot\frac{1}{\ln n}\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}\leq R(h)-R(h_{0})\leq c_{0}^{-1}\cdot\ln n\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}.

Thus, Lemma 3 immediately implies the following result.

Theorem 3.

Suppose the true density f0(x)f_{0}(x) is bounded away from zero and infinity, namely c0<h0(x)<c01,x[0,1]pc_{0}<h_{0}(x)<c_{0}^{-1},\ \forall x\in[0,1]^{p}. If λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}} and the true function h0p,β([0,1]d,C)h_{0}\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 satisfying [0,1]ph0(x)𝑑x=0\int_{[0,1]^{p}}{h_{0}(x)dx}=0, there is nden+n_{den}\in\mathbb{Z}^{+} such that when n>ndenn>n_{den},

𝐄(h^n(X)h0(X))2c01lnn(c1lnlnn+3d32pC+exp(2C)2pd32p)(1n)12pp+d.{\mathbf{E}}(\hat{h}_{n}(X)-h_{0}(X))^{2}\leq c_{0}^{-1}\cdot\ln n\cdot\left(c_{1}\ln\ln n+3d^{\frac{3}{2}p}C+\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

7 Conclusion

In this paper, we proposed a general framework about Mondrian forests, which can be used in many statistical or machine learning problems. These applications includes but not limits to LSE, generalized regression, density estimation and quantile regression. Meanwhile, we studied the upper bound of its regret function and statistical consistency and show how to use them in specific applications listed above. The future work can be the study of the asymptotic distribution of this kind of general Mondrian forests as suggested in Cattaneo et al. (2023).

8 Proofs

This section contains proofs of theoretical results in the paper. Several useful preliminaries and notations are introduced first. Meanwhile, the constant cc in this section will change from line to line in order to simplify notations.

Definition 2 (Blumer et al. (1989)).

Let \mathcal{F} be a Boolean function class in which each f:𝒵{0,1}f:\mathcal{Z}\to\{0,1\} is binary-valued. The growth function of \mathcal{F} is defined by

Π(m)=maxz1,,zm𝒵|{(f(z1),,f(zm)):f}|\Pi_{\mathcal{F}}(m)=\max_{z_{1},\ldots,z_{m}\in\mathcal{Z}}|\{(f(z_{1}),\ldots,f(z_{m})):f\in\mathcal{F}\}|

for each positive integer m+m\in\mathbb{Z}_{+}.

Definition 3 (Györfi et al. (2002)).

Let z1,,znpz_{1},\ldots,z_{n}\in\mathbb{R}^{p} and z1n={z1,,zn}z_{1}^{n}=\{z_{1},\ldots,z_{n}\}. Let \mathcal{H} be a class of functions h:ph:\mathbb{R}^{p}\to\mathbb{R}. An LqL_{q} ε\varepsilon-cover of \mathcal{H} on z1nz_{1}^{n} is a finite set of functions h1,,hN:ph_{1},\ldots,h_{N}:\mathbb{R}^{p}\to\mathbb{R} satisfying

min1jN(1ni=1n|h(zi)hj(zi)|q)1q<ε,h.\min_{1\leq j\leq N}{\left(\frac{1}{n}\sum_{i=1}^{n}{|h(z_{i})-h_{j}(z_{i})|^{q}}\right)^{\frac{1}{q}}}<\varepsilon,\ \ \forall h\in\mathcal{H}.

Then, the LqL_{q} ε\varepsilon-cover number of \mathcal{H} on z1nz_{1}^{n}, denoted by 𝒩q(ε,,z1n)\mathcal{N}_{q}(\varepsilon,\mathcal{H},z_{1}^{n}), is the minimal size of an LqL_{q} ε\varepsilon-cover of \mathcal{H} on z1nz_{1}^{n}. If there exists no finite LqL_{q} ε\varepsilon-cover of \mathcal{H}, then the above cover number is defined as 𝒩q(ε,,z1n)=\mathcal{N}_{q}(\varepsilon,\mathcal{H},z_{1}^{n})=\infty.

For a VC class, there is a useful result giving the upper bound of its covering number. To make this supplementary material self-explanatory, we first introduce some basic concepts and facts about the VC dimension; see Shalev-Shwartz and Ben-David (2014) for more details.

Definition 4 (Kosorok (2008)).

The subgraph of a real function f:𝒳f:\mathcal{X}\to\mathbb{R} is a subset of 𝒳×\mathcal{X}\times\mathbb{R} defined by

Cf={(x,y)𝒳×:f(x)>y},C_{f}=\{(x,y)\in\mathcal{X}\times\mathbb{R}:f(x)>y\},

where 𝒳\mathcal{X} is an abstract set.

Definition 5 (Kosorok (2008)).

Let 𝒞\mathcal{C} be a collection of subsets of the set 𝒳\mathcal{X} and {x1,,xm}𝒳\{x_{1},\ldots,x_{m}\}\subset\mathcal{X} be an arbitrary set of mm points. Define that 𝒞\mathcal{C} picks out a certain subset AA of {x1,,xm}\{x_{1},\ldots,x_{m}\} if AA can be expressed as C{x1,,xm}C\cap\{x_{1},\ldots,x_{m}\} for some C𝒞C\in\mathcal{C}. The collection 𝒞\mathcal{C} is said to shatter {x1,,xm}\{x_{1},\ldots,x_{m}\} if each of 2m2^{m} subsets can be picked out.

Definition 6 (Kosorok (2008)).

The VC dimension of the real function class \mathcal{F}, where each ff\in\mathcal{F} is defined on 𝒳\mathcal{X}, is the largest integer VC(𝒞)VC(\mathcal{C}) such that a set of points in 𝒳×\mathcal{X}\times\mathbb{R} with size VC(𝒞)VC(\mathcal{C}) is shattered by {𝒞f,f}\{\mathcal{C}_{f},f\in\mathcal{F}\}. In this paper, we use VC()VC(\mathcal{F}) to denote the VC dimension of \mathcal{F}.

Proof of Theorem 1. Since (,y)\ell(\cdot,y) is convex in Assumption 1, we have

(h^n(x),y)1Bb=1B(h^b,n(x),y)\ell(\hat{h}_{n}(x),y)\leq\frac{1}{B}\sum_{b=1}^{B}\ell(\hat{h}_{b,n}(x),y)

for any x[0,1]d,yx\in[0,1]^{d},y\in{\mathbb{R}}. Therefore, we only need to consider the excess risk of a single tree in the following analysis.

In fact, our proof is based on the following decomposition:

𝐄R(h^1,n)R(h)=\displaystyle{\mathbf{E}}R(\hat{h}_{1,n})-R(h)= 𝐄(R(h^1,n)R^(h^1,n))+𝐄(R^(h^n)R^(h))\displaystyle{\mathbf{E}}(R(\hat{h}_{1,n})-\hat{R}(\hat{h}_{1,n}))+{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h))
+𝐄(R^(h)R(h))\displaystyle+{\mathbf{E}}(\hat{R}(h)-R(h))
:=I+II+III,\displaystyle:=I+II+III, (19)

where I relates to the variance term of Mondrian tree, and II is the approximation error of Mondrian tree to hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) and III measures the error when the empirical loss R^(h)\hat{R}(h) is used to approximate the theoretical one.

Analysis of Part I. Define two classes first.

𝒯(t):={A Mondrian tree with t leaves by partitioning [0,1]d}\mathcal{T}(t):=\{\text{A Mondrian tree with $t$ leaves by partitioning $[0,1]^{d}$}\}
𝒢(t):={j=1t𝕀(x𝒞j)cj:cj,𝒞jsare leaves of a tree in 𝒯(t)}.\mathcal{G}(t):=\Big{\{}\sum_{j=1}^{t}{{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})}\cdot c_{j}:c_{j}\in{\mathbb{R}},\ {\small\mathcal{C}}_{j}\ ^{\prime}s\ \text{are leaves of a tree in $\mathcal{T}(t)$}\Big{\}}.

Thus, the truncated function class of 𝒢(t)\mathcal{G}(t) is given by

𝒢(t,z):={g~(x)=Tzg(x):g𝒢(t)},\mathcal{G}(t,z):=\{\tilde{g}(x)=T_{z}g(x):g\in\mathcal{G}(t)\},

where the threshold z>0z>0. Then, the part II can be bounded as follows.

|I|\displaystyle|I| 𝐄πλn(𝐄𝒟n|1ni=1n(h^1,n(Xi),Yi)𝐄((h^1,n(X),Y)|𝒟n)||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)|{\mathcal{D}}_{n})}\right|\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟n|1ni=1n(h^1,n(Xi),Yi)𝐄((h^1,n(X),Y)|𝒟n)||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)|{\mathcal{D}}_{n})}\right|\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\Big{|}\pi_{\lambda_{n}}\right)
=𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))|𝕀(An)|πλn)\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n})\Big{|}\pi_{\lambda_{n}}\right)
+𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))|𝕀(Anc)|πλn)\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)
:=I1+I2,\displaystyle:=I_{1}+I_{2}, (20)

where An:={max1in|Yi|lnn}A_{n}:=\{\max_{1\leq i\leq n}|Y_{i}|\leq\ln{n}\}. Next, we need to find the upper bound of I1,I2I_{1},I_{2} respectively.

Let us consider I1I_{1} first. Make the decomposition of I1I_{1} as below.

I1\displaystyle I_{1} 𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)𝐄((g(X),Y))|𝕀(An)|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n})\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)𝐄((g(X),Y))||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)𝐄((g(X),TlnnY))||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))}\right|\Big{|}\pi_{\lambda_{n}}\right)
+𝐄πλn(supg𝒢(K(λn),βn)|𝐄((g(X),TlnnY))𝐄((g(X),Y))||πλn)\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))-{\mathbf{E}}(\ell(g(X),Y))\right|\Big{|}\pi_{\lambda_{n}}\right)
:=I1,1+I1,2.\displaystyle:=I_{1,1}+I_{1,2}.

The part I1,1I_{1,1} can be bounded by considering the covering number of the function class

n:={(g(),Tlnn()):g𝒢(K(λn),βn)},\mathcal{L}_{n}:=\{\ell(g(\cdot),T_{\ln n}(\cdot)):g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})\},

where K(λn)K(\lambda_{n}) denotes the number of regions in the partition that is constructed by the truncated Mondrian process πλn\pi_{\lambda_{n}} with stopping time λn\lambda_{n}. Therefore, K(λn)K(\lambda_{n}) is a deterministic number once πλn\pi_{\lambda_{n}} is given. For any ε>0\varepsilon>0, recall the definition of the covering number of 𝒢(K(λn))\mathcal{G}(K(\lambda_{n})), namely 𝒩1(ε,𝒢(K(λn),βn),z1n)\mathcal{N}_{1}(\varepsilon,\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}) shown in Definition 3. Now, we suppose

{η1(x),η2(x),,ηJ(x)}\{\eta_{1}(x),\eta_{2}(x),\ldots,\eta_{J}(x)\}

is a ε/(M1(βn,lnn))\varepsilon/(M_{1}(\beta_{n},\ln n))-cover of class 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) in L1(z1n)L^{1}(z_{1}^{n}) space, where L1(z1n):={f(x):fz1n:=1ni=1n|f(zi)|<}L^{1}(z_{1}^{n}):=\{f(x):\|f\|_{z_{1}^{n}}:=\frac{1}{n}\sum_{i=1}^{n}|f(z_{i})|<\infty\} is equipped with norm z1n\|\cdot\|_{z_{1}^{n}} and J1J\geq 1. Without loss of generality, we can further assume |ηj(x)|βn|\eta_{j}(x)|\leq\beta_{n} since 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) is upper bounded by lnn\ln n. Otherwise, we consider the truncation of ηj(x)\eta_{j}(x): Tlnnηj(x)T_{\ln n}{\eta_{j}(x)}. According to Assumption 2, we know for any g𝒢(K(λn),βn)g\in\mathcal{G}(K(\lambda_{n}),\beta_{n}) and ηj(x)\eta_{j}(x),

|(g(x),TlnYy)(ηj(x),TlnYy)|M1(βn,lnn)|g(x)ηj(x)|,|\ell(g(x),T_{\ln Y}y)-\ell(\eta_{j}(x),T_{\ln Y}y)|\leq M_{1}(\beta_{n},\ln n)|g(x)-\eta_{j}(x)|,\ \

where x[0,1]d,y.x\in[0,1]^{d},y\in{\mathbb{R}}. The above inequality implies that

1ni=1n|(g(zi),TlnYwi)(ηj(zi),TlnYwi)|M1(βn,lnn)1ni=1n|g(zi)ηj(zi)|\frac{1}{n}\sum_{i=1}^{n}|\ell(g(z_{i}),T_{\ln Y}w_{i})-\ell(\eta_{j}(z_{i}),T_{\ln Y}w_{i})|\leq M_{1}(\beta_{n},\ln n)\cdot\frac{1}{n}\sum_{i=1}^{n}|g(z_{i})-\eta_{j}(z_{i})|

for any z1n:=(z1,z2,,zn)d××dz_{1}^{n}:=(z_{1},z_{2},\ldots,z_{n})\in{\mathbb{R}}^{d}\times\cdots\times{\mathbb{R}}^{d} and (w1,,wn)××(w_{1},\ldots,w_{n})\in{\mathbb{R}}\times\cdots\times{\mathbb{R}}. Therefore, we know (η1(x),Tlnny),,(ηJ(x),Tlnny)\ell(\eta_{1}(x),T_{\ln n}y),\ldots,\ell(\eta_{J}(x),T_{\ln n}y) is a ε\varepsilon-cover of class 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) in L1(v1n)L^{1}(v_{1}^{n}) space, where v1n:=((z1T,w1)T,,(znT,wn)T)v_{1}^{n}:=((z_{1}^{T},w_{1})^{T},\ldots,(z_{n}^{T},w_{n})^{T}). In other words, we have

𝒩1(ε,n,v1n)𝒩1(εM1(βn,lnn),𝒢(K(λn),βn),z1n).\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})\leq\mathcal{N}_{1}\left(\frac{\varepsilon}{M_{1}(\beta_{n},\ln n)},\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}\right). (21)

Note that 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) is a VC class since we have shown 𝒢(K(λn))\mathcal{G}(K(\lambda_{n})) is a VC class in (29). Furthermore, we know the function in 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) is upper bounded by βn\beta_{n}. Therefore, we can bound the RHS of (21) by using Theorem 7.12 in Sen (2018)

𝒩1(εM1(βn,lnn),𝒢(K(λn),βn),z1n)\displaystyle\mathcal{N}_{1}\left(\frac{\varepsilon}{M_{1}(\beta_{n},\ln n)},\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}\right)
cVC(𝒢(K(λn),βn))(4e)VC(𝒢(K(λn),βn))(βnε)VC(𝒢(K(λn),βn))\displaystyle\leq c\cdot VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))(4e)^{VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))}\left(\frac{\beta_{n}}{\varepsilon}\right)^{VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))} (22)

for some universal constant c>0c>0. On the other hand, it is not difficult to show

VC(𝒢(K(λn),βn))VC(𝒢(K(λn))).VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))\leq VC(\mathcal{G}(K(\lambda_{n}))). (23)

Thus, the combination of (23), (22) and (21) implies

𝒩1(ε,n,v1n)cVC(𝒢(K(λn)))(4e)VC(𝒢(K(λn)))(βnε)VC(𝒢(K(λn)))\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})\leq c\cdot VC(\mathcal{G}(K(\lambda_{n})))(4e)^{VC(\mathcal{G}(K(\lambda_{n})))}\left(\frac{\beta_{n}}{\varepsilon}\right)^{VC(\mathcal{G}(K(\lambda_{n})))} (24)

for each v1nv_{1}^{n}. Note that the class n\mathcal{L}_{n} has an envelop function M2(βn,y)M_{2}(\beta_{n},y) satisfying ν(n):=𝐄(M22(βn,Y))<\nu(n):=\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}<\infty by Assumption 3. Construct a series of independent Rademacher variables {bi}i=1n\{b_{i}\}_{i=1}^{n} sharing with the same distribution 𝐏(bi=±1)=0.5,i=1,,n{\mathbf{P}}(b_{i}=\pm 1)=0.5,i=1,\ldots,n. Then, the symmetrization technique (see Lemma 3.12 in Sen (2018)) and the Dudley’s entropy integral (see (41) in Sen (2018)) and (24) imply

I1,1\displaystyle I_{1,1} 𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)bi||πλn)(symmetrization technique)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})}b_{i}\right|\Big{|}\pi_{\lambda_{n}}\right)\ \ \text{(symmetrization technique)}
𝐄πλn(24n0ν(n)ln𝒩1(ε,n,v1n)𝑑ε|πλn)(Dudley’s entropy integral)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\frac{24}{\sqrt{n}}\int_{0}^{\nu(n)}\sqrt{\ln\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right)\ \ \qquad\qquad\text{(Dudley's entropy integral)}
cn𝐄πλn(0ν(n)ln(1+c(βn/ε)2VC(𝒢(K(λn)))𝑑ε|πλn)(E.q. (24))\displaystyle\leq\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)}\sqrt{\ln(1+c\cdot(\beta_{n}/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right)(\text{E.q. \eqref{789nsvbfc}})
βncn𝐄πλn(0ν(n)/βnln(1+c(1/ε)2VC(𝒢(K(λn)))𝑑ε|πλn),\displaystyle\leq\beta_{n}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)/\beta_{n}}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right), (25)

where we use the fact that πλ\pi_{\lambda} is independent to the data set 𝒟n{\mathcal{D}}_{n} and c>0c>0 is universal. Without loss of generality, we can assume ν(n)/βn<1\nu(n)/\beta_{n}<1; otherwise just set βn=max{ν(n),βn}\beta_{n}^{\prime}=\max\{\nu(n),\beta_{n}\} as the new upper bound of the function class 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}). Therefore, (25) also implies

I1,1\displaystyle I_{1,1} max{βn,ν(n)}cn𝐄πλn(01ln(1+c(1/ε)2VC(𝒢(K(λn)))𝑑ε|πλn)\displaystyle\leq\max\{\beta_{n},\nu(n)\}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{1}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right)
max{βn,ν(n)}cn𝐄πλn(VC(𝒢(K(λn))n)01ln(1/ε)𝑑ε\displaystyle\leq\max\{\beta_{n},\nu(n)\}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{VC(\mathcal{G}(K(\lambda_{n}))}{n}}\right)\cdot\int_{0}^{1}\sqrt{\ln(1/\varepsilon)}d\varepsilon
cmax{βn,ν(n)}𝐄πλn(VC(𝒢(K(λn))n).\displaystyle\leq c\cdot\max\{\beta_{n},\nu(n)\}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{VC(\mathcal{G}(K(\lambda_{n}))}{n}}\right). (26)

The left thing is to find the VC dimension of class 𝒢(t)\mathcal{G}(t) for each t+t\in\mathbb{Z}_{+}. This result is summarized as below.

Lemma 4.

For each integer t+t\in\mathbb{Z}_{+}, VC(𝒢(t))c(d)tln(t)VC(\mathcal{G}(t))\leq c(d)\cdot t\ln(t).

Proof.

Recall two defined classes:

𝒯(t):={A Mondrian tree with t leaves by partitioning [0,1]d}\mathcal{T}(t):=\{\text{A Mondrian tree with $t$ leaves by partitioning $[0,1]^{d}$}\}
𝒢(t):={j=1t𝕀(x𝒞j)cj:cj,𝒞jsare leaves of a tree in 𝒯(t)}.\mathcal{G}(t):=\Big{\{}\sum_{j=1}^{t}{{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})}\cdot c_{j}:c_{j}\in{\mathbb{R}},\ {\small\mathcal{C}}_{j}\ ^{\prime}s\ \text{are leaves of a tree in $\mathcal{T}(t)$}\Big{\}}.

We first calculate the VC dimension of class 𝒢(t)\mathcal{G}(t). Define a Boolean class of functions:

t={sgn(f(x,y)):f(x,y)=h(x)y,h𝒢t},\mathcal{F}_{t}=\{sgn(f(x,y)):f(x,y)=h(x)-y,h\in\mathcal{G}_{t}\},

where sgn(v)=1sgn(v)=1 if v0v\geq 0 and sgn(v)=1sgn(v)=-1 otherwise. Recall the VC dimension of t\mathcal{F}_{t}, denoted by VC(t)VC(\mathcal{F}_{t}), is the largest integer m+m\in\mathbb{Z}_{+} satisfying 2mΠt(m)2^{m}\leq\Pi_{\mathcal{F}_{t}}(m) (see, for example, Kosorok (2008)). Therefore, we focus on bounding Πt(m)\Pi_{\mathcal{F}_{t}}(m) for each positive integer m+m\in\mathbb{Z}_{+}. Let z1,,zmdz_{1},\ldots,z_{m}\in{\mathbb{R}}^{d} be the series of points which maximize Πt(m)\Pi_{\mathcal{F}_{t}}(m). Under the above notations, we have two observations as follows.

  • For any ht𝒢th_{t}\in\mathcal{G}_{t} that takes constant on each cell 𝒞j,j=1,,t{\small\mathcal{C}}_{j},j=1,\ldots,t, there is ht1𝒢t1h_{t-1}\in\mathcal{G}_{t-1} and a leaf 𝒞\mathcal{C} of a tree in 𝒯(t1)\mathcal{T}(t-1) such that 𝒞=𝒞j𝒞j\mathcal{C}={\small\mathcal{C}}_{j}\cup{\small\mathcal{C}}_{j^{\prime}} for some jj^{\prime}. Meanwhile, ht1h_{t-1} is constant on the cell in {𝒞k}k=1t{𝒞j,𝒞j}\{{\small\mathcal{C}}^{k}\}_{k=1}^{t}\setminus\{{\small\mathcal{C}}_{j},{\small\mathcal{C}}_{j^{\prime}}\} and 𝒞{\small\mathcal{C}}.

  • All half-planes in d\mathbb{R}^{d} pick out at most (me/(d+1))d+1\left({me}/{(d+1)}\right)^{d+1} subsets from {z1,,zm}\{z_{1},\ldots,z_{m}\} when md+1m\geq d+1 (see, e.g., Kosorok (2008)), namely

    Card({{z1,,zm}{xd:θTxs}:θΘd,s})(me/(d+1))d+1.Card(\{\{z_{1},\ldots,z_{m}\}\cap\{x\in{\mathbb{R}}^{d}:\theta^{T}x\leq s\}:\theta\in\Theta^{d},s\in\mathbb{R}\})\leq\left({me}/{(d+1)}\right)^{d+1}.

Based on the above two facts, we can conclude

Πt(m)Πt1(m)(med+1)d+1.\Pi_{\mathcal{F}_{t}}(m)\leq\Pi_{\mathcal{F}_{t-1}}(m)\cdot\left(\frac{me}{d+1}\right)^{d+1}. (27)

Then, combination of (27) and Π1(m)(med+1)d+1\Pi_{\mathcal{F}_{1}}(m)\leq\left(\frac{me}{d+1}\right)^{d+1} implies that

Πt(m)(med+1)td+t.\Pi_{\mathcal{F}_{t}}(m)\leq\left(\frac{me}{d+1}\right)^{t\cdot d+t}. (28)

Solving the inequality

2m(med+1)td+t2^{m}\leq\left(\frac{me}{d+1}\right)^{t\cdot d+t}

by using the basic inequality lnxγxlnγ1\ln x\leq\gamma\cdot x-\ln\gamma-1 with x,γ>0x,\gamma>0 yields

VC(𝒢(t))4ln2d(t+1)ln(2d(t+1))c(d)tln(t),VC(\mathcal{G}({t}))\leq\frac{4}{\ln 2}\cdot d(t+1)\ln{\left(2d(t+1)\right)}\leq c(d)\cdot t\ln(t), (29)

where the constant c(d)c(d) depends on dd only. ∎

Therefore, we know from Lemma 4 and (29) that

I1,1cmax{βn,ν(n)}𝐄πλn(K(λn)lnK(λn)n).I_{1,1}\leq c\cdot\max\{\beta_{n},\nu(n)\}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{K(\lambda_{n})\ln K(\lambda_{n})}{n}}\right).

By the basic inequality lnxxβ/β,x1,β>0\ln x\leq x^{\beta}/\beta,\ \forall x\geq 1,\forall\beta>0, from above inequality we have

I1,1cmax{βn,ν(n)}n𝐄(K(λn)).I_{1,1}\leq c\cdot\frac{\max\{\beta_{n},\nu(n)\}}{\sqrt{n}}\cdot{\mathbf{E}}(K(\lambda_{n})). (30)

Next, from Proposition 2 in Mourtada et al. (2020), we know 𝐄(K(λn))=(1+λn)d{\mathbf{E}}(K(\lambda_{n}))=(1+\lambda_{n})^{d}. Finally, we have the following upper bound for I1,1I_{1,1}

I1,1cmax{βn,ν(n)}n(1+λn)d.I_{1,1}\leq c\cdot\frac{\max\{\beta_{n},\nu(n)\}}{\sqrt{n}}\cdot(1+\lambda_{n})^{d}. (31)

Then, we bound the second part I1,2I_{1,2} of I1I_{1} by following the arguments below.

I1,2\displaystyle I_{1,2} 𝐄πλn(supg𝒢(K(λn),βn)𝐄(|(g(X),TlnnY)(g(X),Y)|)|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(|\ell(g(X),T_{\ln{n}}Y)-\ell(g(X),Y)|)\Big{|}\pi_{\lambda_{n}}\right)
=𝐄πλn(supg𝒢(K(λn),βn)𝐄(|(g(X),lnn)(g(X),Y)|𝕀({|Y|>lnn}))|πλn)\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(|\ell(g(X),\ln n)-\ell(g(X),Y)|\cdot\mathbb{I}(\{|Y|>\ln n\}))\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn((supg𝒢(K(λn),βn)𝐄(|(g(X),lnn)(g(X),Y)|2))12|πλn)𝐏12(|Y|>lnn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(|\ell(g(X),\ln n)-\ell(g(X),Y)|^{2})\right)^{\frac{1}{2}}\Big{|}\pi_{\lambda_{n}}\right){\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n)
(supx[βn,βn]2(x,lnn)+𝐄(M22(βn,Y)))122exp(ln2n/4),\displaystyle\leq\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\ell^{2}(x,\ln n)}+{\mathbf{E}}(M^{2}_{2}(\beta_{n},Y))\right)^{\frac{1}{2}}\cdot\sqrt{2}\exp(-\ln^{2}n/4), (32)

where in the third line we use Cauchy-Schwarz inequality and in last line we use (x,y)M2(x,y)\ell(x,y)\leq M_{2}(x,y) in Definition 3 and the sub-Gaussian property of YY.

Finally, we end the Analysis of Part I by bounding I1,2I_{1,2}. In fact, this bound can be processed as follows.

I1,2\displaystyle I_{1,2} =𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))|𝕀(Anc)|πλn)\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)(1ni=1n(g(Xi),Yi)+𝐄((g(X),Y)))𝕀(Anc)|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left(\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})+{\mathbf{E}}(\ell(g(X),Y))}\right)\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟n(1ni=1nM2(βn,Yi)+𝐄(M2(βn,Y)))𝕀(Anc)|πλn)(Assumption3)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\frac{1}{n}\sum_{i=1}^{n}{M_{2}(\beta_{n},Y_{i})+{\mathbf{E}}(M_{2}(\beta_{n},Y))}\right)\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)\ \ (Assumption\ \ref{assump3})
2𝐄(M22(βn,Y))𝐏(Anc).\displaystyle\leq 2\cdot\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\cdot{\mathbf{P}}(A_{n}^{c}). (33)

Thus, we only need to find the upper bound of 𝐏(Anc){\mathbf{P}}(A_{n}^{c}). By some calculations, we know

𝐏(Anc)\displaystyle{\mathbf{P}}(A_{n}^{c}) =1𝐏(max1in|Yi|lnn)=1[𝐏(|Yi|lnn)]n1(1cecln2n)n\displaystyle=1-{\mathbf{P}}\left(\max_{1\leq i\leq n}|Y_{i}|\leq\ln n\right)=1-\left[{\mathbf{P}}(|Y_{i}|\leq\ln n)\right]^{n}\leq 1-(1-c\cdot e^{-c\cdot\ln^{2}n})^{n}
1enln(1cecln2n)nln(1cecln2n)cnecln2n\displaystyle\leq 1-e^{n\cdot\ln(1-c\cdot e^{-c\cdot\ln^{2}n})}\leq-n\cdot\ln(1-c\cdot e^{-c\cdot\ln^{2}{n}})\leq c^{\prime}\cdot n\cdot e^{-c\cdot\ln^{2}{n}} (34)

for some c>0c>0 and c>0c^{\prime}>0. Therefore, (33) and (34) imply

I2c𝐄(M22(βn,Y))necln2n.I_{2}\leq c\cdot\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\cdot n\cdot e^{-c\cdot\ln^{2}{n}}. (35)

Analysis of Part II. Recall

II:=𝐄(1ni=1n(h^1,n(Xi),Yi)1ni=1n(h(Xi),Yi)),II:={\mathbf{E}}\left(\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right),

which relates to the empirical approximation error of Mondrian forests. First, suppose the first truncated Mondrian process with stopping time λn\lambda_{n} is given, denoted by π1,λn\pi_{1,\lambda_{n}}. Under this restriction, the partition of [0,1]d[0,1]^{d} is already determined, which is denoted by {𝒞1,λ,j}j=1K1(λn)\{{\small\mathcal{C}}_{1,\lambda,j}\}_{j=1}^{K_{1}(\lambda_{n})}. Let

Δn\displaystyle\Delta_{n} :=𝐄𝒟n(1ni=1n(h^1,n(Xi),Yi)1ni=1n(h(Xi),Yi))\displaystyle:={\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right)
=𝐄𝒟n(1ni=1n(h^1,n(Xi),Yi)1ni=1n(h1,n(Xi),Yi)\displaystyle={\mathbf{E}}_{{\mathcal{D}}_{n}}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})
+1ni=1n(h1,n(Xi),Yi)1ni=1n(h(Xi),Yi)),\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\bigg{)},

where hn(x):=j=1K1(λn)𝕀(x𝒞1,λn,j)h(x1,λn,j)h_{n}^{*}(x):=\sum_{j=1}^{K_{1}(\lambda_{n})}{\mathbb{I}}(x\in{\small\mathcal{C}}_{1,\lambda_{n},j})h(x_{1,\lambda_{n},j}) and x1,λn,jx_{1,\lambda_{n},j} denotes the center of cell 𝒞1,λn,j{\small\mathcal{C}}_{1,\lambda_{n},j}. Remember that Δn\Delta_{n} depends on π1,λn\pi_{1,\lambda_{n}}. Since h^1,n\hat{h}_{1,n} is obtained by

c^b,λ,j=argminz[βn,βn]i:Xi𝒞b,λ,j(z,Yi),\hat{c}_{b,\lambda,j}=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})},

we know

1ni=1n(h^1,n(Xi),Yi)1ni=1n(h1,n(Xi),Yi)0\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})\leq 0 (36)

once βn>C\beta_{n}>C. At this point, we consider two cases about Δn\Delta_{n}:

Case I: Δn0\Delta_{n}\leq 0. This case is trivial because we already have Δn0\Delta_{n}\leq 0.

Case II: Δn>0\Delta_{n}>0. In this case, (36) implies

Δn\displaystyle\Delta_{n} 𝐄𝒟n|1ni=1n(h1,n(Xi),Yi)1ni=1n(h(Xi),Yi)|\displaystyle\leq{\mathbf{E}}_{{\mathcal{D}}_{n}}\left|\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right|
1n𝐄𝒟n(i=1n|(h1,n(Xi),Yi)(h(Xi),Yi)|)\displaystyle\leq\frac{1}{n}{\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\sum_{i=1}^{n}\left|\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\ell(h(X_{i}),Y_{i})\right|\right)
𝐄X,Y(|(h1,n(X),Y)(h(X),Y)|).\displaystyle\leq{\mathbf{E}}_{X,Y}\left(|\ell(h_{1,n}^{*}(X),Y)-\ell(h(X),Y)|\right).

Let Dλ(X)D_{\lambda}(X) be the diameter of the cell that XX lies in. By Assumption 2, the above inequality further implies

Δn\displaystyle\Delta_{n} 𝐄X,Y(M1(C,Y)|h1,n(X)h(X)|)\displaystyle\leq{\mathbf{E}}_{X,Y}(M_{1}(C,Y)|h_{1,n}^{*}(X)-h(X)|)
𝐄X,Y(M1(C,Y)CDλn(X)β)\displaystyle\leq{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot C\cdot D_{\lambda_{n}}(X)^{\beta})
C𝐄X,Y(M1(C,Y)Dλn(X)β)\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta})
C𝐄X,Y(M1(C,Y)Dλn(X)β𝕀(|Y|lnn)+M1(C,Y)Dλn(X)β𝕀(|Y|>lnn))\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta}\mathbb{I}(|Y|\leq\ln n)+M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta}\mathbb{I}(|Y|>\ln n))
C𝐄X,Y(supy[lnn,lnn]M1(C,y)Dλn(X)β+M1(C,Y)dβ2𝕀(|Y|>lnn))\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot D_{\lambda_{n}}(X)^{\beta}+M_{1}(C,Y)\cdot d^{\frac{\beta}{2}}\cdot\mathbb{I}(|Y|>\ln n))
C(supy[lnn,lnn]M1(C,y)𝐄X(Dλn(X)β)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn))),\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right), (37)

where the second line holds because hh is a (p,C)(p,C)-smooth function and we use Dλ(X)da.s.D_{\lambda}(X)\leq\sqrt{d}\ a.s. to get the fifth line and Cauchy-Schwarz inequality in the sixth line .

Therefore, Case I and (37) in Case II imply that

ΔnC(supy[lnn,lnn]M1(C,y)𝐄X(Dλn(X)β)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn)))a.s..\Delta_{n}\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)\ a.s.. (38)

Taking expectation on both sides of (38) w.r.t. λn\lambda_{n} leads that

II\displaystyle II C(supy[lnn,lnn]M1(C,y)𝐄X,λn(Dλn(X)β)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn)))\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X,\lambda_{n}}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)
C(supy[lnn,lnn]M1(C,y)𝐄X𝐄λn(Dλn(x)β|X=x)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn)))\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)^{\beta}|X=x)+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)
C(supy[lnn,lnn]M1(C,y)𝐄X[(𝐄λn(Dλn(x)|X=x))β]+dβ2𝐄M12(C,Y)2exp(ln2n/4)),\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x))^{\beta}\right]+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right), (39)

where β(0,1]\beta\in(0,1] and in the second line we use the fact that the function vβ,v>0v^{\beta},v>0 is concavity. For any fixed x[0,1]dx\in[0,1]^{d}, we can bound 𝐄λn(Dλn(x)|X=x){\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x) by using Corollary 1 in Mourtada et al. (2020). In detail, we have

𝐄λn(Dλn(x)|X=x)\displaystyle{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x) 0d(1+λnδd)exp(λnδd)𝑑δ\displaystyle\leq\int_{0}^{\infty}{d\left(1+\frac{\lambda_{n}\delta}{\sqrt{d}}\right)\exp\left(-\frac{\lambda_{n}\delta}{\sqrt{d}}\right)d\delta}
2d321λn.\displaystyle\leq 2d^{\frac{3}{2}}\cdot\frac{1}{\lambda_{n}}. (40)

Thus, the combination of (39) and (40) imply that

IIC(2d32βsupy[lnn,lnn]M1(C,y)λnβ+dβ2𝐄M12(C,Y)2exp(ln2n/4)).II\leq C\cdot\left(2d^{\frac{3}{2}\beta}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-\beta}+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right). (41)

Analysis of Part III. This part can be bounded by using the central limit theorem. Since hC\|h\|_{\infty}\leq C, we know by Assumption 3 that

(h(x),y)supv[C,C](v,y)M2(C,y),x[0,1]d,y\ell(h(x),y)\leq\sup_{v\in[-C,C]}\ell(v,y)\leq M_{2}(C,y),\ \forall x\in[0,1]^{d},y\in{\mathbb{R}}

with 𝐄(M22(C,Y))<{\mathbf{E}}(M^{2}_{2}(C,Y))<\infty. Thus, M2(C,y)M_{2}(C,y) is an envelop function of {h}\{h\}. Note that a single function hh consists of a Glivenko-Cantelli class and has VC dimension 1. Thus, the application of equation (80) in Sen (2018) implies

III:=𝐄(R^(h)R(h))cn𝐄(M22(C,Y))III:={\mathbf{E}}(\hat{R}(h)-R(h))\leq\frac{c}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))} (42)

for some universal c>0c>0.

Finally, the combination of (31), (35), (41) and (42) completes the proof. \Box

Proof of Corollary 1. This theorem can be obtained directly by using Theorem 1 and Assumption 5. \Box

Proof of Corollary 2. The proof starts from the observation that our class p,β([0,1]d,C)\mathcal{H}^{p,\beta}([0,1]^{d},C) can be used to approximate any general function. Since m(x){f(x):𝐄f2(X)<}m(x)\in\{f(x):{\mathbf{E}}f^{2}(X)<\infty\}, we know m(x)m(x) can be approximated by a sequence of continuous functions in L2L^{2} sense. Thus, we can assume m(x),x[0,1]pm(x),x\in[0,1]^{p} is continuous. Define the function σlog(x)=ex/(1+ex),x\sigma_{log}(x)=e^{x}/(1+e^{x}),x\in{\mathbb{R}}. For any ε>0\varepsilon>0, by Lemma 16.1 in Györfi et al. (2002) there is hε(x)=j=1Jaε,jσlog(θε,jx+sε,j),x[0,1]dh_{\varepsilon}(x)=\sum_{j=1}^{J}a_{\varepsilon,j}\sigma_{log}(\theta_{\varepsilon,j}^{\top}x+s_{\varepsilon,j}),x\in[0,1]^{d} with aε,j,sε,ja_{\varepsilon,j},s_{\varepsilon,j}\in{\mathbb{R}} and θε,jd\theta_{\varepsilon,j}\in{\mathbb{R}}^{d} such that

𝐄(m(X)hε(X))2supx[0,1]p(m(x)hε(x))2ε3.{\mathbf{E}}\left(m(X)-h_{\varepsilon}(X)\right)^{2}\leq\sup_{x\in[0,1]^{p}}(m(x)-h_{\varepsilon}(x))^{2}\leq\frac{\varepsilon}{3}. (43)

After some simple calculation, we know hε(x)1,1([0,1]d,C(hε))h_{\varepsilon}(x)\in\mathcal{H}^{1,1}([0,1]^{d},C(h_{\varepsilon})), where C(hε)>0C(h_{\varepsilon})>0 depends on hεh_{\varepsilon} only. Now we fix such hε(x)h_{\varepsilon}(x). Make the decomposition as follows

𝐄R(h^1,n)R(m)=\displaystyle{\mathbf{E}}R(\hat{h}_{1,n})-R(m)= 𝐄(R(h^1,n)R^(h^1,n))+𝐄(R^(h^n)R^(m))\displaystyle{\mathbf{E}}(R(\hat{h}_{1,n})-\hat{R}(\hat{h}_{1,n}))+{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(m))
+𝐄(R^(m)R(m))\displaystyle+{\mathbf{E}}(\hat{R}(m)-R(m))
:=I+II+III.\displaystyle:=I+II+III. (44)

Part I and III can be upper bounded by following similar analysis in Theorem 1. Therefore, under assumptions in our theorem, we know both of these two parts converges to zero as nn\to\infty. Next, we consider Part II. Note that

𝐄(R^(h^n)R^(m))\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(m)) =𝐄(R^(h^n)R^(hε))+𝐄(R(hε)R(m))\displaystyle={\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+{\mathbf{E}}(R(h_{\varepsilon})-R(m))
𝐄(R^(h^n)R^(hε))+𝐄(hε(X)m(X))2\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+{\mathbf{E}}(h_{\varepsilon}(X)-m(X))^{2}
𝐄(R^(h^n)R^(hε))+cε3,\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+c\cdot\frac{\varepsilon}{3},

where in the second line we use Assumption 5. Finally, we only need to consider the behavior of term 𝐄(R^(h^n)R^(hε)){\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon})) as nn\to\infty. This can be done by using the analysis of Part II in the proof for Theorem 1. Taking C=C(hε)C=C(h_{\varepsilon}) in (41), we have

𝐄(R^(h^n)R^(hε))C(2d32supy[lnn,lnn]M1(C,y)λn1+d12𝐄M12(C,Y)2exp(ln2n/4)),{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))\leq C\cdot\left(2d^{\frac{3}{2}}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-1}+d^{\frac{1}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\sqrt{2}\exp(-\ln^{2}n/4)\right), (45)

which goes to zero as nn increases. In conclusion, we have proved that

limn𝐄(R^(h^n)R(m))=0.\lim_{n\to\infty}{\mathbf{E}}(\hat{R}(\hat{h}_{n})-R(m))=0.

The above inequality and Assumption 5 shows that h^n\hat{h}_{n} is L2L^{2} consistent for the general function m(x),x[0,1]dm(x),x\in[0,1]^{d}. \Box

Proof of Theorem 2. Based on Assumption 1, we only need to consider the regret function for h^1,n\hat{h}^{*}_{1,n}. For any λ>0\lambda>0, by the definition of h^1,n\hat{h}^{*}_{1,n} we know

𝐄(R^(h^1,n))\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})) 𝐄(R^(h^1,n)+αnλn,1)\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})+\alpha_{n}\cdot\lambda_{n,1}^{*})
𝐄(R^(h^1,n,λ)+αnλ),\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n,\lambda})+\alpha_{n}\cdot\lambda), (46)

where h^1,n,λ\hat{h}_{1,n,\lambda} is the estimator based on the process MP1(λ,[0,1]d)MP_{1}(\lambda,[0,1]^{d}).

On the other hand, we have the decomposition below

𝐄R(h^1,n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{1,n}^{*})-R(h) =𝐄(R(h^1,n)R^(h^1,n))+𝐄(R^(h^1,n)R^(h))\displaystyle={\mathbf{E}}(R(\hat{h}_{1,n}^{*})-\hat{R}(\hat{h}_{1,n}^{*}))+{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})-\hat{R}(h))
+𝐄(R^(h)R(h)):=I+II+III.\displaystyle+{\mathbf{E}}(\hat{R}(h)-R(h)):=I+II+III. (47)

Firstly, we bound Part II. Recall An:={max1in|Yi|lnn}A_{n}:=\{\max_{1\leq i\leq n}|Y_{i}|\leq\ln{n}\}, which is defined below (20). Make the decomposition of II as follows.

I\displaystyle I =𝐄((R(h^1,n)R^(h^1,n))𝕀(An))+𝐄((R(h^1,n)R^(h^1,n))𝕀(Anc))\displaystyle={\mathbf{E}}((R(\hat{h}_{1,n}^{*})-\hat{R}(\hat{h}_{1,n}^{*}))\cap\mathbb{I}(A_{n}))+{\mathbf{E}}((R(\hat{h}_{1,n}^{*})-\hat{R}(\hat{h}_{1,n}^{*}))\cap\mathbb{I}(A_{n}^{c}))
:=I1,1+I1,2\displaystyle:=I_{1,1}+I_{1,2} (48)

The key for bounding I1,1I_{1,1} is to find the upper bound of λn,1\lambda_{n,1}^{*}. By the definition of h^1,n\hat{h}_{1,n}^{*} and Assumption 3, we know if AnA_{n} occurs

αnλn,1Pen(0)supy[lnn,lnn]M2(βn,y).\alpha_{n}\cdot\lambda_{n,1}^{*}\leq Pen(0)\leq\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y).

Therefore, when AnA_{n} happens we have

λn,1supy[lnn,lnn]M2(βn,y)αn.\lambda_{n,1}^{*}\leq\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}.

Following arguments that we used to bound I1,1I_{1,1} in the Proof of Theorem 1, we know

|I1,1|\displaystyle|I_{1,1}| cmax{βn,𝐄(M22(βn,Y))}n(1+λn,1)d\displaystyle\leq c\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\cdot(1+\lambda_{n,1}^{*})^{d}
cmax{βn,𝐄(M22(βn,Y))}n(1+supy[lnn,lnn]M2(βn,y)αn)d\displaystyle\leq c\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\cdot\left(1+\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}\right)^{d} (49)

Next, the way for bounding I1,2I_{1,2} in (48) is similar to that we used to bound I1,2I_{1,2} in the Proof of Theorem 1. Namely, we have

|I1,2|(supx[βn,βn]2(x,lnn)+𝐄(M22(βn,Y)))122exp(ln2n/4).|I_{1,2}|\leq\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\ell^{2}(x,\ln n)}+{\mathbf{E}}(M^{2}_{2}(\beta_{n},Y))\right)^{\frac{1}{2}}\cdot\sqrt{2}\exp(-\ln^{2}n/4). (50)

Secondly, we use (46) to bound Part IIII in (47). By the definition of h^1,n\hat{h}_{1,n}^{*}, for any λ>0\lambda>0 we have

II:=𝐄(R^(h^1,n)R^(h))𝐄(R^(h^1,n,λ)R^(h)+αnλ).II:={\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})-\hat{R}(h))\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n,\lambda})-\hat{R}(h)+\alpha_{n}\cdot\lambda).

Similar to the Proof of Theorem 1, the above inequality implies

IIC(2d32psupy[lnn,lnn]M1(C,y)λnp+dp2𝐄(M12(C,Y))2eln2n4)+αnλ.II\leq C\cdot\left(2d^{\frac{3}{2}p}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-p}+d^{\frac{p}{2}}\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\cdot\sqrt{2}e^{-\frac{\ln^{2}n}{4}}\right)+\alpha_{n}\cdot\lambda. (51)

Since (51) holds for all λ>0\lambda>0, taking λ=(1αn)1/(p+1)\lambda=\left(\frac{1}{\alpha_{n}}\right)^{1/(p+1)} inequality (51) further implies

II\displaystyle II (2Csupy[lnn,lnn]M1(C,y)d32p+1)(αn)pp+1+rn\displaystyle\leq(2C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}d^{\frac{3}{2}p}+1)\cdot\left(\alpha_{n}\right)^{\frac{p}{p+1}}+r_{n}
(2Csupy[lnn,lnn]M1(C,y)d32p+1)(αn)p2+rn,\displaystyle\leq(2C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}d^{\frac{3}{2}p}+1)\cdot\left(\alpha_{n}\right)^{\frac{p}{2}}+r_{n}, (52)

where rn:=Cd12p𝐄(M12(C,Y))2eln2n4r_{n}:=C\cdot d^{\frac{1}{2}p}\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\cdot\sqrt{2}e^{-\frac{\ln^{2}n}{4}} is caused by the sub-Gaussian property of YY.

Thirdly, we consider Part IIIIII. The arguments for this is same with that used to obtain (42). Namely, we have

III:=𝐄(R^(h)R(h))cn𝐄(M22(C,Y)),III:={\mathbf{E}}(\hat{R}(h)-R(h))\leq\frac{c}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}, (53)

where cc is universal and does not depend on CC.

Finally, the combination of (8), (53), (49) and (50) finishes the proof. \Box

Proof of Lemma 2. Note that Pen(h):=ln[0,1]dexp(h(x))𝑑xPen(h):=\ln\int_{[0,1]^{d}}{\exp{(h(x))}dx}. Define a real function as follows:

g(α):=Pen((1α)h+αhn), 0α1.g(\alpha):=Pen((1-\alpha)\cdot h+\alpha\cdot h_{n}^{*}),\ 0\leq\alpha\leq 1.

Thus, we have g(0)=Pen(h)g(0)=Pen(h) and g(1)=Pen(hn)g(1)=Pen(h_{n}^{*}). Later, it will be convenient to use the function gg in the analysis of this penalty function. Since both hh and hnh_{n}^{*} are upper bounded, we know g(α)g(\alpha) is differentiable and its derivative is

ddαg(α)=[0,1]p(hn(x)h(x))exp(h(x)+α(hn(x)h(x))dx[0,1]pexp(h(x)+α(hn(x)h(x))dx.\frac{d}{d\alpha}g(\alpha)=\frac{\int_{[0,1]^{p}}{(h_{n}^{*}(x)-h(x))\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x)})dx}}{\int_{[0,1]^{p}}{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}dx}}. (54)

Define a continuous random vector ZαZ_{\alpha} with the density function

fZα(x):=exp(h(x)+α(hn(x)h(x))[0,1]pexp(h(x)+α(hn(x)h(x))dx,x[0,1]d.f_{Z_{\alpha}}(x):=\frac{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}}{\int_{[0,1]^{p}}{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}dx}},\ x\in[0,1]^{d}. (55)

From (54) and (55), we know

ddαg(α)=𝐄Zα(hn(Zα)h(Zα)).\frac{d}{d\alpha}g(\alpha)={\mathbf{E}}_{Z_{\alpha}}(h_{n}^{*}(Z_{\alpha})-h(Z_{\alpha})). (56)

On the other hand, the Lagrange mean theorem implies

|Pen(h)Pen(hn(x))|\displaystyle|Pen(h)-Pen(h_{n}^{*}(x))| =|g(0)g(1)|\displaystyle=|g(0)-g(1)|
=|ddαg(α)|α=α|\displaystyle=\left|\frac{d}{d\alpha}g(\alpha)|_{\alpha=\alpha^{*}}\right|
=𝐄Zα(|hn(Zα)h(Zα)|),\displaystyle={\mathbf{E}}_{Z_{\alpha^{*}}}(|h_{n}^{*}(Z_{\alpha^{*}})-h(Z_{\alpha^{*}})|), (57)

where α[0,1]\alpha^{*}\in[0,1]. Thus, later we only need to consider the last term of (57). Since fZα(x)exp(2C),x[0,1]p,α[0,1]f_{Z_{\alpha^{*}}}(x)\leq\exp{(2C)},\ \forall x\in[0,1]^{p},\forall\alpha\in[0,1], we know from (57) that

|Pen(h)Pen(hn(x))|exp(2C)𝐄U(|hn(U)h(U)|),|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp{(2C)}\cdot{\mathbf{E}}_{U}(|h_{n}^{*}(U)-h(U)|), (58)

where UU follows the uniform distribution in [0,1]d[0,1]^{d} and is independent with πλ\pi_{\lambda}. By further calculation, we have

𝐄πλ|Pen(h)Pen(hn(x))|\displaystyle{\mathbf{E}}_{\pi_{\lambda}}|Pen(h)-Pen(h_{n}^{*}(x))| exp(2C)𝐄πλ𝐄U(|hn(U)h(U)|)\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{\pi_{\lambda}}{\mathbf{E}}_{U}(|h_{n}^{*}(U)-h(U)|)
exp(2C)𝐄U𝐄πλ(CDλn(U)β)\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{U}{\mathbf{E}}_{\pi_{\lambda}}(C\cdot D_{\lambda_{n}}(U)^{\beta})
exp(2C)𝐄U[(𝐄λn(Dλn(u)|U=u))β].\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{U}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(u)|U=u))^{\beta}\right].

From (40), we already know 𝐄λn(Dλn(u)|U=u)2d321λn{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(u)|U=u)\leq 2d^{\frac{3}{2}}\cdot\frac{1}{\lambda_{n}}. Thus, above inequality implies

𝐄πλ|Pen(h)Pen(hn(x))|exp(2C)2βd32β(1λn)β.{\mathbf{E}}_{\pi_{\lambda}}|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp(2C)\cdot 2^{\beta}d^{\frac{3}{2}\beta}\cdot\left(\frac{1}{\lambda_{n}}\right)^{\beta}.

This completes the proof. \Box

Proof of Lemma 3. First, we calculate the term d2dα2R(h0+αg)\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g), where g(x)𝑑x=0\int{g(x)dx}=0. With some calculation, it is not difficult to know

d2dα2R(h0+αg)=Var(h(Xα)),\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)=Var\left(h(X_{\alpha})\right), (59)

where XαX_{\alpha} is a continuous random vector in [0,1]d[0,1]^{d}. Furthermore, XαX_{\alpha} has the density fXα(x)=exp(gα(x))/exp(hα(x))𝑑xf_{X_{\alpha}}(x)=\exp(g_{\alpha}(x))/\int{\exp(h_{\alpha}(x))dx} with gα(x)=h0(x)+αg(x)g_{\alpha}(x)=h_{0}(x)+\alpha g(x). Sicne h0c\|h_{0}\|_{\infty}\leq c and gβn\|g\|_{\infty}\leq\beta_{n}, we can assume without loss generality that gαβn\|g_{\alpha}\|_{\infty}\leq\beta_{n}. This results that

exp(2βn)fXα(x)exp(2βn),x[0,1]d.\exp{(-2\beta_{n})}\leq f_{X_{\alpha}}(x)\leq\exp{(2\beta_{n})},\ \forall x\in[0,1]^{d}. (60)

Let UU follows uniform distribution in [0,1]d[0,1]^{d}. (60) implies

Var(g(Xα))\displaystyle Var\left(g(X_{\alpha})\right) =infc>0𝐄(g(Xα)c)2\displaystyle=\inf_{c>0}{\mathbf{E}}(g(X_{\alpha})-c)^{2}
exp(2βn)infc>0𝐄(g(U)c)2\displaystyle\leq\exp{(2\beta_{n})}\cdot\inf_{c>0}{\mathbf{E}}(g(U)-c)^{2}
=exp(2βn)Var(g(U))=exp(2βn)𝐄(g2(U)).\displaystyle=\exp{(2\beta_{n})}\cdot Var(g(U))=\exp{(2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(U)). (61)

With the same arguemnt, we also have

Var(g(Xα))exp(2βn)𝐄(g2(U)).Var\left(g(X_{\alpha})\right)\geq\exp{(-2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(U)). (62)

The combination of (59), (61) and (62) shows that

cexp(2βn)𝐄(g2(X))d2dα2R(h0+αg)c1exp(2βn)𝐄(g2(X))c\cdot\exp{(-2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(X))\leq\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)\leq c^{-1}\cdot\exp{(2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(X)) (63)

for some universal c>0c>0 and any α[0,1]\alpha\in[0,1]. Finally, by Taylor expansion, we have

R(h)=R(h0)+ddαR(h0+α(hh0))|α=0+d2dα2R(h0+α(hh0))|α=αR(h)=R(h_{0})+\frac{d}{d\alpha}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=0}+\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=\alpha^{*}}

for some α[0,1]\alpha^{*}\in[0,1]. Without loss of generality, We can assume that hh0βn\|h-h_{0}\|_{\infty}\leq\beta_{n}. Thus, the second derivative d2dα2R(h0+α(hh0))|α=α\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=\alpha^{*}} can be bounded by using (63) if we take g=hh0g=h-h_{0}. Meanwhile, the first derivative ddαR(h0+α(hh0))|α=0=0\frac{d}{d\alpha}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=0}=0 since h0h_{0} achieves the minimal value of R()R(\cdot). Based on these analysis, we have

c1lnn𝐄(h(X)h0(X))2R(h)R(h0)c1lnn𝐄(h(X)h0(X))2c\cdot\frac{1}{\ln n}\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}\leq R(h)-R(h_{0})\leq c^{-1}\cdot\ln n\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}

for some universal c>0c>0. This completes the proof. \Box

References

  • Arlot and Genuer (2014) Arlot, S. and R. Genuer (2014). Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939.
  • Athey et al. (2019) Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of Statistics 47(2), 1148–1178.
  • Biau (2012) Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research 13(1), 1063–1095.
  • Biau et al. (2008) Biau, G., L. Devroye, and G. Lugosi (2008). Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research 9(9).
  • Blumer et al. (1989) Blumer, A., A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989). Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM) 36(4), 929–965.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
  • Cattaneo et al. (2023) Cattaneo, M. D., J. M. Klusowski, and W. G. Underwood (2023). Inference with mondrian random forests. arXiv preprint arXiv:2310.09702.
  • Györfi et al. (2002) Györfi, L., M. Kohler, A. Krzyżak, and H. Walk (2002). A distribution-free theory of nonparametric regression, Volume 1. Springer.
  • Huang (1998) Huang, J. Z. (1998). Functional anova models for generalized regression. Journal of multivariate analysis 67(1), 49–71.
  • Klusowski (2021) Klusowski, J. M. (2021). Universal consistency of decision trees in high dimensions. arXiv preprint arXiv:2104.13881.
  • Knight (1998) Knight, K. (1998). Limiting distributions for l1 regression estimators under general conditions. Annals of statistics, 755–770.
  • Kosorok (2008) Kosorok, M. R. (2008). Introduction to empirical processes and semiparametric inference. Springer.
  • Lakshminarayanan et al. (2014) Lakshminarayanan, B., D. M. Roy, and Y. W. Teh (2014). Mondrian forests: Efficient online random forests. Advances in neural information processing systems 27.
  • Liaw et al. (2002) Liaw, A., M. Wiener, et al. (2002). Classification and regression by randomforest. R news 2(3), 18–22.
  • Mourtada et al. (2020) Mourtada, J., S. Gaïffas, and E. Scornet (2020). Minimax optimal rates for Mondrian trees and forests. The Annals of Statistics 48(4), 2253 – 2276.
  • Mourtada et al. (2021) Mourtada, J., S. Gaïffas, and E. Scornet (2021). Amf: Aggregated mondrian forests for online learning. Journal of the Royal Statistical Society Series B: Statistical Methodology 83(3), 505–533.
  • Roy (2011) Roy, D. M. (2011). Computability, inference and modeling in probabilistic programming. Ph. D. thesis, Massachusetts Institute of Technology.
  • Roy and Teh (2008) Roy, D. M. and Y. W. Teh (2008). The mondrian process. In Advances in neural information processing systems, pp. 1377–1384.
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics 48(4), 1875–1897.
  • Scornet et al. (2015) Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests. The Annals of Statistics 43(4), 1716–1741.
  • Sen (2018) Sen, B. (2018). A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University 11, 28–29.
  • Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and S. Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  • Stone (1986) Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 590–606.
  • Stone (1994) Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The annals of statistics 22(1), 118–171.
  • Zhang and Wang (2009) Zhang, H. and M. Wang (2009). Search for the smallest random forest. Statistics and its interface 2 3, 381.