This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-asymptotic Properties of Generalized Mondrian Forests in Statistical Learning

\nameHaoran Zhan \email[email protected]
\addrDepartment of Statistics and Data Science,
National University of Singapore, 117546, Singapore
\AND\nameJingli Wang \email[email protected]
\addrSchool of Statistics and Data Science, KLMDASR, LEBPS, and LPMC,
Nankai University, Tianjin, 300071, China
\AND\nameYingcun Xia \email[email protected]
\addrDepartment of Statistics and Data Science,
National University of Singapore, 117546, Singapore
Abstract

Random Forests have been extensively used in regression and classification, inspiring the development of various forest-based methods. Among these, Mondrian Forests, derived from the Mondrian process, mark a significant advancement. Expanding on Mondrian Forests, this paper presents a general framework for statistical learning, encompassing a range of common learning tasks such as least squares regression, 1\ell_{1} regression, quantile regression, and classification. Under mild assumptions on the loss functions, we provide upper bounds on the regret/risk functions for the estimators and demonstrate their statistical consistency.

Keywords: ensemble learning, machine learning, random forests, regret function, statistical learning

1 Introduction

Random Forest (RF) (Breiman, 2001) is a popular ensemble learning technique in machine learning. It operates by constructing multiple decision trees and averaging their predictions to improve accuracy and robustness. Many empirical studies have demonstrated its powerful performance across various data domains (for example, see Liaw et al. (2002)). The effectiveness of RF is attributed to its data-dependent splitting rule, called CART. Briefly, CART is a greedy algorithm that identifies the best splitting variable and value by maximizing the reduction of training error between two layers. However, this data-dependent splitting scheme complicates the theoretical analysis of RF.

To the best of our knowledge, two papers have made significant contributions to the theoretical analysis of RF. Scornet et al. (2015) was the first to establish the statistical consistency of RF when the true regression function follows an additive model. Klusowski (2021) later demonstrated the consistency of RF under weaker assumptions on the distribution of predictors. However, both studies rely on the technical condition that the conditional mean has an additive structure.

To gain a deeper understanding of the random forest, people consider modified and stylized versions of RF. One such method is Purely Random Forests (PRF, for example Arlot and Genuer (2014), Biau (2012) and Biau et al. (2008)), where individual trees are grown independently of the sample, making them well suited for theoretical analysis. In this paper, our interest is Mondrian Forest, which is one of PRFs applying Mondrian process in its partitioning. This kind of forest was first introduced by Lakshminarayanan et al. (2014) and has competitive online performance in classification problems compared to other state-of-the-art methods. Inspired by its nice online property, Mourtada et al. (2021) also studied the online regression theory and classification using Mondrian forest. Besides its impressive online performance, this data-independent partitioning method is also important for offline regression because it achieves a higher consistency rate than other partitioning ways, such as the midpoint-cut strategy; see Mourtada et al. (2020). In fact, Mourtada et al. (2020) showed that the statistical consistency for Mondrian forests is minimax optimal for the class of Hölder continuous functions. Then, Cattaneo et al. (2023), following the approach in Mourtada et al. (2020), derived the asymptotic normal distribution of the Mondrian forest for the offline regression problem. Recently, Baptista et al. (2024) proposed a novel dimension reduction method using Mondrian forests. Therefore, it is evident that Mondrian forests have attracted increasing research attention due to their advantageous theoretical properties compared to other forest-based methods.

In this paper, we argue that Mondrian forests, beyond their application in classical regression and classification, can be extended to address a broader spectrum of statistical and machine learning problems, including generalized regression, density estimation, and quantile regression. Our main contributions are as follows:

  • First, we propose a general framework based on Mondrian forests that can be applied to various learning problems.

  • Second, we establish an upper bound for the regret (or risk) function of the proposed forest estimator. The theoretical results derived from this analysis are applicable to numerous learning scenarios, with several examples provided in Section 6.

Our training approach adopts a global perspective, contrasting with the local perspective utilized in the generalization of Random Forests by Athey et al. (2019). Specifically, after performing a single optimization on the entire dataset, our method can estimate the objective function m(x),x[0,1]dm(x),\forall x\in[0,1]^{d}, while Athey et al. (2019) focus on estimating m(x0)m(x_{0}) at a specific point x0x_{0}. This global approach significantly reduces computational time, especially in high-dimensional settings where dd is large.

Moreover, our globalized framework can be easily applied to statistical problems involving a penalization function Pen(m)Pen(m), which depends on m(x)m(x) in the entire domain [0,1]d[0,1]^{d}. In Section 6.6, we illustrate the application of our method to nonparametric density estimation, a problem that incorporates such penalization. In contrast, since Athey et al. (2019) relies on pointwise estimation, their method struggles to ensure that the resulting estimator satisfies shape constraints, such as those required for a valid probability density function, thus excluding these cases from their scope.

2 Background and Preliminaries

2.1 Task in Statistical Learning

Let (X,Y)[0,1]d×(X,Y)\in[0,1]^{d}\times{\mathbb{R}} be the random vector, where we have normalized the range of XX. In statistical learning, the goal is to find a policy hh supervised by YY, which is defined as a function h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}}. Usually, a loss function (h(x),y):×[0,)\ell(h(x),y):{\mathbb{R}}\times{\mathbb{R}}\to[0,\infty) is used to measure the difference or loss between the decision h(x)h(x) and the goal yy. Taking expectation w.r.t. X,YX,Y, the risk function

R(h):=𝐄((h(X),Y)R(h):={\mathbf{E}}(\ell(h(X),Y) (1)

denotes the averaged loss by using the policy hh. Naturally, people have reasons to select the best policy hh^{*} by minimizing the averaged loss over some function class 1\mathcal{H}_{1}, namely,

h=argminh1R(h).h^{*}=\arg\min_{h\in\mathcal{H}_{1}}R(h).

Therefore, the policy hh^{*} has the minimal risk and is the best in theory. In practice, the distribution of (X,Y)(X,Y) is unknown and (1) is not able to be used for the calculation of R(h)R(h). Thus, such a best hh^{*} cannot be obtained in a direct way. Usually, we can use i.i.d. data 𝒟n={(Xi,Yi)}i=1n{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n} to approximate R(h)R(h) by the law of large numbers. Thus, the empirical risk function can be approximated by

R^(h):=1ni=1n(h(Xi),Yi).\hat{R}(h):=\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i}).

Traditionally, people always can find an estimator/policy h^n:[0,1]d\hat{h}_{n}:[0,1]^{d}\to{\mathbb{R}} by minimizing R^(h)\hat{R}(h) over a known function class; see spline regression in Györfi et al. (2002), wavelet regression in Györfi et al. (2002) and regression by deep neural networks in Schmidt-Hieber (2020) and Kohler and Langer (2021).

Instead of globally minimizing R^(h)\hat{R}(h), tree-based greedy algorithms adopt a “local” minimization approach, which is widely used to construct the empirical estimator h^n\hat{h}_{n}. These algorithms have demonstrated the advantages of tree-based estimators over many traditional statistical learning methods. Therefore, in this paper, we focus on bounding the regret function:

ε(h^n):=R(h^n)R(h),\varepsilon(\hat{h}_{n}):=R(\hat{h}_{n})-R(h^{*}),

where h^n\hat{h}_{n} represents an ensemble estimator built using Mondrian Forests.

2.2 Mondrian partitions

Mondrian partitions are a specific type of random tree partition, where the partitioning of [0,1]d[0,1]^{d} is independent of the data points. This scheme is entirely governed by a stochastic process called the Mondrian process, denoted as MP([0,1]d)MP([0,1]^{d}). The Mondrian process MP([0,1]d)MP([0,1]^{d}), introduced by Roy and Teh (2008) and Roy (2011), is a probability distribution over infinite tree partitions of [0,1]d[0,1]^{d}. For a detailed definition, we refer to Definition 3 in Mourtada et al. (2020).

In this paper, we consider the Mondrian partitions with stopping time λ\lambda, denoted by MP(λ,[0,1]d)MP(\lambda,[0,1]^{d}) (see Section 1.3 in Mourtada et al. (2020)). Its construction consists of two steps. First, we construct partitions according to MP([0,1]d)MP([0,1]^{d}) by iteratively splitting cells at random times, which depends on the linear dimension of each cell. The probability of splitting along each side is proportional to the side length of the cell, and the splitting position is chosen uniformly. Second, we cut the nodes whose birth time is after the tuning parameter λ>0\lambda>0. In fact, each tree node created in the first step was given a specific birth time. As the tree grows, so does the birth time. Therefore, the second step can be seen as a pruning process, which helps us to choose the best tree model for a learning problem.

To clearly present the Mondrian partition algorithm, some notations are introduced. Let 𝒞=j=1d𝒞j[0,1]d{\small\mathcal{C}}=\prod_{j=1}^{d}{{\small\mathcal{C}}^{j}}\subseteq[0,1]^{d} be a cell with closed intervals 𝒞j=[aj,bj]{\small\mathcal{C}}^{j}=[a_{j},b_{j}]. Denote |𝒞j|=bjaj|{\small\mathcal{C}}^{j}|=b_{j}-a_{j} and |𝒞|=j=1d|𝒞j||{\small\mathcal{C}}|=\sum_{j=1}^{d}{|{\small\mathcal{C}}^{j}|}. Let Exp(|𝒞|)Exp(|{\small\mathcal{C}}|) be an exponential distribution with expectation |𝒞|>0|{\small\mathcal{C}}|>0. Algorithm 1 shows how our Mondrian partition operates. This algorithm is a recursive process, where the root node [0,1]d[0,1]^{d} and stopping time λ\lambda are used as the initial inputs.

Input: Stopping time λ\lambda.
1 Run the iterative function MondrianPartition([0,1]d,0,λ)\text{Mondrian}\text{Partition}([0,1]^{d},0,\lambda).
# This function is defined in Algorithm 2.
Algorithm 1 Mondrian Partition of [0,1]d[0,1]^{d}: Generate a Mondrian partition of [0,1]d[0,1]^{d}, starting from time 0 and until time λ\lambda.
1 Sample a random variable E𝒞Exp(|𝒞|)E_{\small\mathcal{C}}\sim Exp(|{\small\mathcal{C}}|)
2if τ+E𝒞λ\tau+E_{\small\mathcal{C}}\leq\lambda then
3       Randomly choose a split dimension J{1,,d}J\in\{1,\ldots,d\} with 𝐏(J=j)=(bjaj)/|𝒞|{\mathbf{P}}(J=j)=(b_{j}-a_{j})/|{\small\mathcal{C}}|;
4       Randomly choose a split threshold SJS_{J} in [aJ,bJ][a_{J},b_{J}];
5       Split 𝒞{\small\mathcal{C}} along the split (J,SJ)(J,S_{J}): let 𝒞0={x𝒞:xJSJ}{\small\mathcal{C}}_{0}=\{x\in{\small\mathcal{C}}:x_{J}\leq S_{J}\} and 𝒞1=𝒞/𝒞0{\small\mathcal{C}}_{1}={\small\mathcal{C}}/\ {\small\mathcal{C}}_{0};
6       return MondrianPartition(𝒞0,τ+E𝒞,λ)MondrianPartition(𝒞1,τ+E𝒞,λ)\text{Mondrian}\text{Partition}({\small\mathcal{C}}_{0},\tau+E_{\small\mathcal{C}},\lambda)\cup\text{Mondrian}\text{Partition}({\small\mathcal{C}}_{1},\tau+E_{\small\mathcal{C}},\lambda).
7else
8       return {𝒞}\{{\small\mathcal{C}}\} (i.e., do not split 𝒞{\small\mathcal{C}})
9 end if
Algorithm 2 MondrianPartition (𝒞,τ,λ)({\small\mathcal{C}},\tau,\lambda): Generate a Mondrian partition of cell 𝒞{\small\mathcal{C}}, starting from time τ\tau and until time λ\lambda.
Refer to caption
Refer to caption
Figure 1: An example of a Mondrian partition (left) with the corresponding tree structure (right). This shows how the tree grows over time. There are four partitioning times in this demo, 1,2,3,41,2,3,4, which are marked by bullets (\bullet) and the stopping time is λ=4\lambda=4.

3 Methodology

Based on the Mondrian Partition, we introduce our Mondrian Forests in this section. Let MPb([0,1]d),b=1,,BMP_{b}([0,1]^{d}),b=1,\ldots,B, be independent Mondrian processes. When we prune each tree at time λ>0\lambda>0, independent partitions MPb(λ,[0,1]d),b=1,,BMP_{b}(\lambda,[0,1]^{d}),b=1,\ldots,B are obtained, where all cuts after time λ\lambda are ignored. In this case, we can write MPb(λ,[0,1]d)={𝒞b,λ,j}j=1Kb(λ)MP_{b}(\lambda,[0,1]^{d})=\{{\small\mathcal{C}}_{b,\lambda,j}\}_{j=1}^{K_{b}(\lambda)} satisfying

[0,1]d=j=1Kb(λ)𝒞b,λ,jand𝒞b,λ,j1𝒞b,λ,j2=,j1j2,[0,1]^{d}=\bigcup_{j=1}^{K_{b}(\lambda)}{\small\mathcal{C}}_{b,\lambda,j}\ \ \text{and}\ \ {\small\mathcal{C}}_{b,\lambda,j_{1}}\cap{\small\mathcal{C}}_{b,\lambda,j_{2}}=\varnothing,\ \forall j_{1}\neq j_{2},

where 𝒞b,λ,j[0,1]d{\small\mathcal{C}}_{b,\lambda,j}\subseteq[0,1]^{d} denotes a cell in the partition MPb(λ,[0,1]d)MP_{b}(\lambda,[0,1]^{d}). For each cell 𝒞b,λ,j{\small\mathcal{C}}_{b,\lambda,j}, a constant policy c^b,λ,j\hat{c}_{b,\lambda,j}\in{\mathbb{R}} is used as the predictor of h(x)h(x), where

c^b,λ,j=argminz[βn,βn]i:Xi𝒞b,λ,j(z,Yi)\hat{c}_{b,\lambda,j}=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})} (2)

and βn>0\beta_{n}>0 is a threshold. For any fixed yy\in{\mathbb{R}}, (,y)\ell(\cdot,y) is usually a continuous function w.r.t. the first variable in machine learning. Therefore, the optimization (2) over [βn,βn][-\beta_{n},\beta_{n}] guarantees the existence of c^b,λ,j\hat{c}_{b,\lambda,j} in general and we allow βn\beta_{n}\to\infty as nn\to\infty in theory. Then, for each 1bB1\leq b\leq B, we can get an estimator of h(x)h(x):

h^b,n(x):=j=1Kb(λ)c^b,λ,j𝕀(x𝒞b,λ,j),x[0,1]d,\hat{h}_{b,n}(x):=\sum_{j=1}^{K_{b}(\lambda)}{\hat{c}_{b,\lambda,j}\cdot{\mathbb{I}}(x\in{\small\mathcal{C}}_{b,\lambda,j})},\ x\in[0,1]^{d},

where 𝕀(){\mathbb{I}}(\cdot) denotes the indicator function. By applying the ensemble technique, the final estimator is given by

h^n(x):=1Bb=1Bh^b,n(x),x[0,1]d.\hat{h}_{n}(x):=\frac{1}{B}\sum_{b=1}^{B}{\hat{h}_{b,n}(x)},\ x\in[0,1]^{d}. (3)

If the cell Cb,λ,jC_{b,\lambda,j} does not contain any data point, we just use 0 as the optimizer in the corresponding region.

Let us clarify the relationship between (3) and the traditional RF. Recall that the traditional RF (see Mourtada et al. (2020) and Breiman (2001)) is used to estimate the conditional mean 𝐄(Y|X=x){\mathbf{E}}(Y|X=x) by using the sample mean in each leaf. In this case, the 2\ell_{2} loss function (v,y)=(vy)2\ell(v,y)=(v-y)^{2} is applied. If |Y|βn|Y|\leq\beta_{n}, it can be checked that

c^b,λ,j=1Card({i:Xi𝒞b,λ,j})i:Xi𝒞b,λ,jYi,\hat{c}_{b,\lambda,j}=\frac{1}{Card(\{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}\})}\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}Y_{i},

where Card()Card(\cdot) denotes the cardinality of a set. Therefore, in this case, our forest estimator (3) coincides with the traditional RF exactly. In conclusion, h^n(x)\hat{h}_{n}(x) is an extension of traditional RF in Breiman (2001) since the loss function (v,y)\ell(v,y) can be chosen flexibly in different learning problems.

From the above learning process, we know that there are two tuning parameters in the construction of h^n\hat{h}_{n}, namely λ\lambda and BB. We give some remarks about them. The stopping time, λ\lambda, controls the complexity of the Mondrian forest. Generally speaking, the cardinality of a tree partition increases with the value of λ\lambda. Thus, a large value of λ\lambda is beneficial in reducing the bias of the forest estimator. In contrast, a small λ\lambda is instrumental in controlling the generalization error of the final estimator (3). So selecting λ\lambda is crucial in balancing complexity and stability. To ensure the consistency of h^n\hat{h}_{n}, we suppose that λ\lambda is dependent on the sample size nn, denoting it as λn\lambda_{n} in the following analysis. The second parameter, BB, denotes the number of Mondrian trees, which can be determined as the selection for RF. There are many studies on its selection for RF; see, for example, Zhang and Wang (2009). In practice, most practitioners take B=100 or 500B=100\text{\ or\ }500 in their computations.

4 Main results

In this section, we study the upper bound of the regret function of (3), which is constructed by Mondrian processes. Denote 𝒮{\cal S}\subseteq{\mathbb{R}} as the support of YY that satisfies 𝐏(Y𝒮)=1{\mathbf{P}}(Y\in{\cal S})=1. First, we need some mild restrictions on the loss function (v,y)\ell(v,y).

Assumption 1

The risk function R(h):=𝐄((h(X),Y))R(h):={\mathbf{E}}(\ell(h(X),Y)) is convex. In other words, for any ϵ(0,1)\epsilon\in(0,1) and functions h1,h2h_{1},h_{2}, we have R(ϵh1+(1ϵ)h2)ϵR(h1)+(1ϵ)R(h2)R(\epsilon h_{1}+(1-\epsilon)h_{2})\leq\epsilon R(h_{1})+(1-\epsilon)R(h_{2}).

Note that Assumption 1 is satisfied if (,y)\ell(\cdot,y) is convex for any fixed y𝒮y\in{\cal S}.

Assumption 2

There is a nonnegative function M1(v,y)>0M_{1}(v,y)>0 with v>0,yv>0,y\in{\mathbb{R}} such that for any y𝒮y\in{\cal S}, (,y)\ell(\cdot,y) is Lipschitz continuous and for any v1,v2[v,v]v_{1},v_{2}\in[-v,v], y𝒮y\in{\cal S}, we have

|(v1,y)(v2,y)|M1(v,y)|v1v2|.|\ell(v_{1},y)-\ell(v_{2},y)|\leq M_{1}(v,y)|v_{1}-v_{2}|.
Assumption 3

There is an envelope function M2(v,y)>0M_{2}(v,y)>0 such that for any v+v\in{\mathbb{R}}^{+} and y𝒮y\in{\cal S},

|supv[v,v](v,y)|M2(v,y)and𝐄(M22(v,Y))<.\left|\sup_{v^{\prime}\in[-v,v]}\ell(v^{\prime},y)\right|\leq M_{2}(v,y)\ \ \text{and}\ \ {\mathbf{E}}(M_{2}^{2}(v,Y))<\infty.

Without loss of generality, we can assume that M2(,y)M_{2}(\cdot,y) is non-decreasing w.r.t. the first variable for any fixed y𝒮y\in{\cal S}. In the next section, we will see that many commonly used loss functions satisfy Assumption 1- 3 including 2\ell_{2} loss and 1\ell_{1} loss. In the theoretical analysis, we make the following Assumption 4 on the distribution of YY and assume XX takes value in [0,1]d[0,1]^{d}. This paper mainly focuses on the sub-Gaussian case of YY; namely w(t)=tw(t)=t. But Assumption 4 extends the range of sub-Gaussian distributions to a more general class.

Assumption 4

There is an increasing function w(t)w(t) satisfying limtw(t)=\lim_{t\to\infty}w(t)=\infty and constant c>0c>0, such that

lim supt𝐏(|Y|>t)exp(tw(t)/c)<.\limsup_{t\to\infty}{\mathbf{P}}(|Y|>t)\exp{(tw(t)/c)}<\infty.

Our theoretical results relate to the (p,C)(p,C)-smooth class given below since it is large enough and dense in the L2L^{2} integrable space generated by any probability measure. When 0<p10<p\leq 1, this class is also known as Holder space with index pp in the literature; see Adams and Fournier (2003). Additionally, the (p,C)(p,C)-smooth class is frequently used in practice, such as the splines, because its smoothness makes the computation convenient.

Definition 1 ((p,C)(p,C)-smooth class)

Let p=s+β>0p=s+\beta>0, β(0,1]\beta\in(0,1] and C>0C>0. The (p,C)(p,C)-smooth ball with radius CC, denoted by p,β([0,1]d,C)\mathcal{H}^{p,\beta}([0,1]^{d},C), is the set of ss times differentiable functions h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}} such that

|sh(x1)sh(x2)|Cx1x22β,x1,x2[0,1]d.|\nabla^{s}h(x_{1})-\nabla^{s}h(x_{2})|\leq C\|x_{1}-x_{2}\|_{2}^{\beta},\ \ \forall x_{1},x_{2}\in[0,1]^{d}.

and

supx[0,1]d|h(x)|C,\sup_{x\in[0,1]^{d}}{|h(x)|}\leq C,

where 2\|\cdot\|_{2} denotes the 2\ell_{2} norm in d{\mathbb{R}}^{d} space and \nabla is the gradient operator.

Theorem 2 (Regret function bound of Mondrian forests)

Suppose that the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3 and that the distribution of YY satisfies Assumption 4. For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1, we have

𝐄R(h^n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{n})-R(h) c1max{βn,𝐄(M22(βn,Y))}n(1+λn)dgeneralizationerror+2d32pCsupy[lnn,lnn]M1(C,y)1λnpapproximationerror\displaystyle\leq\underbrace{c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}(1+\lambda_{n})^{d}}_{generalization\ error}+\underbrace{2d^{\frac{3}{2}p}C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\frac{1}{\lambda_{n}^{p}}}_{approximation\ error}
+c2(supx[βn,βn]|(x,lnn)|+𝐄(M22(βn,Y))+C𝐄(M12(C,Y)))1nresidualcausedbythetailofY,\displaystyle+\underbrace{c_{2}\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|}+\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}+C\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\right)\cdot\frac{1}{n}}_{residual\ caused\ by\ the\ tail\ of\ Y}, (4)

where c1,c2>0c_{1},c_{2}>0 are some universal constants.

Remark 3

The first term of the RHS of (4) relates to the generalization error of the forest, and the second one is the approximation error of Mondrian forest to hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C). Finally, the last line is caused by the tail property of YY and will disappear if we assume that YY is bounded.

Remark 4

We will see that the coefficients above in many applications, such as 𝐄(M22(βn,Y)){\mathbf{E}}(M_{2}^{2}(\beta_{n},Y)) and supy[lnn,lnn]M1(C,y)\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}, are only of polynomial order of lnn\ln n if βnlnn\beta_{n}\asymp\ln n. In this case, the last line of (4) usually has no influence on the convergence speed of the regret function. Roughly speaking, only the first two terms dominate the convergence rate, namely the generalization and approximation errors.

Remark 5

If max{βn,𝐄(M22(βn,Y))}\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}, supy[lnn,lnn]M1(C,y)\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}, supx[βn,βn]|(x,lnn)|\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|} diverge no faster than O((lnn)γ)O((\ln n)^{\gamma}) for some γ>0\gamma>0, we know from (4) that

lim¯n𝐄(R(h^n))infhp,β([0,1]d,C)R(h)\varlimsup_{n\to\infty}{\mathbf{E}}(R(\hat{h}_{n}))\leq\inf_{h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)}R(h)

when λn\lambda_{n}\to\infty and λn=o(n12d)\lambda_{n}=o(n^{\frac{1}{2d}}). Therefore, Mondrian forests perform not worse than (p,C)(p,C)-smooth class in the general setting.

In statistical learning, the consistency of an estimator is a crucial property that ensures that the estimator converges to the true value of the parameter/function being estimated as the sample size increases. Theorem 2 can also be used to analyze the statistical consistency of h^n\hat{h}_{n}. For this purpose, denote the true function mm by

m:=argmingR(g),m:=\arg min_{\forall g}{R(g)}, (5)

and make the following assumption.

Assumption 5

For any h:[0,1]d[βn,βn]h:[0,1]^{d}\to[-\beta_{n},\beta_{n}], there are c>0c>0 and κ1\kappa\geq 1 such that

c1𝐄|h(X)m(X)|κR(h)R(m)c𝐄|h(X)m(X)|κ.c^{-1}\cdot{\mathbf{E}}|h(X)-m(X)|^{\kappa}\leq R(h)-R(m)\leq c\cdot{\mathbf{E}}|h(X)-m(X)|^{\kappa}.

Usually, κ=2\kappa=2 holds in many specific learning problems. Before presenting the consistency results, we denote the last line of (4) by Res(n)Res(n), namely,

Res(n):=c2(supx[βn,βn]|(x,lnn)|+𝐄(M22(βn,Y))+C𝐄(M12(C,Y)))1n.Res(n):=c_{2}\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|}+\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}+C\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\right)\cdot\frac{1}{n}. (6)

Then, the statistical consistency of Mondrian forests can be guaranteed by the following two corollaries.

Corollary 6 (Consistency rate of Mondrian forests)

Suppose that the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3 and that the distribution of YY satisfies Assumption 4. Suppose that the true function mp,β([0,1]d,C)m\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 and Assumption 5 is satisfied. Then,

𝐄|h^n(X)m(X)|κ\displaystyle{\mathbf{E}}\left|\hat{h}_{n}(X)-m(X)\right|^{\kappa} c1max{βn,𝐄(M22(βn,Y))}n(1+λn)d\displaystyle\leq c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}(1+\lambda_{n})^{d}
+2d32pCsupy[lnn,lnn]M1(C,y)1λnp+Res(n),\displaystyle+2d^{\frac{3}{2}p}C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\frac{1}{\lambda_{n}^{p}}+Res(n), (7)

where c1,c2>0c_{1},c_{2}>0 are some universal constants.

Corollary 7 (Consistency of Mondrian forests)

Suppose the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3 and the distribution of YY satisfies Assumption 4. Suppose m(X)m(X) is LκL^{\kappa} integrable on [0,1]d[0,1]^{d} and 𝐄(2(m(X),Y))<{\mathbf{E}}(\ell^{2}(m(X),Y))<\infty. Furthermore, Assumption 5 is satisfied. If λn=o((nmax{βn,𝐄(M22(βn,Y))})1d)\lambda_{n}=o\left(\left(\frac{\sqrt{n}}{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}\right)^{\frac{1}{d}}\right), λn1supy[lnn,lnn]M1(C,y)0{\lambda_{n}^{-1}}\cdot\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\to 0 and Res(n)0Res(n)\to 0, we have

limn𝐄|h^n(X)m(X)|κ=0.\lim_{n\to\infty}{\mathbf{E}}\left|\hat{h}_{n}(X)-m(X)\right|^{\kappa}=0.

5 Model selection: the choice of λn\lambda_{n}

In practice, the best λn\lambda_{n} is always unknown, thus a criterion is necessary in order to stop the growth of Mondrian forests. Otherwise, the learning process will be overfitted. Here, we adopt a penalty methodology as follows. For each 1bB1\leq b\leq B, define

Pen(λn,b):=1ni=1n(h^b,n(Xi),Yi)+αnλn,b,Pen(\lambda_{n,b}):=\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{b,n}(X_{i}),Y_{i})}+\alpha_{n}\cdot\lambda_{n,b}, (8)

where the parameter αn>0\alpha_{n}>0 controls the power of penalty and h^b,n\hat{h}_{b,n} is constructed already in Section 3 and using process MPb(λn,[0,1]d)MP_{b}(\lambda_{n},[0,1]^{d}). Then, the best λn,b\lambda_{n,b}^{*} is chosen by

λn,b:=argminλ0Pen(λ).\lambda_{n,b}^{*}:=\arg min_{\lambda\geq 0}Pen(\lambda).

Denote h^b,n\hat{h}_{b,n}^{*} as the tree estimator that is constructed by the Mondrian process MPb(λn,b,[0,1]d)MP_{b}(\lambda_{n,b}^{*},[0,1]^{d}). Then, our forest estimator based on model selection is given by

h^n(x):=1Bb=1Bh^b,n(x),x[0,1]d.\hat{h}_{n}^{*}(x):=\frac{1}{B}\sum_{b=1}^{B}{\hat{h}_{b,n}^{*}(x)},\ x\in[0,1]^{d}. (9)
Theorem 8

Suppose the loss function (,)\ell(\cdot,\cdot) satisfies Assumption 1-3. Meanwhile, suppose the distribution of YY satisfies Assumption 4. For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 and 0<αn10<\alpha_{n}\leq 1, we have

𝐄R(h^n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{n}^{*})-R(h) c1max{βn,𝐄(M22(βn,Y))}n(1+supy[lnn,lnn]M2(βn,y)αn)dgeneralizationerror\displaystyle\leq\underbrace{c_{1}\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\left(1+\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}\right)^{d}}_{generalization\ error}
+(2d32pCsupy[lnn,lnn]M1(C,y))(αn)p2approximationerror+Res(n),\displaystyle+\underbrace{(2d^{\frac{3}{2}p}C\cdot\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)})\cdot\left(\alpha_{n}\right)^{\frac{p}{2}}}_{approximation\ error}+Res(n), (10)

where c1,c2>0c_{1},c_{2}>0 are some universal constants and Res(n)Res(n) is defined in (6).

By properly choosing the penalty strength αn\alpha_{n}, we can obtain a convergence rate of the regret function of Mondrian forests according to (8). Theorem 8 also implies that the estimator (9) is adaptive to the smooth degree of the true function mm. If pp is large, this rate will be fast; otherwise, we will have a slower convergence rate. This coincides with the basic knowledge of function approximation. The applications of Theorem 8 are given in the next section, where some examples are discussed in detail. In those cases, we will see that coefficients in (8), such as M1(βn,lnn)M_{1}(\beta_{n},\ln n), can be upper bounded by a polynomial of lnn\ln n indeed.

6 Examples

In this section, we show how to use Mondrian forests in different statistical learning problems. Meanwhile, theoretical properties of these forest estimators, namely h^n(x)\hat{h}_{n}(x) in (3) and h^n(x)\hat{h}_{n}^{*}(x) in (9), are given based on Theorem 2 & 8 for each learning problem. Sometimes, the Lemma 9 below is useful for verifying the Assumption 5. The proof of this result can be directly completed by considering the Taylor expansion of the real function R(h+αh),α[0,1]R(h^{*}+\alpha h),\alpha\in[0,1] around the point α=0\alpha=0.

Lemma 9

For any h:[0,1]ph:[0,1]^{p}\to{\mathbb{R}} and α[0,1]\alpha\in[0,1], we have

C1𝐄(h(X)2)d2dα2R(h+αh)C2𝐄(h(X)2),C_{1}{\mathbf{E}}(h(X)^{2})\leq\frac{d^{2}}{d\alpha^{2}}R(h^{*}+\alpha h)\leq C_{2}{\mathbf{E}}(h(X)^{2}),

where constants C1>0,C2>0C_{1}>0,C_{2}>0 are universal. Then, Assumption 5 holds with κ=2\kappa=2.

6.1 Least squares regression

As shown in Mourtada et al. (2020), Mondrian forests are statistically consistent if the 2\ell_{2} loss function is used. In our first example, we revisit this case by using the general results established in Section 4 & 5. Usually, nonparametric least squares regression refers to methods that do not assume a specific parametric form of the conditional expectation 𝐄(Y|X){\mathbf{E}}(Y|X). Instead, these methods are flexible and can adapt to the underlying structure of the data. The loss function of the least squares regression is given by (v,y)=(vy)2\ell(v,y)=(v-y)^{2}. First, we define the event An:={max1in|Yi|lnn}A_{n}:=\{\max_{1\leq i\leq n}{|Y_{i}|}\leq\ln n\}. Under Assumption 4, by (A.20) we can find constants c,c>0c,c^{\prime}>0 such that 𝐏(An)1cneclnnw(lnn){\mathbf{P}}(A_{n})\geq 1-c^{\prime}\cdot ne^{-c\ln n\cdot w(\ln n)}. This means 𝐏(An){\mathbf{P}}(A_{n}) is very close to 11 as nn\to\infty. When the event AnA_{n} occurs, from (2) we further know

c^b,λ,j\displaystyle\hat{c}_{b,\lambda,j} =argminz[βn,βn]i:Xi𝒞b,λ,j(z,Yi)\displaystyle=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})}
=1Card({i:Xi𝒞b,λ,j})i:Xi𝒞b,λ,jYi,\displaystyle=\frac{1}{Card(\{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}\})}\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}Y_{i},

where Card()Card(\cdot) denotes the cardinality of any set. Therefore, c^b,λ,j\hat{c}_{b,\lambda,j} is just the average of YiY_{i}s that are in the leaf 𝒞b,λ,j{\small\mathcal{C}}_{b,\lambda,j}.

Let us discuss the property of (v,y)=(vy)2\ell(v,y)=(v-y)^{2}. First, it is obvious that Assumption 1 holds for this 2\ell_{2} loss. By some simple calculations, we also know Assumption 1 is satisfied with M1(v,y)=2(|v|+|y|)M_{1}(v,y)=2(|v|+|y|) and Assumption 2 is satisfied with M2(v,y)=2(v2+y2)M_{2}(v,y)=2(v^{2}+y^{2}). Choosing λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}} and βnlnn\beta_{n}\asymp\ln n, Theorem 2 implies the following property of h^n\hat{h}_{n}.

Proposition 10

For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C), there is an integer n1(C)1n_{1}(C)\geq 1 such that for any n>n1(C)n>n_{1}(C),

𝐄R(h^n)R(h)(22ln2n+4d32pC(C+lnn)+1)(1n)12pp+d.{\mathbf{E}}R(\hat{h}_{n})-R(h)\leq\left(2\sqrt{2}\ln^{2}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

Then, we check the consistency as in Corollary 6. By some calculations, we have m(x)=𝐄(Y|X=x)m(x)={\mathbf{E}}(Y|X=x) and

R(h)R(m)\displaystyle R(h)-R(m) =𝐄(Yh(X))2𝐄(Ym(X))2\displaystyle={\mathbf{E}}(Y-h(X))^{2}-{\mathbf{E}}(Y-m(X))^{2}
=𝐄(h(X)m(X))2.\displaystyle={\mathbf{E}}(h(X)-m(X))^{2}.

The above inequality shows that Assumption 5 holds with c=1c=1 and κ=2\kappa=2. When λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}}, Corollary 6 implies Proposition 11.

Proposition 11

For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C), there is an integer n2(C)1n_{2}(C)\geq 1 such that for any n>n2(C)n>n_{2}(C),

𝐄(h^n(X)m(X))2(22ln2n+4d32pC(C+lnn)+1)(1n)12pp+d.{\mathbf{E}}\left(\hat{h}_{n}(X)-m(X)\right)^{2}\leq\left(2\sqrt{2}\ln^{2}n+4d^{\frac{3}{2}p}C\cdot(C+\ln n)+1\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

We can also show that h^n\hat{h}_{n} is statistically consistent for any general function mm defined in (5) when λn\lambda_{n} is chosen properly as stated in Corollary 7. Finally, by choosing αn=np2p+4d\alpha_{n}=n^{-\frac{p}{2p+4d}} and βnlnn\beta_{n}\asymp\ln n in Theorem 8, the regret function of the estimator h^n\hat{h}^{*}_{n}, which is based on the model selection in (8), has an upper bound as shown in the following proposition.

Proposition 12

For any hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C), there is an integer n3(C)1n_{3}(C)\geq 1 such that for any n>n3(C)n>n_{3}(C),

𝐄R(h^n)R(h)c(d,p,C)(1n)12pp+2dln2d+1n,{\mathbf{E}}R(\hat{h}_{n}^{*})-R(h)\leq c(d,p,C)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+2d}}\ln^{2d+1}n,

where c(d,p,C)>0c(d,p,C)>0 depends on d,p,Cd,p,C only.

6.2 Generalized regression

Generalized regression refers to a broad class of regression models that extend beyond the traditional ordinary least squares (OLS) regression, accommodating various types of response variables and relationships between predictors and responses. Usually, in this model, the conditional distribution of YY given XX follows an exponential family of distribution

𝐏(Yy|X=x)=yexp{B(m(x))vD(m(x)))}dΨ(v),{\mathbf{P}}(Y\leq y|X=x)=\int_{-\infty}^{y}\exp\{B(m(x))v-D(m(x)))\}d\Psi(v), (11)

where Ψ()\Psi(\cdot) is a positive measure defined on {\mathbb{R}}, Ψ()>Ψ({y})\Psi({\mathbb{R}})>\Psi(\{y\}) for any yy\in{\mathbb{R}}, and function D()=lnexp{B()y}Ψ(dy)D(\cdot)=\ln\int_{\mathbb{R}}{\exp\{B(\cdot)y\}}\Psi(dy) is defined on an open interval \mathcal{I} of {\mathbb{R}}, which is used for the aim of normalization. Now, we suppose the function A():=D()/B()A(\cdot):=D^{\prime}(\cdot)/B^{\prime}(\cdot) exists and we have 𝐄(Y|X=x)=A(m(x)){\mathbf{E}}(Y|X=x)=A(m(x)) by some calculations. Thus, the conditional expectation 𝐄(Y|X=x){\mathbf{E}}(Y|X=x) will be known if we can estimate the unknown function m(x)m(x). More information about model (11) can be found in Stone (1986), Stone (1994) and Huang (1998).

The problem of generalized regression is to estimate the unknown function m(x)m(x) by using the i.i.d. data 𝒟n={(Xi,Yi)}i=1n{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}. Note that both B()B(\cdot) and D()D(\cdot) are known in (11). In this case, we use the maximal likelihood method for the estimation, and the corresponding loss function is given by

(v,y)=B(v)y+D(x)\ell(v,y)=-B(v)y+D(x)

and by some calculations we know the true function mm satisfies Definition 5, namely

margminh𝐄(B(h(X))Y+D(h(X))).m\in\arg min_{h}{{\mathbf{E}}(-B(h(X))Y+D(h(X)))}.

Therefore, we have reasons to believe that Mondrian forests are statistically consistent in this problem, which is stated in Corollary 7. Now, we give some mild restrictions on B()B(\cdot) and D()D(\cdot) to ensure our general results in Section 4 & 5 can be applied in this generalized regression.

  1. (i)

    B()B(\cdot) has the second continuous derivative, and its first derivative is strictly positive on \mathcal{I}.

  2. (ii)

    We can find an interval SS of {\mathbb{R}} satisfying the measure Ψ\Psi is concentrated on SS and

    B′′(ξ)y+D′′(ξ)>0,yS˘,ξ-B^{\prime\prime}(\xi)y+D^{\prime\prime}(\xi)>0,\quad\quad y\in\breve{S},\xi\in\mathcal{I} (12)

    where S˘\breve{S} denotes the interior of SS. If SS is bounded, (12) holds for at least one of the endpoints.

  3. (iii)

    𝐏(YS)=1{\mathbf{P}}(Y\in S)=1 and 𝐄(Y|X=x)=A(m(x)){\mathbf{E}}(Y|X=x)=A(m(x)) for each x[0,1]dx\in[0,1]^{d}.

  4. (iv)

    There is a compact interval 𝒦0\mathcal{K}_{0} of \mathcal{I} such that the range of mm is contained in 𝒦0\mathcal{K}_{0}.

The above restrictions on B()B(\cdot) and D()D(\cdot) were used in Huang (1998). In fact, we know from Huang (1998) that many commonly used distributions satisfy these conditions, including the normal distribution, the Poisson distribution, and the Bernoulli distribution. Now, let us verify our Assumption 1-5 under this setting.

In particular, Assumption 1 is verified using restrictions (i)-(iii). On the other hand, we choose the Lipchitz constant in Assumption 2 by

M1(v,y):=|supv~[v,v]B(v~)||y|+|supv~[v,v]D(v~)|.M_{1}(v,y):=\left|\sup_{\tilde{v}\in[-v,v]}{B^{\prime}(\tilde{v})}\right|\cdot|y|+\left|\sup_{\tilde{v}\in[-v,v]}{D^{\prime}(\tilde{v})}\right|.

Thirdly, the envelope function of 2(v,y)\ell_{2}(v,y) can be set by

M2(v,y):=|supv~[v,v]B(v~)||y|+|supv~[v,v]D(v~)||y|.M_{2}(v,y):=\left|\sup_{\tilde{v}\in[-v,v]}{B(\tilde{v})}\right|\cdot|y|+\left|\sup_{\tilde{v}\in[-v,v]}{D(\tilde{v})}\right|\cdot|y|.

Since YY is a sub-Gaussian random variable in Assumption 4, thus 𝐄(M22(X,Y))<{\mathbf{E}}(M_{2}^{2}(X,Y))<\infty in this case, indicating that Assumption 3 is satisfied. Finally, under restrictions (i)-(iv), Lemma 4.1 in Huang (1998) shows that our Assumption 5 holds with κ=2\kappa=2 and

c=max{supξ[βn,βn]m𝒦0(B′′(ξ)A(m)+D′′(ξ))],[infξ[βn,βn]m𝒦0(B′′(ξ)A(m)+D′′(ξ))]]1}.c=\max\left\{\sup_{\begin{subarray}{c}\xi\in[-\beta_{n},\beta_{n}]\cap\mathcal{I}\\ m\in\mathcal{K}_{0}\end{subarray}}{(-B^{\prime\prime}(\xi)A(m)+D^{\prime\prime}(\xi))}],\left[\inf_{\begin{subarray}{c}\xi\in[-\beta_{n},\beta_{n}]\cap\mathcal{I}\\ m\in\mathcal{K}_{0}\end{subarray}}{(-B^{\prime\prime}(\xi)A(m)+D^{\prime\prime}(\xi))}]\right]^{-1}\right\}. (13)

From (12), the constant cc in (13) must be larger than zero. On the other hand, as we will see later, the above cc does not equal infinity in many cases.

Therefore, those general theoretical results in Section 4 & 5 can be applied in generalized regression. Meanwhile, we need to stress that the coefficients in the general results, such as M1(βn,lnn)M_{1}(\beta_{n},\ln n) in Theorem 2 and cc in (13), are always of polynomial order of lnn\ln n if this βn\beta_{n} is selected properly. Let us give some specific examples.

  1. (1)

    The first example is Gaussian regression, where the conditional distribution Y|X=xY|X=x follows N(m(x),σ2)N(m(x),\sigma^{2}) and σ2\sigma^{2} is known. Therefore, B(x)=xB(x)=x, D(x)=12x2D(x)=\frac{1}{2}x^{2}, =\mathcal{I}={\mathbb{R}} and S=S={\mathbb{R}}. Our goal is to estimate the unknown conditional mean m(x)m(x). Now, restrictions (i)-(iii) are satisfied. To satisfy the fourth one, we assume the range of mm is contained in a compact set of {\mathbb{R}}, denoted by 𝒦0\mathcal{K}_{0}. Choose βnlnn\beta_{n}\asymp\ln n. Meanwhile, we can ensure YY is a sub-Gaussian random variable. The constant cc in (13) equals 11, and those coefficients in general theoretical results are all of polynomial order of lnn\ln n, such as M1(βn,lnn)2lnnM_{1}(\beta_{n},\ln n)\asymp 2\ln n.

  2. (2)

    The second example is Poisson regression, where the conditional distribution Y|X=xY|X=x follows Poisson(λ(x))Poisson(\lambda(x)) with λ(x)>0\lambda(x)>0. Therefore, B(x)=xB(x)=x, D(x)=exp(x)D(x)=-\exp(x), =\mathcal{I}={\mathbb{R}} and S=[0,)S=[0,\infty). Our goal is to estimate the lnn\ln n transformation of conditional mean, namely m(x)=lnλ(x)m(x)=\ln\lambda(x) in (11), by using Mondrian forest (3). It is not difficult to show restrictions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of λ(x)\lambda(x) is contained in a compact set of (0,)(0,\infty). Thus, m(x)m(x) satisfies restriction (iv). In this case, we choose βnlnlnn\beta_{n}\asymp\ln\ln n. Note that WW satisfies Assumption 4 with w(t)=ln(1+t/λ)w(t)=\ln(1+t/\lambda) if WW follows Poisson distribution with mean λ>0\lambda>0. Thus, we can ensure YY satisfies Assumption 4. The constant cc in (13) equals to lnn\ln n, and those coefficients in general theoretical results are also all of the polynomial order of lnn\ln n, such as M1(βn,lnn)2lnnM_{1}(\beta_{n},\ln n)\asymp 2\ln n.

  3. (3)

    The third example is related to 010-1 classification, where the conditional distribution Y|X=xY|X=x follows Bernoulli distribution (taking values in {0,1}\{0,1\}) with 𝐏(Y=1|X=x)=p(x)(0,1){\mathbf{P}}(Y=1|X=x)=p(x)\in(0,1). It is well known that the best classifier is called the Bayes rule,

    CBayes(x)={1,p(x)1200,p(x)12<0.C^{Bayes}(x)=\left\{\begin{array}[]{lc}1,&p(x)-\frac{1}{2}\geq 0\\ 0,&p(x)-\frac{1}{2}<0.\\ \end{array}\right.

    What we are interested is to estimate the conditional probability p(x)p(x) above. Here, we use Mondrian forest in the estimation. First, we make a shift of p(x)p(x), which means m(x):=p(x)12(12,12)m(x):=p(x)-\frac{1}{2}\in(-\frac{1}{2},\frac{1}{2}) is used in (11) instead. By some calculations, B(x)=ln(0.5+x)ln(0.5x)B(x)=\ln(0.5+x)-\ln(0.5-x), D(x)=ln(0.5x)D(x)=-\ln(0.5-x), =(0.5,0.5)\mathcal{I}=(-0.5,0.5) and S=[0,1]S=[0,1] in this case. Now, the final goal is to estimate m(x)m(x) by using the forest estimator (3). It is not difficult to show restrictions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of m(x)m(x) is contained in a compact set of (12,12)(-\frac{1}{2},\frac{1}{2}). Now, choose βn12(1lnn)γ\beta_{n}\asymp\frac{1}{2}-(\frac{1}{\ln n})^{\gamma} for some γ>0\gamma>0. Meanwhile, Assumption 4 is satisfied since YY is bounded. The constant cc in (13) equals to (lnn)2γ(\ln n)^{2\gamma} and those coefficients in general theoretical results are also all of the polynomial order of lnn\ln n, such as M1(βn,lnn)2(lnn)γ+1M_{1}(\beta_{n},\ln n)\asymp 2(\ln n)^{\gamma+1}.

  4. (4)

    The fourth example is geometry regression, where the model is 𝐏(Y=k|X=x)=p(x)(1p(x))k1,k+,x[0,1]d{\mathbf{P}}(Y=k|X=x)=p(x)(1-p(x))^{k-1},k\in\mathbb{Z}^{+},x\in[0,1]^{d}. Here, p(x)p(x) denotes the successful probability, and we suppose it is bounded from up and below, namely p(x)[c1,c2](0,1)p(x)\in[c_{1},c_{2}]\subseteq(0,1). Thus, we have a positive probability of obtaining success or failure for any xx and k+k\in\mathbb{Z}^{+}. In this case, B(x)=xB(x)=x, D(x)=ln(ex1)D(x)=-\ln(e^{-x}-1) and m(x)=ln(1p(x))m(x)=\ln(1-p(x)) is the unknown function we need to estimate. Since m(x)<0m(x)<0, in this example we optimize (2) over [βn,βn1][-\beta_{n},-\beta_{n}^{-1}] only with βn\beta_{n}\to\infty. In Section 4, we only need to replace [v,v][-v,v], p,β([0,1]d,C)\mathcal{H}^{p,\beta}([0,1]^{d},C) by [v,v][-v,v] and p,β([0,1]d,C){h(x):h(x)<0}\mathcal{H}^{p,\beta}([0,1]^{d},C)\cap\{h(x):h(x)<0\} respectively. Then, it is not difficult to check all results in Section 4 still hold. Furthermore, restrictions (i)-(iii) are satisfied after some calculation. Finally, we check Assumption 5. In this circumstance, we can still use Lemma 4.1 in Huang (1998) after replacing [βn,βn][-\beta_{n},\beta_{n}] in (13) with [βn,βn1][-\beta_{n},-\beta_{n}^{-1}]. If βnlnlnn\beta_{n}\asymp\ln\ln n is chosen, it is known c(lnlnn)2c\asymp(\ln\ln n)^{2} in (13) after some calculation. Therefore, Assumption 5 also holds with c(lnlnn)2c\asymp(\ln\ln n)^{2} and κ=2\kappa=2.

In each of the four examples above, the convergence rate of ε(h^n)\varepsilon(\hat{h}_{n}) is Op(n12pp+d)O_{p}(n^{-\frac{1}{2}\cdot\frac{p}{p+d}}) up to a polynomial of lnn\ln n.

6.3 Huber’s loss

Huber loss, also known as smooth L1L^{1} loss and proposed in Huber (1992), is a loss function frequently used in regression tasks, especially in machine learning applications. It combines the strengths of both Mean Absolute Error (MAE) and Mean Squared Error (MSE), making it more robust to outliers than MSE while maintaining smoothness and differentiability like MSE. Huber loss applies a quadratic penalty for small errors and a linear penalty for large ones, allowing it to balance sensitivity to small deviations and resistance to large, anomalous deviations. This makes it particularly effective when dealing with noisy data or outliers. In detail, this loss function takes the form

(v,y)={12(vy)2|vy|δn,δn(|vy|12δn)|vy|δn.\ell(v,y)=\begin{cases}\frac{1}{2}(v-y)^{2}&|v-y|\leq\delta_{n},\\ \delta_{n}(|v-y|-\frac{1}{2}\delta_{n})&|v-y|\geq\delta_{n}.\end{cases}

This case is interesting and different from commonly used loss functions because such \ell depends on δn\delta_{n} and can vary according to the sample size nn. Although this loss function is a piecewise function which is different with classical loss function, the non-asymptotic result in Theorem 2 can still be applied here.

Let us verify Assumption 1-3 for this loss. Firstly, this loss function is convex and Assumption 1 is satisfied. Secondly, for any yy\in{\mathbb{R}} we know that (,y)\ell(\cdot,y) is a Lipschitz function with Lipschitz constant M1(v,y)=δnM_{1}(v,y)=\delta_{n}. Thirdly, we can define

M2(v,y)={12(|v|+|y|)2|v|+|y|δn,δn(|v|+|y|12δn)|v|+|y|δn.M_{2}(v,y)=\begin{cases}\frac{1}{2}(|v|+|y|)^{2}&|v|+|y|\leq\delta_{n},\\ \delta_{n}(|v|+|y|-\frac{1}{2}\delta_{n})&|v|+|y|\geq\delta_{n}.\end{cases}

Since YY is sub-Gaussian by Assumption 5, thus 𝐄(M22(v,Y))<{\mathbf{E}}(M_{2}^{2}(v,Y))<\infty and Assumption 3 is satisfied. Finally, coefficients in Theorem 2:

max{βn,𝐄(M22(βn,Y))},supy[lnn,lnn]M1(C,y),supx[βn,βn]|(x,lnn)|\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\},\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)},\sup_{x\in[-\beta_{n},\beta_{n}]}{|\ell(x,\ln n)|}

diverge no faster than a polynomial of lnn\ln n if we take βn=O(δn)\beta_{n}=O(\delta_{n}) and the threshold satisfies δn=o(ln1+ηn)\delta_{n}=o(\ln^{1+\eta}n) for any η>0\eta>0. Under the above settings, we can obtain a fast convergence rate of the forest estimator by Theorem 2.

On the other hand, we aim to show our forest estimator performs better than any (p,C)(p,C)-smooth function even if δn=o(n12ν)\delta_{n}=o(n^{\frac{1}{2}-\nu}) for any small ν>0\nu>0. The reason is given below. By calculation, we have

max{βn,𝐄(M22(βn,Y))}βn+𝐄(βn+|Y|)2+δn𝐄(βn+|Y|)=O(βn2+δnβn).\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}\leq\beta_{n}+{\mathbf{E}}(\beta_{n}+|Y|)^{2}+\delta_{n}{\mathbf{E}}(\beta_{n}+|Y|)=O(\beta_{n}^{2}+\delta_{n}\beta_{n}).

Since βnlnn\beta_{n}\asymp\ln n, if δn=o(n12ν)\delta_{n}=o(n^{\frac{1}{2}-\nu}) and λn\lambda_{n} is properly selected by (4) we know

lim¯n𝐄(R(h^n))infhp,β([0,1]d,C)R(h).\varlimsup_{n\to\infty}{\mathbf{E}}(R(\hat{h}_{n}))\leq\inf_{h\in\mathcal{H}^{p,\beta}([0,1]^{d},C)}R(h).

Finally, we show that Huber loss can be applied to estimate the conditional expectation, and the corresponding statistical consistency result is given below.

Proposition 13

Recall the conditional expectation m(x):=𝐄(Y|X=x),x[0,1]dm(x):={\mathbf{E}}(Y|X=x),x\in[0,1]^{d}. If 𝐄(m2(X))<{\mathbf{E}}(m^{2}(X))<\infty, βnlnn\beta_{n}\asymp\ln n and δn=Clnn\delta_{n}=C\ln n for a large C>0C>0, we can find a series of λn\lambda_{n}\to\infty such that

limn𝐄(h^n(X)m(X))2=0.\lim_{n\to\infty}{\mathbf{E}}(\hat{h}_{n}(X)-m(X))^{2}=0.

6.4 Quantile regression

Quantile regression is a type of regression analysis used in statistics and econometrics that focuses on estimating conditional quantiles (such as the median or other percentiles) of the distribution of the response variable given a set of predictor variables. Unlike ordinary least squares (OLS) regression, which estimates the mean of the response variable conditional on the predictor variables, quantile regression provides a more comprehensive analysis by estimating the conditional median or other quantiles.

Specifically, suppose that m(x)m(x) is the τ\tau-th quantile (0<τ<10<\tau<1) of the conditional distribution of Y|X=xY|X=x. Our interest is to estimate m(x)m(x) by using i.i.d. data 𝒟n={(Xi,Yi)}i=1n{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}. The loss function in this case is given by

(v,y):=ρτ(yv),\ell(v,y):=\rho_{\tau}(y-v),

where ρτ(u)=(τ𝕀(u<0))u,u\rho_{\tau}(u)=(\tau-\mathbb{I}(u<0))u,u\in{\mathbb{R}} denotes the check function for the quantile τ\tau. Meanwhile, by some calculations, we know the quantile function m(x)m(x) minimizes the population risk w.r.t. the above (v,y)\ell(v,y). Namely, we have

m(x)argminh𝐄(ρτ(Yh(X))).m(x)\in\arg min_{h}{{\mathbf{E}}(\rho_{\tau}(Y-h(X)))}.

Therefore, we have reasons to believe that the forest estimator in (3) works well in this problem.

Let us verify Assumption 1-5. Firstly, we choose S=S={\mathbb{R}} in Assumption 1 and it is easy to check that the univariate function (,y)\ell(\cdot,y) is convex for any ySy\in S. Secondly, we fix any y𝒮y\in{\cal S}. Then, the loss function (v,y)\ell(v,y) is also Lipschitz continuous w.r.t. the first variable vv with the Lipschitz constant M1(v,y):=max{τ,1τ},v,y,M_{1}(v,y):=\max\{\tau,1-\tau\},\forall v,y,\in{\mathbb{R}}. Thirdly, we choose the envelope function by M2(v,y):=max{τ,1τ}(|v|+|y|)M_{2}(v,y):=\max\{\tau,1-\tau\}\cdot(|v|+|y|) in Assumption 3. Fourthly, we always suppose YY is a sub-Gaussian random variable to meet the requirement in Assumption 4. Finally, it remains to find the sufficient condition for Assumption 5.

In fact, the Knight equality in Knight (1998) tells us

ρτ(uv)ρτ(u)=v(𝕀(u0)τ)+0v(𝕀(us)𝕀(u0))𝑑s,\rho_{\tau}(u-v)-\rho_{\tau}(u)=v(\mathbb{I}(u\leq 0)-\tau)+\int_{0}^{v}{(\mathbb{I}(u\leq s)-\mathbb{I}(u\leq 0))ds},

from which we get

R(h)R(m)\displaystyle R(h)-R(m) =𝐄[ρτ(Yh(X))ρτ(Ym(X))]\displaystyle={\mathbf{E}}\left[\rho_{\tau}(Y-h(X))-\rho_{\tau}(Y-m(X))\right]
=𝐄[(h(X)m(X))𝕀(Ym(X)0)τ]\displaystyle={\mathbf{E}}\left[(h(X)-m(X))\mathbb{I}(Y-m(X)\leq 0)-\tau\right]
+𝐄[0h(X)m(X)(𝕀(Ym(X)s)𝕀(Ym(X)0))𝑑s].\displaystyle+{\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{(\mathbb{I}(Y-m(X)\leq s)-\mathbb{I}(Y-m(X)\leq 0))ds}\right].

Conditional on XX, we know that the first part of above inequality equals to zero by using the definition of m(x)m(x). Therefore,

R(h)R(m)\displaystyle R(h)-R(m) =𝐄[0h(X)m(X)(𝕀(Ym(X)s)𝕀(Ym(X)0))𝑑s]\displaystyle={\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{(\mathbb{I}(Y-m(X)\leq s)-\mathbb{I}(Y-m(X)\leq 0))ds}\right]
=𝐄[0h(X)m(X)sgn(s)𝐏[(Ym(X))(0,s)(s,0))|X]ds].\displaystyle={\mathbf{E}}\left[\int_{0}^{h(X)-m(X)}{sgn(s){\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]}ds\right]. (14)

To illustrate our basic idea clearly, we simply consider a normal case, where m1(X):=𝐄(Y|X)m_{1}(X):={\mathbf{E}}(Y|X) is independent of the residual ε=Y𝐄(Y|X)N(0,1)\varepsilon=Y-{\mathbf{E}}(Y|X)\sim N(0,1) and supx[0,1]d|m1(x)|<\sup_{x\in[0,1]^{d}}{|m_{1}(x)|}<\infty. The generalization of this sub-Gaussian case can be finished by following the spirit. In this normal case, m(X)m(X) is equal to the sum of m1(X)m_{1}(X) and the τ\tau-th quantile of ε\varepsilon. Denote by qτ(ε)q_{\tau}(\varepsilon)\in{\mathbb{R}} the τ\tau-th quantile of ε\varepsilon. Thus, the conditional distribution of (Ym(X))|X(Y-m(X))|X is the same as the distribution of εqτ(ε)\varepsilon-q_{\tau}(\varepsilon). For any s0>0s_{0}>0,

0s0𝐏[(Ym(X))(0,s)(s,0))|X]ds=0s0𝐏[ε(qτ(ε),qτ(ε)+s)]ds.\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds=\int_{0}^{s_{0}}{\mathbf{P}}[\varepsilon\in(q_{\tau}(\varepsilon),q_{\tau}(\varepsilon)+s)]ds.

Since qτ(ε)q_{\tau}(\varepsilon) is a fixed number, we assume s0s_{0} is a large number later. By the Lagrange mean value theorem, the following probability bound holds

12πexp((|qτ(ε)|+s0)2/2)s𝐏[ε(qτ(ε),qτ(ε)+s)]12πs.\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+s_{0})^{2}/2)}\cdot s\leq{\mathbf{P}}[\varepsilon\in(q_{\tau}(\varepsilon),q_{\tau}(\varepsilon)+s)]\leq\frac{1}{\sqrt{2\pi}}\cdot s.

Then,

12πexp((|qτ(ε)|+s0)2/2)12s020s0𝐏[(Ym(X))(0,s)(s,0))|X]ds12π12s02.\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+s_{0})^{2}/2)}\cdot\frac{1}{2}s_{0}^{2}\leq\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}s_{0}^{2}.

With the same argument, we also have

12πexp((|qτ(ε)|+|s0|)2/2)12s020s0𝐏[(Ym(X))(0,s)(s,0))|X]ds12π12s02\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+|s_{0}|)^{2}/2)}\cdot\frac{1}{2}s_{0}^{2}\leq\int_{0}^{s_{0}}{\mathbf{P}}[(Y-m(X))\in(0,s)\cup(s,0))|X]ds\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}s_{0}^{2}

once s0<0s_{0}<0. Therefore, (14) implies

R(h)R(m)12π12𝐄(h(X)m(X))2R(h)-R(m)\leq\frac{1}{\sqrt{2\pi}}\cdot\frac{1}{2}{\mathbf{E}}(h(X)-m(X))^{2} (15)

and

R(h)R(m)12πexp((|qτ(ε)|+supx[0,1]d|h(x)m(x)|)2/2)12𝐄(h(X)m(X))2.R(h)-R(m)\geq\frac{1}{\sqrt{2\pi}}\exp{(-(|q_{\tau}(\varepsilon)|+\sup_{x\in[0,1]^{d}}|h(x)-m(x)|)^{2}/2)}\cdot\frac{1}{2}{\mathbf{E}}(h(X)-m(X))^{2}. (16)

Now, we choose βnlnlnn\beta_{n}\asymp\sqrt{\ln\ln n}. The combination of (15) and (16) implies that the assumption 5 holds with κ=2\kappa=2 and c=(lnn)1c=(\ln n)^{-1}. Meanwhile, those coefficients in general theoretical results are also of polynomial order of lnlnn\ln\ln n, such as 𝐄(M22(βn,Y))lnlnn\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\asymp\ln\ln n. The above setting results that the convergence rate of ε(h^n)\varepsilon(\hat{h}_{n}) is Op(n12pp+d)O_{p}(n^{-\frac{1}{2}\cdot\frac{p}{p+d}}) up to a polynomial of lnn\ln n.

6.5 Binary classification

In previous sections, we give several examples of regression. Now, let us discuss another topic related to classification. In this section, we will show that Mondrian forests can be applied in binary classification as long as the chosen loss function is convex. In detail, we assume Y{1,1}Y\in\{1,-1\} takes only two labels and X[0,1]dX\in[0,1]^{d} is the explanation vector. It is well known that the Bayes classifier has the minimal classification error and takes the form:

CBayes(x)=𝕀(η(x)>0.5)𝕀(η(x)0.5),C^{Bayes}(x)=\mathbb{I}(\eta(x)>0.5)-\mathbb{I}(\eta(x)\leq 0.5),

where η(x):=𝐏(Y=1|X=x)\eta(x):={\mathbf{P}}(Y=1|X=x). However, such a theoretical optimal classifier is not available due to the unknown η(x)\eta(x). In machine learning, the most commonly used loss function in this problem takes the form

(h(x),y):=ϕ(yh(x)),\ell(h(x),y):=\phi(-yh(x)),

where ϕ:[0,)\phi:{\mathbb{R}}\to[0,\infty) is called a non-negative cost function and h:h:{\mathbb{R}}\to{\mathbb{R}} is the goal function we need to learn from the data. The best hh is always chosen as the function that minimizes the empirical risk

R^(h):=1ni=1nϕ(Yih(Xi))\hat{R}(h):=\frac{1}{n}\sum_{i=1}^{n}{\phi(-Y_{i}h(X_{i}))}

over a function class. If this minimizer is denoted by hh^{*}, the best classifier is regarded as

C(x):=𝕀(h(x)>0)𝕀(h(x)0),x[0,1]d.{C^{*}}(x):=\mathbb{I}(h^{*}(x)>0)-\mathbb{I}(h^{*}(x)\leq 0),x\in[0,1]^{d}.

Here are some examples of the loss function \ell.

  1. (i)

    Square cost: ϕ1(v)=(1+v)2,v\phi_{1}(v)=(1+v)^{2},v\in{\mathbb{R}} suggested in Li and Yang (2003).

  2. (ii)

    Hinge cost: ϕ2(v)=max{1v,0},v\phi_{2}(v)=\max\{1-v,0\},v\in{\mathbb{R}} that is used in the support vector machine; see Hearst et al. (1998).

  3. (iii)

    Smoothing hinge cost. A problem with the hinge loss is that direct optimization is difficult due to the discontinuity in the derivative at v=1v=1. Rennie and Srebro (2005) proposed a smooth version of the Hinge:

    ϕ3(v)={0.5vv0,(1v)2/20<v1,0v1.\phi_{3}(v)=\begin{cases}0.5-v&v\leq 0,\\ (1-v)^{2}/2&0<v\leq 1,\\ 0&v\geq 1.\end{cases}
  4. (iv)

    Modified square cost: ϕ4(v)=max{1v,0}2,v\phi_{4}(v)=\max\{1-v,0\}^{2},v\in{\mathbb{R}} used in Zhang and Oles (2001).

  5. (v)

    Logistic cost: ϕ5(v)=log2(1+exp(v)),v\phi_{5}(v)=\log_{2}(1+\exp(v)),v\in{\mathbb{R}} applied in Friedman et al. (2000).

  6. (vi)

    Exponential cost: ϕ6(v)=exp(v),v\phi_{6}(v)=\exp(v),v\in{\mathbb{R}}, which is used in the famous Adaboost; see Freund and Schapire (1997).

It is obvious that Y{1,1}Y\in\{1,-1\} follows Assumption 4. In order to verify Assumption 1-3, we need to propose the following three conditions on ϕ\phi:

  1. (a)

    ϕ:[0,)\phi:{\mathbb{R}}\to[0,\infty) is convex.

  2. (b)

    ϕ(v)\phi(v) is piecewise differentiable and the absolute value of each piece ϕ(v)\phi^{\prime}(v) is upper bounded by a polynomial of |v||v|.

  3. (c)

    ϕ(|v|)c1|v|γ+c2\phi(|v|)\leq c_{1}|v|^{\gamma}+c_{2} for some γ,c1,c2>0\gamma,c_{1},c_{2}>0.

These conditions are satisfied by ϕ1ϕ5\phi_{1}-\phi_{5} obviously. We will discuss the case of ϕ6\phi_{6} later due to its dramatic increase in speed. When Condition (a) holds, we have

R(λh1+(1λ)h2)\displaystyle R(\lambda h_{1}+(1-\lambda)h_{2}) =𝐄(Y(λh1(X)+(1λ)h2(X)))\displaystyle={\mathbf{E}}(-Y(\lambda h_{1}(X)+(1-\lambda)h_{2}(X)))
λR(h1)+(1λ)R(h2)\displaystyle\leq\lambda R(h_{1})+(1-\lambda)R(h_{2})

for any two functions h1,h2h_{1},h_{2} and λ(0,1)\lambda\in(0,1). This shows that Assumption 1 is true. With a slight abuse of notation, let

M1(v,y):=supv1[v,v]|ϕ(v1)|M_{1}(v,y):=\sup_{v_{1}\in[-v,v]}{|\phi^{\prime}(v_{1})|}

be the maximal value of piecewise function |ϕ(v)||\phi^{\prime}(v)|. Then, by Lagrange mean value theorem,

supy{1,1},v1,v2[v,v]|(v1,y)(v2,y)|M1(v,1)|v1v2|.\sup_{y\in\{1,-1\},v_{1},v_{2}\in[-v,v]}|\ell(v_{1},y)-\ell(v_{2},y)|\leq M_{1}(v,1)|v_{1}-v_{2}|.

This shows that Assumption 2 holds with the Lipschitz constant M1(v,1)M_{1}(v,1). Finally, Condition (c) ensures that Assumption 3 holds with

M2(v,y):=c1|v|γ+c2.M_{2}(v,y):=c_{1}|v|^{\gamma}+c_{2}.

If we take βnlnn\beta_{n}\asymp\ln n, all the coefficients in Theorem 2:

max{βn,𝐄(M22(βn,Y))},supy[lnn,lnn]M1(C,y),supv[βn,βn]|(v,lnn)|\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\},\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)},\sup_{v\in[-\beta_{n},\beta_{n}]}{|\ell(v,\ln n)|} (17)

diverges not faster than a polynomial of lnn\ln n. These arguments complete the verification of Assumption 1-3 for cost functions ϕ1ϕ5\phi_{1}-\phi_{5}. For the exponential cost ϕ6\phi_{6}, some direct calculations imply that the assumption 1-3 also holds, and the coefficients in (17) are O(lnn)O(\ln n) if we take βnlnlnn\beta_{n}\asymp\ln\ln n.

Generally speaking, the minimizer mm in (5) is not unique in this classification case; see Lemma 3 in Lugosi and Vayatis (2004). Therefore, it is meaningless to discuss the statistical consistency of forests as shown in Corollary 6 or 7. However, we can establish weak consistency for Mondrian forests as follows if the cost function is chosen to be ϕ5\phi_{5} or ϕ6\phi_{6}.

Proposition 14

For cost function ϕ5\phi_{5} or ϕ6\phi_{6}, the minimizer mm defined in (5) always exists. Meanwhile,

limn𝐄(R(h^n))=R(m).\lim_{n\to\infty}{{\mathbf{E}}(R(\hat{h}_{n}))}=R(m).

6.6 Nonparametric density estimation

Assume XX is a continuous random vector defined on [0,1]d[0,1]^{d} and has a density function f0(x),x[0,1]df_{0}(x),x\in[0,1]^{d}. Our interest lies in the estimation of the unknown function f0(x)f_{0}(x) based on an i.i.d. sample of XX, namely data 𝒟n={Xi}i=1n{\mathcal{D}}_{n}=\{X_{i}\}_{i=1}^{n}. Note that any density estimator has to satisfy two shape requirements that f0f_{0} is non-negative, namely, f0(x)0,x[0,1]df_{0}(x)\geq 0,x\in[0,1]^{d} and f0(x)𝑑x=1\int f_{0}(x)dx=1. These two restrictions can be relaxed by transforming. We have the decomposition

f0(x)=exp(h0(x))exp(h0(x))𝑑x,x[0,1]d,f_{0}(x)=\frac{\exp(h_{0}(x))}{\int\exp(h_{0}(x))dx},x\in[0,1]^{d},

where h0(x)h_{0}(x) is a real function on [0,1]d[0,1]^{d}. The above relationship helps us focus on estimating h0(x)h_{0}(x) only, which will be a statistical learning problem without constraint. On the other hand, this transformation introduces a model identification problem since h0+ch_{0}+c and h0h_{0} give the same density function, where cc\in{\mathbb{R}}. To solve this problem, we impose an additional requirement [0,1]dh0(x)=0\int_{[0,1]^{d}}{h_{0}(x)}=0, which guarantees a one-to-one map between f0f_{0} and h0h_{0}.

In the case of density estimation, the scaled log-likelihood for any function h(x)h(x) based on the sampled data 𝒟n{\mathcal{D}}_{n} is

R^(h):=1n(i=1nh(Xi)+ln[0,1]dexp(h(x))𝑑x)\hat{R}(h):=\frac{1}{n}\left(-\sum_{i=1}^{n}{h(X_{i})}+\ln\int_{[0,1]^{d}}{\exp(h(x))dx}\right)

and its population version is

R(h)=𝐄(h(X))+ln[0,1]dexp(h(x))𝑑x.R(h)=-{\mathbf{E}}(h(X))+\ln\int_{[0,1]^{d}}{\exp{(h(x))}dx}.

With a slight modification, Mondrian forests can also be applied to this problem. Recall the partition {𝒞b,λ,j}j=1Kb(λ)\{{\small\mathcal{C}}_{b,\lambda,j}\}_{j=1}^{K_{b}(\lambda)} of the bb-th Mondrian with stopping time λ\lambda satisfies

[0,1]d=j=1Kb(λ)𝒞b,λ,jand𝒞b,λ,j1𝒞b,λ,j2=,j1j2.[0,1]^{d}=\bigcup_{j=1}^{K_{b}(\lambda)}{\small\mathcal{C}}_{b,\lambda,j}\ \ \text{and}\ \ {\small\mathcal{C}}_{b,\lambda,j_{1}}\cap{\small\mathcal{C}}_{b,\lambda,j_{2}}=\varnothing,\ \forall j_{1}\neq j_{2}.

For each cell 𝒞b,λ,j{\small\mathcal{C}}_{b,\lambda,j}, a constant c^b,λ,j\hat{c}_{b,\lambda,j}\in{\mathbb{R}} is used as the predictor of h(x)h(x) in this small region. Thus, the estimator of η0(x)\eta_{0}(x) based on a single tree has the form

h^b,npre(x)=j=1Kb(λ)c^b,λ,j𝕀(x𝒞b,λ,j),\hat{h}_{b,n}^{pre}(x)=\sum_{j=1}^{K_{b}(\lambda)}{\hat{c}_{b,\lambda,j}\cdot{\mathbb{I}}(x\in{\small\mathcal{C}}_{b,\lambda,j})},

where coefficients are obtained by minimizing the empirical risk function,

(c^b,λ,1,,c^b,λ,Kb(λ)):=\displaystyle(\hat{c}_{b,\lambda,1},\ldots,\hat{c}_{b,\lambda,K_{b}(\lambda)}):= argmincb,λ,j[βn,βn]j=1,,Kb(λ)1nj=1Kb(λ)i=1ncb,λ,j𝕀(Xi𝒞b,λ,j)\displaystyle\arg min_{\begin{subarray}{c}c_{b,\lambda,j}\in[-\beta_{n},\beta_{n}]\\ j=1,\ldots,K_{b}(\lambda)\end{subarray}}\frac{1}{n}\sum_{j=1}^{K_{b}(\lambda)}\sum_{i=1}^{n}{-c_{b,\lambda,j}\cdot\mathbb{I}(X_{i}\in{\small\mathcal{C}}_{b,\lambda,j})}
+lnj=1Kb(λ)𝒞b,λ,jexp(cb,λ,j)𝑑x.\displaystyle+\ln\sum_{j=1}^{K_{b}(\lambda)}\int_{{\small\mathcal{C}}_{b,\lambda,j}}{\exp{(c_{b,\lambda,j})}dx}.

Since the optimized function above is differentiable with respect to parameters cb,λ,jc_{b,\lambda,j}s, it is not difficult to show that the corresponding minimum can indeed be achieved. To meet the requirement of our restriction, the estimator based on a single tree is revised by

h^b,n(x)=h^b,npre(x)[0,1]dh^b,npre(x)𝑑x.\hat{h}_{b,n}(x)=\hat{h}_{b,n}^{pre}(x)-\int_{[0,1]^{d}}{\hat{h}_{b,n}^{pre}(x)dx}.

Finally, by applying the ensemble technique again, the estimator of h0h_{0} based on the Mondrian forest is

h^n(x):=1Bb=1Bh^b,n(x),x[0,1]d.\hat{h}_{n}(x):=\frac{1}{B}\sum_{b=1}^{B}\hat{h}_{b,n}(x),x\in[0,1]^{d}. (18)

Next, we analyze the theoretical properties of h^n\hat{h}_{n}. The only difference between (18) and previous estimators is that (18) is obtained by using an additional penalty, namely

Pen(h):=ln[0,1]dexp(h(x))𝑑x,Pen(h):=\ln\int_{[0,1]^{d}}{\exp{(h(x))}dx},

where h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}}. Our theoretical analysis will be revised as follows. Let the pseudo loss function be pse(v,y):=v\ell^{pse}(v,y):=-v with v0,yv\geq 0,y\in{\mathbb{R}}. It is obvious that Assumption 1 holds for pse(v,y)\ell^{pse}(v,y). Assumption 2 is satisfied with M1(v,y)=1M_{1}(v,y)=1 and Assumption 3 is satisfied with M2(v,y)=vM_{2}(v,y)=v. After choosing βnlnlnn\beta_{n}\asymp\ln\ln n and following similar arguments in Theorem 2, we have

𝐄R(h^n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{n})-R(h) c1lnnn(1+λn)d+2d32pC1λnp\displaystyle\leq c_{1}\frac{\ln n}{\sqrt{n}}(1+\lambda_{n})^{d}+2d^{\frac{3}{2}p}C\cdot\frac{1}{\lambda_{n}^{p}}
+Cn+𝐄λn|Pen(h)Pen(hn(x))|,\displaystyle+\frac{C}{\sqrt{n}}+{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|, (19)

where hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) (0<p10<p\leq 1) and hn(x):=j=1K1(λn)𝕀(x𝒞1,λn,j)h(x1,λn,j)h_{n}^{*}(x):=\sum_{j=1}^{K_{1}(\lambda_{n})}{\mathbb{I}}(x\in{\small\mathcal{C}}_{1,\lambda_{n},j})h(x_{1,\lambda_{n},j}) (x1,λn,jx_{1,\lambda_{n},j} is the center of cell 𝒞1,λn,j{\small\mathcal{C}}_{1,\lambda_{n},j}) and c1c_{1} is the coefficient in Theorem 2 . Therefore, it remains to bound 𝐄λn|Pen(h)Pen(hn(x))|{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|.

Lemma 15

For any h(x)p,β([0,1]d,C)h(x)\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 and hn(x)h_{n}^{*}(x),

𝐄λn|Pen(h)Pen(hn(x))|exp(2C)2pd32p(1λn)p.{\mathbf{E}}_{\lambda_{n}}|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\cdot\left(\frac{1}{\lambda_{n}}\right)^{p}.

Choosing λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}}, the combination of (19) and Lemma 15 implies the regret function bound as follows:

𝐄R(h^n)R(h)(c1lnlnn+3d32pC+exp(2C)2pd32p)(1n)12pp+d,{\mathbf{E}}R(\hat{h}_{n})-R(h)\leq\left(c_{1}\ln\ln n+3d^{\frac{3}{2}p}C+\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}},

where nn is sufficiently large.

Let \|\cdot\|_{\infty} be the supremum norm of a function. To obtain the consistency rate of our density estimator, we need to change Assumption 5 by the fact below.

Lemma 16

Suppose the true density f0(x)f_{0}(x) is bounded away from zero and infinity, namely c0<h0(x)<c01,x[0,1]dc_{0}<h_{0}(x)<c_{0}^{-1},\forall x\in[0,1]^{d} and βnlnlnn\beta_{n}\asymp\ln\ln n. For any function h:[0,1]dh:[0,1]^{d}\to{\mathbb{R}} with hβn\|h\|_{\infty}\leq\beta_{n} and [0,1]dh(x)𝑑x=0\int_{[0,1]^{d}}{h(x)dx}=0, we have

c01lnn𝐄(h(X)h0(X))2R(h)R(h0)c01lnn𝐄(h(X)h0(X))2.c_{0}\cdot\frac{1}{\ln n}\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}\leq R(h)-R(h_{0})\leq c_{0}^{-1}\cdot\ln n\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}.

Then Lemma 16 immediately implies the following consistency result.

Proposition 17

Suppose the true density f0(x)f_{0}(x) is bounded away from zero and infinity, namely c0<h0(x)<c01,x[0,1]dc_{0}<h_{0}(x)<c_{0}^{-1},\ \forall x\in[0,1]^{d}. If λn=n12(p+d)\lambda_{n}=n^{\frac{1}{2(p+d)}} and the true function h0p,β([0,1]d,C)h_{0}\in\mathcal{H}^{p,\beta}([0,1]^{d},C) with 0<p10<p\leq 1 satisfying [0,1]dh0(x)𝑑x=0\int_{[0,1]^{d}}{h_{0}(x)dx}=0, there is nden+n_{den}\in\mathbb{Z}^{+} such that for n>ndenn>n_{den},

𝐄(h^n(X)h0(X))2c01lnn(c1lnlnn+3d32pC+exp(2C)2pd32p)(1n)12pp+d.{\mathbf{E}}(\hat{h}_{n}(X)-h_{0}(X))^{2}\leq c_{0}^{-1}\cdot\ln n\cdot\left(c_{1}\ln\ln n+3d^{\frac{3}{2}p}C+\exp(2C)\cdot 2^{p}d^{\frac{3}{2}p}\right)\cdot\left(\frac{1}{n}\right)^{\frac{1}{2}\cdot\frac{p}{p+d}}.

7 Conclusion

In this paper, we proposed a general framework for Mondrian forests, which can be used in many statistical or machine-learning problems. These applications include but are not limited to LSE, generalized regression, density estimation, quantile regression, and binary classification. Meanwhile, we studied the upper bound of its regret/risk function and statistical consistency and showed how to use them in the specific applications listed above. The future work can study the asymptotic distribution of this kind of general Mondrian forests as suggested by Cattaneo et al. (2023).

References

  • Adams and Fournier (2003) Robert A Adams and John JF Fournier. Sobolev spaces. Elsevier, 2003.
  • Arlot and Genuer (2014) Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939, 2014.
  • Athey et al. (2019) Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
  • Baptista et al. (2024) Ricardo Baptista, Eliza O’Reilly, and Yangxinyu Xie. Trim: Transformed iterative mondrian forests for gradient-based dimension reduction and high-dimensional regression. arXiv preprint arXiv:2407.09964, 2024.
  • Biau (2012) Gérard Biau. Analysis of a random forests model. The Journal of Machine Learning Research, 13(1):1063–1095, 2012.
  • Biau et al. (2008) Gérard Biau, Luc Devroye, and Gäbor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(9), 2008.
  • Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  • Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • Cattaneo et al. (2023) Matias D Cattaneo, Jason M Klusowski, and William G Underwood. Inference with mondrian random forests. arXiv preprint arXiv:2310.09702, 2023.
  • Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Friedman et al. (2000) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Special invited paper. additive logistic regression: A statistical view of boosting. Annals of statistics, pages 337–374, 2000.
  • Györfi et al. (2002) László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
  • Hearst et al. (1998) Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
  • Huang (1998) Jianhua Z Huang. Functional anova models for generalized regression. Journal of multivariate analysis, 67(1):49–71, 1998.
  • Huber (1992) Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
  • Klusowski (2021) Jason M Klusowski. Universal consistency of decision trees in high dimensions. arXiv preprint arXiv:2104.13881, 2021.
  • Knight (1998) Keith Knight. Limiting distributions for l1 regression estimators under general conditions. Annals of statistics, pages 755–770, 1998.
  • Kohler and Langer (2021) Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
  • Kosorok (2008) Michael R Kosorok. Introduction to empirical processes and semiparametric inference. Springer, 2008.
  • Lakshminarayanan et al. (2014) Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Efficient online random forests. Advances in neural information processing systems, 27, 2014.
  • Li and Yang (2003) Fan Li and Yiming Yang. A loss function analysis for classification methods in text categorization. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 472–479, 2003.
  • Liaw et al. (2002) Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
  • Lugosi and Vayatis (2004) Gábor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of regularized boosting methods. The Annals of statistics, 32(1):30–55, 2004.
  • Mourtada et al. (2020) Jaouad Mourtada, Stéphane Gaïffas, and Erwan Scornet. Minimax optimal rates for Mondrian trees and forests. The Annals of Statistics, 48(4):2253 – 2276, 2020.
  • Mourtada et al. (2021) Jaouad Mourtada, Stéphane Gaïffas, and Erwan Scornet. Amf: Aggregated mondrian forests for online learning. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(3):505–533, 2021.
  • Rennie and Srebro (2005) Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1. AAAI Press, Menlo Park, CA, 2005.
  • Roy and Teh (2008) Daniel M Roy and Yee W Teh. The mondrian process. In Advances in neural information processing systems, pages 1377–1384, 2008.
  • Roy (2011) Daniel Murphy Roy. Computability, inference and modeling in probabilistic programming. PhD thesis, Massachusetts Institute of Technology, 2011.
  • Schmidt-Hieber (2020) Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
  • Scornet et al. (2015) Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
  • Sen (2018) Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28–29, 2018.
  • Shalev-Shwartz and Ben-David (2014) S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. ISBN 9781107057135.
  • Stone (1986) Charles J Stone. The dimensionality reduction principle for generalized additive models. The Annals of Statistics, pages 590–606, 1986.
  • Stone (1994) Charles J Stone. The use of polynomial splines and their tensor products in multivariate function estimation. The annals of statistics, 22(1):118–171, 1994.
  • Zhang and Wang (2009) Heping Zhang and Minghui Wang. Search for the smallest random forest. Statistics and its interface, 2 3:381, 2009.
  • Zhang and Oles (2001) Tong Zhang and Frank J Oles. Text categorization based on regularized linear classification methods. Information retrieval, 4:5–31, 2001.

Appendix A Proofs

This section contains proofs of theoretical results in the paper. Several useful preliminaries and notations are introduced first. Meanwhile, the constant cc in this section is always a positive number and will change from line to line in order to simplify notations.

Definition A1 (Blumer et al. (1989))

Let \mathcal{F} be a Boolean function class in which each f:𝒵{0,1}f:\mathcal{Z}\to\{0,1\} is binary-valued. The growth function of \mathcal{F} is defined by

Π(m)=maxz1,,zm𝒵|{(f(z1),,f(zm)):f}|\Pi_{\mathcal{F}}(m)=\max_{z_{1},\ldots,z_{m}\in\mathcal{Z}}|\{(f(z_{1}),\ldots,f(z_{m})):f\in\mathcal{F}\}|

for each positive integer m+m\in\mathbb{Z}_{+}.

Definition A2 (Györfi et al. (2002))

Let z1,,znpz_{1},\ldots,z_{n}\in\mathbb{R}^{p} and z1n={z1,,zn}z_{1}^{n}=\{z_{1},\ldots,z_{n}\}. Let \mathcal{H} be a class of functions h:ph:\mathbb{R}^{p}\to\mathbb{R}. An LqL_{q} ε\varepsilon-cover of \mathcal{H} on z1nz_{1}^{n} is a finite set of functions h1,,hN:ph_{1},\ldots,h_{N}:\mathbb{R}^{p}\to\mathbb{R} satisfying

min1jN(1ni=1n|h(zi)hj(zi)|q)1q<ε,h.\min_{1\leq j\leq N}{\left(\frac{1}{n}\sum_{i=1}^{n}{|h(z_{i})-h_{j}(z_{i})|^{q}}\right)^{\frac{1}{q}}}<\varepsilon,\ \ \forall h\in\mathcal{H}.

Then, the LqL_{q} ε\varepsilon-cover number of \mathcal{H} on z1nz_{1}^{n}, denoted by 𝒩q(ε,,z1n)\mathcal{N}_{q}(\varepsilon,\mathcal{H},z_{1}^{n}), is the minimal size of an LqL_{q} ε\varepsilon-cover of \mathcal{H} on z1nz_{1}^{n}. If there exist no finite LqL_{q} ε\varepsilon-cover of \mathcal{H}, then the above cover number is defined as 𝒩q(ε,,z1n)=\mathcal{N}_{q}(\varepsilon,\mathcal{H},z_{1}^{n})=\infty.

To bound the generalization error, VC class will be introduced in our proof. To make this supplementary material self-explanatory, we first introduce some basic concepts and facts about the VC dimension; see Shalev-Shwartz and Ben-David (2014) for more details.

Definition A3 (Kosorok (2008))

The subgraph of a real function f:𝒳f:\mathcal{X}\to\mathbb{R} is a subset of 𝒳×\mathcal{X}\times\mathbb{R} defined by

Cf={(x,y)𝒳×:f(x)>y},C_{f}=\{(x,y)\in\mathcal{X}\times\mathbb{R}:f(x)>y\},

where 𝒳\mathcal{X} is an abstract set.

Definition A4 (Kosorok (2008))

Let 𝒞\mathcal{C} be a collection of subsets of the set 𝒳\mathcal{X} and {x1,,xm}𝒳\{x_{1},\ldots,x_{m}\}\subset\mathcal{X} be an arbitrary set of mm points. Define that 𝒞\mathcal{C} picks out a certain subset AA of {x1,,xm}\{x_{1},\ldots,x_{m}\} if AA can be expressed as C{x1,,xm}C\cap\{x_{1},\ldots,x_{m}\} for some C𝒞C\in\mathcal{C}. The collection 𝒞\mathcal{C} is said to shatter {x1,,xm}\{x_{1},\ldots,x_{m}\} if each of 2m2^{m} subsets can be picked out.

Definition A5 (Kosorok (2008))

The VC dimension of the real function class \mathcal{F}, where each ff\in\mathcal{F} is defined on 𝒳\mathcal{X}, is the largest integer VC(𝒞)VC(\mathcal{C}) such that a set of points in 𝒳×\mathcal{X}\times\mathbb{R} with size VC(𝒞)VC(\mathcal{C}) is shattered by {𝒞f,f}\{\mathcal{C}_{f},f\in\mathcal{F}\}. In this paper, we use VC()VC(\mathcal{F}) to denote the VC dimension of \mathcal{F}.

Proof of Theorem 2. By Assumption 1, the convexity of risk function implies

𝐄(R(h^n))1Bb=1B𝐄(R(h^b,n)).{\mathbf{E}}(R(\hat{h}_{n}))\leq\frac{1}{B}\sum_{b=1}^{B}{\mathbf{E}}(R(\hat{h}_{b,n})).

Therefore, we only need to consider the excess risk of a single tree in the following analysis.

In fact, our proof is based on the following decomposition:

𝐄R(h^1,n)R(h)=\displaystyle{\mathbf{E}}R(\hat{h}_{1,n})-R(h)= 𝐄(R(h^1,n)R^(h^1,n))+𝐄(R^(h^n)R^(h))\displaystyle{\mathbf{E}}(R(\hat{h}_{1,n})-\hat{R}(\hat{h}_{1,n}))+{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h))
+𝐄(R^(h)R(h))\displaystyle+{\mathbf{E}}(\hat{R}(h)-R(h))
:=I+II+III,\displaystyle:=I+II+III,

where I relates to the variance term of Mondrian tree, and II is the approximation error of Mondrian tree to hp,β([0,1]d,C)h\in\mathcal{H}^{p,\beta}([0,1]^{d},C) and III measures the error when the empirical loss R^(h)\hat{R}(h) is used to approximate the theoretical one.

Analysis of Part I. Define two classes first.

𝒯(t):={A Mondrian tree with t leaves by partitioning [0,1]d}\mathcal{T}(t):=\{\text{A Mondrian tree with $t$ leaves by partitioning $[0,1]^{d}$}\}
𝒢(t):={j=1t𝕀(x𝒞j)cj:cj,𝒞jsare leaves of a tree in 𝒯(t)}.\mathcal{G}(t):=\Big{\{}\sum_{j=1}^{t}{{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})}\cdot c_{j}:c_{j}\in{\mathbb{R}},\ {\small\mathcal{C}}_{j}\ ^{\prime}s\ \text{are leaves of a tree in $\mathcal{T}(t)$}\Big{\}}.

Thus, the truncated function class of 𝒢(t)\mathcal{G}(t) is given by

𝒢(t,z):={g~(x)=Tzg(x):g𝒢(t)},\mathcal{G}(t,z):=\{\tilde{g}(x)=T_{z}g(x):g\in\mathcal{G}(t)\},

where the threshold z>0z>0. Then, the part II can be bounded as follows.

|I|\displaystyle|I| 𝐄πλn(𝐄𝒟n|1ni=1n(h^1,n(Xi),Yi)𝐄((h^1,n(X),Y)|𝒟n)||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)|{\mathcal{D}}_{n})}\right|\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟n|1ni=1n(h^1,n(Xi),Yi)𝐄((h^1,n(X),Y)|𝒟n)||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-{\mathbf{E}}(\ell(\hat{h}_{1,n}(X),Y)|{\mathcal{D}}_{n})}\right|\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\Big{|}\pi_{\lambda_{n}}\right)
=𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))|𝕀(An)|πλn)\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n})\Big{|}\pi_{\lambda_{n}}\right)
+𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))|𝕀(Anc)|πλn)\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)
:=I1+I2,\displaystyle:=I_{1}+I_{2}, (A.1)

where An:={max1in|Yi|lnn}A_{n}:=\{\max_{1\leq i\leq n}|Y_{i}|\leq\ln{n}\}. Next, we need to find the upper bound of I1,I2I_{1},I_{2} respectively.

Let us consider I1I_{1} first. Make the decomposition of I1I_{1} as below.

I1\displaystyle I_{1} 𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)𝐄((g(X),Y))|𝕀(An)|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n})\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)𝐄((g(X),Y))||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)𝐄((g(X),TlnnY))||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})-{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))}\right|\Big{|}\pi_{\lambda_{n}}\right)
+𝐄πλn(supg𝒢(K(λn),βn)|𝐄((g(X),TlnnY))𝐄((g(X),Y))||πλn)\displaystyle+{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|{\mathbf{E}}(\ell(g(X),T_{\ln{n}}Y))-{\mathbf{E}}(\ell(g(X),Y))\right|\Big{|}\pi_{\lambda_{n}}\right)
:=I1,1+I1,2.\displaystyle:=I_{1,1}+I_{1,2}.

The part I1,1I_{1,1} can be bounded by considering the covering number of the function class

n:={(g(),Tlnn()):g𝒢(K(λn),βn)},\mathcal{L}_{n}:=\{\ell(g(\cdot),T_{\ln n}(\cdot)):g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})\},

where K(λn)K(\lambda_{n}) denotes the number of regions in the partition that is constructed by the truncated Mondrian process πλn\pi_{\lambda_{n}} with stopping time λn\lambda_{n}. Therefore, K(λn)K(\lambda_{n}) is a deterministic number once πλn\pi_{\lambda_{n}} is given. For any ε>0\varepsilon>0, recall the definition of the covering number of 𝒢(K(λn))\mathcal{G}(K(\lambda_{n})), namely 𝒩1(ε,𝒢(K(λn),βn),z1n)\mathcal{N}_{1}(\varepsilon,\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}) shown in Definition A2. Now, we suppose

{η1(x),η2(x),,ηJ(x)}\{\eta_{1}(x),\eta_{2}(x),\ldots,\eta_{J}(x)\}

is a ε/(M1(βn,lnn))\varepsilon/(M_{1}(\beta_{n},\ln n))-cover of class 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) in L1(z1n)L^{1}(z_{1}^{n}) space, where L1(z1n):={f(x):fz1n:=1ni=1n|f(zi)|<}L^{1}(z_{1}^{n}):=\{f(x):\|f\|_{z_{1}^{n}}:=\frac{1}{n}\sum_{i=1}^{n}|f(z_{i})|<\infty\} is equipped with norm z1n\|\cdot\|_{z_{1}^{n}} and J1J\geq 1. Without loss of generality, we can further assume |ηj(x)|βn|\eta_{j}(x)|\leq\beta_{n} since 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) is upper bounded by lnn\ln n. Otherwise, we consider the truncation of ηj(x)\eta_{j}(x): Tlnnηj(x)T_{\ln n}{\eta_{j}(x)}. According to Assumption 2, we know for any g𝒢(K(λn),βn)g\in\mathcal{G}(K(\lambda_{n}),\beta_{n}) and ηj(x)\eta_{j}(x),

|(g(x),TlnYy)(ηj(x),TlnYy)|M1(βn,lnn)|g(x)ηj(x)|,|\ell(g(x),T_{\ln Y}y)-\ell(\eta_{j}(x),T_{\ln Y}y)|\leq M_{1}(\beta_{n},\ln n)|g(x)-\eta_{j}(x)|,\ \

where x[0,1]d,y.x\in[0,1]^{d},y\in{\mathbb{R}}. The above inequality implies that

1ni=1n|(g(zi),TlnYwi)(ηj(zi),TlnYwi)|M1(βn,lnn)1ni=1n|g(zi)ηj(zi)|\frac{1}{n}\sum_{i=1}^{n}|\ell(g(z_{i}),T_{\ln Y}w_{i})-\ell(\eta_{j}(z_{i}),T_{\ln Y}w_{i})|\leq M_{1}(\beta_{n},\ln n)\cdot\frac{1}{n}\sum_{i=1}^{n}|g(z_{i})-\eta_{j}(z_{i})|

for any z1n:=(z1,z2,,zn)d××dz_{1}^{n}:=(z_{1},z_{2},\ldots,z_{n})\in{\mathbb{R}}^{d}\times\cdots\times{\mathbb{R}}^{d} and (w1,,wn)××(w_{1},\ldots,w_{n})\in{\mathbb{R}}\times\cdots\times{\mathbb{R}}. Therefore, we know (η1(x),Tlnny),,(ηJ(x),Tlnny)\ell(\eta_{1}(x),T_{\ln n}y),\ldots,\ell(\eta_{J}(x),T_{\ln n}y) is a ε\varepsilon-cover of class 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) in L1(v1n)L^{1}(v_{1}^{n}) space, where v1n:=((z1T,w1)T,,(znT,wn)T)v_{1}^{n}:=((z_{1}^{T},w_{1})^{T},\ldots,(z_{n}^{T},w_{n})^{T}). In other words, we have

𝒩1(ε,n,v1n)𝒩1(εM1(βn,lnn),𝒢(K(λn),βn),z1n).\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})\leq\mathcal{N}_{1}\left(\frac{\varepsilon}{M_{1}(\beta_{n},\ln n)},\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}\right). (A.2)

Note that 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) is a VC class since we have shown 𝒢(K(λn))\mathcal{G}(K(\lambda_{n})) is a VC class in (LABEL:4.7). Furthermore, we know the function in 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}) is upper bounded by βn\beta_{n}. Therefore, we can bound the RHS of (A.2) by using Theorem 7.12 in Sen (2018)

𝒩1(εM1(βn,lnn),𝒢(K(λn),βn),z1n)\displaystyle\mathcal{N}_{1}\left(\frac{\varepsilon}{M_{1}(\beta_{n},\ln n)},\mathcal{G}(K(\lambda_{n}),\beta_{n}),z_{1}^{n}\right)
cVC(𝒢(K(λn),βn))(4e)VC(𝒢(K(λn),βn))(βnε)VC(𝒢(K(λn),βn)),\displaystyle\leq c\cdot VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))(4e)^{VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))}\left(\frac{\beta_{n}}{\varepsilon}\right)^{VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))}, (A.3)

where the constant c>0c>0 is universal. On the other hand, it is not difficult to show

VC(𝒢(K(λn),βn))VC(𝒢(K(λn))).VC(\mathcal{G}(K(\lambda_{n}),\beta_{n}))\leq VC(\mathcal{G}(K(\lambda_{n}))). (A.4)

Thus, the combination of (A.4), (A.3) and (A.2) implies

𝒩1(ε,n,v1n)cVC(𝒢(K(λn)))(4e)VC(𝒢(K(λn)))(βnε)VC(𝒢(K(λn)))\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})\leq c\cdot VC(\mathcal{G}(K(\lambda_{n})))(4e)^{VC(\mathcal{G}(K(\lambda_{n})))}\left(\frac{\beta_{n}}{\varepsilon}\right)^{VC(\mathcal{G}(K(\lambda_{n})))} (A.5)

for each v1nv_{1}^{n}. Note that the class n\mathcal{L}_{n} has an envelop function M2(βn,y)M_{2}(\beta_{n},y) satisfying ν(n):=𝐄(M22(βn,Y))<\nu(n):=\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}<\infty by Assumption 3. Construct a series of independent Rademacher variables {bi}i=1n\{b_{i}\}_{i=1}^{n} sharing with the same distribution 𝐏(bi=±1)=0.5,i=1,,n{\mathbf{P}}(b_{i}=\pm 1)=0.5,i=1,\ldots,n. Then, the symmetrization technique (see Lemma 3.12 in Sen (2018)) and the Dudley entropy integral (see (41) in Sen (2018)) and (A.5) imply

I1,1\displaystyle I_{1,1} 𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),TlnnYi)bi||πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),T_{\ln{n}}Y_{i})}b_{i}\right|\Big{|}\pi_{\lambda_{n}}\right)\
(symmetrization technique)
𝐄πλn(24n0ν(n)ln𝒩1(ε,n,v1n)𝑑ε|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\frac{24}{\sqrt{n}}\int_{0}^{\nu(n)}\sqrt{\ln\mathcal{N}_{1}(\varepsilon,\mathcal{L}_{n},v_{1}^{n})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right)\ \ \qquad\qquad
(Dudley’s entropy integral)
cn𝐄πλn(0ν(n)ln(1+c(βn/ε)2VC(𝒢(K(λn)))𝑑ε|πλn)\displaystyle\leq\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)}\sqrt{\ln(1+c\cdot(\beta_{n}/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right)
(E.q. (A.5))\displaystyle(\text{E.q. \eqref{789nsvbfc}})
βncn𝐄πλn(0ν(n)/βnln(1+c(1/ε)2VC(𝒢(K(λn)))𝑑ε|πλn),\displaystyle\leq\beta_{n}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{\nu(n)/\beta_{n}}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right), (A.6)

where we use the fact that πλ\pi_{\lambda} is independent to the data set 𝒟n{\mathcal{D}}_{n} and c>0c>0 is universal. Without loss of generality, we can assume ν(n)/βn<1\nu(n)/\beta_{n}<1; otherwise just set βn=max{ν(n),βn}\beta_{n}^{\prime}=\max\{\nu(n),\beta_{n}\} as the new upper bound of the function class 𝒢(K(λn),βn)\mathcal{G}(K(\lambda_{n}),\beta_{n}). Therefore, (A.6) also implies

I1,1\displaystyle I_{1,1} max{βn,ν(n)}cn𝐄πλn(01ln(1+c(1/ε)2VC(𝒢(K(λn)))𝑑ε|πλn)\displaystyle\leq\max\{\beta_{n},\nu(n)\}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\int_{0}^{1}\sqrt{\ln(1+c(1/\varepsilon)^{2VC(\mathcal{G}(K(\lambda_{n}))})}d\varepsilon\Big{|}\pi_{\lambda_{n}}\right)
max{βn,ν(n)}cn𝐄πλn(VC(𝒢(K(λn))n)01ln(1/ε)𝑑ε\displaystyle\leq\max\{\beta_{n},\nu(n)\}\cdot\frac{c}{\sqrt{n}}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{VC(\mathcal{G}(K(\lambda_{n}))}{n}}\right)\cdot\int_{0}^{1}\sqrt{\ln(1/\varepsilon)}d\varepsilon
cmax{βn,ν(n)}𝐄πλn(VC(𝒢(K(λn))n).\displaystyle\leq c\cdot\max\{\beta_{n},\nu(n)\}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{VC(\mathcal{G}(K(\lambda_{n}))}{n}}\right). (A.7)

The next task is to find the VC dimension of class 𝒢(t)\mathcal{G}(t) for each positive integer t+t\in\mathbb{Z}_{+}, which is summarized below.

Lemma A6

For each integer t+t\in\mathbb{Z}_{+}, VC(𝒢(t))c(d)tln(t)VC(\mathcal{G}(t))\leq c(d)\cdot t\ln(t).

Proof  Recall two defined classes:

𝒯(t):={A Mondrian tree with t leaves by partitioning [0,1]d}\mathcal{T}(t):=\{\text{A Mondrian tree with $t$ leaves by partitioning $[0,1]^{d}$}\}
𝒢(t):={j=1t𝕀(x𝒞j)cj:cj,𝒞jsare leaves of a tree in 𝒯(t)}.\mathcal{G}(t):=\Big{\{}\sum_{j=1}^{t}{{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})}\cdot c_{j}:c_{j}\in{\mathbb{R}},\ {\small\mathcal{C}}_{j}\ ^{\prime}s\ \text{are leaves of a tree in $\mathcal{T}(t)$}\Big{\}}.

For any Mondrian tree gt𝒢(t)g_{t}\in\mathcal{G}(t) with the form

gt(x)=j=1t𝕀(x𝒞j)cj,g_{t}(x)=\sum_{j=1}^{t}{\mathbb{I}}(x\in{\small\mathcal{C}}_{j})\cdot c_{j}, (A.8)

we also use notation {𝒞1,,𝒞t;c1,,ct}\{{\small\mathcal{C}}_{1},\ldots,{\small\mathcal{C}}_{t};c_{1},\ldots,c_{t}\} to denote tree (A.8) later.

In order to bound VC(𝒢(t))VC(\mathcal{G}(t)), we need to consider the related Boolean class:

t={sgn(f(x,y)):f(x,y)=g(x)y,g𝒢(t)},\mathcal{F}_{t}=\{sgn(f(x,y)):f(x,y)=g(x)-y,g\in\mathcal{G}(t)\},

where sgn(v)=1sgn(v)=1 if v0v\geq 0 and sgn(v)=1sgn(v)=-1 otherwise. Recall the VC dimension of t\mathcal{F}_{t}, denoted by VC(𝒢(t))VC(\mathcal{G}(t)), is the largest integer m+m\in\mathbb{Z}_{+} satisfying 2mΠ𝒢(t)(m)2^{m}\leq\Pi_{\mathcal{G}(t)}(m); see Definition A5. Next we focus on bounding Π𝒢(t)(m)\Pi_{\mathcal{G}(t)}(m) for each positive integer m+m\in\mathbb{Z}_{+}.

Define the partition set generated by 𝒢(t)\mathcal{G}(t):

𝒫tn:={{𝒞1,,𝒞t}:{𝒞1,,𝒞t;c1,,ctn}𝒢(t)}{\mathcal{P}}_{t_{n}}:=\left\{\{{\small\mathcal{C}}_{1},\ldots,{\small\mathcal{C}}_{t}\}:\{{\small\mathcal{C}}_{1},\ldots,{\small\mathcal{C}}_{t};c_{1},\ldots,c_{t_{n}}\}\in\mathcal{G}(t)\right\}

and the maximal partition number of mm points cut by 𝒫tn{\mathcal{P}}_{t_{n}}:

K(t):=maxx1,,xm[0,1]pCard({𝒫𝒫t{{x1,,xm}A:A𝒫}}).K(t):=\max_{x_{1},\ldots,x_{m}\in[0,1]^{p}}Card\left(\left\{\bigcup_{{\mathcal{P}}\in{\mathcal{P}}_{t}}\{\{x_{1},\ldots,x_{m}\}\cap A:A\in{\mathcal{P}}\}\right\}\right). (A.9)

Since each Mondrian tree takes constant value on its leaf 𝒞j{\small\mathcal{C}}_{j} (j=1,,t)(j=1,\ldots,t), thus there is at most +1\ell+1 ways to pick out any fixed points {(x1,y1),,(x,y)}[0,1]p×\{(x_{1},y_{1}),\ldots,(x_{\ell},y_{\ell})\}\subseteq[0,1]^{p}\times{\mathbb{R}} that lie in a same leaf. In other words,

max(xj,yj)[0,1]p×j=1,,Card({(sgn(f(x1,y1))\displaystyle\max_{\begin{subarray}{c}(x_{j},y_{j})\in[0,1]^{p}\times{\mathbb{R}}\\ j=1,\ldots,\ell\end{subarray}}Card\Big{(}\Big{\{}(sgn(f(x_{1},y_{1})) ,,sgn(f(x,y)))\displaystyle,\ldots,sgn(f(x_{\ell},y_{\ell})))
:f(x,y)=cy,x[0,1]p,y,c})+1.\displaystyle:f(x,y)=c-y,x\in[0,1]^{p},y\in{\mathbb{R}},c\in{\mathbb{R}}\Big{\}}\Big{)}\leq\ell+1. (A.10)

According to the tree structure in (A.8), the growth function of 𝒢(t)\mathcal{G}(t) satisfies

Πt(m)K(t)(m+1)t.\Pi_{\mathcal{F}_{t}}(m)\leq K(t)\cdot(m+1)^{t}. (A.11)

Therefore, the left task is to bound K(t)K(t) only. In fact, We can use induction method to prove

K(t)[c(p)(3m)p+1]t,K(t)\leq\left[c(p)(3m)^{p+1}\right]^{t}, (A.12)

where the constant c(p)c(p) only depends on pp. The arguments are given below.

When t=1t=1, the partition generated by 𝒢(1)\mathcal{G}({1}) is {[0,1]p}\{[0,1]^{p}\}. Thus, K(1)=1K(1)=1 and (A.12) is satisfied obviously in this case.

Suppose (A.12) is satisfied with t1t-1. Now we consider the case for tt. Here, we need to use two facts. First, any gtg_{t} must be grown from a gt1g_{t-1} by splitting one of its leaves; Second, any gt1g_{t-1} can grow to be another gtg_{t} by partitioning one of its leaves. In conclusion,

𝒢(t)={gt:gt is obtained by splitting one of the leaves of gt1𝒢(t1)}.\mathcal{G}({t})=\{g_{t}:g_{t}\text{\ is\ obtained by splitting one of the leaves of }g_{t-1}\in\mathcal{G}({t-1})\}. (A.13)

Suppose x1,,xm[0,1]px_{1}^{*},\ldots,x_{m}^{*}\in[0,1]^{p} maximizes K(tn)K(t_{n}). We divide 𝒢(t1)\mathcal{G}(t-1) into K(t1)K(t-1) groups:

𝒢j(t1):={gt1𝒢(t1):gt1s share with a same partition for {x1,,xm}},\mathcal{G}^{j}(t-1):=\left\{g_{t-1}\in\mathcal{G}(t-1):g_{t-1}s\text{ share with a same partition for }\{x_{1}^{*},\ldots,x_{m}^{*}\}\right\},

where j=1,2,,K(t1)j=1,2,\ldots,K(t-1). Meanwhile, we write the partition for {x1,,xm}\{x_{1}^{*},\ldots,x_{m}^{*}\} that is cut by 𝒢j(t1)\mathcal{G}^{j}(t-1) as

{xu1j,xu2j,,xum1,jj}m1,jtimes,,{xum1,j++mtn2,j+1j,,xum}mtn,jtimes,\underbrace{\{x^{*}_{u_{1}^{j}},x^{*}_{u_{2}^{j}},\ldots,x^{*}_{u^{j}_{m_{1,j}}}\}}_{m_{1,j}\ \text{times}},\ldots,\underbrace{\{x^{*}_{u^{j}_{m_{1,j}+\cdots+m_{t_{n}-2,j}+1}},\ldots,x^{*}_{u_{m}}\}}_{m_{t_{n},j}\ \text{times}}, (A.14)

where (u1j,,umj)(u_{1}^{j},\ldots,u_{m}^{j}) is a permutation of (1,,m)(1,\ldots,m) and the cardinality of above sets are written as m1,j,m2,j,,mtn,jm_{1,j},m_{2,j},\ldots,m_{t_{n},j}. Note that both (u1j,,umj)(u_{1}^{j},\ldots,u_{m}^{j}) and (m1,j,m2,j,,mtn,j)(m_{1,j},m_{2,j},\ldots,m_{t_{n},j}) are determined once the index jj is fixed.

If we have generated a Mondrian tree with tt leaves from its parent in 𝒢j(t1)\mathcal{G}^{j}(t-1), the corresponding partition for {x1,,xm}\{x_{1}^{*},\ldots,x_{m}^{*}\} can be obtained by following two steps below

  1. 1.

    Partition one of sets in (A.14) by using a hyperplane θTx=s\theta^{T}x=s, where θp,s\theta\in{\mathbb{R}}^{p},s\in{\mathbb{R}}.

  2. 2.

    Keep other t2t-2 sets in (A.14) unchanged.

Denote by K~j\tilde{K}_{j} the number of partition for {x1,,xm}\{x_{1}^{*},\ldots,x_{m}^{*}\} after following above process. Then, the Sauer–Shelah lemma (see for example Lemma 6.10 in Shalev-Shwartz and Ben-David (2014) ) tells us

K~j=1t1c(p)(mep+1)p+1=1t1c(p)(3m)p+1c(p)(3m)p+1.\tilde{K}_{j}\leq\sum_{\ell=1}^{t-1}c(p)\left(\frac{m_{\ell}e}{p+1}\right)^{p+1}\leq\sum_{\ell=1}^{t-1}c(p)\left(3m_{\ell}\right)^{p+1}\leq c(p)\cdot(3m)^{p+1}.

Since each 𝒢j(t1)\mathcal{G}^{j}({t{-1}}) can at most produce c(p)(3m)p+1c(p)(3m)^{p+1} new partitions of {x1,,xm}\{x_{1}^{*},\ldots,x_{m}^{*}\} based on (A.14), (A.13) and 𝒢(t1)=j=1K(t1)𝒢j(t1)\mathcal{G}({t-1})=\cup_{j=1}^{K(t-1)}\mathcal{G}^{j}({t{-1}}) imply

K(t)K(t1)c(p)(3m)p+1.K(t)\leq K(t-1)\cdot c(p)(3m)^{p+1}.

Therefore, (A.12) also holds for the case of K(t)K(t).

According to (A.11) and (A.12), we finally have

Πt(m)[c(p)(3m)p+1]t(m+1)t(3c(p)m)t(p+2).\Pi_{\mathcal{F}_{t}}(m)\leq[c(p)\cdot(3m)^{p+1}]^{t}\cdot(m+1)^{t}\leq(3c(p)m)^{t(p+2)}.

Solving the inequality

2m(3c(p)m)t(p+2)2^{m}\leq(3c(p)m)^{t(p+2)}

by using the basic inequality lnxγxlnγ1\ln x\leq\gamma\cdot x-\ln\gamma-1 with x,γ>0x,\gamma>0 yields

VC(𝒢(t))c(p)tln(t),t2,VC(\mathcal{G}({t}))\leq c(p)\cdot t\ln(t),\forall t\geq 2, (A.15)

where the constant c(p)>0c(p)>0 depends on pp only. This completes the proof.  

Therefore, we know from Lemma A6 and (A.7) that

I1,1cmax{βn,ν(n)}𝐄πλn(K(λn)lnK(λn)n).I_{1,1}\leq c\cdot\max\{\beta_{n},\nu(n)\}\cdot{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sqrt{\frac{K(\lambda_{n})\ln K(\lambda_{n})}{n}}\right).

By the basic inequality lnxxβ/β,x1,β>0\ln x\leq x^{\beta}/\beta,\ \forall x\geq 1,\forall\beta>0, from above inequality we have

I1,1cmax{βn,ν(n)}n𝐄(K(λn)).I_{1,1}\leq c\cdot\frac{\max\{\beta_{n},\nu(n)\}}{\sqrt{n}}\cdot{\mathbf{E}}(K(\lambda_{n})). (A.16)

Next, from Proposition 2 in Mourtada et al. (2020), we know 𝐄(K(λn))=(1+λn)d{\mathbf{E}}(K(\lambda_{n}))=(1+\lambda_{n})^{d}. Finally, we have the following upper bound for I1,1I_{1,1}

I1,1cmax{βn,ν(n)}n(1+λn)d.I_{1,1}\leq c\cdot\frac{\max\{\beta_{n},\nu(n)\}}{\sqrt{n}}\cdot(1+\lambda_{n})^{d}. (A.17)

Then, we bound the second part I1,2I_{1,2} of I1I_{1} by following the arguments below.

I1,2\displaystyle I_{1,2} 𝐄πλn(supg𝒢(K(λn),βn)𝐄(|(g(X),TlnnY)(g(X),Y)|)|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(|\ell(g(X),T_{\ln{n}}Y)-\ell(g(X),Y)|)\Big{|}\pi_{\lambda_{n}}\right)
=𝐄πλn(supg𝒢(K(λn),βn)𝐄(|(g(X),lnn)(g(X),Y)|𝕀({|Y|>lnn}))|πλn)\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(|\ell(g(X),\ln n)-\ell(g(X),Y)|\cdot\mathbb{I}(\{|Y|>\ln n\}))\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn((supg𝒢(K(λn),βn)𝐄(|(g(X),lnn)(g(X),Y)|2))12|πλn)𝐏12(|Y|>lnn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left(\left(\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}{\mathbf{E}}(|\ell(g(X),\ln n)-\ell(g(X),Y)|^{2})\right)^{\frac{1}{2}}\Big{|}\pi_{\lambda_{n}}\right){\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n)
(supx[βn,βn]2(x,lnn)+𝐄(M22(βn,Y)))12cexp(lnnw(lnn)/c),\displaystyle\leq\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\ell^{2}(x,\ln n)}+{\mathbf{E}}(M^{2}_{2}(\beta_{n},Y))\right)^{\frac{1}{2}}\cdot c\exp(-\ln n\cdot w(\ln n)/c), (A.18)

where in the third line we use Cauchy-Schwarz inequality and in last line we use Assumption 3 and the tail probability of YY given in Assumption 4.

Finally, we end the Analysis of Part I by bounding I1,2I_{1,2}. In fact, this bound can be processed as follows.

I1,2\displaystyle I_{1,2} =𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)|1ni=1n(g(Xi),Yi)𝐄((g(X),Y))|𝕀(Anc)|πλn)\displaystyle={\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left|\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})-{\mathbf{E}}(\ell(g(X),Y))}\right|\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟nsupg𝒢(K(λn),βn)(1ni=1n(g(Xi),Yi)+𝐄((g(X),Y)))𝕀(Anc)|πλn)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\sup_{g\in\mathcal{G}(K(\lambda_{n}),\beta_{n})}\left(\frac{1}{n}\sum_{i=1}^{n}{\ell(g(X_{i}),Y_{i})+{\mathbf{E}}(\ell(g(X),Y))}\right)\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)
𝐄πλn(𝐄𝒟n(1ni=1nM2(βn,Yi)+𝐄(M2(βn,Y)))𝕀(Anc)|πλn)(Assumption3)\displaystyle\leq{\mathbf{E}}_{\pi_{\lambda_{n}}}\left({\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\frac{1}{n}\sum_{i=1}^{n}{M_{2}(\beta_{n},Y_{i})+{\mathbf{E}}(M_{2}(\beta_{n},Y))}\right)\cap\mathbb{I}(A_{n}^{c})\Big{|}\pi_{\lambda_{n}}\right)\ \ (Assumption\ \ref{assump3})
2𝐄(M22(βn,Y))𝐏(Anc).\displaystyle\leq 2\cdot\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\cdot{\mathbf{P}}(A_{n}^{c}). (A.19)

Thus, we only need to find the upper bound of 𝐏(Anc){\mathbf{P}}(A_{n}^{c}). By Assumption 4, we know

𝐏(Anc)\displaystyle{\mathbf{P}}(A_{n}^{c}) =1𝐏(max1in|Yi|lnn)=1[𝐏(|Y1|lnn)]n1(1ceclnnw(lnn))n\displaystyle=1-{\mathbf{P}}\left(\max_{1\leq i\leq n}|Y_{i}|\leq\ln n\right)=1-\left[{\mathbf{P}}(|Y_{1}|\leq\ln n)\right]^{n}\leq 1-(1-c\cdot e^{-c\cdot\ln n\cdot w(\ln n)})^{n}
1enln(1ceclnnw(lnn))nln(1ceclnnw(lnn))cneclnnw(lnn)\displaystyle\leq 1-e^{n\cdot\ln(1-c\cdot e^{-c\cdot\ln n\cdot w(\ln n)})}\leq-n\cdot\ln(1-c\cdot e^{-c\cdot\ln{n}\cdot w(\ln n)})\leq c\cdot n\cdot e^{-c\cdot\ln{n}\cdot w(\ln n)}
cn\displaystyle\leq\frac{c}{n} (A.20)

when nn is sufficiently large. Therefore, (A.19) and (A.20) imply

I2c𝐄(M22(βn,Y))1n.I_{2}\leq c\cdot\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\cdot\frac{1}{n}. (A.21)

Analysis of Part II. Recall

II:=𝐄(1ni=1n(h^1,n(Xi),Yi)1ni=1n(h(Xi),Yi)),II:={\mathbf{E}}\left(\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right),

which relates to the empirical approximation error of Mondrian forests. First, suppose the first truncated Mondrian process with stopping time λn\lambda_{n} is given, denoted by π1,λn\pi_{1,\lambda_{n}}. Under this restriction, the partition of [0,1]d[0,1]^{d} is determined and not random, which is denoted by {𝒞1,λ,j}j=1K1(λn)\{{\small\mathcal{C}}_{1,\lambda,j}\}_{j=1}^{K_{1}(\lambda_{n})}. Let

Δn\displaystyle\Delta_{n} :=𝐄𝒟n(1ni=1n(h^1,n(Xi),Yi)1ni=1n(h(Xi),Yi))\displaystyle:={\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right)
=𝐄𝒟n(1ni=1n(h^1,n(Xi),Yi)1ni=1n(h1,n(Xi),Yi)\displaystyle={\mathbf{E}}_{{\mathcal{D}}_{n}}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})
+1ni=1n(h1,n(Xi),Yi)1ni=1n(h(Xi),Yi)),\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\bigg{)},

where hn(x):=j=1K1(λn)𝕀(x𝒞1,λn,j)h(x1,λn,j)h_{n}^{*}(x):=\sum_{j=1}^{K_{1}(\lambda_{n})}{\mathbb{I}}(x\in{\small\mathcal{C}}_{1,\lambda_{n},j})h(x_{1,\lambda_{n},j}) and x1,λn,jx_{1,\lambda_{n},j} denotes the center of cell 𝒞1,λn,j{\small\mathcal{C}}_{1,\lambda_{n},j}. We need to highlight that Δn\Delta_{n} depends on π1,λn\pi_{1,\lambda_{n}}. Since h^1,n\hat{h}_{1,n} is obtained by

c^b,λ,j=argminz[βn,βn]i:Xi𝒞b,λ,j(z,Yi),\hat{c}_{b,\lambda,j}=\arg\min_{z\in[-\beta_{n},\beta_{n}]}{\sum_{i:X_{i}\in{\small\mathcal{C}}_{b,\lambda,j}}}{\ell(z,Y_{i})},

we know

1ni=1n(h^1,n(Xi),Yi)1ni=1n(h1,n(Xi),Yi)0\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{h}_{1,n}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})\leq 0 (A.22)

once βn>C\beta_{n}>C. At this point, we consider two cases about Δn\Delta_{n}:

Case I: Δn0\Delta_{n}\leq 0. This case is trivial because we already have Δn0\Delta_{n}\leq 0.

Case II: Δn>0\Delta_{n}>0. In this case, (A.22) implies

Δn\displaystyle\Delta_{n} 𝐄𝒟n|1ni=1n(h1,n(Xi),Yi)1ni=1n(h(Xi),Yi)|\displaystyle\leq{\mathbf{E}}_{{\mathcal{D}}_{n}}\left|\frac{1}{n}\sum_{i=1}^{n}\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\frac{1}{n}\sum_{i=1}^{n}\ell(h(X_{i}),Y_{i})\right|
1n𝐄𝒟n(i=1n|(h1,n(Xi),Yi)(h(Xi),Yi)|)\displaystyle\leq\frac{1}{n}{\mathbf{E}}_{{\mathcal{D}}_{n}}\left(\sum_{i=1}^{n}\left|\ell(h_{1,n}^{*}(X_{i}),Y_{i})-\ell(h(X_{i}),Y_{i})\right|\right)
𝐄X,Y(|(h1,n(X),Y)(h(X),Y)|).\displaystyle\leq{\mathbf{E}}_{X,Y}\left(|\ell(h_{1,n}^{*}(X),Y)-\ell(h(X),Y)|\right).

Let Dλ(X)D_{\lambda}(X) be the diameter of the cell that XX lies in. By Assumption 2, the above inequality further implies

Δn\displaystyle\Delta_{n} 𝐄X,Y(M1(C,Y)|h1,n(X)h(X)|)\displaystyle\leq{\mathbf{E}}_{X,Y}(M_{1}(C,Y)|h_{1,n}^{*}(X)-h(X)|)
𝐄X,Y(M1(C,Y)CDλn(X)β)\displaystyle\leq{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot C\cdot D_{\lambda_{n}}(X)^{\beta})
C𝐄X,Y(M1(C,Y)Dλn(X)β)\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta})
C𝐄X,Y(M1(C,Y)Dλn(X)β𝕀(|Y|lnn)+M1(C,Y)Dλn(X)β𝕀(|Y|>lnn))\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta}\mathbb{I}(|Y|\leq\ln n)+M_{1}(C,Y)\cdot D_{\lambda_{n}}(X)^{\beta}\mathbb{I}(|Y|>\ln n))
C𝐄X,Y(supy[lnn,lnn]M1(C,y)Dλn(X)β+M1(C,Y)dβ2𝕀(|Y|>lnn))\displaystyle\leq C\cdot{\mathbf{E}}_{X,Y}(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot D_{\lambda_{n}}(X)^{\beta}+M_{1}(C,Y)\cdot d^{\frac{\beta}{2}}\cdot\mathbb{I}(|Y|>\ln n))
C(supy[lnn,lnn]M1(C,y)𝐄X(Dλn(X)β)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn))),\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right), (A.23)

where the second line holds because hh is a (p,C)(p,C)-smooth function and we use Dλ(X)da.s.D_{\lambda}(X)\leq\sqrt{d}\ a.s. to get the fifth line and Cauchy-Schwarz inequality in the sixth line .

Therefore, Case I and (A.23) in Case II imply that

ΔnC(supy[lnn,lnn]M1(C,y)𝐄X(Dλn(X)β)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn)))a.s..\Delta_{n}\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)\ a.s.. (A.24)

Taking expectation on both sides of (A.24) w.r.t. λn\lambda_{n} leads that

II\displaystyle II C(supy[lnn,lnn]M1(C,y)𝐄X,λn(Dλn(X)β)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn)))\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X,\lambda_{n}}(D_{\lambda_{n}}(X)^{\beta})+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)
C(supy[lnn,lnn]M1(C,y)𝐄X𝐄λn(Dλn(x)β|X=x)+dβ2𝐄M12(C,Y)𝐏12(|Y|>lnn)))\displaystyle\leq C\cdot\left(\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)^{\beta}|X=x)+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot{\mathbf{P}}^{\frac{1}{2}}(|Y|>\ln n))\right)
C(supy[lnn,lnn]M1(C,y)𝐄X[(𝐄λn(Dλn(x)|X=x))β]\displaystyle\leq C\cdot\Big{(}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot{\mathbf{E}}_{X}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x))^{\beta}\right]
+dβ2𝐄M12(C,Y)cexp(lnnw(lnn)/c)),\displaystyle+d^{\frac{\beta}{2}}\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot c\exp(-\ln n\cdot w(\ln n)/c)\Big{)}, (A.25)

where β(0,1]\beta\in(0,1] and in the second line we use the fact that the function vβ,v>0v^{\beta},v>0 is concavity. For any fixed x[0,1]dx\in[0,1]^{d}, we can bound 𝐄λn(Dλn(x)|X=x){\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x) by using Corollary 1 in Mourtada et al. (2020). In detail, we have

𝐄λn(Dλn(x)|X=x)\displaystyle{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(x)|X=x) 0d(1+λnδd)exp(λnδd)𝑑δ\displaystyle\leq\int_{0}^{\infty}{d\left(1+\frac{\lambda_{n}\delta}{\sqrt{d}}\right)\exp\left(-\frac{\lambda_{n}\delta}{\sqrt{d}}\right)d\delta}
2d321λn.\displaystyle\leq 2d^{\frac{3}{2}}\cdot\frac{1}{\lambda_{n}}. (A.26)

Thus, the combination of (A.25) and (A.26) imply that

IIC(2d32βsupy[lnn,lnn]M1(C,y)λnβ+dβ2𝐄M12(C,Y)cn).II\leq C\cdot\left(2d^{\frac{3}{2}\beta}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-\beta}+d^{\frac{\beta}{2}}\cdot\sqrt{{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot\frac{c}{n}\right). (A.27)

Analysis of Part III. This part can be bounded by the central limit theorem. Since the supremum norm of hh is bounded by CC, we know from Assumption 3 that

(h(x),y)supv[C,C](v,y)M2(C,y),x[0,1]d,y\ell(h(x),y)\leq\sup_{v\in[-C,C]}\ell(v,y)\leq M_{2}(C,y),\ \forall x\in[0,1]^{d},y\in{\mathbb{R}}

with 𝐄(M22(C,Y))<{\mathbf{E}}(M^{2}_{2}(C,Y))<\infty. Thus, M2(C,y)M_{2}(C,y) is an envelop function of {h}\{h\}. Note that a single function hh consists of a Glivenko-Cantelli class and has VC dimension 1. Thus, the application of equation (80) in Sen (2018) implies

III:=𝐄(R^(h)R(h))cn𝐄(M22(C,Y))III:={\mathbf{E}}(\hat{R}(h)-R(h))\leq\frac{c}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))} (A.28)

for some universal c>0c>0.

Finally, the combination of (A.17), (A.21), (A.27) and (A.28) completes the proof. \Box

Proof of Corollary 6. This theorem can be obtained directly by using Theorem 2 and Assumption 5. \Box

Proof of Corollary 7. The proof starts from the observation that our class p,β([0,1]d,C)\mathcal{H}^{p,\beta}([0,1]^{d},C) can be used to approximate any general function. Since m(x){f(x):𝐄|f|κ(X)<}m(x)\in\{f(x):{\mathbf{E}}|f|^{\kappa}(X)<\infty\}, by density argument we know m(x)m(x) can be approximated by a sequence of continuous functions in LκL^{\kappa} sense. Thus, we just assume m(x),x[0,1]dm(x),x\in[0,1]^{d} is continuous. Define the logistic activation σlog(x)=ex/(1+ex),x\sigma_{log}(x)=e^{x}/(1+e^{x}),x\in{\mathbb{R}}. For any ε>0\varepsilon>0, by Lemma 16.1 in Györfi et al. (2002) there is hε(x)=j=1Jaε,jσlog(θε,jx+sε,j),x[0,1]dh_{\varepsilon}(x)=\sum_{j=1}^{J}a_{\varepsilon,j}\sigma_{log}(\theta_{\varepsilon,j}^{\top}x+s_{\varepsilon,j}),x\in[0,1]^{d} with aε,j,sε,ja_{\varepsilon,j},s_{\varepsilon,j}\in{\mathbb{R}} and θε,jd\theta_{\varepsilon,j}\in{\mathbb{R}}^{d} such that

𝐄|m(X)hε(X)|κsupx[0,1]d|m(x)hε(x)|κε3.{\mathbf{E}}\left|m(X)-h_{\varepsilon}(X)\right|^{\kappa}\leq\sup_{x\in[0,1]^{d}}|m(x)-h_{\varepsilon}(x)|^{\kappa}\leq\frac{\varepsilon}{3}. (A.29)

Since hε(x)h_{\varepsilon}(x) is a continuously differentiable, we know hε(x)p,β([0,1]d,C(hε))h_{\varepsilon}(x)\in\mathcal{H}^{p,\beta}([0,1]^{d},C(h_{\varepsilon})), where C(hε)>0C(h_{\varepsilon})>0 depends on hεh_{\varepsilon} only. Now we fix such hε(x)h_{\varepsilon}(x) and make the decomposition as follows

𝐄R(h^1,n)R(m)=\displaystyle{\mathbf{E}}R(\hat{h}_{1,n})-R(m)= 𝐄(R(h^1,n)R^(h^1,n))+𝐄(R^(h^n)R^(m))\displaystyle{\mathbf{E}}(R(\hat{h}_{1,n})-\hat{R}(\hat{h}_{1,n}))+{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(m))
+𝐄(R^(m)R(m))\displaystyle+{\mathbf{E}}(\hat{R}(m)-R(m))
:=I+II+III.\displaystyle:=I+II+III.

Part I and III can be upper bounded by following similar analysis in Theorem 2. Therefore, under assumptions in our theorem, we know both of these two parts converges to zero as nn\to\infty. Next, we consider Part II. Note that

𝐄(R^(h^n)R^(m))\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(m)) =𝐄(R^(h^n)R^(hε))+𝐄(R(hε)R(m))\displaystyle={\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+{\mathbf{E}}(R(h_{\varepsilon})-R(m))
𝐄(R^(h^n)R^(hε))+𝐄|hε(X)m(X)|κ\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+{\mathbf{E}}|h_{\varepsilon}(X)-m(X)|^{\kappa}
𝐄(R^(h^n)R^(hε))+cε3,\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))+c\cdot\frac{\varepsilon}{3},

where in the second line we use Assumption 5. Finally, we only need to consider the behavior of term 𝐄(R^(h^n)R^(hε)){\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon})) as nn\to\infty. This can be done by using the analysis of Part II in the proof for Theorem 2. Taking C=C(hε)C=C(h_{\varepsilon}) in (A.27), we have

𝐄(R^(h^n)R^(hε))\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{n})-\hat{R}(h_{\varepsilon}))\leq C(2d32supy[lnn,lnn]M1(C,y)λn1\displaystyle C\cdot\Big{(}2d^{\frac{3}{2}}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-1}
+d122𝐄M12(C,Y)cexp(lnnw(lnn)/c)),\displaystyle+d^{\frac{1}{2}}\sqrt{2{\mathbf{E}}M^{2}_{1}(C,Y)}\cdot c\exp(-\ln n\cdot w(\ln n)/c)\Big{)},

which goes to zero as nn increases. In conclusion, we have proved that

limn𝐄(R^(h^n)R(m))=0.\lim_{n\to\infty}{\mathbf{E}}(\hat{R}(\hat{h}_{n})-R(m))=0.

The above inequality and Assumption 5 shows that h^n\hat{h}_{n} is LκL^{\kappa} consistent for the general function m(x),x[0,1]dm(x),x\in[0,1]^{d}. \Box

Proof of Theorem 8. Based on Assumption 1, we only need to consider the regret function for h^1,n\hat{h}^{*}_{1,n}. For any λ>0\lambda>0, by the definition of h^1,n\hat{h}^{*}_{1,n} we know

𝐄(R^(h^1,n))\displaystyle{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})) 𝐄(R^(h^1,n)+αnλn,1)\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})+\alpha_{n}\cdot\lambda_{n,1}^{*})
𝐄(R^(h^1,n,λ)+αnλ),\displaystyle\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n,\lambda})+\alpha_{n}\cdot\lambda), (A.30)

where h^1,n,λ\hat{h}_{1,n,\lambda} is the estimator based on the process MP1(λ,[0,1]d)MP_{1}(\lambda,[0,1]^{d}).

On the other hand, we have the decomposition below

𝐄R(h^1,n)R(h)\displaystyle{\mathbf{E}}R(\hat{h}_{1,n}^{*})-R(h) =𝐄(R(h^1,n)R^(h^1,n))+𝐄(R^(h^1,n)R^(h))\displaystyle={\mathbf{E}}(R(\hat{h}_{1,n}^{*})-\hat{R}(\hat{h}_{1,n}^{*}))+{\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})-\hat{R}(h))
+𝐄(R^(h)R(h)):=I+II+III.\displaystyle+{\mathbf{E}}(\hat{R}(h)-R(h)):=I+II+III. (A.31)

Firstly, we bound Part II. Recall An:={max1in|Yi|lnn}A_{n}:=\{\max_{1\leq i\leq n}|Y_{i}|\leq\ln{n}\}, which is defined below (A.1). Make the decomposition of II as follows.

I\displaystyle I =𝐄((R(h^1,n)R^(h^1,n))𝕀(An))+𝐄((R(h^1,n)R^(h^1,n))𝕀(Anc))\displaystyle={\mathbf{E}}((R(\hat{h}_{1,n}^{*})-\hat{R}(\hat{h}_{1,n}^{*}))\cap\mathbb{I}(A_{n}))+{\mathbf{E}}((R(\hat{h}_{1,n}^{*})-\hat{R}(\hat{h}_{1,n}^{*}))\cap\mathbb{I}(A_{n}^{c}))
:=I1,1+I1,2\displaystyle:=I_{1,1}+I_{1,2} (A.32)

The key for bounding I1,1I_{1,1} is to find the upper bound of λn,1\lambda_{n,1}^{*}. By the definition of h^1,n\hat{h}_{1,n}^{*} and Assumption 3, we know if AnA_{n} occurs

αnλn,1Pen(0)supy[lnn,lnn]M2(βn,y).\alpha_{n}\cdot\lambda_{n,1}^{*}\leq Pen(0)\leq\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y).

Therefore, when AnA_{n} happens we have

λn,1supy[lnn,lnn]M2(βn,y)αn.\lambda_{n,1}^{*}\leq\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}.

Following arguments that we used to bound I1,1I_{1,1} in the Proof of Theorem 2, we know

|I1,1|\displaystyle|I_{1,1}| cmax{βn,𝐄(M22(βn,Y))}n(1+λn,1)d\displaystyle\leq c\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\cdot(1+\lambda_{n,1}^{*})^{d}
cmax{βn,𝐄(M22(βn,Y))}n(1+supy[lnn,lnn]M2(βn,y)αn)d\displaystyle\leq c\cdot\frac{\max\{\beta_{n},\sqrt{{\mathbf{E}}(M_{2}^{2}(\beta_{n},Y))}\}}{\sqrt{n}}\cdot\left(1+\frac{\sup_{y\in[-\ln n,\ln n]}M_{2}(\beta_{n},y)}{\alpha_{n}}\right)^{d} (A.33)

Next, the way for bounding I1,2I_{1,2} in (A.32) is similar to that we used to bound I1,2I_{1,2} in the proof of Theorem 2. Namely, we have

|I1,2|(supx[βn,βn]2(x,lnn)+𝐄(M22(βn,Y)))12cexp(lnnw(lnn)/c).|I_{1,2}|\leq\left(\sup_{x\in[-\beta_{n},\beta_{n}]}{\ell^{2}(x,\ln n)}+{\mathbf{E}}(M^{2}_{2}(\beta_{n},Y))\right)^{\frac{1}{2}}\cdot c\exp(-\ln n\cdot w(\ln n)/c). (A.34)

Secondly, we use (A.30) to bound Part IIII in (A.31). By the definition of h^1,n\hat{h}_{1,n}^{*}, for any λ>0\lambda>0 we have

II:=𝐄(R^(h^1,n)R^(h))𝐄(R^(h^1,n,λ)R^(h)+αnλ).II:={\mathbf{E}}(\hat{R}(\hat{h}_{1,n}^{*})-\hat{R}(h))\leq{\mathbf{E}}(\hat{R}(\hat{h}_{1,n,\lambda})-\hat{R}(h)+\alpha_{n}\cdot\lambda).

Similar to the Proof of Theorem 2, the above inequality implies

IIC(2d32psupy[lnn,lnn]M1(C,y)λnp+dp2𝐄(M12(C,Y))cn)+αnλ.II\leq C\cdot\left(2d^{\frac{3}{2}p}\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}\cdot\lambda_{n}^{-p}+d^{\frac{p}{2}}\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\cdot\frac{c}{n}\right)+\alpha_{n}\cdot\lambda. (A.35)

Since (A.35) holds for all λ>0\lambda>0, taking λ=(1αn)1/(p+1)\lambda=\left(\frac{1}{\alpha_{n}}\right)^{1/(p+1)} inequality (A.35) further implies

II\displaystyle II (2Csupy[lnn,lnn]M1(C,y)d32p+1)(αn)pp+1+rn\displaystyle\leq(2C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}d^{\frac{3}{2}p}+1)\cdot\left(\alpha_{n}\right)^{\frac{p}{p+1}}+r_{n}
(2Csupy[lnn,lnn]M1(C,y)d32p+1)(αn)p2+rn,\displaystyle\leq(2C\sup_{y\in[-\ln n,\ln n]}{M_{1}(C,y)}d^{\frac{3}{2}p}+1)\cdot\left(\alpha_{n}\right)^{\frac{p}{2}}+r_{n}, (A.36)

where rn:=Cd12p𝐄(M12(C,Y))cnr_{n}:=C\cdot d^{\frac{1}{2}p}\sqrt{{\mathbf{E}}(M_{1}^{2}(C,Y))}\cdot\frac{c}{n} is caused by the probability tail of YY.

Thirdly, we consider Part IIIIII. The argument for this part is same with that used to obtain (A.28). Namely, we have

III:=𝐄(R^(h)R(h))cn𝐄(M22(C,Y)),III:={\mathbf{E}}(\hat{R}(h)-R(h))\leq\frac{c}{\sqrt{n}}\cdot\sqrt{{\mathbf{E}}(M^{2}_{2}(C,Y))}, (A.37)

where c>0c>0 is universal and does not depend on CC.

Finally, the combination of (A), (A.37), (A.33) and (A.34) finishes the proof. \Box

Proof of Proposition 13. For any function h:[0,1]d[βn,βn]h:[0,1]^{d}\to[-\beta_{n},\beta_{n}], by Assumption 4 we know the event

Fn:=i=1n{Yih(Xi)[Clnn,Clnn]}F_{n}:=\bigcap_{i=1}^{n}\{Y_{i}-h(X_{i})\in[-C\ln n,C\ln n]\}

happens with probability larger than 1cnexp(Clnnw(Clnn)/c)1-cn\exp(-C\ln n\cdot w(C\ln n)/c), where C>0C>0 is a large number and c>0c>0. Denote by h^n,ols\hat{h}_{n,ols} the least square forest estimator in Section 6.1. Now, we make the decomposition below.

𝐄(h^n(X)m(X))2\displaystyle{\mathbf{E}}(\hat{h}_{n}(X)-m(X))^{2} =𝐄[(h^n(X)m(X))2|Fn]𝐏(Fn)+𝐄[(h^n(X)m(X))2|Fnc]𝐏(Fnc)\displaystyle={\mathbf{E}}[(\hat{h}_{n}(X)-m(X))^{2}|F_{n}]{\mathbf{P}}(F_{n})+{\mathbf{E}}[(\hat{h}_{n}(X)-m(X))^{2}|F_{n}^{c}]{\mathbf{P}}(F_{n}^{c})
𝐄(h^n,ols(X)m(X))2\displaystyle{\mathbf{E}}(\hat{h}_{n,ols}(X)-m(X))^{2} =𝐄[(h^n,ols(X)m(X))2|Fn]𝐏(Fn)+𝐄[(h^n,ols(X)m(X))2|Fnc]𝐏(Fnc)\displaystyle={\mathbf{E}}[(\hat{h}_{n,ols}(X)-m(X))^{2}|F_{n}]{\mathbf{P}}(F_{n})+{\mathbf{E}}[(\hat{h}_{n,ols}(X)-m(X))^{2}|F_{n}^{c}]{\mathbf{P}}(F_{n}^{c})

When FnF_{n} occurs, it can be seen h^n,ols=h^n\hat{h}_{n,ols}=\hat{h}_{n} if δn=Clnn\delta_{n}=C\ln n for some large C>0C>0. On the other hand, we have the upper bounds of two risk functions:

𝐄(h^n(X)m(X))2(2βn2+2𝐄m2(X)),R(h^n,ols)(2βn2+2𝐄m2(X)).{\mathbf{E}}(\hat{h}_{n}(X)-m(X))^{2}\leq(2\beta_{n}^{2}+2{\mathbf{E}}m^{2}(X)),R(\hat{h}_{n,ols})\leq(2\beta_{n}^{2}+2{\mathbf{E}}m^{2}(X)).

The combination of above inequalities leads that

|𝐄(h^n(X)m(X))2𝐄(h^n,ols(X)m(X))2|(cln2n+c)cnexp(Clnnw(Clnn)/c).|{\mathbf{E}}(\hat{h}_{n}(X)-m(X))^{2}-{\mathbf{E}}(\hat{h}_{n,ols}(X)-m(X))^{2}|\leq(c\ln^{2}n+c)\cdot cn\exp(-C\ln n\cdot w(C\ln n)/c).

By Corollary 7, we already know h^n,ols\hat{h}_{n,ols} is L2L^{2} consistent for any general m(X)m(X). Therefore, h^n\hat{h}_{n} is also L2L^{2} consistent. \Box

Proof of Proposition 14. Let us show the existence of mm in (5). In fact, Lugosi and Vayatis (2004) tells us one of the minimizers is

m(x):=infα{η(x)ϕ(α)+(1η(x))ϕ(α)}m_{*}(x):=\inf_{\alpha\in{\mathbb{R}}}\{\eta(x)\phi(-\alpha)+(1-\eta(x))\phi(\alpha)\}

when ϕ\phi is a differentiable strictly convex, strictly increasing cost function satisfying ϕ(0)=1,limvϕ(v)=0\phi(0)=1,\lim_{v\to-\infty}\phi(v)=0. Let ε>0\varepsilon>0 be a given small number. Next, we need to show there is hp,β([0,1]d,Cp)h_{*}\in\mathcal{H}^{p,\beta}([0,1]^{d},C_{p}) such that

R(h)R(m)ε.R(h_{*})-R(m_{*})\leq\varepsilon. (A.38)

when the cost function is ϕ5\phi_{5} or ϕ6\phi_{6}.

First, we consider ϕ5\phi_{5}. Since supv|ϕ(v)|1/ln2\sup_{v\in{\mathbb{R}}}|\phi^{\prime}(v)|\leq 1/\ln 2, we have

R(h)R(m)1ln2𝐄|h(X)m(X)|.R(h)-R(m)\leq\frac{1}{\ln 2}{\mathbf{E}}|h(X)-m_{*}(X)|.

Note that 𝐄|m(X)|<{\mathbf{E}}|m_{*}(X)|<\infty. We can always find a infinite differentiable function hp,β([0,1]d,Cp)h_{*}\in\mathcal{H}^{p,\beta}([0,1]^{d},C_{p}) s.t. 𝐄|h(X)m(X)|<ln2ε{\mathbf{E}}|h(X)-m(X)|<\ln 2\varepsilon. This completes the proof for (A.38).

Second, we consider ϕ6\phi_{6}. The argument for ϕ5\phi_{5} above does not work due to the dramatic increase of eve^{v}. In this case, we need to define an infinite differentiable function

w(x):=e1x221𝕀(x2<1),x.w(x):=e^{\frac{1}{\|x\|_{2}^{2}-1}}{\mathbb{I}}(\|x\|_{2}<1),x\in{\mathbb{R}}.

Based on this mollifier, we consider the weighted average function of mm:

mη(x):=dm(xz)1ηdw(zη)𝑑z,x[0,1]d,m_{\eta}(x):=\int_{{\mathbb{R}}^{d}}m_{*}(x-z)\frac{1}{\eta^{d}}w\left(\frac{z}{\eta}\right)dz,x\in[0,1]^{d},

where we define m(x)=0m(x)=0 for any x[0,1]dx\notin[0,1]^{d} and η>0\eta>0. We know mηm_{\eta} is an infinite differentiable function in [0,1]d[0,1]^{d} and

supx[0,1]d|mη(x)|[0,1].\sup_{x\in[0,1]^{d}}{|m_{\eta}(x)|}\in[0,1].

Importantly, some simple analysis implies

limη0𝐄|mη(X)m(X)|=0.\lim_{\eta\to 0}{{\mathbf{E}}|m_{\eta}(X)-m_{*}(X)|}=0. (A.39)

In fact, we next show one of mηm_{\eta} can be defined as hh_{*} satisfying (A.38). By the dominated convergence theorem, we have

𝐄(eYm(X))\displaystyle{\mathbf{E}}(e^{-Ym_{*}(X)}) =𝐄(k=0(Y)kmk(X)k!)=k=0𝐄((Y)kmk(X)k!)\displaystyle={\mathbf{E}}\left(\sum_{k=0}^{\infty}\frac{(-Y)^{k}m_{*}^{k}(X)}{k!}\right)=\sum_{k=0}^{\infty}{\mathbf{E}}\left(\frac{(-Y)^{k}m_{*}^{k}(X)}{k!}\right)
𝐄(eYmη(X))\displaystyle{\mathbf{E}}(e^{-Ym_{\eta}(X)}) =𝐄(k=0(Y)kmηk(X)k!)=k=0𝐄((Y)kmηk(X)k!)\displaystyle={\mathbf{E}}\left(\sum_{k=0}^{\infty}\frac{(-Y)^{k}m_{\eta}^{k}(X)}{k!}\right)=\sum_{k=0}^{\infty}{\mathbf{E}}\left(\frac{(-Y)^{k}m_{\eta}^{k}(X)}{k!}\right)

Since functions |m|,|mη||m|,|m_{\eta}| are upper bounded by 11, we can find a Nη,ε+N_{\eta,\varepsilon}\in\mathbb{Z}^{+} such that

|R(m)k=0Nη,ε𝐄((Y)kmk(X)k!)|\displaystyle\left|R(m_{*})-\sum_{k=0}^{N_{\eta,\varepsilon}}{\mathbf{E}}\left(\frac{(-Y)^{k}m_{*}^{k}(X)}{k!}\right)\right| ε3\displaystyle\leq\frac{\varepsilon}{3}
|R(mη)k=0Nη,ε𝐄((Y)kmηk(X)k!)|\displaystyle\left|R(m_{\eta})-\sum_{k=0}^{N_{\eta,\varepsilon}}{\mathbf{E}}\left(\frac{(-Y)^{k}m_{\eta}^{k}(X)}{k!}\right)\right| ε3\displaystyle\leq\frac{\varepsilon}{3}

For any kNη,εk\leq N_{\eta,\varepsilon}, we have

|(Y)kmk(X)(Y)kmηk(X)|\displaystyle|(-Y)^{k}m_{*}^{k}(X)-(-Y)^{k}m_{\eta}^{k}(X)| |m(X)mη(X)||mk1(X)+mk2(X)mη(X)++mηk1(X)|\displaystyle\leq|m_{*}(X)-m_{\eta}(X)||m_{*}^{k-1}(X)+m_{*}^{k-2}(X)m_{\eta}(X)+\cdots+m_{\eta}^{k-1}(X)|
k|m(X)mη(X)|.\displaystyle\leq k|m_{*}(X)-m_{\eta}(X)|.

From (A.39), choose mηεm_{\eta^{\varepsilon}} such that

𝐄|m(X)mηε(X)|(k=1Nη,ε1(k1)!)1ε3.{\mathbf{E}}|m_{*}(X)-m_{\eta^{\varepsilon}}(X)|\leq\left(\sum_{k=1}^{N_{\eta,\varepsilon}}\frac{1}{(k-1)!}\right)^{-1}\frac{\varepsilon}{3}.

Put above inequalities together. Then, we know

R(mηε)R(m)ε.R(m_{\eta^{\varepsilon}})-R(m_{*})\leq\varepsilon.

Finally, mηεp,β([0,1]d,Cp)m_{\eta^{\varepsilon}}\in\mathcal{H}^{p,\beta}([0,1]^{d},C_{p}) for some Cp>0C_{p}>0 since it is infinite differentiable. Thus, mηεm_{\eta^{\varepsilon}} can be hh_{*} defined in (A.38).

On the other hand, by Remark 5 we have

lim¯n𝐄(R(h^n))R(h).\varlimsup_{n\to\infty}{\mathbf{E}}(R(\hat{h}_{n}))\leq R(h_{*}).

Thus, there exist N2+N_{2}\in\mathbb{Z}^{+} such that for any nN2n\geq N_{2},

𝐄(R(h^n))R(h)+ε.{\mathbf{E}}(R(\hat{h}_{n}))\leq R(h_{*})+\varepsilon.

The proof is completed by combining above inequality and (A.38) together. \Box

Proof of Lemma 15. Note that Pen(h):=ln[0,1]dexp(h(x))𝑑xPen(h):=\ln\int_{[0,1]^{d}}{\exp{(h(x))}dx}. Define a real function as follows:

g(α):=Pen((1α)h+αhn), 0α1.g(\alpha):=Pen((1-\alpha)\cdot h+\alpha\cdot h_{n}^{*}),\ 0\leq\alpha\leq 1.

Thus, we have g(0)=Pen(h)g(0)=Pen(h) and g(1)=Pen(hn)g(1)=Pen(h_{n}^{*}). Later, it will be convenient to use the function gg in the analysis of this penalty function. Since both hh and hnh_{n}^{*} are upper bounded, we know g(α)g(\alpha) is differentiable and its derivative is

ddαg(α)=[0,1]d(hn(x)h(x))exp(h(x)+α(hn(x)h(x))dx[0,1]dexp(h(x)+α(hn(x)h(x))dx.\frac{d}{d\alpha}g(\alpha)=\frac{\int_{[0,1]^{d}}{(h_{n}^{*}(x)-h(x))\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x)})dx}}{\int_{[0,1]^{d}}{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}dx}}. (A.40)

Define a continuous random vector ZαZ_{\alpha} with the density function

fZα(x):=exp(h(x)+α(hn(x)h(x))[0,1]dexp(h(x)+α(hn(x)h(x))dx,x[0,1]d.f_{Z_{\alpha}}(x):=\frac{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}}{\int_{[0,1]^{d}}{\exp{(h(x)+\alpha\cdot(h_{n}^{*}(x)-h(x))}dx}},\ x\in[0,1]^{d}. (A.41)

From (A.40) and (A.41), we know

ddαg(α)=𝐄Zα(hn(Zα)h(Zα)).\frac{d}{d\alpha}g(\alpha)={\mathbf{E}}_{Z_{\alpha}}(h_{n}^{*}(Z_{\alpha})-h(Z_{\alpha})). (A.42)

On the other hand, the Lagrange mean theorem implies

|Pen(h)Pen(hn(x))|\displaystyle|Pen(h)-Pen(h_{n}^{*}(x))| =|g(0)g(1)|\displaystyle=|g(0)-g(1)|
=|ddαg(α)|α=α|\displaystyle=\left|\frac{d}{d\alpha}g(\alpha)|_{\alpha=\alpha^{*}}\right|
=𝐄Zα(|hn(Zα)h(Zα)|),\displaystyle={\mathbf{E}}_{Z_{\alpha^{*}}}(|h_{n}^{*}(Z_{\alpha^{*}})-h(Z_{\alpha^{*}})|), (A.43)

where α[0,1]\alpha^{*}\in[0,1]. Thus, later we only need to consider the last term of (A.43). Since fZα(x)exp(2C),x[0,1]d,α[0,1]f_{Z_{\alpha^{*}}}(x)\leq\exp{(2C)},\ \forall x\in[0,1]^{d},\forall\alpha\in[0,1], we know from (A.43) that

|Pen(h)Pen(hn(x))|exp(2C)𝐄U(|hn(U)h(U)|),|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp{(2C)}\cdot{\mathbf{E}}_{U}(|h_{n}^{*}(U)-h(U)|), (A.44)

where UU follows the uniform distribution in [0,1]d[0,1]^{d} and is independent with πλ\pi_{\lambda}. By further calculation, we have

𝐄πλ|Pen(h)Pen(hn(x))|\displaystyle{\mathbf{E}}_{\pi_{\lambda}}|Pen(h)-Pen(h_{n}^{*}(x))| exp(2C)𝐄πλ𝐄U(|hn(U)h(U)|)\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{\pi_{\lambda}}{\mathbf{E}}_{U}(|h_{n}^{*}(U)-h(U)|)
exp(2C)𝐄U𝐄πλ(CDλn(U)β)\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{U}{\mathbf{E}}_{\pi_{\lambda}}(C\cdot D_{\lambda_{n}}(U)^{\beta})
exp(2C)𝐄U[(𝐄λn(Dλn(u)|U=u))β].\displaystyle\leq\exp(2C)\cdot{\mathbf{E}}_{U}\left[({\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(u)|U=u))^{\beta}\right].

From (A.26), we already know 𝐄λn(Dλn(u)|U=u)2d321λn{\mathbf{E}}_{\lambda_{n}}(D_{\lambda_{n}}(u)|U=u)\leq 2d^{\frac{3}{2}}\cdot\frac{1}{\lambda_{n}}. Thus, above inequality implies

𝐄πλ|Pen(h)Pen(hn(x))|exp(2C)2βd32β(1λn)β.{\mathbf{E}}_{\pi_{\lambda}}|Pen(h)-Pen(h_{n}^{*}(x))|\leq\exp(2C)\cdot 2^{\beta}d^{\frac{3}{2}\beta}\cdot\left(\frac{1}{\lambda_{n}}\right)^{\beta}.

This completes the proof. \Box

Proof of Lemma 16. First, we calculate the term d2dα2R(h0+αg)\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g), where g(x)𝑑x=0\int{g(x)dx}=0. With some calculation, it is not difficult to know

d2dα2R(h0+αg)=Var(h(Xα)),\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)=Var\left(h(X_{\alpha})\right), (A.45)

where XαX_{\alpha} is a continuous random vector in [0,1]d[0,1]^{d}. Furthermore, XαX_{\alpha} has the density fXα(x)=exp(gα(x))/exp(hα(x))𝑑xf_{X_{\alpha}}(x)=\exp(g_{\alpha}(x))/\int{\exp(h_{\alpha}(x))dx} with gα(x)=h0(x)+αg(x)g_{\alpha}(x)=h_{0}(x)+\alpha g(x). Sicne h0c\|h_{0}\|_{\infty}\leq c and gβn\|g\|_{\infty}\leq\beta_{n}, we can assume without loss generality that gαβn\|g_{\alpha}\|_{\infty}\leq\beta_{n}. This results that

exp(2βn)fXα(x)exp(2βn),x[0,1]d.\exp{(-2\beta_{n})}\leq f_{X_{\alpha}}(x)\leq\exp{(2\beta_{n})},\ \forall x\in[0,1]^{d}. (A.46)

Let UU follows uniform distribution in [0,1]d[0,1]^{d}. Equation (A.46) implies

Var(g(Xα))\displaystyle Var\left(g(X_{\alpha})\right) =infc>0𝐄(g(Xα)c)2\displaystyle=\inf_{c>0}{\mathbf{E}}(g(X_{\alpha})-c)^{2}
exp(2βn)infc>0𝐄(g(U)c)2\displaystyle\leq\exp{(2\beta_{n})}\cdot\inf_{c>0}{\mathbf{E}}(g(U)-c)^{2}
=exp(2βn)Var(g(U))=exp(2βn)𝐄(g2(U)).\displaystyle=\exp{(2\beta_{n})}\cdot Var(g(U))=\exp{(2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(U)). (A.47)

With the same argument, we also have

Var(g(Xα))exp(2βn)𝐄(g2(U)).Var\left(g(X_{\alpha})\right)\geq\exp{(-2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(U)). (A.48)

The combination of (A.45), (A.47) and (A.48) shows that

cexp(2βn)𝐄(g2(X))d2dα2R(h0+αg)c1exp(2βn)𝐄(g2(X))c\cdot\exp{(-2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(X))\leq\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha g)\leq c^{-1}\cdot\exp{(2\beta_{n})}\cdot{\mathbf{E}}(g^{2}(X)) (A.49)

for some universal c>0c>0 and any α[0,1]\alpha\in[0,1]. Finally, by Taylor expansion, we have

R(h)=R(h0)+ddαR(h0+α(hh0))|α=0+d2dα2R(h0+α(hh0))|α=αR(h)=R(h_{0})+\frac{d}{d\alpha}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=0}+\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=\alpha^{*}}

for some α[0,1]\alpha^{*}\in[0,1]. Without loss of generality, We can assume that hh0βn\|h-h_{0}\|_{\infty}\leq\beta_{n}. Thus, the second derivative d2dα2R(h0+α(hh0))|α=α\frac{d^{2}}{d\alpha^{2}}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=\alpha^{*}} can be bounded by using (A.49) if we take g=hh0g=h-h_{0}. Meanwhile, the first derivative satisfies ddαR(h0+α(hh0))|α=0=0\frac{d}{d\alpha}R(h_{0}+\alpha(h-h_{0}))|_{\alpha=0}=0 since h0h_{0} achieves the minimal value of R()R(\cdot). Based on above analysis, we have

c1lnn𝐄(h(X)h0(X))2R(h)R(h0)c1lnn𝐄(h(X)h0(X))2c\cdot\frac{1}{\ln n}\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}\leq R(h)-R(h_{0})\leq c^{-1}\cdot\ln n\cdot{\mathbf{E}}(h(X)-h_{0}(X))^{2}

for some universal c>0c>0. This completes the proof. \Box