Non-asymptotic Properties of Generalized Mondrian Forests in Statistical Learning
Abstract
Since the publication of Breiman (2001), Random Forests (RF) have been widely used in both regression and classification. Later on, other forests are also proposed and studied in literature and Mondrian Forests are notable examples built on the Mondrian process; see Lakshminarayanan et al. (2014). In this paper, we propose an ensemble estimator in general statistical learning based on Mondrian Forests, which can be regarded as an extension of RF. This general framework includes many common learning problems, such as least squared regression, least regression, quantile regression and classification. Under mild conditions of loss functions, we give the upper bound of the regret function of our estimator and show that such estimator is statistically consistent.
Keywords: Statistical learning, Machine learning, Random forests, Ensemble learning, Regret function.
1 Introduction
Random Forest (RF) in Breiman (2001) is a very popular ensemble learning technique in machine learning used for classification and regression tasks. It operates by constructing multiple decision trees during training and averaging their predictions for improved accuracy and robustness. Many empirical studies have delved into its powerful performance across different domains and data characteristics, see for example Liaw et al. (2002). However, until now we only know a few of its theoretical properties due to the complicated tree structure and the data-dependent splitting rule (CART). Scornet et al. (2015) was the first one who showed its statistical consistency when the true regressor follows an additive model. Later, Klusowski (2021) established the consistency of RF for any general true regressor. While there are many other papers studying the theory about RF, only these two focused on the original RF in Breiman (2001) by considering how to analyze the important splitting rule, CART.
To gain a deeper insight into the random forest, additional research delves into modified and stylized versions of RF in Breiman (2001). One such method is Purely Random Forests (PRF) (see, for example Arlot and Genuer (2014), Biau (2012) and Biau et al. (2008)), where individual trees are grown independently of the sample, making them well-suited for theoretical analysis. In this paper, our interest is Mondrian Forest, which is an ensemble learning algorithm that combines elements of decision trees and random forests with partitioning strategies inspired by the Mondrian process. This kind of forest was introduced in Lakshminarayanan et al. (2014) that showed that Mondrian Forest has competitive online performance in classification problems compared to other state-of-the-art methods. Inspired by its nice online property, Mourtada et al. (2021) studied the theory of online regression and classification by utilizing Mondrian forest. On the other hand, people also focused on the theory study of Mondrian forest for batched data. Namely, in this case, the data set is already collected by experimenters. Mourtada et al. (2020) gave the consistency and convergence rates for Mondrian trees and forests, that are minimax optimal on the set of Hölder continuous functions. Later on, Cattaneo et al. (2023) follows the line in Mourtada et al. (2020) and gave the asymptotic normal distribution of Mondrian forest for offline regression problem.
Instead of considering classical regression and classification problems, we find in this paper that Mondrian forest can also be applied in more statistical/machine learning problems, such as generalized regression, density estimation and quantile regression. Our main contributions are two-fold:
-
•
First, we propose a general framework (estimator) based on Mondrian forest that can be used in different learning problems.
-
•
Second, we study the upper bound of the regret (risk) function of above forest estimator. Our theoretical results can be applied in many learning cases and corresponding examples are given in Section 6.
1.1 Related work
Our method about generalizing RF is based on a global perspective, while the method of generalizing RF in Athey et al. (2019) starts from a local perspective. In other words, by doing one optimization with full data points, we can estimate the objective function and Athey et al. (2019) can only estimate a specific point . Therefore, our generalized method can save a lot of time in computation especially when the dimension is large. Secondly, the generalized method based on globalization can also be easily applied to the statistical problem with a penalization , where is a functional of . In Section 6.4, we show the application of our method in one of these penalty optimizations, namely the nonparametric density estimation. Since Athey et al. (2019) only perform estimation pointwisely, it is difficult for them to ensure the obtained estimator satisfies shape constraints and the corresponding case is not included in their scope.
2 Background and Preliminaries
2.1 Goal in Statistical Learning
Let be the random vector, where we normalize the range of . In statistical learning, we would like a policy to learn from , which is defined as a function . Usually, a loss function is used to measure the difference or loss between the decision and goal . Taking expectation w.r.t. , the risk function
(1) |
denotes the averaged loss by using the policy . Naturally, people have reasons to select the best policy by minimizing the averaged loss over some function class , namely
Therefore, has the minimal risk in theory that one can try the best to achieve. In practice, the distribution of is unknown and (1) is not able to be used for the calculation of . Thus, such best can not be obtained in a direct way. On the other hand, we usually use i.i.d. data to approximate . Thus, the empirical risk function can be approximated by
Traditionally, people always find an estimator/policy by minimizing over a known function class; see spline regression in Györfi et al. (2002), wavelet regression in Györfi et al. (2002) and regression by deep neural networks in Schmidt-Hieber (2020). Recently, instead of minimizing globally, tree-based algorithms have been applied to construct the empirical estimator . According to many practitioners’ experience, the strong prediction ability of RF’s have shown the superiority of tree-based estimators over many traditional methods in statistical learning. In this paper, we are interested in bounding the regret function
where will be an ensemble estimator obtained by Mondrian Forests.
2.2 Mondrian partitions
Mondian partitions correspond with a case of random tree partitions, where the partition of is independent of data points. This scheme totally depends on a stochastic process, named as Mondrian process and denoted by . The Mondrian process is a distribution on infinite tree partitions of introduced by Roy and Teh (2008) and Roy (2011). To reduce notations, the details of its definition is omitted here and can be checked in Definition 3 in Mourtada et al. (2020).
Let us come back to the Mondrian partitions. The partition that will be used in this paper is called the Mondrian partitions with stopping time and is denoted by . (see Section 1.3 in Mourtada et al. (2020)) Its construction consists of two steps. Firstly, we construct partitions according to by iteratively splitting cells at random times, which depend on the linear dimension of each cell. The probability of splitting along each side is proportional to the side length of the cell, and the splitting position is chosen uniformly. Secondly, we cut the nodes whose birth time is behind the tuning parameter . Note that each tree node generated in the first step was given a specific birth time. When the tree layer grows, the birth time also increases. Therefore, the second step can be regarded as a pruning process, which helps us choose the best tree model in learning problems.
Let with closed intervals . Denote and . Let be an exponential distribution with expectation . Then, our Mondrian partition with stopping time can be generated according to Algorithm 1. In fact, this algorithm is a recursive process and in the initial time we input the root node . On the other hand, we give an example in Figure 1 to show how Algorithm 1 works.

3 Methodology
Let be independent Mondrian processes. When we prune each tree at time , independent partitions are obtained, where all cuts after time are ignored. In this case, we can write satisfying
where denotes a cell in the partition . For each cell , a constant is used as the predictor of in this small region, where
(2) |
and denotes the threshold. For any fixed , is usually a convex function w.r.t. the first variable in machine learning. Therefore, the optimization (2) over guarantees the existence of in general and we let as goes to infinity. Then, for each , we can get an estimator of :
where denotes the indicator function. By applying the ensemble technique, the final estimator is given by
(3) |
If the cell that does not contain any data point, we just use as the predictor in the corresponding region.
Let us clarify the relationship between (3) and the traditional RF. If we take the loss function and , it can be checked that
where denotes the cardinality of a set. In this case, it is a problem about least squared regression and the estimator in (3) exactly coincides with that in Mourtada et al. (2020). In conclusion, our estimator can be regarded as an extension of regression forests in Mourtada et al. (2020) since can be chosen arbitrarily by a practitioner.
By examining the above algorithm carefully, we know there are two tuning parameters in the construction of : and . The stopping time, , controls the model complexity of Mondrian forests. Generally speaking, the cardinality of a tree partition increases when goes to infinity. Thus, a large is useful to reduce the bias of the forest estimator. However, a small helps controls the generalization error of (3). Therefore, there should a balance about the choice of . To ensure the consistency of , we suppose is dependent with the sample size and write in the following analysis. The second parameter, , denotes the number of Mondrian trees. There are studies about its selection for RF; see, for example, Zhang and Wang (2009). In practice, many practitioners take during their computations.
4 Regret function of Mondrian forests
In this section, we study the upper bound of the forest estimator (3) that is constructed by Mondrian processes. First, we need some mild restrictions on the loss functions .
Assumption 1.
The loss function for all and is convex for any fixed with .
Assumption 2.
There exists a non-negative function with such that for any , is Lipshitz continuous and for any , , we have
Assumption 3.
There exists an envelop function such that for any ,
We will see in next section that many commonly used loss functions satisfy Assumption 1- 3 including loss and loss. In the following analysis, we suppose is a sub-Gaussian random variable and takes value in . In detail, we make the following assumption about the distribution of .
Assumption 4.
For some , we have for each . To simplify notation, we always assume later that .
Our theoretical results relate to the -smooth class below. This class is considered in the calculation of the regret function since it is large enough and dense in the integrable space generated by any probability measure.
Definition 1 (-smooth class).
Let , and . The -smooth ball with radius , denoted by , is the set of times differentiable functions such that
and
where denotes the norm in space and is the gradient operator.
The main result in this section is presented in Theorem 1.
Theorem 1 (Excess risk of Mondrian forests).
Remark 1.
The first term of the RHS of (4) relates to the generalization error of forest, and the second one is the approximation error of Mondrian forest to . The second line of (4) measures the convergence error when the empirical loss is used to approximate its theoretical version . Finally, the last line is due to the sub-Gaussian property of and terms in this line will disappear if we further assume is bounded.
Remark 2.
In many applications, we will see later those coefficients above, such as and , are only of polynomial order of . Since the term decays to zero at any polynomial rate, the last line of (4) usually has no influence on the convergence speed of the regret function. In conclusion, we can always hope only the first two lines dominate the convergence rate.
In statistics, we always have the true function that satisfies
(5) |
On the other hand, Theorem 1 can also be used to analyze the consistency of . To meet this requirement, the following assumption is necessary.
Assumption 5.
For any , there are and such that
To simplify notation, we denote the last line of (4) by . Namely,
(6) |
Corollary 1 (Consistency rate of Mondrian forests).
5 The choice of
In practice, the best is unknown and we need a criterion to stop the growth of Mondrian forests. Here, we use a penalty method. For each , define
(8) |
where controls the power of penalty and is constructed by the Mondrian process . Then, the best is chosen by
Denote by the tree estimator that is constructed by using Mondrian process . Then, the data driven estimator is given by
(9) |
Theorem 2.
Therefore, by properly choosing , we can obtain a nice rate of the regret function of Mondrian forests according to (2). Theorem 2 also tells us the estimator (9) is adaptive if we ignore some unimportant coefficients first, such as . The application of Theorem 2 are given in next section, where some examples are discussed in detail and we will show in those cases coefficients in (2), such as , can indeed be bounded away from a polynomial of .
6 Examples
In this section, we show how to use Mondrian forests in different statistical learning problems. Meanwhile, theoretical properties of forest estimators, namely in (3) and in (9), are given based on Theorem 1-2 in each learning problem. Sometimes, the result in Lemma 1 is useful for the verification of Assumption 5. The proof of this result can be done by considering the Taylor expansion of the real function w.r.t. .
Lemma 1.
6.1 Least square regression
Nonparametric least squares regression refers to methods that do not assume a specific parametric form of the conditional expectation . Instead, these methods are flexible and can adapt to the underlying structure of the data. The loss function of least square regression is given by . First, we define the event . Under the Assumption 4, by (34) we can find constants such that . This means is very close to as . On the event , from (2) we further know
where denotes the cardinality of a set. Therefore, is just the average of s that are in the leaf .
Let us discuss the property of . First, it is obvious that Assumption 1 holds for this loss. By some simple calculations, we also know Assumption 1 is satisfied with and Assumption 2 is satisfied with . Choosing and , Theorem 1 implies the following property of .
Proposition 1.
For any , there exists an integer such that for any ,
Then, we check Theorem 1. By some simple calculations, we have and
The above inequality shows Assumption 5 holds with and . When is selected, Corollary 1 tells us
Proposition 2.
For any , there exists an integer such that for any ,
We can also show is consistent for any general function defined in (5) if the convergence rate of is chosen as stated in Corollary 2. Finally, choose and in Theorem 2. The regret function of the estimator that is based on the model selection in (8) has the upper bound below.
Proposition 3.
For any , there exists an integer such that for any ,
6.2 Generalized regression
Generalized regression refers to a broad class of regression models that extend beyond the traditional ordinary least squares (OLS) regression, accommodating various types of response variables and relationships between predictors and response. Usually, in this model the conditional distribution of given follows an exponential family of distribution
(11) |
where is a positive measure defined on and for any . The above function is used as the aim of normalization that is defined on a open subinterval of and . Now, we suppose the function exists and we have by some calculations. Thus, the conditional expectation will be known if we can estimate the unknown function . More information about (11) can be found in Stone (1986), Stone (1994) and Huang (1998).
The problem of generalized regression is to estimate the unknown function by using the i.i.d. data . Note that both and are known in (11). In this case, we use the maximal likelihood method for the estimation and the corresponding loss function is given by
and by some calculations we know the true function satisfies Definition 5, namely
Therefore, we have reasons to believe Mondrian forests is statistically consistent in this problem, which is stated in Corollary 2. Now we give some mild restrictions on and in order to make sure our general results can be applied in generalized regression.
Conditions
-
(i)
has the 2nd continuous derivative and its first derivative is strictly positive on .
-
(ii)
We can find a subinterval of satisfying the measure is concentrated on and
(12) where denotes the interior of . If is bounded, (12) holds for at least one of endpoints.
-
(iii)
and for each .
-
(iv)
There is a compact subinterval of such that the range of is contained in .
The above restrictions on and were used in Huang (1998). In fact, we can know from Huang (1998) that many commonly used distributions satisfy these conditions, including normal distribution, Poisson distribution and Bernoulli distribution. Now, let us verify our Assumption 1-5 under this setting.
In particular, Assumption 1 is verified by using Condition (i)-(iii). On the other hand, we choose the Lipchitz constant in Assumption 2 by
Thirdly, the envelop function of can be set by
Since we assume is a sub-Gaussian random variable in Assumption 4, thus in this case. This indicates Assumption 3 is satisfied. Finally, under conditions (i)-(iv), Lemma 4.1 in Huang (1998) shows that our Assumption 5 holds with and
(13) |
From (12), the constant in (13) must be larger than zero. On the other hand, we will see later the above does not equal to infinity in many cases.
Therefore, those general theoretical results in Section 4 & 5 can be applied in generalized regression. Finally, we need to stress those coefficients in the general results, such as in Theorem 1 and in (13), are only of polynomial order of if we choose properly. Let us give some specific examples.
Examples
-
1.
The first example is Gaussian regression, where the conditional distribution follows . Therefore, , , and . Our goal is to estimate the unknown conditional mean . Now, Conditions (i)-(iii) are satisfied. To satisfy the fourth one, we assume the range of is contained in a compact set of , denoted by . Choose . Meanwhile, we can ensure is a sub-Gaussian random variable. The constant in (13) equals to and those coefficients in general theoretical results are all of polynomial order of , such as .
-
2.
The second example is Poisson regression, where the conditional distribution follows with . Therefore, , , and . Our goal is to estimate the transformation of conditional mean, namely in (11), by using Mondrian forest (3). It is not difficult to show Conditions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of is contained in a compact set of . Thus, satisfies Assumption (iv). Choose . Meanwhile, we can ensure satisfies Assumption 4 by using the fact that Poisson distribution is sub-Gaussian. The constant in (13) equals to and those coefficients in general theoretical results are also all of polynomial order of , such as .
-
3.
The third example is classification, where the conditional distribution follows Bernoulli distribution (taking values in ) with . It is well known that the best classifier is named as Bayes rule,
Therefore, the main focus in this problem is how to estimate the conditional probability well. Here, we use Mondrian forest for the estimation. First, we make a shift of , which means is used in (11) instead. By some calculations, , , and in this case. Now, the final goal is to estimate by using the forest estimator (3). It is not difficult to show Conditions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of is contained in a compact set of . Now, choose for some . Meanwhile, Assumption 4 is satisfied since is bounded. The constant in (13) equals to and those coefficients in general theoretical results are also all of polynomial order of , such as .
In each of three examples above, the convergence rate of is up to a polynomial of .
6.3 Quantile regression
Quantile regression is a type of regression analysis used in statistics and econometrics that focuses on estimating the conditional quantiles (such as the median or other percentiles) of the response variable distribution given a set of predictor variables. Unlike ordinary least squares (OLS) regression, which estimates the mean of the response variable conditional on the predictor variables, quantile regression provides a more comprehensive analysis by estimating the conditional median or other quantiles.
Specifically, suppose is the -th quantile () of the conditional distribution of . Our interest is to estimate by using i.i.d. data . The loss function in this case is given by
where denotes the check function for the quantile . Meanwhile, by some calculations we know the quantile function maximizes the population risk w.r.t. . Namely, we have
Therefore, we have reason to believe the forest estimator in (3) work well in this problem.
Let us verify Assumption 1-5. First, we choose in Assumption 1. Then, it is easy to check is convex for any . Second, the loss function is also Lipschitz continuous w.r.t. the first variable when is given and the Lipschitz constant . Third, we choose the envelop function in Assumption 3. Fourth, we always suppose is a sub-Gaussian random variable to meet the requirement in Assumption 4. Finally, we consider the sufficient condition of Assumption 5.
The Knight equality in Knight (1998) tells us
from which we get
Conditioning on , we know the first part of above inequality equals to zero by using the definition of . Thus, we have
(14) |
To illustrate our basic idea clearly, we just consider a normal case, where is independent with the residual and . The generalization of this case can be finished by following the spirit below. Under this normal case, is equal to the sum of and the -th quantile of . Denote by the -th quantile of . Thus, the conditional distribution of is same with the distribution of . For any ,
Since is a fixed number, we assume is a large number later. By the Lagrange mean value theorem, the following probability bound holds
Then,
With the same argument, we also have
once . Therefore, (14) implies
(15) |
and
(16) |
Now, we choose . The combination of (15) and (16) implies Assumption 5 holds with and . Meanwhile, those coefficients in general theoretical results are also of polynomial order of , such as . The above setting results that the convergence rate of is up to a polynomial of .
6.4 Nonparametric density estimation
Assume is a continuous random vector defined on and has a density function . Our interest lies in the estimation of the unknown function based on an i.i.d. sample of , namely data . However, any density estimator has to satisfy two shape requirements that is non-negative, namely, and . These two restrictions can be losen by making a transformation. In fact, we have the decomposition
where can be any integrable function on . The above link helps us to focus on the estimation of only, which will be a statistical learning problem without constraint. However, this transformation introduces a model identifiability problem since and give the same density function, where . To solve this problem, we impose an additional requirement , which guarantees a one-to-one map between and .
In the case of density estiamtion, the scaled log-likelihood for any function based on the sampled data is
and its population version is
With a slight modification, Mondrian forests can also be applied in this problem. Recall the partition of the -th Mondrian with stopping time satisfies
For each cell , a constant is used as the predictor of in this small region. Thus, the estimator of based on a single tree has the form
where coefficients are obtained by minimizing the empirical risk function,
Since the optimization function above is differentiable w.r.t. variables s, therefore the minimal value can indeed be achieved. To meet the requirement of identification, the estimator based on a single tree is given by
Finally, by applying the ensemble technique again, the estimator of based on the Mondrian forest is
(17) |
Next, we analyze the theoretical properties of . In this case, the forest estimator (17) is similar to the previous one defined in (3) which is obtained by optimization without constraint. Although there is a difference that here we add an extra penalty function:
where , those arguments in Theorem 1 can also be applied here. Let the pseudo loss function be . It is obvious that Assumption 1 holds for . Assumption 2 is satisfied with and Assumption 3 is satisfied with . Choose . Following similar arguments in Theorem 1, we have
(18) |
where () and ( is the center of cell ) and is the coefficient in Theorem 1 . Therefore, the left thing is to bound
Lemma 2.
For any with and ,
To obtain the consistency rate of Mondrian forest estimator, we need a result that is similar to Assumption 5.
Lemma 3.
Suppose the true density is bounded away from zero and infinity, namely and . For any function with and , we have
Thus, Lemma 3 immediately implies the following result.
Theorem 3.
Suppose the true density is bounded away from zero and infinity, namely . If and the true function with satisfying , there is such that when ,
7 Conclusion
In this paper, we proposed a general framework about Mondrian forests, which can be used in many statistical or machine learning problems. These applications includes but not limits to LSE, generalized regression, density estimation and quantile regression. Meanwhile, we studied the upper bound of its regret function and statistical consistency and show how to use them in specific applications listed above. The future work can be the study of the asymptotic distribution of this kind of general Mondrian forests as suggested in Cattaneo et al. (2023).
8 Proofs
This section contains proofs of theoretical results in the paper. Several useful preliminaries and notations are introduced first. Meanwhile, the constant in this section will change from line to line in order to simplify notations.
Definition 2 (Blumer et al. (1989)).
Let be a Boolean function class in which each is binary-valued. The growth function of is defined by
for each positive integer .
Definition 3 (Györfi et al. (2002)).
Let and . Let be a class of functions . An -cover of on is a finite set of functions satisfying
Then, the -cover number of on , denoted by , is the minimal size of an -cover of on . If there exists no finite -cover of , then the above cover number is defined as .
For a VC class, there is a useful result giving the upper bound of its covering number. To make this supplementary material self-explanatory, we first introduce some basic concepts and facts about the VC dimension; see Shalev-Shwartz and Ben-David (2014) for more details.
Definition 4 (Kosorok (2008)).
The subgraph of a real function is a subset of defined by
where is an abstract set.
Definition 5 (Kosorok (2008)).
Let be a collection of subsets of the set and be an arbitrary set of points. Define that picks out a certain subset of if can be expressed as for some . The collection is said to shatter if each of subsets can be picked out.
Definition 6 (Kosorok (2008)).
The VC dimension of the real function class , where each is defined on , is the largest integer such that a set of points in with size is shattered by . In this paper, we use to denote the VC dimension of .
Proof of Theorem 1. Since is convex in Assumption 1, we have
for any . Therefore, we only need to consider the excess risk of a single tree in the following analysis.
In fact, our proof is based on the following decomposition:
(19) |
where I relates to the variance term of Mondrian tree, and II is the approximation error of Mondrian tree to and III measures the error when the empirical loss is used to approximate the theoretical one.
Analysis of Part I. Define two classes first.
Thus, the truncated function class of is given by
where the threshold . Then, the part can be bounded as follows.
(20) |
where . Next, we need to find the upper bound of respectively.
Let us consider first. Make the decomposition of as below.
The part can be bounded by considering the covering number of the function class
where denotes the number of regions in the partition that is constructed by the truncated Mondrian process with stopping time . Therefore, is a deterministic number once is given. For any , recall the definition of the covering number of , namely shown in Definition 3. Now, we suppose
is a -cover of class in space, where is equipped with norm and . Without loss of generality, we can further assume since is upper bounded by . Otherwise, we consider the truncation of : . According to Assumption 2, we know for any and ,
where The above inequality implies that
for any and . Therefore, we know is a -cover of class in space, where . In other words, we have
(21) |
Note that is a VC class since we have shown is a VC class in (29). Furthermore, we know the function in is upper bounded by . Therefore, we can bound the RHS of (21) by using Theorem 7.12 in Sen (2018)
(22) |
for some universal constant . On the other hand, it is not difficult to show
(23) |
Thus, the combination of (23), (22) and (21) implies
(24) |
for each . Note that the class has an envelop function satisfying by Assumption 3. Construct a series of independent Rademacher variables sharing with the same distribution . Then, the symmetrization technique (see Lemma 3.12 in Sen (2018)) and the Dudley’s entropy integral (see (41) in Sen (2018)) and (24) imply
(25) |
where we use the fact that is independent to the data set and is universal. Without loss of generality, we can assume ; otherwise just set as the new upper bound of the function class . Therefore, (25) also implies
(26) |
The left thing is to find the VC dimension of class for each . This result is summarized as below.
Lemma 4.
For each integer , .
Proof.
Recall two defined classes:
We first calculate the VC dimension of class . Define a Boolean class of functions:
where if and otherwise. Recall the VC dimension of , denoted by , is the largest integer satisfying (see, for example, Kosorok (2008)). Therefore, we focus on bounding for each positive integer . Let be the series of points which maximize . Under the above notations, we have two observations as follows.
-
•
For any that takes constant on each cell , there is and a leaf of a tree in such that for some . Meanwhile, is constant on the cell in and .
-
•
All half-planes in pick out at most subsets from when (see, e.g., Kosorok (2008)), namely
Based on the above two facts, we can conclude
(27) |
Then, combination of (27) and implies that
(28) |
Solving the inequality
by using the basic inequality with yields
(29) |
where the constant depends on only. ∎
Therefore, we know from Lemma 4 and (29) that
By the basic inequality , from above inequality we have
(30) |
Next, from Proposition 2 in Mourtada et al. (2020), we know . Finally, we have the following upper bound for
(31) |
Then, we bound the second part of by following the arguments below.
(32) |
where in the third line we use Cauchy-Schwarz inequality and in last line we use in Definition 3 and the sub-Gaussian property of .
Finally, we end the Analysis of Part I by bounding . In fact, this bound can be processed as follows.
(33) |
Thus, we only need to find the upper bound of . By some calculations, we know
(34) |
for some and . Therefore, (33) and (34) imply
(35) |
Analysis of Part II. Recall
which relates to the empirical approximation error of Mondrian forests. First, suppose the first truncated Mondrian process with stopping time is given, denoted by . Under this restriction, the partition of is already determined, which is denoted by . Let
where and denotes the center of cell . Remember that depends on . Since is obtained by
we know
(36) |
once . At this point, we consider two cases about :
Case I: . This case is trivial because we already have .
Case II: . In this case, (36) implies
Let be the diameter of the cell that lies in. By Assumption 2, the above inequality further implies
(37) |
where the second line holds because is a -smooth function and we use to get the fifth line and Cauchy-Schwarz inequality in the sixth line .
Therefore, Case I and (37) in Case II imply that
(38) |
Taking expectation on both sides of (38) w.r.t. leads that
(39) |
where and in the second line we use the fact that the function is concavity. For any fixed , we can bound by using Corollary 1 in Mourtada et al. (2020). In detail, we have
(40) |
Thus, the combination of (39) and (40) imply that
(41) |
Analysis of Part III. This part can be bounded by using the central limit theorem. Since , we know by Assumption 3 that
with . Thus, is an envelop function of . Note that a single function consists of a Glivenko-Cantelli class and has VC dimension 1. Thus, the application of equation (80) in Sen (2018) implies
(42) |
for some universal .
Proof of Corollary 2. The proof starts from the observation that our class can be used to approximate any general function. Since , we know can be approximated by a sequence of continuous functions in sense. Thus, we can assume is continuous. Define the function . For any , by Lemma 16.1 in Györfi et al. (2002) there is with and such that
(43) |
After some simple calculation, we know , where depends on only. Now we fix such . Make the decomposition as follows
(44) |
Part I and III can be upper bounded by following similar analysis in Theorem 1. Therefore, under assumptions in our theorem, we know both of these two parts converges to zero as . Next, we consider Part II. Note that
where in the second line we use Assumption 5. Finally, we only need to consider the behavior of term as . This can be done by using the analysis of Part II in the proof for Theorem 1. Taking in (41), we have
(45) |
which goes to zero as increases. In conclusion, we have proved that
The above inequality and Assumption 5 shows that is consistent for the general function .
Proof of Theorem 2. Based on Assumption 1, we only need to consider the regret function for . For any , by the definition of we know
(46) |
where is the estimator based on the process .
On the other hand, we have the decomposition below
(47) |
Firstly, we bound Part . Recall , which is defined below (20). Make the decomposition of as follows.
(48) |
The key for bounding is to find the upper bound of . By the definition of and Assumption 3, we know if occurs
Therefore, when happens we have
Following arguments that we used to bound in the Proof of Theorem 1, we know
(49) |
Next, the way for bounding in (48) is similar to that we used to bound in the Proof of Theorem 1. Namely, we have
(50) |
Secondly, we use (46) to bound Part in (47). By the definition of , for any we have
Similar to the Proof of Theorem 1, the above inequality implies
(51) |
Since (51) holds for all , taking inequality (51) further implies
(52) |
where is caused by the sub-Gaussian property of .
Thirdly, we consider Part . The arguments for this is same with that used to obtain (42). Namely, we have
(53) |
where is universal and does not depend on .
Proof of Lemma 2. Note that . Define a real function as follows:
Thus, we have and . Later, it will be convenient to use the function in the analysis of this penalty function. Since both and are upper bounded, we know is differentiable and its derivative is
(54) |
Define a continuous random vector with the density function
(55) |
(56) |
On the other hand, the Lagrange mean theorem implies
(57) |
where . Thus, later we only need to consider the last term of (57). Since , we know from (57) that
(58) |
where follows the uniform distribution in and is independent with . By further calculation, we have
From (40), we already know . Thus, above inequality implies
This completes the proof.
Proof of Lemma 3. First, we calculate the term , where . With some calculation, it is not difficult to know
(59) |
where is a continuous random vector in . Furthermore, has the density with . Sicne and , we can assume without loss generality that . This results that
(60) |
Let follows uniform distribution in . (60) implies
(61) |
With the same arguemnt, we also have
(62) |
The combination of (59), (61) and (62) shows that
(63) |
for some universal and any . Finally, by Taylor expansion, we have
for some . Without loss of generality, We can assume that . Thus, the second derivative can be bounded by using (63) if we take . Meanwhile, the first derivative since achieves the minimal value of . Based on these analysis, we have
for some universal . This completes the proof.
References
- Arlot and Genuer (2014) Arlot, S. and R. Genuer (2014). Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939.
- Athey et al. (2019) Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of Statistics 47(2), 1148–1178.
- Biau (2012) Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research 13(1), 1063–1095.
- Biau et al. (2008) Biau, G., L. Devroye, and G. Lugosi (2008). Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research 9(9).
- Blumer et al. (1989) Blumer, A., A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989). Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM) 36(4), 929–965.
- Breiman (2001) Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
- Cattaneo et al. (2023) Cattaneo, M. D., J. M. Klusowski, and W. G. Underwood (2023). Inference with mondrian random forests. arXiv preprint arXiv:2310.09702.
- Györfi et al. (2002) Györfi, L., M. Kohler, A. Krzyżak, and H. Walk (2002). A distribution-free theory of nonparametric regression, Volume 1. Springer.
- Huang (1998) Huang, J. Z. (1998). Functional anova models for generalized regression. Journal of multivariate analysis 67(1), 49–71.
- Klusowski (2021) Klusowski, J. M. (2021). Universal consistency of decision trees in high dimensions. arXiv preprint arXiv:2104.13881.
- Knight (1998) Knight, K. (1998). Limiting distributions for l1 regression estimators under general conditions. Annals of statistics, 755–770.
- Kosorok (2008) Kosorok, M. R. (2008). Introduction to empirical processes and semiparametric inference. Springer.
- Lakshminarayanan et al. (2014) Lakshminarayanan, B., D. M. Roy, and Y. W. Teh (2014). Mondrian forests: Efficient online random forests. Advances in neural information processing systems 27.
- Liaw et al. (2002) Liaw, A., M. Wiener, et al. (2002). Classification and regression by randomforest. R news 2(3), 18–22.
- Mourtada et al. (2020) Mourtada, J., S. Gaïffas, and E. Scornet (2020). Minimax optimal rates for Mondrian trees and forests. The Annals of Statistics 48(4), 2253 – 2276.
- Mourtada et al. (2021) Mourtada, J., S. Gaïffas, and E. Scornet (2021). Amf: Aggregated mondrian forests for online learning. Journal of the Royal Statistical Society Series B: Statistical Methodology 83(3), 505–533.
- Roy (2011) Roy, D. M. (2011). Computability, inference and modeling in probabilistic programming. Ph. D. thesis, Massachusetts Institute of Technology.
- Roy and Teh (2008) Roy, D. M. and Y. W. Teh (2008). The mondrian process. In Advances in neural information processing systems, pp. 1377–1384.
- Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics 48(4), 1875–1897.
- Scornet et al. (2015) Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests. The Annals of Statistics 43(4), 1716–1741.
- Sen (2018) Sen, B. (2018). A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University 11, 28–29.
- Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and S. Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
- Stone (1986) Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 590–606.
- Stone (1994) Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The annals of statistics 22(1), 118–171.
- Zhang and Wang (2009) Zhang, H. and M. Wang (2009). Search for the smallest random forest. Statistics and its interface 2 3, 381.