Non-asymptotic Properties of Generalized Mondrian Forests in Statistical Learning
Abstract
Random Forests have been extensively used in regression and classification, inspiring the development of various forest-based methods. Among these, Mondrian Forests, derived from the Mondrian process, mark a significant advancement. Expanding on Mondrian Forests, this paper presents a general framework for statistical learning, encompassing a range of common learning tasks such as least squares regression, regression, quantile regression, and classification. Under mild assumptions on the loss functions, we provide upper bounds on the regret/risk functions for the estimators and demonstrate their statistical consistency.
Keywords: ensemble learning, machine learning, random forests, regret function, statistical learning
1 Introduction
Random Forest (RF) (Breiman, 2001) is a popular ensemble learning technique in machine learning. It operates by constructing multiple decision trees and averaging their predictions to improve accuracy and robustness. Many empirical studies have demonstrated its powerful performance across various data domains (for example, see Liaw et al. (2002)). The effectiveness of RF is attributed to its data-dependent splitting rule, called CART. Briefly, CART is a greedy algorithm that identifies the best splitting variable and value by maximizing the reduction of training error between two layers. However, this data-dependent splitting scheme complicates the theoretical analysis of RF.
To the best of our knowledge, two papers have made significant contributions to the theoretical analysis of RF. Scornet et al. (2015) was the first to establish the statistical consistency of RF when the true regression function follows an additive model. Klusowski (2021) later demonstrated the consistency of RF under weaker assumptions on the distribution of predictors. However, both studies rely on the technical condition that the conditional mean has an additive structure.
To gain a deeper understanding of the random forest, people consider modified and stylized versions of RF. One such method is Purely Random Forests (PRF, for example Arlot and Genuer (2014), Biau (2012) and Biau et al. (2008)), where individual trees are grown independently of the sample, making them well suited for theoretical analysis. In this paper, our interest is Mondrian Forest, which is one of PRFs applying Mondrian process in its partitioning. This kind of forest was first introduced by Lakshminarayanan et al. (2014) and has competitive online performance in classification problems compared to other state-of-the-art methods. Inspired by its nice online property, Mourtada et al. (2021) also studied the online regression theory and classification using Mondrian forest. Besides its impressive online performance, this data-independent partitioning method is also important for offline regression because it achieves a higher consistency rate than other partitioning ways, such as the midpoint-cut strategy; see Mourtada et al. (2020). In fact, Mourtada et al. (2020) showed that the statistical consistency for Mondrian forests is minimax optimal for the class of Hölder continuous functions. Then, Cattaneo et al. (2023), following the approach in Mourtada et al. (2020), derived the asymptotic normal distribution of the Mondrian forest for the offline regression problem. Recently, Baptista et al. (2024) proposed a novel dimension reduction method using Mondrian forests. Therefore, it is evident that Mondrian forests have attracted increasing research attention due to their advantageous theoretical properties compared to other forest-based methods.
In this paper, we argue that Mondrian forests, beyond their application in classical regression and classification, can be extended to address a broader spectrum of statistical and machine learning problems, including generalized regression, density estimation, and quantile regression. Our main contributions are as follows:
-
•
First, we propose a general framework based on Mondrian forests that can be applied to various learning problems.
-
•
Second, we establish an upper bound for the regret (or risk) function of the proposed forest estimator. The theoretical results derived from this analysis are applicable to numerous learning scenarios, with several examples provided in Section 6.
Our training approach adopts a global perspective, contrasting with the local perspective utilized in the generalization of Random Forests by Athey et al. (2019). Specifically, after performing a single optimization on the entire dataset, our method can estimate the objective function , while Athey et al. (2019) focus on estimating at a specific point . This global approach significantly reduces computational time, especially in high-dimensional settings where is large.
Moreover, our globalized framework can be easily applied to statistical problems involving a penalization function , which depends on in the entire domain . In Section 6.6, we illustrate the application of our method to nonparametric density estimation, a problem that incorporates such penalization. In contrast, since Athey et al. (2019) relies on pointwise estimation, their method struggles to ensure that the resulting estimator satisfies shape constraints, such as those required for a valid probability density function, thus excluding these cases from their scope.
2 Background and Preliminaries
2.1 Task in Statistical Learning
Let be the random vector, where we have normalized the range of . In statistical learning, the goal is to find a policy supervised by , which is defined as a function . Usually, a loss function is used to measure the difference or loss between the decision and the goal . Taking expectation w.r.t. , the risk function
(1) |
denotes the averaged loss by using the policy . Naturally, people have reasons to select the best policy by minimizing the averaged loss over some function class , namely,
Therefore, the policy has the minimal risk and is the best in theory. In practice, the distribution of is unknown and (1) is not able to be used for the calculation of . Thus, such a best cannot be obtained in a direct way. Usually, we can use i.i.d. data to approximate by the law of large numbers. Thus, the empirical risk function can be approximated by
Traditionally, people always can find an estimator/policy by minimizing over a known function class; see spline regression in Györfi et al. (2002), wavelet regression in Györfi et al. (2002) and regression by deep neural networks in Schmidt-Hieber (2020) and Kohler and Langer (2021).
Instead of globally minimizing , tree-based greedy algorithms adopt a “local” minimization approach, which is widely used to construct the empirical estimator . These algorithms have demonstrated the advantages of tree-based estimators over many traditional statistical learning methods. Therefore, in this paper, we focus on bounding the regret function:
where represents an ensemble estimator built using Mondrian Forests.
2.2 Mondrian partitions
Mondrian partitions are a specific type of random tree partition, where the partitioning of is independent of the data points. This scheme is entirely governed by a stochastic process called the Mondrian process, denoted as . The Mondrian process , introduced by Roy and Teh (2008) and Roy (2011), is a probability distribution over infinite tree partitions of . For a detailed definition, we refer to Definition 3 in Mourtada et al. (2020).
In this paper, we consider the Mondrian partitions with stopping time , denoted by (see Section 1.3 in Mourtada et al. (2020)). Its construction consists of two steps. First, we construct partitions according to by iteratively splitting cells at random times, which depends on the linear dimension of each cell. The probability of splitting along each side is proportional to the side length of the cell, and the splitting position is chosen uniformly. Second, we cut the nodes whose birth time is after the tuning parameter . In fact, each tree node created in the first step was given a specific birth time. As the tree grows, so does the birth time. Therefore, the second step can be seen as a pruning process, which helps us to choose the best tree model for a learning problem.
To clearly present the Mondrian partition algorithm, some notations are introduced. Let be a cell with closed intervals . Denote and . Let be an exponential distribution with expectation . Algorithm 1 shows how our Mondrian partition operates. This algorithm is a recursive process, where the root node and stopping time are used as the initial inputs.


3 Methodology
Based on the Mondrian Partition, we introduce our Mondrian Forests in this section. Let , be independent Mondrian processes. When we prune each tree at time , independent partitions are obtained, where all cuts after time are ignored. In this case, we can write satisfying
where denotes a cell in the partition . For each cell , a constant policy is used as the predictor of , where
(2) |
and is a threshold. For any fixed , is usually a continuous function w.r.t. the first variable in machine learning. Therefore, the optimization (2) over guarantees the existence of in general and we allow as in theory. Then, for each , we can get an estimator of :
where denotes the indicator function. By applying the ensemble technique, the final estimator is given by
(3) |
If the cell does not contain any data point, we just use as the optimizer in the corresponding region.
Let us clarify the relationship between (3) and the traditional RF. Recall that the traditional RF (see Mourtada et al. (2020) and Breiman (2001)) is used to estimate the conditional mean by using the sample mean in each leaf. In this case, the loss function is applied. If , it can be checked that
where denotes the cardinality of a set. Therefore, in this case, our forest estimator (3) coincides with the traditional RF exactly. In conclusion, is an extension of traditional RF in Breiman (2001) since the loss function can be chosen flexibly in different learning problems.
From the above learning process, we know that there are two tuning parameters in the construction of , namely and . We give some remarks about them. The stopping time, , controls the complexity of the Mondrian forest. Generally speaking, the cardinality of a tree partition increases with the value of . Thus, a large value of is beneficial in reducing the bias of the forest estimator. In contrast, a small is instrumental in controlling the generalization error of the final estimator (3). So selecting is crucial in balancing complexity and stability. To ensure the consistency of , we suppose that is dependent on the sample size , denoting it as in the following analysis. The second parameter, , denotes the number of Mondrian trees, which can be determined as the selection for RF. There are many studies on its selection for RF; see, for example, Zhang and Wang (2009). In practice, most practitioners take in their computations.
4 Main results
In this section, we study the upper bound of the regret function of (3), which is constructed by Mondrian processes. Denote as the support of that satisfies . First, we need some mild restrictions on the loss function .
Assumption 1
The risk function is convex. In other words, for any and functions , we have .
Note that Assumption 1 is satisfied if is convex for any fixed .
Assumption 2
There is a nonnegative function with such that for any , is Lipschitz continuous and for any , , we have
Assumption 3
There is an envelope function such that for any and ,
Without loss of generality, we can assume that is non-decreasing w.r.t. the first variable for any fixed . In the next section, we will see that many commonly used loss functions satisfy Assumption 1- 3 including loss and loss. In the theoretical analysis, we make the following Assumption 4 on the distribution of and assume takes value in . This paper mainly focuses on the sub-Gaussian case of ; namely . But Assumption 4 extends the range of sub-Gaussian distributions to a more general class.
Assumption 4
There is an increasing function satisfying and constant , such that
Our theoretical results relate to the -smooth class given below since it is large enough and dense in the integrable space generated by any probability measure. When , this class is also known as Holder space with index in the literature; see Adams and Fournier (2003). Additionally, the -smooth class is frequently used in practice, such as the splines, because its smoothness makes the computation convenient.
Definition 1 (-smooth class)
Let , and . The -smooth ball with radius , denoted by , is the set of times differentiable functions such that
and
where denotes the norm in space and is the gradient operator.
Theorem 2 (Regret function bound of Mondrian forests)
Remark 3
The first term of the RHS of (4) relates to the generalization error of the forest, and the second one is the approximation error of Mondrian forest to . Finally, the last line is caused by the tail property of and will disappear if we assume that is bounded.
Remark 4
We will see that the coefficients above in many applications, such as and , are only of polynomial order of if . In this case, the last line of (4) usually has no influence on the convergence speed of the regret function. Roughly speaking, only the first two terms dominate the convergence rate, namely the generalization and approximation errors.
Remark 5
If , , diverge no faster than for some , we know from (4) that
when and . Therefore, Mondrian forests perform not worse than -smooth class in the general setting.
In statistical learning, the consistency of an estimator is a crucial property that ensures that the estimator converges to the true value of the parameter/function being estimated as the sample size increases. Theorem 2 can also be used to analyze the statistical consistency of . For this purpose, denote the true function by
(5) |
and make the following assumption.
Assumption 5
For any , there are and such that
Usually, holds in many specific learning problems. Before presenting the consistency results, we denote the last line of (4) by , namely,
(6) |
Then, the statistical consistency of Mondrian forests can be guaranteed by the following two corollaries.
Corollary 6 (Consistency rate of Mondrian forests)
5 Model selection: the choice of
In practice, the best is always unknown, thus a criterion is necessary in order to stop the growth of Mondrian forests. Otherwise, the learning process will be overfitted. Here, we adopt a penalty methodology as follows. For each , define
(8) |
where the parameter controls the power of penalty and is constructed already in Section 3 and using process . Then, the best is chosen by
Denote as the tree estimator that is constructed by the Mondrian process . Then, our forest estimator based on model selection is given by
(9) |
Theorem 8
By properly choosing the penalty strength , we can obtain a convergence rate of the regret function of Mondrian forests according to (8). Theorem 8 also implies that the estimator (9) is adaptive to the smooth degree of the true function . If is large, this rate will be fast; otherwise, we will have a slower convergence rate. This coincides with the basic knowledge of function approximation. The applications of Theorem 8 are given in the next section, where some examples are discussed in detail. In those cases, we will see that coefficients in (8), such as , can be upper bounded by a polynomial of indeed.
6 Examples
In this section, we show how to use Mondrian forests in different statistical learning problems. Meanwhile, theoretical properties of these forest estimators, namely in (3) and in (9), are given based on Theorem 2 & 8 for each learning problem. Sometimes, the Lemma 9 below is useful for verifying the Assumption 5. The proof of this result can be directly completed by considering the Taylor expansion of the real function around the point .
Lemma 9
6.1 Least squares regression
As shown in Mourtada et al. (2020), Mondrian forests are statistically consistent if the loss function is used. In our first example, we revisit this case by using the general results established in Section 4 & 5. Usually, nonparametric least squares regression refers to methods that do not assume a specific parametric form of the conditional expectation . Instead, these methods are flexible and can adapt to the underlying structure of the data. The loss function of the least squares regression is given by . First, we define the event . Under Assumption 4, by (A.20) we can find constants such that . This means is very close to as . When the event occurs, from (2) we further know
where denotes the cardinality of any set. Therefore, is just the average of s that are in the leaf .
Let us discuss the property of . First, it is obvious that Assumption 1 holds for this loss. By some simple calculations, we also know Assumption 1 is satisfied with and Assumption 2 is satisfied with . Choosing and , Theorem 2 implies the following property of .
Proposition 10
For any , there is an integer such that for any ,
Then, we check the consistency as in Corollary 6. By some calculations, we have and
The above inequality shows that Assumption 5 holds with and . When , Corollary 6 implies Proposition 11.
Proposition 11
For any , there is an integer such that for any ,
We can also show that is statistically consistent for any general function defined in (5) when is chosen properly as stated in Corollary 7. Finally, by choosing and in Theorem 8, the regret function of the estimator , which is based on the model selection in (8), has an upper bound as shown in the following proposition.
Proposition 12
For any , there is an integer such that for any ,
where depends on only.
6.2 Generalized regression
Generalized regression refers to a broad class of regression models that extend beyond the traditional ordinary least squares (OLS) regression, accommodating various types of response variables and relationships between predictors and responses. Usually, in this model, the conditional distribution of given follows an exponential family of distribution
(11) |
where is a positive measure defined on , for any , and function is defined on an open interval of , which is used for the aim of normalization. Now, we suppose the function exists and we have by some calculations. Thus, the conditional expectation will be known if we can estimate the unknown function . More information about model (11) can be found in Stone (1986), Stone (1994) and Huang (1998).
The problem of generalized regression is to estimate the unknown function by using the i.i.d. data . Note that both and are known in (11). In this case, we use the maximal likelihood method for the estimation, and the corresponding loss function is given by
and by some calculations we know the true function satisfies Definition 5, namely
Therefore, we have reasons to believe that Mondrian forests are statistically consistent in this problem, which is stated in Corollary 7. Now, we give some mild restrictions on and to ensure our general results in Section 4 & 5 can be applied in this generalized regression.
-
(i)
has the second continuous derivative, and its first derivative is strictly positive on .
-
(ii)
We can find an interval of satisfying the measure is concentrated on and
(12) where denotes the interior of . If is bounded, (12) holds for at least one of the endpoints.
-
(iii)
and for each .
-
(iv)
There is a compact interval of such that the range of is contained in .
The above restrictions on and were used in Huang (1998). In fact, we know from Huang (1998) that many commonly used distributions satisfy these conditions, including the normal distribution, the Poisson distribution, and the Bernoulli distribution. Now, let us verify our Assumption 1-5 under this setting.
In particular, Assumption 1 is verified using restrictions (i)-(iii). On the other hand, we choose the Lipchitz constant in Assumption 2 by
Thirdly, the envelope function of can be set by
Since is a sub-Gaussian random variable in Assumption 4, thus in this case, indicating that Assumption 3 is satisfied. Finally, under restrictions (i)-(iv), Lemma 4.1 in Huang (1998) shows that our Assumption 5 holds with and
(13) |
From (12), the constant in (13) must be larger than zero. On the other hand, as we will see later, the above does not equal infinity in many cases.
Therefore, those general theoretical results in Section 4 & 5 can be applied in generalized regression. Meanwhile, we need to stress that the coefficients in the general results, such as in Theorem 2 and in (13), are always of polynomial order of if this is selected properly. Let us give some specific examples.
-
(1)
The first example is Gaussian regression, where the conditional distribution follows and is known. Therefore, , , and . Our goal is to estimate the unknown conditional mean . Now, restrictions (i)-(iii) are satisfied. To satisfy the fourth one, we assume the range of is contained in a compact set of , denoted by . Choose . Meanwhile, we can ensure is a sub-Gaussian random variable. The constant in (13) equals , and those coefficients in general theoretical results are all of polynomial order of , such as .
-
(2)
The second example is Poisson regression, where the conditional distribution follows with . Therefore, , , and . Our goal is to estimate the transformation of conditional mean, namely in (11), by using Mondrian forest (3). It is not difficult to show restrictions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of is contained in a compact set of . Thus, satisfies restriction (iv). In this case, we choose . Note that satisfies Assumption 4 with if follows Poisson distribution with mean . Thus, we can ensure satisfies Assumption 4. The constant in (13) equals to , and those coefficients in general theoretical results are also all of the polynomial order of , such as .
-
(3)
The third example is related to classification, where the conditional distribution follows Bernoulli distribution (taking values in ) with . It is well known that the best classifier is called the Bayes rule,
What we are interested is to estimate the conditional probability above. Here, we use Mondrian forest in the estimation. First, we make a shift of , which means is used in (11) instead. By some calculations, , , and in this case. Now, the final goal is to estimate by using the forest estimator (3). It is not difficult to show restrictions (i)-(iii) are already satisfied. To satisfy the fourth one, we assume the range of is contained in a compact set of . Now, choose for some . Meanwhile, Assumption 4 is satisfied since is bounded. The constant in (13) equals to and those coefficients in general theoretical results are also all of the polynomial order of , such as .
-
(4)
The fourth example is geometry regression, where the model is . Here, denotes the successful probability, and we suppose it is bounded from up and below, namely . Thus, we have a positive probability of obtaining success or failure for any and . In this case, , and is the unknown function we need to estimate. Since , in this example we optimize (2) over only with . In Section 4, we only need to replace , by and respectively. Then, it is not difficult to check all results in Section 4 still hold. Furthermore, restrictions (i)-(iii) are satisfied after some calculation. Finally, we check Assumption 5. In this circumstance, we can still use Lemma 4.1 in Huang (1998) after replacing in (13) with . If is chosen, it is known in (13) after some calculation. Therefore, Assumption 5 also holds with and .
In each of the four examples above, the convergence rate of is up to a polynomial of .
6.3 Huber’s loss
Huber loss, also known as smooth loss and proposed in Huber (1992), is a loss function frequently used in regression tasks, especially in machine learning applications. It combines the strengths of both Mean Absolute Error (MAE) and Mean Squared Error (MSE), making it more robust to outliers than MSE while maintaining smoothness and differentiability like MSE. Huber loss applies a quadratic penalty for small errors and a linear penalty for large ones, allowing it to balance sensitivity to small deviations and resistance to large, anomalous deviations. This makes it particularly effective when dealing with noisy data or outliers. In detail, this loss function takes the form
This case is interesting and different from commonly used loss functions because such depends on and can vary according to the sample size . Although this loss function is a piecewise function which is different with classical loss function, the non-asymptotic result in Theorem 2 can still be applied here.
Let us verify Assumption 1-3 for this loss. Firstly, this loss function is convex and Assumption 1 is satisfied. Secondly, for any we know that is a Lipschitz function with Lipschitz constant . Thirdly, we can define
Since is sub-Gaussian by Assumption 5, thus and Assumption 3 is satisfied. Finally, coefficients in Theorem 2:
diverge no faster than a polynomial of if we take and the threshold satisfies for any . Under the above settings, we can obtain a fast convergence rate of the forest estimator by Theorem 2.
On the other hand, we aim to show our forest estimator performs better than any smooth function even if for any small . The reason is given below. By calculation, we have
Since , if and is properly selected by (4) we know
Finally, we show that Huber loss can be applied to estimate the conditional expectation, and the corresponding statistical consistency result is given below.
Proposition 13
Recall the conditional expectation . If , and for a large , we can find a series of such that
6.4 Quantile regression
Quantile regression is a type of regression analysis used in statistics and econometrics that focuses on estimating conditional quantiles (such as the median or other percentiles) of the distribution of the response variable given a set of predictor variables. Unlike ordinary least squares (OLS) regression, which estimates the mean of the response variable conditional on the predictor variables, quantile regression provides a more comprehensive analysis by estimating the conditional median or other quantiles.
Specifically, suppose that is the -th quantile () of the conditional distribution of . Our interest is to estimate by using i.i.d. data . The loss function in this case is given by
where denotes the check function for the quantile . Meanwhile, by some calculations, we know the quantile function minimizes the population risk w.r.t. the above . Namely, we have
Therefore, we have reasons to believe that the forest estimator in (3) works well in this problem.
Let us verify Assumption 1-5. Firstly, we choose in Assumption 1 and it is easy to check that the univariate function is convex for any . Secondly, we fix any . Then, the loss function is also Lipschitz continuous w.r.t. the first variable with the Lipschitz constant . Thirdly, we choose the envelope function by in Assumption 3. Fourthly, we always suppose is a sub-Gaussian random variable to meet the requirement in Assumption 4. Finally, it remains to find the sufficient condition for Assumption 5.
In fact, the Knight equality in Knight (1998) tells us
from which we get
Conditional on , we know that the first part of above inequality equals to zero by using the definition of . Therefore,
(14) |
To illustrate our basic idea clearly, we simply consider a normal case, where is independent of the residual and . The generalization of this sub-Gaussian case can be finished by following the spirit. In this normal case, is equal to the sum of and the -th quantile of . Denote by the -th quantile of . Thus, the conditional distribution of is the same as the distribution of . For any ,
Since is a fixed number, we assume is a large number later. By the Lagrange mean value theorem, the following probability bound holds
Then,
With the same argument, we also have
once . Therefore, (14) implies
(15) |
and
(16) |
Now, we choose . The combination of (15) and (16) implies that the assumption 5 holds with and . Meanwhile, those coefficients in general theoretical results are also of polynomial order of , such as . The above setting results that the convergence rate of is up to a polynomial of .
6.5 Binary classification
In previous sections, we give several examples of regression. Now, let us discuss another topic related to classification. In this section, we will show that Mondrian forests can be applied in binary classification as long as the chosen loss function is convex. In detail, we assume takes only two labels and is the explanation vector. It is well known that the Bayes classifier has the minimal classification error and takes the form:
where . However, such a theoretical optimal classifier is not available due to the unknown . In machine learning, the most commonly used loss function in this problem takes the form
where is called a non-negative cost function and is the goal function we need to learn from the data. The best is always chosen as the function that minimizes the empirical risk
over a function class. If this minimizer is denoted by , the best classifier is regarded as
Here are some examples of the loss function .
-
(i)
Square cost: suggested in Li and Yang (2003).
-
(ii)
Hinge cost: that is used in the support vector machine; see Hearst et al. (1998).
-
(iii)
Smoothing hinge cost. A problem with the hinge loss is that direct optimization is difficult due to the discontinuity in the derivative at . Rennie and Srebro (2005) proposed a smooth version of the Hinge:
-
(iv)
Modified square cost: used in Zhang and Oles (2001).
-
(v)
Logistic cost: applied in Friedman et al. (2000).
-
(vi)
Exponential cost: , which is used in the famous Adaboost; see Freund and Schapire (1997).
It is obvious that follows Assumption 4. In order to verify Assumption 1-3, we need to propose the following three conditions on :
-
(a)
is convex.
-
(b)
is piecewise differentiable and the absolute value of each piece is upper bounded by a polynomial of .
-
(c)
for some .
These conditions are satisfied by obviously. We will discuss the case of later due to its dramatic increase in speed. When Condition (a) holds, we have
for any two functions and . This shows that Assumption 1 is true. With a slight abuse of notation, let
be the maximal value of piecewise function . Then, by Lagrange mean value theorem,
This shows that Assumption 2 holds with the Lipschitz constant . Finally, Condition (c) ensures that Assumption 3 holds with
If we take , all the coefficients in Theorem 2:
(17) |
diverges not faster than a polynomial of . These arguments complete the verification of Assumption 1-3 for cost functions . For the exponential cost , some direct calculations imply that the assumption 1-3 also holds, and the coefficients in (17) are if we take .
Generally speaking, the minimizer in (5) is not unique in this classification case; see Lemma 3 in Lugosi and Vayatis (2004). Therefore, it is meaningless to discuss the statistical consistency of forests as shown in Corollary 6 or 7. However, we can establish weak consistency for Mondrian forests as follows if the cost function is chosen to be or .
Proposition 14
For cost function or , the minimizer defined in (5) always exists. Meanwhile,
6.6 Nonparametric density estimation
Assume is a continuous random vector defined on and has a density function . Our interest lies in the estimation of the unknown function based on an i.i.d. sample of , namely data . Note that any density estimator has to satisfy two shape requirements that is non-negative, namely, and . These two restrictions can be relaxed by transforming. We have the decomposition
where is a real function on . The above relationship helps us focus on estimating only, which will be a statistical learning problem without constraint. On the other hand, this transformation introduces a model identification problem since and give the same density function, where . To solve this problem, we impose an additional requirement , which guarantees a one-to-one map between and .
In the case of density estimation, the scaled log-likelihood for any function based on the sampled data is
and its population version is
With a slight modification, Mondrian forests can also be applied to this problem. Recall the partition of the -th Mondrian with stopping time satisfies
For each cell , a constant is used as the predictor of in this small region. Thus, the estimator of based on a single tree has the form
where coefficients are obtained by minimizing the empirical risk function,
Since the optimized function above is differentiable with respect to parameters s, it is not difficult to show that the corresponding minimum can indeed be achieved. To meet the requirement of our restriction, the estimator based on a single tree is revised by
Finally, by applying the ensemble technique again, the estimator of based on the Mondrian forest is
(18) |
Next, we analyze the theoretical properties of . The only difference between (18) and previous estimators is that (18) is obtained by using an additional penalty, namely
where . Our theoretical analysis will be revised as follows. Let the pseudo loss function be with . It is obvious that Assumption 1 holds for . Assumption 2 is satisfied with and Assumption 3 is satisfied with . After choosing and following similar arguments in Theorem 2, we have
(19) |
where () and ( is the center of cell ) and is the coefficient in Theorem 2 . Therefore, it remains to bound .
Lemma 15
For any with and ,
Choosing , the combination of (19) and Lemma 15 implies the regret function bound as follows:
where is sufficiently large.
Let be the supremum norm of a function. To obtain the consistency rate of our density estimator, we need to change Assumption 5 by the fact below.
Lemma 16
Suppose the true density is bounded away from zero and infinity, namely and . For any function with and , we have
Then Lemma 16 immediately implies the following consistency result.
Proposition 17
Suppose the true density is bounded away from zero and infinity, namely . If and the true function with satisfying , there is such that for ,
7 Conclusion
In this paper, we proposed a general framework for Mondrian forests, which can be used in many statistical or machine-learning problems. These applications include but are not limited to LSE, generalized regression, density estimation, quantile regression, and binary classification. Meanwhile, we studied the upper bound of its regret/risk function and statistical consistency and showed how to use them in the specific applications listed above. The future work can study the asymptotic distribution of this kind of general Mondrian forests as suggested by Cattaneo et al. (2023).
References
- Adams and Fournier (2003) Robert A Adams and John JF Fournier. Sobolev spaces. Elsevier, 2003.
- Arlot and Genuer (2014) Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939, 2014.
- Athey et al. (2019) Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
- Baptista et al. (2024) Ricardo Baptista, Eliza O’Reilly, and Yangxinyu Xie. Trim: Transformed iterative mondrian forests for gradient-based dimension reduction and high-dimensional regression. arXiv preprint arXiv:2407.09964, 2024.
- Biau (2012) Gérard Biau. Analysis of a random forests model. The Journal of Machine Learning Research, 13(1):1063–1095, 2012.
- Biau et al. (2008) Gérard Biau, Luc Devroye, and Gäbor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(9), 2008.
- Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
- Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- Cattaneo et al. (2023) Matias D Cattaneo, Jason M Klusowski, and William G Underwood. Inference with mondrian random forests. arXiv preprint arXiv:2310.09702, 2023.
- Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Friedman et al. (2000) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Special invited paper. additive logistic regression: A statistical view of boosting. Annals of statistics, pages 337–374, 2000.
- Györfi et al. (2002) László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
- Hearst et al. (1998) Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
- Huang (1998) Jianhua Z Huang. Functional anova models for generalized regression. Journal of multivariate analysis, 67(1):49–71, 1998.
- Huber (1992) Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
- Klusowski (2021) Jason M Klusowski. Universal consistency of decision trees in high dimensions. arXiv preprint arXiv:2104.13881, 2021.
- Knight (1998) Keith Knight. Limiting distributions for l1 regression estimators under general conditions. Annals of statistics, pages 755–770, 1998.
- Kohler and Langer (2021) Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
- Kosorok (2008) Michael R Kosorok. Introduction to empirical processes and semiparametric inference. Springer, 2008.
- Lakshminarayanan et al. (2014) Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Efficient online random forests. Advances in neural information processing systems, 27, 2014.
- Li and Yang (2003) Fan Li and Yiming Yang. A loss function analysis for classification methods in text categorization. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 472–479, 2003.
- Liaw et al. (2002) Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
- Lugosi and Vayatis (2004) Gábor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of regularized boosting methods. The Annals of statistics, 32(1):30–55, 2004.
- Mourtada et al. (2020) Jaouad Mourtada, Stéphane Gaïffas, and Erwan Scornet. Minimax optimal rates for Mondrian trees and forests. The Annals of Statistics, 48(4):2253 – 2276, 2020.
- Mourtada et al. (2021) Jaouad Mourtada, Stéphane Gaïffas, and Erwan Scornet. Amf: Aggregated mondrian forests for online learning. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(3):505–533, 2021.
- Rennie and Srebro (2005) Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1. AAAI Press, Menlo Park, CA, 2005.
- Roy and Teh (2008) Daniel M Roy and Yee W Teh. The mondrian process. In Advances in neural information processing systems, pages 1377–1384, 2008.
- Roy (2011) Daniel Murphy Roy. Computability, inference and modeling in probabilistic programming. PhD thesis, Massachusetts Institute of Technology, 2011.
- Schmidt-Hieber (2020) Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
- Scornet et al. (2015) Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
- Sen (2018) Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28–29, 2018.
- Shalev-Shwartz and Ben-David (2014) S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. ISBN 9781107057135.
- Stone (1986) Charles J Stone. The dimensionality reduction principle for generalized additive models. The Annals of Statistics, pages 590–606, 1986.
- Stone (1994) Charles J Stone. The use of polynomial splines and their tensor products in multivariate function estimation. The annals of statistics, 22(1):118–171, 1994.
- Zhang and Wang (2009) Heping Zhang and Minghui Wang. Search for the smallest random forest. Statistics and its interface, 2 3:381, 2009.
- Zhang and Oles (2001) Tong Zhang and Frank J Oles. Text categorization based on regularized linear classification methods. Information retrieval, 4:5–31, 2001.
Appendix A Proofs
This section contains proofs of theoretical results in the paper. Several useful preliminaries and notations are introduced first. Meanwhile, the constant in this section is always a positive number and will change from line to line in order to simplify notations.
Definition A1 (Blumer et al. (1989))
Let be a Boolean function class in which each is binary-valued. The growth function of is defined by
for each positive integer .
Definition A2 (Györfi et al. (2002))
Let and . Let be a class of functions . An -cover of on is a finite set of functions satisfying
Then, the -cover number of on , denoted by , is the minimal size of an -cover of on . If there exist no finite -cover of , then the above cover number is defined as .
To bound the generalization error, VC class will be introduced in our proof. To make this supplementary material self-explanatory, we first introduce some basic concepts and facts about the VC dimension; see Shalev-Shwartz and Ben-David (2014) for more details.
Definition A3 (Kosorok (2008))
The subgraph of a real function is a subset of defined by
where is an abstract set.
Definition A4 (Kosorok (2008))
Let be a collection of subsets of the set and be an arbitrary set of points. Define that picks out a certain subset of if can be expressed as for some . The collection is said to shatter if each of subsets can be picked out.
Definition A5 (Kosorok (2008))
The VC dimension of the real function class , where each is defined on , is the largest integer such that a set of points in with size is shattered by . In this paper, we use to denote the VC dimension of .
Proof of Theorem 2. By Assumption 1, the convexity of risk function implies
Therefore, we only need to consider the excess risk of a single tree in the following analysis.
In fact, our proof is based on the following decomposition:
where I relates to the variance term of Mondrian tree, and II is the approximation error of Mondrian tree to and III measures the error when the empirical loss is used to approximate the theoretical one.
Analysis of Part I. Define two classes first.
Thus, the truncated function class of is given by
where the threshold . Then, the part can be bounded as follows.
(A.1) |
where . Next, we need to find the upper bound of respectively.
Let us consider first. Make the decomposition of as below.
The part can be bounded by considering the covering number of the function class
where denotes the number of regions in the partition that is constructed by the truncated Mondrian process with stopping time . Therefore, is a deterministic number once is given. For any , recall the definition of the covering number of , namely shown in Definition A2. Now, we suppose
is a -cover of class in space, where is equipped with norm and . Without loss of generality, we can further assume since is upper bounded by . Otherwise, we consider the truncation of : . According to Assumption 2, we know for any and ,
where The above inequality implies that
for any and . Therefore, we know is a -cover of class in space, where . In other words, we have
(A.2) |
Note that is a VC class since we have shown is a VC class in (LABEL:4.7). Furthermore, we know the function in is upper bounded by . Therefore, we can bound the RHS of (A.2) by using Theorem 7.12 in Sen (2018)
(A.3) |
where the constant is universal. On the other hand, it is not difficult to show
(A.4) |
Thus, the combination of (A.4), (A.3) and (A.2) implies
(A.5) |
for each . Note that the class has an envelop function satisfying by Assumption 3. Construct a series of independent Rademacher variables sharing with the same distribution . Then, the symmetrization technique (see Lemma 3.12 in Sen (2018)) and the Dudley entropy integral (see (41) in Sen (2018)) and (A.5) imply
(symmetrization technique) | ||||
(Dudley’s entropy integral) | ||||
(A.6) |
where we use the fact that is independent to the data set and is universal. Without loss of generality, we can assume ; otherwise just set as the new upper bound of the function class . Therefore, (A.6) also implies
(A.7) |
The next task is to find the VC dimension of class for each positive integer , which is summarized below.
Lemma A6
For each integer , .
Proof Recall two defined classes:
For any Mondrian tree with the form
(A.8) |
we also use notation to denote tree (A.8) later.
In order to bound , we need to consider the related Boolean class:
where if and otherwise. Recall the VC dimension of , denoted by , is the largest integer satisfying ; see Definition A5. Next we focus on bounding for each positive integer .
Define the partition set generated by :
and the maximal partition number of points cut by :
(A.9) |
Since each Mondrian tree takes constant value on its leaf , thus there is at most ways to pick out any fixed points that lie in a same leaf. In other words,
(A.10) |
According to the tree structure in (A.8), the growth function of satisfies
(A.11) |
Therefore, the left task is to bound only. In fact, We can use induction method to prove
(A.12) |
where the constant only depends on . The arguments are given below.
When , the partition generated by is . Thus, and (A.12) is satisfied obviously in this case.
Suppose (A.12) is satisfied with . Now we consider the case for . Here, we need to use two facts. First, any must be grown from a by splitting one of its leaves; Second, any can grow to be another by partitioning one of its leaves. In conclusion,
(A.13) |
Suppose maximizes . We divide into groups:
where . Meanwhile, we write the partition for that is cut by as
(A.14) |
where is a permutation of and the cardinality of above sets are written as . Note that both and are determined once the index is fixed.
If we have generated a Mondrian tree with leaves from its parent in , the corresponding partition for can be obtained by following two steps below
-
1.
Partition one of sets in (A.14) by using a hyperplane , where .
-
2.
Keep other sets in (A.14) unchanged.
Denote by the number of partition for after following above process. Then, the Sauer–Shelah lemma (see for example Lemma 6.10 in Shalev-Shwartz and Ben-David (2014) ) tells us
Since each can at most produce new partitions of based on (A.14), (A.13) and imply
Therefore, (A.12) also holds for the case of .
According to (A.11) and (A.12), we finally have
Solving the inequality
by using the basic inequality with yields
(A.15) |
where the constant depends on only. This completes the proof.
Therefore, we know from Lemma A6 and (A.7) that
By the basic inequality , from above inequality we have
(A.16) |
Next, from Proposition 2 in Mourtada et al. (2020), we know . Finally, we have the following upper bound for
(A.17) |
Then, we bound the second part of by following the arguments below.
(A.18) |
where in the third line we use Cauchy-Schwarz inequality and in last line we use Assumption 3 and the tail probability of given in Assumption 4.
Finally, we end the Analysis of Part I by bounding . In fact, this bound can be processed as follows.
(A.19) |
Thus, we only need to find the upper bound of . By Assumption 4, we know
(A.20) |
when is sufficiently large. Therefore, (A.19) and (A.20) imply
(A.21) |
Analysis of Part II. Recall
which relates to the empirical approximation error of Mondrian forests. First, suppose the first truncated Mondrian process with stopping time is given, denoted by . Under this restriction, the partition of is determined and not random, which is denoted by . Let
where and denotes the center of cell . We need to highlight that depends on . Since is obtained by
we know
(A.22) |
once . At this point, we consider two cases about :
Case I: . This case is trivial because we already have .
Case II: . In this case, (A.22) implies
Let be the diameter of the cell that lies in. By Assumption 2, the above inequality further implies
(A.23) |
where the second line holds because is a -smooth function and we use to get the fifth line and Cauchy-Schwarz inequality in the sixth line .
Therefore, Case I and (A.23) in Case II imply that
(A.24) |
Taking expectation on both sides of (A.24) w.r.t. leads that
(A.25) |
where and in the second line we use the fact that the function is concavity. For any fixed , we can bound by using Corollary 1 in Mourtada et al. (2020). In detail, we have
(A.26) |
Thus, the combination of (A.25) and (A.26) imply that
(A.27) |
Analysis of Part III. This part can be bounded by the central limit theorem. Since the supremum norm of is bounded by , we know from Assumption 3 that
with . Thus, is an envelop function of . Note that a single function consists of a Glivenko-Cantelli class and has VC dimension 1. Thus, the application of equation (80) in Sen (2018) implies
(A.28) |
for some universal .
Proof of Corollary 7. The proof starts from the observation that our class can be used to approximate any general function. Since , by density argument we know can be approximated by a sequence of continuous functions in sense. Thus, we just assume is continuous. Define the logistic activation . For any , by Lemma 16.1 in Györfi et al. (2002) there is with and such that
(A.29) |
Since is a continuously differentiable, we know , where depends on only. Now we fix such and make the decomposition as follows
Part I and III can be upper bounded by following similar analysis in Theorem 2. Therefore, under assumptions in our theorem, we know both of these two parts converges to zero as . Next, we consider Part II. Note that
where in the second line we use Assumption 5. Finally, we only need to consider the behavior of term as . This can be done by using the analysis of Part II in the proof for Theorem 2. Taking in (A.27), we have
which goes to zero as increases. In conclusion, we have proved that
The above inequality and Assumption 5 shows that is consistent for the general function .
Proof of Theorem 8. Based on Assumption 1, we only need to consider the regret function for . For any , by the definition of we know
(A.30) |
where is the estimator based on the process .
On the other hand, we have the decomposition below
(A.31) |
Firstly, we bound Part . Recall , which is defined below (A.1). Make the decomposition of as follows.
(A.32) |
The key for bounding is to find the upper bound of . By the definition of and Assumption 3, we know if occurs
Therefore, when happens we have
Following arguments that we used to bound in the Proof of Theorem 2, we know
(A.33) |
Next, the way for bounding in (A.32) is similar to that we used to bound in the proof of Theorem 2. Namely, we have
(A.34) |
Secondly, we use (A.30) to bound Part in (A.31). By the definition of , for any we have
Similar to the Proof of Theorem 2, the above inequality implies
(A.35) |
Since (A.35) holds for all , taking inequality (A.35) further implies
(A.36) |
where is caused by the probability tail of .
Thirdly, we consider Part . The argument for this part is same with that used to obtain (A.28). Namely, we have
(A.37) |
where is universal and does not depend on .
Proof of Proposition 13. For any function , by Assumption 4 we know the event
happens with probability larger than , where is a large number and . Denote by the least square forest estimator in Section 6.1. Now, we make the decomposition below.
When occurs, it can be seen if for some large . On the other hand, we have the upper bounds of two risk functions:
The combination of above inequalities leads that
By Corollary 7, we already know is consistent for any general . Therefore, is also consistent.
Proof of Proposition 14. Let us show the existence of in (5). In fact, Lugosi and Vayatis (2004) tells us one of the minimizers is
when is a differentiable strictly convex, strictly increasing cost function satisfying . Let be a given small number. Next, we need to show there is such that
(A.38) |
when the cost function is or .
First, we consider . Since , we have
Note that . We can always find a infinite differentiable function s.t. . This completes the proof for (A.38).
Second, we consider . The argument for above does not work due to the dramatic increase of . In this case, we need to define an infinite differentiable function
Based on this mollifier, we consider the weighted average function of :
where we define for any and . We know is an infinite differentiable function in and
Importantly, some simple analysis implies
(A.39) |
In fact, we next show one of can be defined as satisfying (A.38). By the dominated convergence theorem, we have
Since functions are upper bounded by , we can find a such that
For any , we have
From (A.39), choose such that
Put above inequalities together. Then, we know
Finally, for some since it is infinite differentiable. Thus, can be defined in (A.38).
On the other hand, by Remark 5 we have
Thus, there exist such that for any ,
The proof is completed by combining above inequality and (A.38) together.
Proof of Lemma 15. Note that . Define a real function as follows:
Thus, we have and . Later, it will be convenient to use the function in the analysis of this penalty function. Since both and are upper bounded, we know is differentiable and its derivative is
(A.40) |
Define a continuous random vector with the density function
(A.41) |
From (A.40) and (A.41), we know
(A.42) |
On the other hand, the Lagrange mean theorem implies
(A.43) |
where . Thus, later we only need to consider the last term of (A.43). Since , we know from (A.43) that
(A.44) |
where follows the uniform distribution in and is independent with . By further calculation, we have
From (A.26), we already know . Thus, above inequality implies
This completes the proof.
Proof of Lemma 16. First, we calculate the term , where . With some calculation, it is not difficult to know
(A.45) |
where is a continuous random vector in . Furthermore, has the density with . Sicne and , we can assume without loss generality that . This results that
(A.46) |
Let follows uniform distribution in . Equation (A.46) implies
(A.47) |
With the same argument, we also have
(A.48) |
The combination of (A.45), (A.47) and (A.48) shows that
(A.49) |
for some universal and any . Finally, by Taylor expansion, we have
for some . Without loss of generality, We can assume that . Thus, the second derivative can be bounded by using (A.49) if we take . Meanwhile, the first derivative satisfies since achieves the minimal value of . Based on above analysis, we have
for some universal . This completes the proof.