Stability and -penalty in Model Averaging
Abstract
Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there are few literatures on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we define the stability, asymptotic empirical risk minimizer, generalization, and consistency of model averaging and study the relationship among them. Our results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose a -penalty model averaging method without limiting model weights and prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The Monte Carlo simulation and an illustrative application demonstrate the usefulness of the proposed method.
keywords:
Model averaging , Stability , Mean squared prediction error, -penalty[label1]organization=School of Mathematical Sciences, Capital Normal University, city=Beijing, postcode=100048, country=China
1 Introduction
In practical applications, data analysts usually determine a series of models based on exploratory analysis for data and empirical knowledge to describe the relationship between variables of interest and related variables, but how to use these models to produce good results is a more important problem. It is very common to select one model using some data driven criteria, such as AIC(Akaike, 1998), BIC(Schwarz, 1978), (Mallows, 1973) and FIC(Hjort and Claeskens, 2003). However, the properties of the corresponding results depend not only on the selected model, but also on the randomness of the selected model. An alternative to model selection is to make the resultant estimator compromises across a set of competing models. Further, statisticians find that they can usually obtain better and more stable results by combining information from different models. This process of combining multiple models into estimation is known as model averaging. Up to now, there have been a lot of literatures on Bayesian model averaging (BMA) and frequentist model averaging (FMA). Fragoso et al. (2018) reviewed the relevant literature on BMA. In this paper, we focus on FMA. In the past decades, the model averaging method has been applied in various fields. Wan and Zhang (2009) examined applications of model averaging in tourism research. Zhang and Zou (2011) applied the model averaging method to grain production forecasting of China. Moral-Benito (2015) reviewed the literature on model averaging with special emphasis on its applications to economics. The key to FMA lies in how to select model weights. The common weight selection methods include: 1) methods based on information criteria, such as smoothed AIC and smoothed BIC in Buckland et al. (1997); 2) Mallows model averaging(MMA), proposed by Hansen (2007)(see also Wan et al., 2010), modified by Liu and Okui (2013) to make it applicable to heteroscedasticity, and improved by Liao and Zou (2020) in small sample sizes; 3) FIC criterion, proposed by Hjort and Claeskens (2003), which is extended to generalized additive partial linear model(Zhang and Liang, 2011), the Tobit model with non-zero threshold(Zhang et al., 2012), and the weighted composite quantile regression(Xu et al., 2014); 4) adaptive methods, such as Yang (2001) and Yuan and Yang (2005); 5) OPT method(Liang et al., 2011); 6) cross validation methods, such as jackknife model averaging(JMA)(Hansen and Racine, 2012), Zhang et al. (2013), Gao et al. (2016), Zhang and Zou (2020) and Li et al. (2021), among others.
In the learning theory, stability is used to measure algorithm’s sensitivity to perturbations in the training set, and is an important tool for analyzing generalization and learnability of algorithms. Bousquet and Elisseeff (2002) introduced four kinds of stabilities (hypothesis stability, pointwise hypothesis stability, error stability, and uniform stability), and showed that stability is a sufficient condition for learnability. Kutin and Niyogi (2002) introduced several weaker variants of stability, and showed how they are sufficient to obtain generalization bounds for algorithms stable in their sense. Rakhlin et al. (2005) and Mukherjee et al. (2006) discussed the necessity of stability for learnability under the assumption that uniform convergence is equivalent to learnability. Further, Shalev-Shwartz et al. (2010) showed that uniform convergence is in fact not necessary for learning in the general learning setting, and stability plays there a key role which has nothing to do with uniform convergence. In the general learning setting with differential privacy constraint, Wang et al. (2016) studied some intricate relationships between privacy, stability and learnability.
Although various model averaging methods have been proposed, there are few literatures on their theoretical properties from the perspective of stability, and the majority of these methods are concerned only with whether the resultant estimator leads to a good approximation for the minimum of a given target when the model weights are constrained to a simplex. Considering that there is still a lack of literature on whether stability can ensure good theoretical properties of the model average estimator, and it may be better to approximate the global minimum than the local minimum in some cases, our attempts in this paper are to study stability in model averaging and to answer whether the resultant estimator can lead to a good approximation for the minimum of a given target when the model weights are unrestricted.
For the first attempt, we introduce the concept of stability from statistical learning theory into model averaging. Stability discusses how much the algorithm’s output varies when the sample set changes a little. Shalev-Shwartz et al. (2010) discussed the relationship among asymptotic empirical risk minimizer (AERM), stability, generalization and consistency, but relevant conclusions cannot be directly applied to model averaging. Therefore, we explore the relevant definitions and conclusions of Shalev-Shwartz et al. (2010) in model averaging. The results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error (MSPE). For the second one, we find, for MMA and JMA, extreme weights tend to appear with the influence of correlation among models when the model weights are unrestricted, resulting in poor performance of model averaging estimator. Therefore, we should not simply remove the weight constraint and directly use the existing model averaging method. Thus, similar to ridge regression in Hoerl and Kennard (1970), we introduce -penalty for the weight vector in MMA and JMA and call them Ridge-Mallows model averaging (RMMA) and Ridge-jackknife model averaging (RJMA), respectively. Like Theorem 4.3 in Hoerl and Kennard (1970), we discuss the reasonability of introducing -penalty. We also prove the stability and consistency of the proposed method, where consistency means model averaging estimator can asymptotically minimize MSPE when the model weights are unrestricted. In the context of shrinkage estimation, Schomaker (2012) discussed the impact of tuning parameter selection, and pointed out that the weighted average of shrinkage estimators with different tuning parameters can improve overall stability, predictive performance and standard errors of shrinkage estimators. Hence, like Schomaker (2012), we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors.
The remainder of this paper is organised as follows. In Section 2, we explain the relevant notations, give the definitions of consistency and stability, and discuss their relationship. In Section 3, we propose RMMA and RJMA methods, and prove that they are stable and consistent. Section 4 conducts the Monte Carlo simulation experiment. Section 5 applies the proposed method to a real data set. Section 6 concludes. The proofs of lemmas and theorems are provided in the Appendix.
2 Consistency and Stability for Model Averaging
2.1 Model Averaging
We assume that is a simple random sample from distribution , where is the -th observation of response variable, and is the -th observation of covariates. Let be an observation from distribution and independent of .
In model averaging, approximating models are selected first in order to describe the relationship between response variable and covariates. We assume that the hypothesis spaces of approximating models are
where consists of some elements of , and is a given function set. For example, in MMA, to estimate , we take
where represents the dimension of the vector. For the -th approximating model, a proper estimation method is selected, and , the estimator of , is obtained based on and . Then, the hypothesis space of model averaging is defined as follows:
where is a given weight space, and is a given function of weight vector and estimators of approximating models. In MMA, we take
as the model average estimator of . An important problem with model averaging is the choice of model weights. Here, the estimator of weight vector is obtained based on and a proper weight selection criterion that makes be optimal in a certain sense.
The selection of and is closely related to the definition of the loss function. Let be a real value loss function which is defined in , where is the value space of . Then, the risk function is defined as follows:
which is an MSPE under the sample set S and weight vector .
2.2 Related Concepts
In this paper, we mainly discuss whether can approximate smallest possible risk . If has such a property, we say that is consistent. For fixed , Shalev-Shwartz et al. (2010) defined the stability and consistency of , and discussed their relationship. Obviously, for model averaging, we need to pay more attention to the stability and consistency of weight selection. We note that relevant conclusions of Shalev-Shwartz et al. (2010) cannot be directly applied to model averaging because depends on . Therefore, we extend the relevant definitions and conclusions to model averaging. The following is the definition of consistency:
Definition 1 (Consistency).
If there is a sequence of constants such that and satisfies
then is said to be consistent with rate .
In statistical learning theory, stability concerns how much the algorithm’s output varies when changes a little.“Leave-one-out(Loo)” and “Replace-one(Ro)” are two common tools used to evaluate stability. Loo considers the change in the algorithm’s output after removing an observation from , and Ro considers such a change after replacing an observation in with an observation that is independent of S. Accordingly, the stability is called Loo stability and Ro stability respectively. Here we will give the formal definitions of Loo stability and Ro stability. To this end, we first give the definition of algorithm symmetry:
Definition 2 (Symmetry).
If the algorithm’s output is not affected by the order of the observations in , the algorithm is symmetric.
Now let be the sample set after removing the -th observation from , be the estimator of based on and , be the estimator of weight vector based on and , and , where . We define Loo stability as follows:
Definition 3 (PLoo Stability).
If there is a sequence of constants such that and satisfies
then is Predicted-Loo(PLoo) stable with rate ; If and are symmetric, it only has to satisfy
Definition 4 (FLoo Stability).
If there is a sequence of constants such that and satisfies
then is Fitted-Loo(FLoo) stable with rate ; If and are symmetric, it only has to satisfy
Let be the sample set S with the -th observation replaced by , be the estimator of based on and , and be the estimator of weight vector based on and , where is from distribution and independent of . Let , then
where . Therefore, we define Ro stability as follows:
Definition 5 (Ro Stability).
If there is a sequence of constants such that and satisfies
then is Ro stable with rate ; If and are symmetric, it only has to satisfy
Before discussing the relationship between stability and consistency, we give the definitions of AERM and generalization. The empirical risk function is defined as follows:
Definition 6 (AERM).
If there is a sequence of constants such that and satisfies
then is an AERM with rate .
Vapnik (1998) proved some theoretical properties of the empirical risk minimization principle. However, when the sample size is small, the empirical risk minimizer tends to produce over-fitting phenomenon. Therefore, the structural risk minimization principle is proposed in Vapnik (1998), and the method to satisfy this principle is usually an AERM. Shalev-Shwartz et al. (2010) also discussed the deficiency of the empirical risk minimization principle and the importance of AERM.
Definition 7 (Generalization).
If there is a sequence of constants such that and satisfies
then generalizes with rate .
In statistical learning theory, generalization refers to the performance of the concept learned by models on unknown samples. It can be seen from Definition 7 that the generalization of describes the difference between using to fit the training set and predict the unknown sample.
2.3 Relationship between Different Concepts
Note that for any ,
so we give the following theorem to illustrate the relationship between Loo stability and Ro stability:
Theorem 2.1.
If has two of FLoo stability, PLoo stability and Ro stability with rate , then it has all three stabilities with rate .
Shalev-Shwartz et al. (2010) emphasized that Ro stability and Loo stability are in general incomparable notions, but Theorem 2.1 shows that they are closely related.
By definitions of generalization and Ro stability, we have
and then we give the following theorem to illustrate the equivalence of Ro stability and generalization:
Theorem 2.2.
have Ro stability with rate if and noly if generalizes with rate .
Theorem 2.2 shows that stability is an important property of weight selection criterion, which can ensure that the corresponding estimator has good generalization performance.
Let satisfy . Note that
so we give the following theorem to illustrate the relationship between stability and consistency:
Theorem 2.3.
If is an AERM and has Ro stability with rate , and satisfies
then is consistent with rate .
3 -penalty Model Averaging
In the most of existing literatures on model averaging, the theoretical properties are explored under the weight set . From Definition 1, it is seen that, even if the corresponding weight selection criterion is consistent, such a consistency holds only under the subspace of . Therefore, a natural question is whether it is possible to make the weight space unrestricted. What will happen when we do so? We note that unrestricted Granger-Ramanathan method obtains the estimator of the weight vector under by minimizing the sum of squared forecast errors from the combination forecast, but its poor performance is observed when it is compared with some other methods(see Hansen, 2008). On the other hand, in the prediction task, we are more concerned about whether the resulting estimator can predict better, and the estimator that minimizes MSPE in the full space will most likely outperform the estimator that minimizes MSPE in the subspace. Therefore, it is necessary to further develop new research ideas.
3.1 Model Framework and Estimators
We assume that the response variable and covariates satisfy the following data generating process:
and M approximating models are given by
where is the approximating error of the -th approximating model. We assume that the -th approximating model contains all the considered covariates. To simplify notation, we let throughout the rest of this article, where .
Let , , , . Then, the corresponding matrix form of the true model is , where , and is the design matrix of the -th approximating model. When there is an approximating model such that , it indicates that the true model is included in the -th approximating model, i.e. the model is correctly specified. Unlike Hansen (2007) , Wan et al. (2010) and Hansen and Racine (2012), we do not require that the infimum of (this is defined in section 3.2) tends to infinity, and therefore allow that the model is correctly specified.
Let be the variable selection matrix satisfying and , . Then, the hypothesis spaces of M approximating models are
The least squares estimator of is , .
3.2 Weight Selection Criterion
Let , , , and . When , Hansen (2007) and Wan et al. (2010) used Mallows criterion to select a model weight vector from hypothesis space and proved that the estimator of weight vector asymptotically minimizes , where , , and . Hansen and Racine (2012) used Jackknife criterion to select a model weight vector from hypothesis space and proved that the estimator of weight vector asymptotically minimizes and , where with and .
Different from Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012), we focus on whether the model averaging estimator can asymptotically minimize MSPE when the model weights are not restricted. Let . Then, the risk function and the empirical risk function are defined as:
and
respectively. Since Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012) restrict , the corresponding estimators of weight vector do not necessarily asymptotically minimize on . An intuitive way that enables the estimator of weight vector to asymptotically minimize on is to remove the restriction directly.
Let and be the orthogonal matrices satisfying , , where and are the eigenvalues of and respectively. We assume that , and are invertible( this is reasonable under Assumption 3 ), then
and
From this, we can see that, in order to satisfy the consistency, and should be good estimators of . However, when approximating models are highly correlated, the minimum eigenvalues of and may be small so that and are too large, which usually result in extreme weights, where , . Therefore, similar to ridge regression in Hoerl and Kennard (1970), we make the following correction to and :
where is a tuning parameter. The above corrections are actually -penalty for weight vector. Let , . Then
and
In the next subsection, we discuss the theoretical properties of and .
3.3 Stability and Consistency
Let and be the maximum and minimum eigenvalues of a square matrix respectively, and be the value space of K covariates, where . In order to discuss the stability and consistency of the proposed method, we need the following assumptions:
Assumption 1.
There are constants and such that
Assumption 2.
There is a constant such that , and , a.s.; There is a constant such that , a.s., where is a ball with center and radius , and is the K-dimensional 0 vector.
Assumption 3.
There are constants and such that
and
a.s..
Assumption 4.
, .
Assumption 1 is mild and similar conditions can be found in Zou and Zhang (2009) and Zhao et al. (2020). From , we see that, under Assumption 1, for any m , we have
Shalev-Shwartz et al. (2010) assumed that the loss function is bounded, which is usually not satisfied in traditional regression analysis. We replace this assumption with Assumption 2. Tong and Wu (2017) assumed that is a compact subset of , under which Assumption 2 is obviously true. Assumption 3 requires the minimum eigenvalue of to have a lower bound away from 0 and the maximum eigenvalue to have an order a.s.. A similar assumption is used in Liao and Zou (2020). Lemma 3 guarantees the rationality of the assumptions about the eigenvalues of and . Assumption 4 is a mild assumption for tuning parameter in order to avoid excessive penalty. In Section 3.4, we look for tuning parameters only from .
Let , , , and . We define
In order for and to better approximate , we naturally want and to be as small as possible. In the following discussion, we call and , and , and the corresponding mean square errors, estimation variances, and estimation biases, respectively. Obviously, when , which means estimation bias is equal to zero. From Lemma 4 and the proof of Theorems 3.4, we see that, under Assumptions , and are , a.s.. On the other hand, the existence of extreme weights may make the performance of and extremely unstable. So the purpose of using -penalty is to reduce estimation variance by introducing estimation bias, and thus make the performance of the model average estimator more stable. Further we define
Like Theorem 4.3 in Hoerl and Kennard (1970), we give the following theorem to illustrate the reasonability of introducing -penalty:
Theorem 3.4.
Let , . Then, 1) when , and ; 2) when , and .
Theorems 3.4 shows that the use of -penalty reduces estimation variance by introducing estimation bias. However, since is unknown, and are also unknown. In Section 3.4, we use cross validation to select the tuning parameter . The following theorem shows that and are AERM.
Theorem 3.5.
Under Assumptions , and are AERM with rate and , respectively.
The following theorem shows that , and have FLoo stability and PLoo stability.
Theorem 3.6.
Under Assumptions , , and have FLoo stability and PLoo stability with rate , and , respectively.
It can be seen from Theorems 2.1, 2.2 and 3.6 that , and have Ro stability and generalization. The following theorem shows that and have consistency, which is a direct consequence of Theorems and .
Theorem 3.7.
Under Assumptions , and have consistency with rate and , respectively.
3.4 Optimal Weighting Based on Cross Validation
Although Theorems 3.4 shows that there are and such that and are better approximation of , and cannot be obtained. Therefore, like Schomaker (2012), we propose an algorithm based on 10-fold cross validation to obtain the estimator of weight vector, which is a weighted average of the weight estimators for different . That is, we first select 100 segmentation points on with equal intervals as the candidates of . Then we calculate the estimation error for each candidate of by using 10-fold cross validation. Based on this, we remove those candidates with large estimation error. Last, for remaining candidates, the estimation errors are used to perform a weighted average of the estimators of weight vector. We summarize our algorithm on RMMA below, and a similar algorithm can be given for RJMA.
Algorithm 1 Optimal weighting based on cross validation
4 Simulation Study
In this section, we conduct simulation experiments to demonstrate the finite sample performance of the proposed method. Similar to Hansen (2007), we consider the following data generating process:
where are the model parameters, , , and . We set , , and with , and , where the population . For the homoskedastic simulation we set , while for the heteroskedastic simulation we set .
We compare the following ten model selection/averaging methods: 1) model selection with AIC (AI), model selection with BIC (BI), and model selection with (Cp); 2) model averaging with smoothed AIC (SA), and model averaging with smoothed BIC (SB); 3) Mallows model averaging (MM), jackknife model averaging (JM), and least squares model averaging based on generalized cross validation (GM)(Li et al., 2021); 4) Ridge-Mallows model averaging (RM), and Ridge-jackknife model averaging (RJ). To evaluate these methods, we generate test set by the above data generating process, and
is calculated as a measure of consistency. In the simulation, we set and repeat 200 times. For each parameterization, we normalize the MSE by dividing by the infeasible MSE(the mean of the smallest MSEs of M approximating models in 200 simulations).
We consider two simulation settings. In the first setting, like Hansen (2007), all candidate models are misspecified, and candidate models are strictly nested. In the second setting, the true model is one of candidate models, and candidate models are non-nested.
4.1 Nested Setting and Results
We set , , and (i.e. ), where is controlled by the parameter c. For , the mean of normalized MSEs in 200 simulations is shown in Figures . The results with are similar and so omitted for saving space.
For homoskedastic case, we can draw the following conclusions from Figures . When (Figure 1): 1) RM and RJ perform better than other methods if and and ; 2) RM and RJ perform worse than other methods except for BI and SB if , and worse than MA, JM and GM if and ; 3) RM performs better than RJ in most of cases if and . When (Figure 2): 1) RM performs better than other methods if ; 2) RM and RJ perform worse than other methods except for BI and SB if ; 3) RM performs better than RJ in most of cases if and . When (Figure 3): 1) RM and RJ always perform better than other methods; 2) RM performs better than RJ in most of cases if and .
For heteroskedastic case, we can draw the following conclusions from Figures . When (Figure 4): 1) RM performs better than other methods if ; 2) RM and RJ perform worse than other methods except for BI and SB if ; 3) RM performs better than RJ. When (Figure 5): 1) RM performs better than other methods if ; 2) RM and RJ perform worse than other methods except for BI and SB if ; 3) RM performs better than RJ, and RJ performs better than other methods in most cases. When (Figure 6): 1) RM always performs better than other methods; 2) RM always performs better than RJ, and RJ performs better than other methods in most cases.
From Figures , we can draw the following conclusions: 1) As increases, RM and RJ perform better and better; 2) has no obvious influence on the performance comparison of various methods; 3) With the increase of , the performance of various methods is improved, but RJ and RM always have not very bad performance, and are the best in most cases.
To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and not bad if not the best; 2) When is small and is large, GM has better performance, and RM and RJ are the best in other cases; 3) RM performs better than RJ.
4.2 Non-nested Setting and Results
We set and for , and for , where is controlled by the parameter c. Each approximating model contains the first 6 covariates, and the last 6 covariates are combined to obtain approximating models. For , the mean of normalized MSEs in 200 simulations is shown in Figures . Like the nested case, the results with are similar and so omitted.
For this setting, we can draw the following conclusions from Figures . When (Figure 7 and Figure 10): 1) RM and RJ perform better than other methods if ; 2) As increases, RM and RJ perform worse than other methods for larger ; 3) For homoskedastic case, RJ performs better than RM in most of cases. When (Figure 8 and Figure 11): 1) RM and RJ perform better than other methods except for SB if , but the performance of SB is very instable; 2) As increases, RM and RJ perform worse than other methods for larger ; 3) RJ performs better than RM in most of cases. When (Figure 9 and Figure 12): 1) RM and RJ perform better than other methods except for BI and SB, but the performance of SB and BI is very instable; 2) RJ performs better than RM in most of cases.
To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and have stable performance; 2) One of SB, SA, BI, and AI may perform best when R2 is small or large, but their performance is unstable compared to RM and RJ; 3) In heteroskedastic case with larger and and homoskedastic case, on the whole, RJ performs better than RM.
5 Real Data Analysis
In this section, we apply the proposed method to the real ”wage1” dataset in Wooldridge (2003) from from the US Current Population Survey for the year 1976. There are 526 observations in this dataset. The response variable is the log of average hourly earning, while covariates include: 1) dummy variables—nonwhite, female, married, numdep, smsa, northcen, south, west, construc, ndurman, trcommpu, trade, services, profserv, profocc, clerocc, and servocc; 2) non-dummy variables—educ, exper, and tenure; 3) interaction variables—nonwhite educ, nonwhite exper, nonwhite tenure, female educ, female exper, female tenure, married educ, married exper, and married tenure.
We consider the following two cases: 1) we rank the covariates according to their linear correlations with the response variable, and then consider the strictly nested model averaging method (intercept term is considered and ranked first); 2) 100 models are selected by using function ”regsubsets” in ”leaps” package of R language, where the parameters ”nvmax” and ”nbest” are taken to be 20 and 5 respectively, and other parameters use default values.
We randomly divide the data into two parts: a training sample of observations for estimating the models and a test sample of observations for validating the results. We consider , and
is calculated as a measure of consistency. We replicate the process for 200 times. The box plots of MSEs in 200 simulations are shown in Figures . From these figures, we see that the performance of RM and RJ is good and stable. We also compute the mean and median of MSEs, as well as the best performance rate (BPR), which is the frequency of achieving the lowest risk across the replications. The results are shown in Tables . From these tables, we can draw the following conclusions: 1) RM and RJ always are superior to other methods in terms of mean and median of MSEs, and BPR; 2) The performance of RM and RJ is basically the same in terms of mean and median of MSEs; 3) For BPR, on the whole, RM outperforms RJ.
n | AI | Cp | BI | SA | SB | MM | RM | GM | JM | RJ | |
---|---|---|---|---|---|---|---|---|---|---|---|
110 | Mean | 0.185 | 0.182 | 0.184 | 0.180 | 0.179 | 0.169 | 0.164 | 0.178 | 0.168 | 0.164 |
Median | 0.179 | 0.176 | 0.182 | 0.174 | 0.177 | 0.167 | 0.162 | 0.174 | 0.166 | 0.162 | |
BPR | 0.011 | 0.000 | 0.006 | 0.061 | 0.039 | 0.028 | 0.376 | 0.044 | 0.088 | 0.348 | |
210 | Mean | 0.160 | 0.159 | 0.169 | 0.157 | 0.166 | 0.155 | 0.152 | 0.155 | 0.155 | 0.152 |
Median | 0.160 | 0.159 | 0.169 | 0.156 | 0.166 | 0.154 | 0.152 | 0.155 | 0.155 | 0.153 | |
BPR | 0.030 | 0.000 | 0.010 | 0.075 | 0.000 | 0.020 | 0.460 | 0.080 | 0.035 | 0.290 | |
320 | Mean | 0.152 | 0.152 | 0.156 | 0.150 | 0.154 | 0.148 | 0.146 | 0.148 | 0.148 | 0.146 |
Median | 0.152 | 0.152 | 0.156 | 0.150 | 0.154 | 0.148 | 0.145 | 0.148 | 0.147 | 0.146 | |
BPR | 0.010 | 0.000 | 0.050 | 0.160 | 0.035 | 0.045 | 0.315 | 0.070 | 0.020 | 0.295 | |
420 | Mean | 0.152 | 0.152 | 0.150 | 0.150 | 0.151 | 0.148 | 0.147 | 0.149 | 0.148 | 0.147 |
Median | 0.151 | 0.150 | 0.148 | 0.147 | 0.149 | 0.147 | 0.147 | 0.148 | 0.147 | 0.147 | |
BPR | 0.025 | 0.005 | 0.105 | 0.175 | 0.060 | 0.030 | 0.215 | 0.055 | 0.030 | 0.300 |
n | AI | Cp | BI | SA | SB | MM | RM | GM | JM | RJ | |
---|---|---|---|---|---|---|---|---|---|---|---|
110 | Mean | 0.167 | 0.167 | 0.171 | 0.161 | 0.163 | 0.158 | 0.152 | 0.160 | 0.158 | 0.152 |
Median | 0.164 | 0.164 | 0.170 | 0.158 | 0.163 | 0.157 | 0.151 | 0.158 | 0.157 | 0.151 | |
BPR | 0.011 | 0.000 | 0.005 | 0.081 | 0.000 | 0.016 | 0.400 | 0.011 | 0.011 | 0.465 | |
210 | Mean | 0.152 | 0.151 | 0.157 | 0.149 | 0.153 | 0.148 | 0.145 | 0.148 | 0.148 | 0.145 |
Median | 0.151 | 0.151 | 0.157 | 0.148 | 0.153 | 0.147 | 0.144 | 0.148 | 0.148 | 0.144 | |
BPR | 0.000 | 0.005 | 0.000 | 0.170 | 0.010 | 0.005 | 0.460 | 0.005 | 0.015 | 0.330 | |
320 | Mean | 0.147 | 0.146 | 0.152 | 0.144 | 0.148 | 0.145 | 0.142 | 0.145 | 0.145 | 0.142 |
Median | 0.145 | 0.145 | 0.150 | 0.143 | 0.147 | 0.144 | 0.142 | 0.143 | 0.144 | 0.142 | |
BPR | 0.010 | 0.000 | 0.000 | 0.280 | 0.040 | 0.000 | 0.340 | 0.000 | 0.000 | 0.330 | |
420 | Mean | 0.148 | 0.148 | 0.152 | 0.144 | 0.148 | 0.146 | 0.143 | 0.146 | 0.146 | 0.143 |
Median | 0.147 | 0.147 | 0.151 | 0.144 | 0.147 | 0.145 | 0.141 | 0.145 | 0.145 | 0.141 | |
BPR | 0.005 | 0.000 | 0.000 | 0.370 | 0.055 | 0.000 | 0.265 | 0.000 | 0.010 | 0.295 |
6 Concluding Remarks
In this paper, we study the relationship between AERM, stability, generalization and consistency in model averaging. The results indicate that stability is an important property of model averaging, which can ensure that model averaging has good generalization performance and is consistent under reasonable conditions. When the model weights are not restricted, similar to ridge regression in Hoerl and Kennard (1970), extreme weights tend to appear under the influence of correlation between candidate models in MMA and JMA, resulting in poor performance of the corresponding model averaging estimator. So we propose a -penalty model averaging method. We prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The numerical simulation and real data analysis show the superiority of the proposed method.
Many issues deserve to be further investigated. We only apply the methods of Section 2 to the generalization of MMA and JMA in linear regression. It is worth investigating whether it is possible to extend the proposed method to more complex scenarios, such as generalized linear model, quantile regression, and dependent data. Further, we also expect to be able to propose a model averaging framework with stability and consistency that can be applied to multiple scenarios. In addition, in RMMA and RJMA, we can see that the estimators of weight vector are explicitly expressed. So, how to study their asymptotic behavior based on these explicit expressions is a meaningful but challenging topic.














7 Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (Grant nos. 11971323 and 12031016).
Appendix A Lemmas and Proofs
Let and be the model averaging estimators corresponding to , which are based on and respectively, where represents the sample set after removing the -th and -th observations from .
Proof.
By Dufour (1982), we see that for any ,
It follows from Assumption 1 that
(1) |
From Assumption 2, we obtain
(2) |
Further, from Assumption 2 and Lemma 1, we have
(3) |
Combining , it is seen that
In a similar way, it can be shown that
Proof.
Note that
and
It follows from Assumption 2, Lemma 1 and Lemma 2 that
In a similar way, we obtain
Further, it is seen that
So we have
On the other hand, it follows from Assumptions 1 and 2 that
Hence, we have
Thus, from Assumption 2, we see that
Proof.
It follows from Assumptions 2 and 3 that
and
In a similar way, we obtain
On the other hand, it follows from Assumptions and Lemma 1 that
We take to complete the proof.
Proof.
Note that
and
where
and
It follows from Lemmas 1, 2 and 4 that
and
In a similar way, we obtain
Further, from Assumption 2, we have
and
These arguments indicate that
and
Similarly, we obtain
Thus, it follows from Assumption 4 and Lemma 4 that
On the other hand, we notice that
and
where
and
So, we obtain
and
Further, we have
Similarly, from
and
where
and
we have
Proof.
It follows from the definition of that . So, we have
Further, it follows from Lemma 5 that
On the other hand, it follows from the definition of that . So, we have
Further, from Lemma 5, we have
Similarly, from the definition of , Lemma 5 and
we obtain
Let . Then, we have
and
Similarly, from
we see that, when , . So, we have and when .
Proof of Theorem 3.6: It follows from Lemmas 1 and 6 that
Further, it follows from the proof of Lemma 5 that
and
From
we have
In a similar way, we obtain
References
- Akaike (1998) Akaike, H., 1998. Information theory and an extension of the maximum likelihood principle, in: Selected papers of Hirotugu Akaike, pp. 199–213.
- Bousquet and Elisseeff (2002) Bousquet, O., Elisseeff, A., 2002. Stability and generalization. The Journal of Machine Learning Research 2, 499–526.
- Buckland et al. (1997) Buckland, S.T., Burnham, K.P., Augustin, N.H., 1997. Model selection: An integral part of inference. Biometrics 53, 603–618.
- Dufour (1982) Dufour, J.M., 1982. Recursive stability analysis of linear regression relationships: An exploratory methodology. Journal of Econometrics 19, 31–76.
- Fragoso et al. (2018) Fragoso, T.M., Bertoli, W., Louzada, F., 2018. Bayesian model averaging: A systematic review and conceptual classification. International Statistical Review 86, 1–28.
- Gao et al. (2016) Gao, Y., Zhang, X., Wang, S., Zou, G., 2016. Model averaging based on leave-subject-out cross-validation. Journal of Econometrics 192, 139–151.
- Hansen (2007) Hansen, B.E., 2007. Least squares model averaging. Econometrica 75, 1175–1189.
- Hansen (2008) Hansen, B.E., 2008. Least-squares forecast averaging. Journal of Econometrics 146, 342–350.
- Hansen and Racine (2012) Hansen, B.E., Racine, J.S., 2012. Jackknife model averaging. Journal of Econometrics 167, 38–46.
- Hjort and Claeskens (2003) Hjort, N.L., Claeskens, G., 2003. Frequentist model average estimators. Journal of the American Statistical Association 98, 879–899.
- Hoerl and Kennard (1970) Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67.
- Kutin and Niyogi (2002) Kutin, S., Niyogi, P., 2002. Almost-everywhere algorithmic stability and generalization error, in: In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, p. 275–282.
- Li et al. (2021) Li, X., Zou, G., Zhang, X., Zhao, S., 2021. Least squares model averaging based on generalized cross validation. Acta Mathematicae Applicatae Sinica, English Series 37, 495–509.
- Liang et al. (2011) Liang, H., Zou, G., Wan, A.T., Zhang, X., 2011. Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association 106, 1053–1066.
- Liao and Zou (2020) Liao, J., Zou, G., 2020. Corrected Mallows criterion for model averaging. Computational Statistics and Data Analysis 144, 106902.
- Liu and Okui (2013) Liu, Q., Okui, R., 2013. Heteroskedasticity-robust model averaging. The Econometrics Journal 16, 463–472.
- Mallows (1973) Mallows, C.L., 1973. Some comments on . Technometrics 15, 661–675.
- Moral-Benito (2015) Moral-Benito, E., 2015. Model averaging in economics: An overview. Journal of Economic Surveys 29, 46–75.
- Mukherjee et al. (2006) Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R., 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 161–193.
- Rakhlin et al. (2005) Rakhlin, A., Mukherjee, S., Poggio, T., 2005. Stability results in learning theory. Analysis and Applications 3, 397–417.
- Schomaker (2012) Schomaker, M., 2012. Shrinkage averaging estimation. Statistical Papers 53, 1015–1034.
- Schwarz (1978) Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461–464.
- Shalev-Shwartz et al. (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K., 2010. Learnability, stability and uniform convergence. The Journal of Machine Learning Research 11, 2635–2670.
- Tong and Wu (2017) Tong, H., Wu, Q., 2017. Learning performance of regularized moving least square regression. Journal of Computational and Applied Mathematics 325, 42–55.
- Vapnik (1998) Vapnik, V., 1998. Statistical Learning Theory. Wiley.
- Wan and Zhang (2009) Wan, A.T., Zhang, X., 2009. On the use of model averaging in tourism research. Annals of Tourism Research 36, 525–532.
- Wan et al. (2010) Wan, A.T., Zhang, X., Zou, G., 2010. Least squares model averaging by Mallows criterion. Journal of Econometrics 156, 277–283.
- Wang et al. (2016) Wang, Y.X., Lei, J., Fienberg, S.E., 2016. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. The Journal of Machine Learning Research 17, 6353–6392.
- Wooldridge (2003) Wooldridge, J.M., 2003. Introductory Econometrics. Thompson South-Western.
- Xu et al. (2014) Xu, G., Wang, S., Huang, J.Z., 2014. Focused information criterion and model averaging based on weighted composite quantile regression. Scandinavian Journal of Statistics 41, 365–381.
- Yang (2001) Yang, Y., 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96, 574–588.
- Yuan and Yang (2005) Yuan, Z., Yang, Y., 2005. Combining linear regression models: When and how? Journal of the American Statistical Association 100, 1202–1214.
- Zhang and Zou (2020) Zhang, H., Zou, G., 2020. Cross-validation model averaging for generalized functional linear model. Econometrics 8, 7.
- Zhang and Liang (2011) Zhang, X., Liang, H., 2011. Focused information criterion and model averaging for generalized additive partial linear models. The Annals of Statistics 39, 174–200.
- Zhang et al. (2012) Zhang, X., Wan, A.T., Zhou, S.Z., 2012. Focused information criteria, model selection, and model averaging in a Tobit model with a nonzero threshold. Journal of Business and Economic Statistics 30, 132–142.
- Zhang et al. (2013) Zhang, X., Wan, A.T., Zou, G., 2013. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics 174, 82–94.
- Zhang and Zou (2011) Zhang, X., Zou, G., 2011. Model averaging method and its application in forecast. Statistical Research 28, 6.
- Zhao et al. (2020) Zhao, S., Liao, J., Yu, D., 2020. Model averaging estimator in ridge regression and its large sample properties. Statistical Papers 61, 1719–1739.
- Zou and Zhang (2009) Zou, H., Zhang, H.H., 2009. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751.