Stability and $L_{2}$ -penalty in Model Averaging

Hengkun Zhu [email protected] Guohua Zou [email protected].

Abstract

Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there are few literatures on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we define the stability, asymptotic empirical risk minimizer, generalization, and consistency of model averaging and study the relationship among them. Our results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose a $L_{2}$ -penalty model averaging method without limiting model weights and prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The Monte Carlo simulation and an illustrative application demonstrate the usefulness of the proposed method.

keywords:

Model averaging , Stability , Mean squared prediction error,

L_{2}

-penalty

^†^†journal: Journal of Machine Learning Research

\affiliation

[label1]organization=School of Mathematical Sciences, Capital Normal University, city=Beijing, postcode=100048, country=China

1 Introduction

In practical applications, data analysts usually determine a series of models based on exploratory analysis for data and empirical knowledge to describe the relationship between variables of interest and related variables, but how to use these models to produce good results is a more important problem. It is very common to select one model using some data driven criteria, such as AIC(Akaike, 1998), BIC(Schwarz, 1978), $C_{p}$ (Mallows, 1973) and FIC(Hjort and Claeskens, 2003). However, the properties of the corresponding results depend not only on the selected model, but also on the randomness of the selected model. An alternative to model selection is to make the resultant estimator compromises across a set of competing models. Further, statisticians find that they can usually obtain better and more stable results by combining information from different models. This process of combining multiple models into estimation is known as model averaging. Up to now, there have been a lot of literatures on Bayesian model averaging (BMA) and frequentist model averaging (FMA). Fragoso et al. (2018) reviewed the relevant literature on BMA. In this paper, we focus on FMA. In the past decades, the model averaging method has been applied in various fields. Wan and Zhang (2009) examined applications of model averaging in tourism research. Zhang and Zou (2011) applied the model averaging method to grain production forecasting of China. Moral-Benito (2015) reviewed the literature on model averaging with special emphasis on its applications to economics. The key to FMA lies in how to select model weights. The common weight selection methods include: 1) methods based on information criteria, such as smoothed AIC and smoothed BIC in Buckland et al. (1997); 2) Mallows model averaging(MMA), proposed by Hansen (2007)(see also Wan et al., 2010), modified by Liu and Okui (2013) to make it applicable to heteroscedasticity, and improved by Liao and Zou (2020) in small sample sizes; 3) FIC criterion, proposed by Hjort and Claeskens (2003), which is extended to generalized additive partial linear model(Zhang and Liang, 2011), the Tobit model with non-zero threshold(Zhang et al., 2012), and the weighted composite quantile regression(Xu et al., 2014); 4) adaptive methods, such as Yang (2001) and Yuan and Yang (2005); 5) OPT method(Liang et al., 2011); 6) cross validation methods, such as jackknife model averaging(JMA)(Hansen and Racine, 2012), Zhang et al. (2013), Gao et al. (2016), Zhang and Zou (2020) and Li et al. (2021), among others.

In the learning theory, stability is used to measure algorithm’s sensitivity to perturbations in the training set, and is an important tool for analyzing generalization and learnability of algorithms. Bousquet and Elisseeff (2002) introduced four kinds of stabilities (hypothesis stability, pointwise hypothesis stability, error stability, and uniform stability), and showed that stability is a sufficient condition for learnability. Kutin and Niyogi (2002) introduced several weaker variants of stability, and showed how they are sufficient to obtain generalization bounds for algorithms stable in their sense. Rakhlin et al. (2005) and Mukherjee et al. (2006) discussed the necessity of stability for learnability under the assumption that uniform convergence is equivalent to learnability. Further, Shalev-Shwartz et al. (2010) showed that uniform convergence is in fact not necessary for learning in the general learning setting, and stability plays there a key role which has nothing to do with uniform convergence. In the general learning setting with differential privacy constraint, Wang et al. (2016) studied some intricate relationships between privacy, stability and learnability.

Although various model averaging methods have been proposed, there are few literatures on their theoretical properties from the perspective of stability, and the majority of these methods are concerned only with whether the resultant estimator leads to a good approximation for the minimum of a given target when the model weights are constrained to a simplex. Considering that there is still a lack of literature on whether stability can ensure good theoretical properties of the model average estimator, and it may be better to approximate the global minimum than the local minimum in some cases, our attempts in this paper are to study stability in model averaging and to answer whether the resultant estimator can lead to a good approximation for the minimum of a given target when the model weights are unrestricted.

For the first attempt, we introduce the concept of stability from statistical learning theory into model averaging. Stability discusses how much the algorithm’s output varies when the sample set changes a little. Shalev-Shwartz et al. (2010) discussed the relationship among asymptotic empirical risk minimizer (AERM), stability, generalization and consistency, but relevant conclusions cannot be directly applied to model averaging. Therefore, we explore the relevant definitions and conclusions of Shalev-Shwartz et al. (2010) in model averaging. The results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error (MSPE). For the second one, we find, for MMA and JMA, extreme weights tend to appear with the influence of correlation among models when the model weights are unrestricted, resulting in poor performance of model averaging estimator. Therefore, we should not simply remove the weight constraint and directly use the existing model averaging method. Thus, similar to ridge regression in Hoerl and Kennard (1970), we introduce $L_{2}$ -penalty for the weight vector in MMA and JMA and call them Ridge-Mallows model averaging (RMMA) and Ridge-jackknife model averaging (RJMA), respectively. Like Theorem 4.3 in Hoerl and Kennard (1970), we discuss the reasonability of introducing $L_{2}$ -penalty. We also prove the stability and consistency of the proposed method, where consistency means model averaging estimator can asymptotically minimize MSPE when the model weights are unrestricted. In the context of shrinkage estimation, Schomaker (2012) discussed the impact of tuning parameter selection, and pointed out that the weighted average of shrinkage estimators with different tuning parameters can improve overall stability, predictive performance and standard errors of shrinkage estimators. Hence, like Schomaker (2012), we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors.

The remainder of this paper is organised as follows. In Section 2, we explain the relevant notations, give the definitions of consistency and stability, and discuss their relationship. In Section 3, we propose RMMA and RJMA methods, and prove that they are stable and consistent. Section 4 conducts the Monte Carlo simulation experiment. Section 5 applies the proposed method to a real data set. Section 6 concludes. The proofs of lemmas and theorems are provided in the Appendix.

2 Consistency and Stability for Model Averaging

2.1 Model Averaging

We assume that $S=\{z_{i}=(y_{i},x_{i}^{{}^{\prime}})^{{}^{\prime}}\in\mathcal{Z},i=1,...,n\}$ is a simple random sample from distribution $\mathcal{D}$ , where $y_{i}$ is the $i$ -th observation of response variable, and $x_{i}$ is the $i$ -th observation of covariates. Let $z^{*}=(y^{*},x^{*^{\prime}})^{{}^{\prime}}$ be an observation from distribution $\mathcal{D}$ and independent of $S$ .

In model averaging, $M$ approximating models are selected first in order to describe the relationship between response variable and covariates. We assume that the hypothesis spaces of $M$ approximating models are

\mathcal{H}_{m}=\big{\{}h_{m}(x^{*}_{m}),h_{m}\in\mathcal{F}_{m}\big{\}},m=1,...,M,

where $x^{*}_{m}$ consists of some elements of $x^{*}$ , and $\mathcal{F}_{m}$ is a given function set. For example, in MMA, to estimate $E(y^{*}|x^{*})$ , we take

\mathcal{H}_{m}=\big{\{}x^{*^{\prime}}_{m}\theta_{m},\theta_{m}\in R^{dim(x^{*}_{m})}\big{\}},m=1,...,M,

where $dim(\cdot)$ represents the dimension of the vector. For the $m$ -th approximating model, a proper estimation method $A_{m}$ is selected, and $\hat{h}_{m}$ , the estimator of $h_{m}$ , is obtained based on $S$ and $A_{m}$ . Then, the hypothesis space of model averaging is defined as follows:

\mathcal{H}=\big{\{}\hat{h}(x^{*},w)=H[w,\hat{h}_{1}(x^{*}_{1}),...,\hat{h}_{M}(x^{*}_{M})],w\in W\big{\}},

where $W$ is a given weight space, and $H(\cdot)$ is a given function of weight vector and estimators of $M$ approximating models. In MMA, we take

H[w,\hat{h}_{1}(x^{*}_{1}),...,\hat{h}_{M}(x^{*}_{M})]=\sum_{m=1}^{m}w_{m}\hat{h}_{m}(x^{*}_{m})

as the model average estimator of $E(y^{*}|x^{*})$ . An important problem with model averaging is the choice of model weights. Here, the estimator $\hat{w}$ of weight vector is obtained based on $S$ and a proper weight selection criterion $A(w)$ that makes $\hat{w}$ be optimal in a certain sense.

The selection of $A_{m},m=1,...,M$ and $A(w)$ is closely related to the definition of the loss function. Let $L[\hat{h}(x^{*},w),y^{*}]$ be a real value loss function which is defined in $\mathcal{H}\times\mathcal{Y}$ , where $\mathcal{Y}$ is the value space of $y^{*}$ . Then, the risk function is defined as follows:

F(w,S)=E_{z^{*}}\big{\{}L[\hat{h}(x^{*},w),y^{*}]\big{\}},

which is an MSPE under the sample set S and weight vector $w$ .

2.2 Related Concepts

In this paper, we mainly discuss whether $F(\hat{w},S)$ can approximate smallest possible risk $inf_{w\in W}F(w,S)$ . If $A(w)$ has such a property, we say that $A(w)$ is consistent. For fixed $m$ , Shalev-Shwartz et al. (2010) defined the stability and consistency of $A_{m}$ , and discussed their relationship. Obviously, for model averaging, we need to pay more attention to the stability and consistency of weight selection. We note that relevant conclusions of Shalev-Shwartz et al. (2010) cannot be directly applied to model averaging because $\mathcal{H}$ depends on $S$ . Therefore, we extend the relevant definitions and conclusions to model averaging. The following is the definition of consistency:

Definition 1 (Consistency).

If there is a sequence of constants $\{\epsilon_{n},n\in N_{+}\}$ such that $\epsilon_{n}=o(1)$ and $\hat{w}$ satisfies

E_{S}\big{\{}F(\hat{w},S)-inf_{w\in W}F(w,S)\big{\}}=O(\epsilon_{n}),

then $A(w)$ is said to be consistent with rate $\epsilon_{n}$ .

In statistical learning theory, stability concerns how much the algorithm’s output varies when $S$ changes a little.“Leave-one-out(Loo)” and “Replace-one(Ro)” are two common tools used to evaluate stability. Loo considers the change in the algorithm’s output after removing an observation from $S$ , and Ro considers such a change after replacing an observation in $S$ with an observation that is independent of S. Accordingly, the stability is called Loo stability and Ro stability respectively. Here we will give the formal definitions of Loo stability and Ro stability. To this end, we first give the definition of algorithm symmetry:

Definition 2 (Symmetry).

If the algorithm’s output is not affected by the order of the observations in $S$ , the algorithm is symmetric.

Now let $S^{-i}$ be the sample set after removing the $i$ -th observation from $S$ , $\hat{h}^{-i}_{m}$ be the estimator of $h_{m}$ based on $S^{-i}$ and $A_{m}$ , $\hat{w}^{-i}$ be the estimator of weight vector based on $S^{-i}$ and $A(w)$ , and $F(w,S^{-i})=E_{z^{*}}\big{\{}L[\hat{h}^{-i}(x^{*},w),y^{*}]\big{\}}$ , where $\hat{h}^{-i}(x^{*},w)=H[w,\hat{h}^{-i}_{1}(x^{*}_{1}),...,\hat{h}^{-i}_{M}(x^{*}_{M})]$ . We define Loo stability as follows:

Definition 3 (PLoo Stability).

If there is a sequence of constants $\{\epsilon_{n},n\in N_{+}\}$ such that $\epsilon_{n}=o(1)$ and $A(w)$ satisfies

\frac{1}{n}\sum_{i=1}^{n}E_{S}\big{[}F(\hat{w},S)-F(\hat{w}^{-i},S^{-i})\big{]}=O(\epsilon_{n}),

then $A(w)$ is Predicted-Loo(PLoo) stable with rate $\epsilon_{n}$ ; If $A_{m},m=1,...,M$ and $A(w)$ are symmetric, it only has to satisfy

E_{S}\big{[}F(\hat{w},S)-F(\hat{w}^{-n},S^{-n})\big{]}=O(\epsilon_{n}).

Definition 4 (FLoo Stability).

If there is a sequence of constants $\{\epsilon_{n},n\in N_{+}\}$ such that $\epsilon_{n}=o(1)$ and $A(w)$ satisfies

\frac{1}{n}\sum_{i=1}^{n}E_{S}\big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]\big{\}}=O(\epsilon_{n}),

then $A(w)$ is Fitted-Loo(FLoo) stable with rate $\epsilon_{n}$ ; If $A_{m},m=1,...,M$ and $A(w)$ are symmetric, it only has to satisfy

E_{S}\big{\{}L[\hat{h}(x_{n},\hat{w}),y_{n}]-L[\hat{h}^{-n}(x_{n},\hat{w}^{-n}),y_{n}]\big{\}}=O(\epsilon_{n}).

Let $S^{i}$ be the sample set S with the $i$ -th observation replaced by $z_{i}^{*}=(y_{i}^{*},x_{i}^{*^{\prime}})^{{}^{\prime}}$ , $\hat{h}^{i}_{m}$ be the estimator of $h_{m}$ based on $S^{i}$ and $A_{m}$ , and $\hat{w}^{i}$ be the estimator of weight vector based on $S^{i}$ and $A(w)$ , where $z_{i}^{*}$ is from distribution $\mathcal{D}$ and independent of $S$ . Let $F(w,S^{i})=E_{z^{*}}\big{\{}L[\hat{h}^{i}(x^{*},w),y^{*}]\big{\}}$ , then

\frac{1}{n}\sum_{i=1}^{n}E_{S,z_{i}^{*}}\big{[}F(\hat{w},S)-F(\hat{w}^{i},S^{i})\big{]}=0,

where $\hat{h}^{i}(x^{*},w)=H[w,\hat{h}^{i}_{1}(x^{*}_{1}),...,\hat{h}^{i}_{M}(x^{*}_{M})]$ . Therefore, we define Ro stability as follows:

Definition 5 (Ro Stability).

If there is a sequence of constants $\{\epsilon_{n},n\in N_{+}\}$ such that $\epsilon_{n}=o(1)$ and $A(w)$ satisfies

\frac{1}{n}\sum_{i=1}^{n}E_{S,z_{i}^{*}}\big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\big{\}}=O(\epsilon_{n}),

then $A(w)$ is Ro stable with rate $\epsilon_{n}$ ; If $A_{m},m=1,...,M$ and $A(w)$ are symmetric, it only has to satisfy

E_{S,z_{n}^{*}}\big{\{}L[\hat{h}(x_{n},\hat{w}),y_{n}]-L[\hat{h}^{n}(x_{n},\hat{w}^{n}),y_{n}]\big{\}}=O(\epsilon_{n}).

Before discussing the relationship between stability and consistency, we give the definitions of AERM and generalization. The empirical risk function is defined as follows:

\hat{F}(w,S)=\frac{1}{n}\sum_{n=1}^{n}L[\hat{h}(x_{i},w),y_{i}].

Definition 6 (AERM).

If there is a sequence of constants $\{\epsilon_{n},n\in N_{+}\}$ such that $\epsilon_{n}=o(1)$ and $A(w)$ satisfies

E_{S}\big{[}\hat{F}(\hat{w},S)-inf_{w\in W}\hat{F}(w,S)\big{]}=O(\epsilon_{n}),

then $A(w)$ is an AERM with rate $\epsilon_{n}$ .

Vapnik (1998) proved some theoretical properties of the empirical risk minimization principle. However, when the sample size is small, the empirical risk minimizer tends to produce over-fitting phenomenon. Therefore, the structural risk minimization principle is proposed in Vapnik (1998), and the method to satisfy this principle is usually an AERM. Shalev-Shwartz et al. (2010) also discussed the deficiency of the empirical risk minimization principle and the importance of AERM.

Definition 7 (Generalization).

If there is a sequence of constants $\{\epsilon_{n},n\in N_{+}\}$ such that $\epsilon_{n}=o(1)$ and $A(w)$ satisfies

E_{S}\big{[}\hat{F}(\hat{w},S)-F(\hat{w},S)\big{]}=O(\epsilon_{n}),

then $A(w)$ generalizes with rate $\epsilon_{n}$ .

In statistical learning theory, generalization refers to the performance of the concept learned by models on unknown samples. It can be seen from Definition 7 that the generalization of $A(w)$ describes the difference between using $\hat{w}$ to fit the training set $S$ and predict the unknown sample.

2.3 Relationship between Different Concepts

Note that for any $i\in\{1,...,n\}$ ,

	$\displaystyle E_{S,z_{i}^{*}}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\Big{\}}$
	$\displaystyle=E_{S,z_{i}^{*}}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]+L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\Big{\}}$
	$\displaystyle=E_{S}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]\Big{\}}+E_{S}[F(\hat{w}^{-i},S^{-i})-F(\hat{w},S)],$

so we give the following theorem to illustrate the relationship between Loo stability and Ro stability:

Theorem 2.1.

If $A(w)$ has two of FLoo stability, PLoo stability and Ro stability with rate $\epsilon_{n}$ , then it has all three stabilities with rate $\epsilon_{n}$ .

Shalev-Shwartz et al. (2010) emphasized that Ro stability and Loo stability are in general incomparable notions, but Theorem 2.1 shows that they are closely related.

By definitions of generalization and Ro stability, we have

	$\displaystyle E_{S}[\hat{F}(\hat{w},S)-F(\hat{w},S)]$
	$\displaystyle=E_{S,z_{1}^{},...,z_{n}^{}}\Big{\{}\frac{1}{n}\sum_{n=1}^{n}L[\hat{h}(x_{i},\hat{w}),y_{i}]-\frac{1}{n}\sum_{n=1}^{n}L[\hat{h}(x_{i}^{},\hat{w}),y_{i}^{}]\Big{\}}$
	$\displaystyle=\frac{1}{n}\sum_{n=1}^{n}E_{S,z_{i}^{*}}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\Big{\}},$

and then we give the following theorem to illustrate the equivalence of Ro stability and generalization:

Theorem 2.2.

$A(w)$ have Ro stability with rate $\epsilon_{n}$ if and noly if $A(w)$ generalizes with rate $\epsilon_{n}$ .

Theorem 2.2 shows that stability is an important property of weight selection criterion, which can ensure that the corresponding estimator has good generalization performance.

Let $\hat{w}^{*}\in W$ satisfy $F(\hat{w}^{*},S)=inf_{w\in W}F(w,S)$ . Note that

	$\displaystyle E_{S}[F(\hat{w},S)-F(\hat{w}^{*},S)]$
	$\displaystyle=E_{S}[F(\hat{w},S)-\hat{F}(\hat{w},S)+\hat{F}(\hat{w},S)-\hat{F}(\hat{w}^{},S)+\hat{F}(\hat{w}^{},S)-F(\hat{w}^{*},S)]$
	$\displaystyle\leq E_{S}[F(\hat{w},S)-\hat{F}(\hat{w},S)+\hat{F}(\hat{w},S)-inf_{w\in W}\hat{F}(w,S)+\hat{F}(\hat{w}^{},S)-F(\hat{w}^{},S)],$

so we give the following theorem to illustrate the relationship between stability and consistency:

Theorem 2.3.

If $A(w)$ is an AERM and has Ro stability with rate $\epsilon_{n}$ , and $\hat{w}^{*}$ satisfies

E_{S}\big{[}\hat{F}(\hat{w}^{*},S)-F(\hat{w}^{*},S)\big{]}=O(\epsilon_{n}),

then $A(w)$ is consistent with rate $\epsilon_{n}$ .

Since $\hat{w}^{*}$ and $\mathcal{H}$ depend on $S$ , unlike Lemma 15 in Shalev-Shwartz et al. (2010), Theorem 2.3 requires $\hat{w}^{*}$ to generalize with rate $\epsilon_{n}$ . In the next section, we will propose a $L_{2}$ -penalty model averaging method and prove that it has stability and consistency under certain reasonable conditions.

3 $L_{2}$ -penalty Model Averaging

In the most of existing literatures on model averaging, the theoretical properties are explored under the weight set $W^{0}=\{w\in[0,1]^{M}:\sum_{m=1}^{M}w_{m}=1\}$ . From Definition 1, it is seen that, even if the corresponding weight selection criterion is consistent, such a consistency holds only under the subspace of $R^{M}$ . Therefore, a natural question is whether it is possible to make the weight space unrestricted. What will happen when we do so? We note that unrestricted Granger-Ramanathan method obtains the estimator of the weight vector under $R^{M}$ by minimizing the sum of squared forecast errors from the combination forecast, but its poor performance is observed when it is compared with some other methods(see Hansen, 2008). On the other hand, in the prediction task, we are more concerned about whether the resulting estimator can predict better, and the estimator that minimizes MSPE in the full space will most likely outperform the estimator that minimizes MSPE in the subspace. Therefore, it is necessary to further develop new research ideas.

3.1 Model Framework and Estimators

We assume that the response variable $y_{i}$ and covariates $x_{i}=(x_{1i},x_{2i},...)$ satisfy the following data generating process:

\displaystyle y_{i}=\mu_{i}+e_{i}=\sum_{k=1}^{\infty}x_{ki}\theta_{k}+e_{i},E(e_{i}|x_{i})=0,E(e_{i}^{2}|x_{i})=\sigma^{2}_{i},

and M approximating models are given by

\displaystyle y_{i}=\sum_{k=1}^{k_{m}}x_{m(k)i}\theta_{m,(k)}+b_{mi}+e_{i},m=1,...,M,

where $b_{mi}=\mu_{i}-\sum_{k=1}^{k_{m}}x_{m(k)i}\theta_{m,(k)}$ is the approximating error of the $m$ -th approximating model. We assume that the $M$ -th approximating model contains all the considered covariates. To simplify notation, we let $x_{i}=(x_{(1)i},...,x_{(k_{M})i})$ throughout the rest of this article, where $x_{(k)i}=x_{M(k)i}$ .

Let $y=(y_{1},...,y_{n})^{{}^{\prime}}$ , $\mu=(\mu_{1},...,\mu_{n})^{{}^{\prime}}$ , $e=(e_{1},...,e_{n})^{{}^{\prime}}$ , $b_{m}=(b_{m1},...,b_{mn})^{{}^{\prime}}$ . Then, the corresponding matrix form of the true model is $y=X_{m}\theta_{m}+b_{m}+e$ , where $\theta_{m}=(\theta_{m,(1)},...,\theta_{m,(k_{m})})^{{}^{\prime}}$ , and $X_{m}$ is the design matrix of the $m$ -th approximating model. When there is an approximating model such that $b_{m}=\textbf{0}$ , it indicates that the true model is included in the $m$ -th approximating model, i.e. the model is correctly specified. Unlike Hansen (2007) , Wan et al. (2010) and Hansen and Racine (2012), we do not require that the infimum of $R_{n}(w)$ (this is defined in section 3.2) tends to infinity, and therefore allow that the model is correctly specified.

Let $\pi_{m}\in R^{K\times k_{m}}$ be the variable selection matrix satisfying $X_{M}\pi_{m}=X_{m}$ and $\pi_{m}^{{}^{\prime}}\pi_{m}=I_{k_{m}}$ , $m=1,...,M$ . Then, the hypothesis spaces of M approximating models are

\mathcal{H}_{m}=\big{\{}x^{*^{\prime}}\pi_{m}\theta_{m},\theta_{m}\in R^{k_{m}}\big{\}},m=1,...,M.

The least squares estimator of $\theta_{m}$ is $\hat{\theta}_{m}=(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}y$ , $m=1,...,M$ .

3.2 Weight Selection Criterion

Let $P_{m}=X_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}$ , $P(w)=\sum_{m=1}^{M}w_{m}P_{m}$ , $L_{n}(w)=\|\mu-P(w)y\|_{2}^{2}$ , and $R_{n}(w)=E_{e}[L_{n}(w)]$ . When $\sigma_{i}^{2}\equiv\sigma^{2}$ , Hansen (2007) and Wan et al. (2010) used Mallows criterion $C_{n}(w)=\|y-\hat{\Omega}w\|_{2}^{2}+2\sigma^{2}w^{{}^{\prime}}\kappa$ to select a model weight vector from hypothesis space $\mathcal{H}^{0}=\big{\{}x^{*^{\prime}}\hat{\theta}(w),w\in W^{0}\big{\}}$ and proved that the estimator of weight vector asymptotically minimizes $L_{n}(w)$ , where $\hat{\Omega}=(P_{1}y,...,P_{M}y)$ , $\kappa=(k_{1},...,k_{M})^{{}^{\prime}}$ , and $\hat{\theta}(w)=\sum_{m=1}^{M}w_{m}\pi_{m}\hat{\theta}_{m}$ . Hansen and Racine (2012) used Jackknife criterion $J_{n}(w)=\|y-\bar{\Omega}w\|_{2}^{2}$ to select a model weight vector from hypothesis space $\mathcal{H}^{0}$ and proved that the estimator of weight vector asymptotically minimizes $L_{n}(w)$ and $R_{n}(w)$ , where $\bar{\Omega}=\big{[}y-D_{1}(I-P_{1})y,...,y-D_{M}(I-P_{M})y\big{]}$ with $D_{m}=diag[(1-h_{ii}^{m})^{-1}]$ and $h_{ii}^{m}=x_{i}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{i},i=1,...,n$ .

Different from Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012), we focus on whether the model averaging estimator can asymptotically minimize MSPE when the model weights are not restricted. Let $\hat{\gamma}=(x^{*^{\prime}}\pi_{1}\hat{\theta}_{1},...,x^{*^{\prime}}\pi_{M}\hat{\theta}_{M})$ . Then, the risk function and the empirical risk function are defined as:

F(w,S)=E_{z^{*}}[y^{*}-x^{*^{\prime}}\hat{\theta}(w)]^{2}=E_{z^{*}}(y^{*}-\hat{\gamma}w)^{2}

and

\hat{F}(w,S)=\frac{1}{n}\sum_{i=1}^{n}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(w)]^{2}=\frac{1}{n}\|y-\hat{\Omega}w\|_{2}^{2}

respectively. Since Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012) restrict $w\in W^{0}$ , the corresponding estimators of weight vector do not necessarily asymptotically minimize $F(w,S)$ on $R^{M}$ . An intuitive way that enables the estimator of weight vector to asymptotically minimize $F(w,S)$ on $R^{M}$ is to remove the restriction $w\in W^{0}$ directly.

Let $\hat{P}$ and $\bar{P}$ be the orthogonal matrices satisfying $\hat{P}^{{}^{\prime}}\hat{\Omega}^{{}^{\prime}}\hat{\Omega}\hat{P}=diag(\hat{\zeta}_{1},...,\hat{\zeta}_{M})$ , $\bar{P}^{{}^{\prime}}\bar{\Omega}^{{}^{\prime}}\bar{\Omega}\bar{P}=diag(\bar{\zeta}_{1},...,\bar{\zeta}_{M})$ , where $\hat{\zeta}_{1}\leq...\leq\hat{\zeta}_{M}$ and $\bar{\zeta}_{1}\leq...\leq\bar{\zeta}_{M}$ are the eigenvalues of $\hat{\Omega}^{{}^{\prime}}\hat{\Omega}$ and $\bar{\Omega}^{{}^{\prime}}\bar{\Omega}$ respectively. We assume that $E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})$ , $\hat{\Omega}^{{}^{\prime}}\hat{\Omega}$ and $\bar{\Omega}^{{}^{\prime}}\bar{\Omega}$ are invertible( this is reasonable under Assumption 3 ), then

\hat{w}^{0}=argmin_{w\in R^{M}}C_{n}(w)=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})^{-1}(\hat{\Omega}^{{}^{\prime}}y-\sigma^{2}\kappa),

\bar{w}^{0}=argmin_{w\in R^{M}}J_{n}(w)=(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})^{-1}\bar{\Omega}^{{}^{\prime}}y,

\tilde{w}=argmin_{w\in R^{M}}\hat{F}(w,S)=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})^{-1}\hat{\Omega}^{{}^{\prime}}y,

and

\hat{w}^{*}=argmin_{w\in R^{M}}F(w,S)=[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}y^{*}).

From this, we can see that, in order to satisfy the consistency, $\hat{w}^{0}$ and $\bar{w}^{0}$ should be good estimators of $\hat{w}^{*}$ . However, when approximating models are highly correlated, the minimum eigenvalues of $\hat{\Omega}^{{}^{\prime}}\hat{\Omega}$ and $\bar{\Omega}^{{}^{\prime}}\bar{\Omega}$ may be small so that $\|\hat{w}^{0}\|_{2}^{2}=\sum_{m=1}^{M}\frac{a_{m}^{2}}{\hat{\zeta}_{m}^{2}}\geq\frac{a_{1}^{2}}{\hat{\zeta}_{1}^{2}}$ and $\|\bar{w}^{0}\|_{2}^{2}=\sum_{m=1}^{M}\frac{b_{m}^{2}}{\bar{\zeta}_{m}^{2}}\geq\frac{b_{1}^{2}}{\bar{\zeta}_{1}^{2}}$ are too large, which usually result in extreme weights, where $(a_{1},a_{2},...,a_{M})^{{}^{\prime}}=\hat{P}^{{}^{\prime}}\hat{\Omega}y$ , $(b_{1},b_{2},...,b_{M})^{{}^{\prime}}=\bar{P}^{{}^{\prime}}\bar{\Omega}y$ . Therefore, similar to ridge regression in Hoerl and Kennard (1970), we make the following correction to $C_{n}(w)$ and $J_{n}(w)$ :

C(w,S)=C_{n}(w)+\lambda_{n}w^{{}^{\prime}}w,

J(w,S)=J_{n}(w)+\lambda_{n}w^{{}^{\prime}}w,

where $\lambda_{n}>0$ is a tuning parameter. The above corrections are actually $L_{2}$ -penalty for weight vector. Let $\hat{Z}=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I)^{-1}\hat{\Omega}^{{}^{\prime}}\hat{\Omega}$ , $\bar{Z}=(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I)^{-1}\bar{\Omega}^{{}^{\prime}}\bar{\Omega}$ . Then

\hat{w}=argmin_{w\in R^{M}}C(w,S)=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I)^{-1}(\hat{\Omega}^{{}^{\prime}}y-\sigma^{2}\kappa)=\hat{Z}\hat{w}^{0},

and

\bar{w}=argmin_{w\in R^{M}}J(w,S)=(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I)^{-1}\bar{\Omega}^{{}^{\prime}}y=\bar{Z}\bar{w}^{0}.

In the next subsection, we discuss the theoretical properties of $C(w,S)$ and $J(w,S)$ .

3.3 Stability and Consistency

Let $\lambda_{max}(\cdot)$ and $\lambda_{min}(\cdot)$ be the maximum and minimum eigenvalues of a square matrix respectively, and $\chi$ be the value space of K covariates, where $K=k_{M}$ . In order to discuss the stability and consistency of the proposed method, we need the following assumptions:

Assumption 1.

There are constants $C_{1}>0$ and $C_{2}>0$ such that

C_{1}\leq\lambda_{min}(n^{-1}X_{M}^{{}^{\prime}}X_{M})\leq\lambda_{max}(n^{-1}X_{M}^{{}^{\prime}}X_{M})\leq C_{2}K,a.s..

Assumption 2.

There is a constant $C_{3}>0$ such that $E_{y^{*}}[(y^{*})^{2}]\leq C_{3}$ , and $n^{-1}y^{{}^{\prime}}y\leq C_{3}$ , a.s.; There is a constant $C_{4}>0$ such that $\chi\subset B({\bf{0}_{K}},C_{4})$ , a.s., where $B({\bf{0}_{K}},C_{4})$ is a ball with center ${\bf{0}_{K}}$ and radius $C_{4}$ , and ${\bf{0}_{K}}$ is the K-dimensional 0 vector.

Assumption 3.

There are constants $C_{5}>0$ and $C_{6}>0$ such that

C_{5}\leq n^{-1}\lambda_{min}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})\leq n^{-1}\lambda_{max}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})\leq C_{6}M,

C_{5}\leq\lambda_{min}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]\leq\lambda_{max}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]\leq C_{6}M,

and

C_{5}\leq n^{-1}\lambda_{min}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})\leq n^{-1}\lambda_{max}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})\leq C_{6}M,

a.s..

Assumption 4.

$\lambda_{n}=O(M\log n)$ , $\lambda_{n}-\lambda_{n-1}=O(K^{3}M)$ .

Assumption 1 is mild and similar conditions can be found in Zou and Zhang (2009) and Zhao et al. (2020). From $X_{m}^{{}^{\prime}}X_{m}=\pi_{m}^{{}^{\prime}}X_{M}^{{}^{\prime}}X_{M}\pi_{m}$ , we see that, under Assumption 1, for any m $\in\{1,...,M\}$ , we have

C_{1}\leq\lambda_{min}(n^{-1}X_{m}^{{}^{\prime}}X_{m})\leq\lambda_{max}(n^{-1}X_{m}^{{}^{\prime}}X_{m})\leq C_{2}K,a.s..

Shalev-Shwartz et al. (2010) assumed that the loss function is bounded, which is usually not satisfied in traditional regression analysis. We replace this assumption with Assumption 2. Tong and Wu (2017) assumed that $\chi\times\mathcal{Y}$ is a compact subset of $R^{K+1}$ , under which Assumption 2 is obviously true. Assumption 3 requires the minimum eigenvalue of $\hat{\Omega}^{{}^{\prime}}\hat{\Omega}$ to have a lower bound away from 0 and the maximum eigenvalue to have an order $O(nM)$ a.s.. A similar assumption is used in Liao and Zou (2020). Lemma 3 guarantees the rationality of the assumptions about the eigenvalues of $E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})$ and $\bar{\Omega}^{{}^{\prime}}\bar{\Omega}$ . Assumption 4 is a mild assumption for tuning parameter in order to avoid excessive penalty. In Section 3.4, we look for tuning parameters only from $[0,M\log n]$ .

Let $\hat{V}(\lambda_{n})=\|\hat{Z}\hat{w}^{0}-\hat{Z}\hat{w}^{*}\|_{2}^{2}$ , $\hat{B}(\lambda_{n})=\|\hat{Z}\hat{w}^{*}-\hat{w}^{*}\|_{2}^{2}$ , $\bar{V}(\lambda_{n})=\|\bar{Z}\bar{w}^{0}-\bar{Z}\hat{w}^{*}\|_{2}^{2}$ , and $\bar{B}(\lambda_{n})=\|\bar{Z}\hat{w}^{*}-\hat{w}^{*}\|_{2}^{2}$ . We define

\hat{M}(\lambda_{n})=\|\hat{Z}\hat{w}^{0}-\hat{w}^{*}\|_{2}^{2}=\hat{V}(\lambda_{n})+\hat{B}(\lambda_{n})+2(\hat{Z}\hat{w}^{0}-\hat{Z}\hat{w}^{*})^{{}^{\prime}}(\hat{Z}\hat{w}^{*}-\hat{w}^{*}),

\bar{M}(\lambda_{n})=\|\bar{Z}\bar{w}^{0}-\hat{w}^{*}\|_{2}^{2}=\bar{V}(\lambda_{n})+\bar{B}(\lambda_{n})+2(\bar{Z}\bar{w}^{0}-\bar{Z}\hat{w}^{*})^{{}^{\prime}}(\bar{Z}\hat{w}^{*}-\hat{w}^{*}).

In order for $F(\hat{Z}\hat{w}^{0},S)$ and $F(\bar{Z}\bar{w}^{0},S)$ to better approximate $F(\hat{w}^{*},S)$ , we naturally want $E_{S}[\hat{M}(\lambda_{n})]$ and $E_{S}[\bar{M}(\lambda_{n})]$ to be as small as possible. In the following discussion, we call $E_{S}[\hat{M}(\lambda_{n})]$ and $E_{S}[\bar{M}(\lambda_{n})]$ , $E_{S}[\hat{V}(\lambda_{n})]$ and $E_{S}[\bar{V}(\lambda_{n})]$ , $E_{S}[\hat{B}(\lambda_{n})]$ and $E_{S}[\bar{B}(\lambda_{n})]$ the corresponding mean square errors, estimation variances, and estimation biases, respectively. Obviously, when $\lambda_{n}=0$ , $\hat{Z}=\bar{Z}=I_{M}$ which means estimation bias is equal to zero. From Lemma 4 and the proof of Theorems 3.4, we see that, under Assumptions $\ref{ass:1}-\ref{ass:4}$ , $\hat{B}(\lambda_{n})$ and $\bar{B}(\lambda_{n})$ are $O(n^{-2}M^{4}log^{2}n)$ , a.s.. On the other hand, the existence of extreme weights may make the performance of $\hat{w}^{0}$ and $\bar{w}^{0}$ extremely unstable. So the purpose of using $L_{2}$ -penalty is to reduce estimation variance by introducing estimation bias, and thus make the performance of the model average estimator more stable. Further we define

\hat{M}_{1}(\lambda_{n})=\hat{V}(\lambda_{n})+\hat{B}(\lambda_{n}),

\bar{M}_{1}(\lambda_{n})=\bar{V}(\lambda_{n})+\bar{B}(\lambda_{n}).

Like Theorem 4.3 in Hoerl and Kennard (1970), we give the following theorem to illustrate the reasonability of introducing $L_{2}$ -penalty:

Theorem 3.4.

Let $\hat{\lambda}_{n}=\min\{\lambda_{n}:\frac{d}{d\lambda_{n}}\hat{M}_{1}(\lambda_{n})=0\}$ , $\bar{\lambda}_{n}=\min\{\lambda_{n}:\frac{d}{d\lambda_{n}}\bar{M}_{1}(\lambda_{n})=0\}$ . Then, 1) when $\hat{w}^{0}\neq\hat{w}^{*}$ , $\hat{\lambda}_{n}>0$ and $\hat{M}_{1}(\hat{\lambda}_{n})<\hat{M}_{1}(0)$ ; 2) when $\bar{w}^{0}\neq\hat{w}^{*}$ , $\bar{\lambda}_{n}>0$ and $\bar{M}_{1}(\bar{\lambda}_{n})<\bar{M}_{1}(0)$ .

Theorems 3.4 shows that the use of $L_{2}$ -penalty reduces estimation variance by introducing estimation bias. However, since $\hat{w}^{*}$ is unknown, $\hat{\lambda}_{n}$ and $\bar{\lambda}_{n}$ are also unknown. In Section 3.4, we use cross validation to select the tuning parameter $\lambda_{n}$ . The following theorem shows that $C(w,S)$ and $J(w,S)$ are AERM.

Theorem 3.5.

Under Assumptions $\ref{ass:1}-\ref{ass:4}$ , $C(w,S)$ and $J(w,S)$ are AERM with rate $n^{-1}\log nKM^{2}(1+n^{-2}K^{2})$ and $n^{-1}\log nK^{3}M^{2}$ , respectively.

The following theorem shows that $C(w,S)$ , $J(w,S)$ and $F(w,S)$ have FLoo stability and PLoo stability.

Theorem 3.6.

Under Assumptions $\ref{ass:1}-\ref{ass:4}$ , $C(w,S)$ , $J(w,S)$ and $F(w,S)$ have FLoo stability and PLoo stability with rate $n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})$ , $n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}$ and $n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}$ , respectively.

It can be seen from Theorems 2.1, 2.2 and 3.6 that $C(w,S)$ , $J(w,S)$ and $F(w,S)$ have Ro stability and generalization. The following theorem shows that $C(w,S)$ and $J(w,S)$ have consistency, which is a direct consequence of Theorems $\ref{thm:3-1}-\ref{thm:3-3}$ and $\ref{thm:4-3-1}-\ref{thm:4-3-2}$ .

Theorem 3.7.

Under Assumptions $\ref{ass:1}-\ref{ass:4}$ , $C(w,S)$ and $J(w,S)$ have consistency with rate $n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}(1+n^{-2}K^{2})$ and $n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}$ , respectively.

3.4 Optimal Weighting Based on Cross Validation

Although Theorems 3.4 shows that there are $\hat{\lambda}_{n}$ and $\bar{\lambda}_{n}$ such that $\hat{w}$ and $\bar{w}$ are better approximation of $\hat{w}^{*}$ , $\hat{\lambda}_{n}$ and $\bar{\lambda}_{n}$ cannot be obtained. Therefore, like Schomaker (2012), we propose an algorithm based on 10-fold cross validation to obtain the estimator of weight vector, which is a weighted average of the weight estimators for different $\lambda_{n}$ . That is, we first select 100 segmentation points on $[0,M\log n]$ with equal intervals as the candidates of $\lambda_{n}$ . Then we calculate the estimation error for each candidate of $\lambda_{n}$ by using 10-fold cross validation. Based on this, we remove those candidates with large estimation error. Last, for remaining candidates, the estimation errors are used to perform a weighted average of the estimators of weight vector. We summarize our algorithm on RMMA below, and a similar algorithm can be given for RJMA.

Algorithm 1 Optimal weighting based on cross validation

1:S;

\hat{w}

;

\hat{E}_{L}=0,L=1,...,100

;

4:The sample set S is randomly divided into 10 sample subsets with equal size, and the sample index set belonging to the B-th part is denoted as

S_{B},B=1,...,10

;

5:for each

B\in\{1,2,...,10\}

6: Let

S_{train}=\{z_{i},i\notin S_{B}\}

which is the training set, and

S_{test}=\{z_{i},i\in S_{B}\}

which is the testing set;

\hat{\theta}_{m}^{B}

is obtained based on

S_{train}

m=1,...,M

;

8: for each

L\in\{1,2,...,100\}

\hat{w}_{BL}

is obtained based on

\lambda_{n}=\frac{(L-1)M\log n}{99}

and

C(w,S_{train})

;

10: The estimation error of

\hat{w}_{BL}

S_{test}

is obtained as

\hat{E}(\hat{w}_{BL})=\sum_{z_{i}\in S_{test}}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{B}(\hat{w}_{BL})]^{2},

where

\hat{\theta}^{B}(w)=\sum_{m=1}^{M}w_{m}\pi_{m}\hat{\theta}_{m}^{B}

;

11:

\hat{E}_{L}=\hat{E}_{L}+\hat{E}(\hat{w}_{BL})

;

12:Let

S_{\lambda}

be the index set of the smallest 50 numbers in

\{\hat{E}_{L},L=1,...,100\}

;

13:

\hat{w}_{L}

is obtained based on

\lambda_{n}=\frac{(L-1)M\log n}{99}

S

, and

C(w,S)

, where

L\in S_{\lambda}

;

14:

\hat{w}=\sum_{L\in S_{\lambda}}\frac{exp(-0.5\hat{E}_{L})}{\sum_{L\in S_{\lambda}exp(-0.5\hat{E}_{L})}}\hat{w}_{L}

4 Simulation Study

In this section, we conduct simulation experiments to demonstrate the finite sample performance of the proposed method. Similar to Hansen (2007), we consider the following data generating process:

\displaystyle y_{i}=\mu_{i}+e_{i}=\sum_{k=1}^{K_{t}}x_{ki}\theta_{k}+e_{i}\ \ i=1,...,n,

where $\{\theta_{k},k=1,2,...,K_{t}\}$ are the model parameters, $x_{1i}\equiv 1$ , $(x_{2i},...,x_{K_{t}i})\sim N(0,\Sigma)$ , and $(e_{1},e_{2},...,e_{n})\sim N[0,diag(\sigma_{1}^{2},\sigma_{2}^{2},...,\sigma_{n}^{2})]$ . We set $n=100,300,500,700$ , $\alpha=0.5,1.0,1.5$ , $\Sigma=(\sigma_{kl})$ and $\sigma_{kl}=\rho^{|k-l|}$ with $\rho=0.3,0.6$ , and $R^{2}=0.1,...,0.9$ , where the population $R^{2}=\frac{var(\sum_{k=1}^{K_{t}}x_{ki}\theta_{k})}{var(\sum_{k=1}^{K_{t}}x_{ki}\theta_{k}+e_{i})}$ . For the homoskedastic simulation we set $\sigma_{i}^{2}\equiv 1$ , while for the heteroskedastic simulation we set $\sigma_{i}^{2}=x_{2i}^{2}$ .

We compare the following ten model selection/averaging methods: 1) model selection with AIC (AI), model selection with BIC (BI), and model selection with $C_{p}$ (Cp); 2) model averaging with smoothed AIC (SA), and model averaging with smoothed BIC (SB); 3) Mallows model averaging (MM), jackknife model averaging (JM), and least squares model averaging based on generalized cross validation (GM)(Li et al., 2021); 4) Ridge-Mallows model averaging (RM), and Ridge-jackknife model averaging (RJ). To evaluate these methods, we generate test set $\{(\mu_{i}^{*},x_{i}^{*}),i=1,...,n_{t}\}$ by the above data generating process, and

MSE=n_{t}^{-1}\sum_{i=1}^{n_{t}}[\mu_{i}^{*}-x_{i}^{*^{\prime}}\hat{\theta}(\hat{w})]^{2}

is calculated as a measure of consistency. In the simulation, we set $n_{t}=1000$ and repeat 200 times. For each parameterization, we normalize the MSE by dividing by the infeasible MSE(the mean of the smallest MSEs of M approximating models in 200 simulations).

We consider two simulation settings. In the first setting, like Hansen (2007), all candidate models are misspecified, and candidate models are strictly nested. In the second setting, the true model is one of candidate models, and candidate models are non-nested.

4.1 Nested Setting and Results

We set $K_{t}=400$ , $\theta_{k}=c\sqrt{2\alpha}k^{-\alpha-\frac{1}{2}}$ , and $K=\log_{4}^{2}n$ (i.e. $K=11,17,20,22$ ), where $R^{2}$ is controlled by the parameter c. For $\rho=0.3$ , the mean of normalized MSEs in 200 simulations is shown in Figures $\ref{pic:1}-\ref{pic:6}$ . The results with $\rho=0.6$ are similar and so omitted for saving space.

For homoskedastic case, we can draw the following conclusions from Figures $\ref{pic:1}-\ref{pic:3}$ . When $\alpha=0.5$ (Figure 1): 1) RM and RJ perform better than other methods if $R^{2}\leq 0.5$ and $n=300,500$ and $700$ ; 2) RM and RJ perform worse than other methods except for BI and SB if $R^{2}>0.5$ , and worse than MA, JM and GM if $R^{2}=0.1$ and $n=100$ ; 3) RM performs better than RJ in most of cases if $n=100$ and $300$ . When $\alpha=1.0$ (Figure 2): 1) RM performs better than other methods if $R^{2}\leq 0.8$ ; 2) RM and RJ perform worse than other methods except for BI and SB if $R^{2}=0.9$ ; 3) RM performs better than RJ in most of cases if $n=100$ and $300$ . When $\alpha=1.5$ (Figure 3): 1) RM and RJ always perform better than other methods; 2) RM performs better than RJ in most of cases if $n=100$ and $300$ .

For heteroskedastic case, we can draw the following conclusions from Figures $\ref{pic:4}-\ref{pic:6}$ . When $\alpha=0.5$ (Figure 4): 1) RM performs better than other methods if $R^{2}\leq 0.5$ ; 2) RM and RJ perform worse than other methods except for BI and SB if $R^{2}>0.5$ ; 3) RM performs better than RJ. When $\alpha=1.0$ (Figure 5): 1) RM performs better than other methods if $R^{2}\leq 0.7$ ; 2) RM and RJ perform worse than other methods except for BI and SB if $R^{2}>0.7$ ; 3) RM performs better than RJ, and RJ performs better than other methods in most cases. When $\alpha=1.5$ (Figure 6): 1) RM always performs better than other methods; 2) RM always performs better than RJ, and RJ performs better than other methods in most cases.

From Figures $\ref{pic:1}-\ref{pic:6}$ , we can draw the following conclusions: 1) As $\alpha$ increases, RM and RJ perform better and better; 2) $n$ has no obvious influence on the performance comparison of various methods; 3) With the increase of $R^{2}$ , the performance of various methods is improved, but RJ and RM always have not very bad performance, and are the best in most cases.

To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and not bad if not the best; 2) When $\alpha$ is small and $R^{2}$ is large, GM has better performance, and RM and RJ are the best in other cases; 3) RM performs better than RJ.

4.2 Non-nested Setting and Results

We set $K_{t}=12$ and $\theta_{k}=c\sqrt{2\alpha}k^{-\alpha-\frac{1}{2}}$ for $1\leq k\leq 10$ , and $\theta_{k}=0$ for $k=11,12$ , where $R^{2}$ is controlled by the parameter c. Each approximating model contains the first 6 covariates, and the last 6 covariates are combined to obtain $2^{6}$ approximating models. For $\rho=0.3$ , the mean of normalized MSEs in 200 simulations is shown in Figures $\ref{pic:7}-\ref{pic:12}$ . Like the nested case, the results with $\rho=0.6$ are similar and so omitted.

For this setting, we can draw the following conclusions from Figures $\ref{pic:7}-\ref{pic:12}$ . When $\alpha=0.5$ (Figure 7 and Figure 10): 1) RM and RJ perform better than other methods if $R^{2}\leq 0.5$ ; 2) As $n$ increases, RM and RJ perform worse than other methods for larger $R^{2}$ ; 3) For homoskedastic case, RJ performs better than RM in most of cases. When $\alpha=1.0$ (Figure 8 and Figure 11): 1) RM and RJ perform better than other methods except for SB if $R^{2}\leq 0.7$ , but the performance of SB is very instable; 2) As $n$ increases, RM and RJ perform worse than other methods for larger $R^{2}$ ; 3) RJ performs better than RM in most of cases. When $\alpha=1.5$ (Figure 9 and Figure 12): 1) RM and RJ perform better than other methods except for BI and SB, but the performance of SB and BI is very instable; 2) RJ performs better than RM in most of cases.

To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and have stable performance; 2) One of SB, SA, BI, and AI may perform best when R2 is small or large, but their performance is unstable compared to RM and RJ; 3) In heteroskedastic case with larger $\alpha$ and $n$ and homoskedastic case, on the whole, RJ performs better than RM.

5 Real Data Analysis

In this section, we apply the proposed method to the real ”wage1” dataset in Wooldridge (2003) from from the US Current Population Survey for the year 1976. There are 526 observations in this dataset. The response variable is the log of average hourly earning, while covariates include: 1) dummy variables—nonwhite, female, married, numdep, smsa, northcen, south, west, construc, ndurman, trcommpu, trade, services, profserv, profocc, clerocc, and servocc; 2) non-dummy variables—educ, exper, and tenure; 3) interaction variables—nonwhite $\times$ educ, nonwhite $\times$ exper, nonwhite $\times$ tenure, female $\times$ educ, female $\times$ exper, female $\times$ tenure, married $\times$ educ, married $\times$ exper, and married $\times$ tenure.

We consider the following two cases: 1) we rank the covariates according to their linear correlations with the response variable, and then consider the strictly nested model averaging method (intercept term is considered and ranked first); 2) 100 models are selected by using function ”regsubsets” in ”leaps” package of R language, where the parameters ”nvmax” and ”nbest” are taken to be 20 and 5 respectively, and other parameters use default values.

We randomly divide the data into two parts: a training sample $S$ of $n$ observations for estimating the models and a test sample $S_{t}$ of $n_{t}=529-n$ observations for validating the results. We consider $n=110,210,320,420$ , and

MSE=n_{t}^{-1}\sum_{z_{i}\in S_{t}}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}

is calculated as a measure of consistency. We replicate the process for 200 times. The box plots of MSEs in 200 simulations are shown in Figures $\ref{pic:13}-\ref{pic:14}$ . From these figures, we see that the performance of RM and RJ is good and stable. We also compute the mean and median of MSEs, as well as the best performance rate (BPR), which is the frequency of achieving the lowest risk across the replications. The results are shown in Tables $\ref{tab:1}-\ref{tab:2}$ . From these tables, we can draw the following conclusions: 1) RM and RJ always are superior to other methods in terms of mean and median of MSEs, and BPR; 2) The performance of RM and RJ is basically the same in terms of mean and median of MSEs; 3) For BPR, on the whole, RM outperforms RJ.

Table 1: The mean, median and BPR of MSE in Case 1

n		AI	Cp	BI	SA	SB	MM	RM	GM	JM	RJ
110	Mean	0.185	0.182	0.184	0.180	0.179	0.169	0.164	0.178	0.168	0.164
	Median	0.179	0.176	0.182	0.174	0.177	0.167	0.162	0.174	0.166	0.162
	BPR	0.011	0.000	0.006	0.061	0.039	0.028	0.376	0.044	0.088	0.348
210	Mean	0.160	0.159	0.169	0.157	0.166	0.155	0.152	0.155	0.155	0.152
	Median	0.160	0.159	0.169	0.156	0.166	0.154	0.152	0.155	0.155	0.153
	BPR	0.030	0.000	0.010	0.075	0.000	0.020	0.460	0.080	0.035	0.290
320	Mean	0.152	0.152	0.156	0.150	0.154	0.148	0.146	0.148	0.148	0.146
	Median	0.152	0.152	0.156	0.150	0.154	0.148	0.145	0.148	0.147	0.146
	BPR	0.010	0.000	0.050	0.160	0.035	0.045	0.315	0.070	0.020	0.295
420	Mean	0.152	0.152	0.150	0.150	0.151	0.148	0.147	0.149	0.148	0.147
	Median	0.151	0.150	0.148	0.147	0.149	0.147	0.147	0.148	0.147	0.147
	BPR	0.025	0.005	0.105	0.175	0.060	0.030	0.215	0.055	0.030	0.300

Table 2: The mean, median and BPR of MSE in Case 2

n		AI	Cp	BI	SA	SB	MM	RM	GM	JM	RJ
110	Mean	0.167	0.167	0.171	0.161	0.163	0.158	0.152	0.160	0.158	0.152
	Median	0.164	0.164	0.170	0.158	0.163	0.157	0.151	0.158	0.157	0.151
	BPR	0.011	0.000	0.005	0.081	0.000	0.016	0.400	0.011	0.011	0.465
210	Mean	0.152	0.151	0.157	0.149	0.153	0.148	0.145	0.148	0.148	0.145
	Median	0.151	0.151	0.157	0.148	0.153	0.147	0.144	0.148	0.148	0.144
	BPR	0.000	0.005	0.000	0.170	0.010	0.005	0.460	0.005	0.015	0.330
320	Mean	0.147	0.146	0.152	0.144	0.148	0.145	0.142	0.145	0.145	0.142
	Median	0.145	0.145	0.150	0.143	0.147	0.144	0.142	0.143	0.144	0.142
	BPR	0.010	0.000	0.000	0.280	0.040	0.000	0.340	0.000	0.000	0.330
420	Mean	0.148	0.148	0.152	0.144	0.148	0.146	0.143	0.146	0.146	0.143
	Median	0.147	0.147	0.151	0.144	0.147	0.145	0.141	0.145	0.145	0.141
	BPR	0.005	0.000	0.000	0.370	0.055	0.000	0.265	0.000	0.010	0.295

6 Concluding Remarks

In this paper, we study the relationship between AERM, stability, generalization and consistency in model averaging. The results indicate that stability is an important property of model averaging, which can ensure that model averaging has good generalization performance and is consistent under reasonable conditions. When the model weights are not restricted, similar to ridge regression in Hoerl and Kennard (1970), extreme weights tend to appear under the influence of correlation between candidate models in MMA and JMA, resulting in poor performance of the corresponding model averaging estimator. So we propose a $L_{2}$ -penalty model averaging method. We prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The numerical simulation and real data analysis show the superiority of the proposed method.

Many issues deserve to be further investigated. We only apply the methods of Section 2 to the generalization of MMA and JMA in linear regression. It is worth investigating whether it is possible to extend the proposed method to more complex scenarios, such as generalized linear model, quantile regression, and dependent data. Further, we also expect to be able to propose a model averaging framework with stability and consistency that can be applied to multiple scenarios. In addition, in RMMA and RJMA, we can see that the estimators of weight vector are explicitly expressed. So, how to study their asymptotic behavior based on these explicit expressions is a meaningful but challenging topic.

Refer to caption — Figure 1: The mean of normalized MSE under homoskedastic errors with $\alpha=0.5$ in nested setting

7 Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant nos. 11971323 and 12031016).

Appendix A Lemmas and Proofs

Let $\hat{\theta}^{-n}(w)$ and $\hat{\theta}^{-(n-1,n)}(w)$ be the model averaging estimators corresponding to $\hat{\theta}(w)$ , which are based on $S^{-n}$ and $S^{-(n-1,n)}$ respectively, where $S^{-(n-1,n)}$ represents the sample set after removing the $(n-1)$ -th and $n$ -th observations from $S$ .

Lemma 1.

Under Assumptions 1 and 2, there exists a constant $B_{1}>0$ such that $\mathop{max}\limits_{1\leq m\leq M}\|\hat{\theta}_{m}\|_{2}^{2}\leq B_{1}K,a.s.$ .

Proof.

It follows from Assumptions 1 and 2 that

	$\displaystyle\max_{1\leq m\leq M}\\|\hat{\theta}_{m}\\|_{2}^{2}$
	$\displaystyle=\max_{1\leq m\leq M}\\|(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}y\\|_{2}^{2}$
	$\displaystyle=\max_{1\leq m\leq M}y^{{}^{\prime}}X_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}y$
	$\displaystyle\leq C_{1}^{-2}C_{2}Kn^{-1}y^{{}^{\prime}}y$
	$\displaystyle\leq C_{1}^{-2}C_{2}C_{3}K,a.s..$

We take $B_{1}=C_{1}^{-2}C_{2}C_{3}$ to complete the proof.

Lemma 2.

Under Assumptions 1 and 2, we have

E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}),

and

E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}^{-n}-\hat{\theta}^{-(n-1,n)}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}).

Proof.

By Dufour (1982), we see that for any $m\in\{1,...,M\}$ ,

\hat{\theta}_{m}=\hat{\theta}^{-n}_{m}+(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m}).

It follows from Assumption 1 that

	$\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}\\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\\|_{2}^{2}\Big{]}$
	$\displaystyle=E_{S}\Big{[}\max_{1\leq m\leq M}x_{n}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}\max_{1\leq m\leq M}\lambda_{max}^{2}[(X_{m}^{{}^{\prime}}X_{m})^{-1}]\\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\\|_{2}^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}C_{1}^{-2}n^{-2}\max_{1\leq m\leq M}\\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\\|_{2}^{2}\Big{]}.$			(1)

From Assumption 2, we obtain

	$\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}\\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\\|_{2}^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}\max_{1\leq m\leq M}\sum_{k=1}^{K}x_{(k)n}^{2}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}$
	$\displaystyle\leq C_{4}^{2}KE_{S}\Big{[}\max_{1\leq m\leq M}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}]$
	$\displaystyle\leq C_{4}^{2}KE_{S}[\max_{1\leq m\leq M}(2y_{n}^{2}+2(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2})\Big{]}.$			(2)

Further, from Assumption 2 and Lemma 1, we have

	$\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{(}\max_{1\leq m\leq M}\\|x_{n}\\|_{2}^{2}\\|\hat{\theta}^{-n}_{m}\\|_{2}^{2}\Big{)}$
	$\displaystyle\leq C_{4}^{2}KE_{S}\Big{(}\max_{1\leq m\leq M}\\|\hat{\theta}^{-n}_{m}\\|_{2}^{2}\Big{)}$
	$\displaystyle\leq B_{1}C_{4}^{2}K^{2}.$			(3)

Combining $(\ref{equ:1})-(\ref{equ:3})$ , it is seen that

E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}).

In a similar way, it can be shown that

E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}^{-n}-\hat{\theta}^{-(n-1,n)}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}).

Lemma 3.

Under Assumptions 1 and 2, we have

E_{S,z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})=n^{-1}E_{S}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})+O(n^{-1}K^{3}),

and

n^{-1}E_{S}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})=n^{-1}E_{S}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})+O[n(C_{1}n-C_{4}^{2}K)^{-2}K].

Proof.

Note that

\hat{\gamma}^{{}^{\prime}}\hat{\gamma}=(x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})_{M\times M},

\bar{\Omega}^{{}^{\prime}}\bar{\Omega}=\big{\{}[y-D_{m}(y-P_{m}y)]^{{}^{\prime}}[y-D_{t}(y-P_{t}y)]\big{\}}_{M\times M},

and

\hat{\Omega}^{{}^{\prime}}\hat{\Omega}=(y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{t}y)_{M\times M}=\Big{(}\sum_{i=1}^{n}x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}\Big{)}_{M\times M}.

It follows from Assumption 2, Lemma 1 and Lemma 2 that

	$\displaystyle\|E_{S}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x_{n})\|$
	$\displaystyle\leq\|E_{S}[x_{n}^{{}^{\prime}}\pi_{m}(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}]\|+\|E_{S}[x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}]\|$
	$\displaystyle\leq\sqrt{E_{S}(\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}^{2})}\sqrt{E_{S}(\\|x_{n}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\pi_{m}^{{}^{\prime}}x_{n}\\|_{2}^{2})}+\sqrt{E_{S}(\\|x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\pi_{t}^{{}^{\prime}}x_{n}\\|_{2}^{2})}\sqrt{E_{S}(\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}^{2})}$
	$\displaystyle\leq\sqrt{E_{S}(\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}^{2})}\sqrt{E_{S}(\\|x_{n}\\|_{2}^{2}\\|\hat{\theta}_{t}\\|_{2}^{2}\\|x_{n}\\|_{2}^{2})}+\sqrt{E_{S}(\\|x_{n}\\|_{2}^{2}\\|\hat{\theta}_{m}^{-n}\\|_{2}^{2}\\|x_{n}\\|_{2}^{2})}\sqrt{E_{S}(\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}^{2})}$
	$\displaystyle=O(n^{-1}K^{3}).$

In a similar way, we obtain

|E_{S,z^{*}}(x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})|=O(n^{-1}K^{3}).

Further, it is seen that

	$\displaystyle E_{S,z^{}}(\frac{1}{n}\sum_{i=1}^{n}x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}E_{S,z^{}}(x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})$
	$\displaystyle=E_{S,z^{}}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})$
	$\displaystyle=E_{S,z^{}}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{})+E_{S,z^{}}(x^{^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{})$
	$\displaystyle=E_{S}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x_{n})+E_{S,z^{}}(x^{^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}).$

So we have

E_{S,z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})=n^{-1}E_{S}[\hat{\Omega}^{{}^{\prime}}\hat{\Omega}]+O(n^{-1}K^{3}).

On the other hand, it follows from Assumptions 1 and 2 that

\max_{1\leq i\leq n}\max_{1\leq m\leq M}h_{ii}^{m}=x_{i}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{i}\leq\frac{C_{4}^{2}K}{nC_{1}},a.s..

Hence, we have

	$\displaystyle\|y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{t}y-[y-D_{m}(y-P_{m}y)]^{{}^{\prime}}[y-D_{t}(y-P_{t}y)]\|$
	$\displaystyle=\|y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{t}y-[(I_{n}-D_{m})y+D_{m}P_{m}y]^{{}^{\prime}}[(I_{n}-D_{t})y+D_{t}P_{t}y]\|$
	$\displaystyle\leq\|y^{{}^{\prime}}P_{m}^{{}^{\prime}}(D_{m}D_{t}-I_{n})P_{t}y\|+y^{{}^{\prime}}(I_{n}-D_{m})(I_{n}-D_{t})y+\|y^{{}^{\prime}}(I_{n}-D_{m})D_{t}P_{t}y\|+\|y^{{}^{\prime}}P_{m}^{{}^{\prime}}D_{m}(I_{n}-D_{t})y\|$
	$\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq i\leq n}[(1-h_{ii}^{m})^{-2}-1]y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{m}y+\max_{1\leq m\leq M}\max_{1\leq i\leq n}[1-(1-h_{ii}^{m})^{-1}]^{2}y^{{}^{\prime}}y$
	$\displaystyle\ \ \ +2\max_{1\leq m\leq M}\max_{1\leq i\leq n}\sqrt{(1-h_{ii}^{m})^{-2}[1-(1-h_{ii}^{m})^{-1}]^{2}y^{{}^{\prime}}yy^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{m}y}$
	$\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq i\leq n}\{[(1-h_{ii}^{m})^{-2}-1]+[(1-h_{ii}^{m})^{-1}-1]^{2}+2(1-h_{ii}^{m})^{-1}[(1-h_{ii}^{m})^{-1}-1]\}y^{{}^{\prime}}y$
	$\displaystyle\leq C_{3}n\{[(1-\frac{C_{4}^{2}K}{nC_{1}})^{-2}-1]+[(1-\frac{C_{4}^{2}K}{nC_{1}})^{-1}-1]^{2}+2(1-\frac{C_{4}^{2}K}{nC_{1}})^{-1}[(1-\frac{C_{4}^{2}K}{nC_{1}})^{-1}-1]\}$
	$\displaystyle=\frac{4C_{1}C_{3}C_{4}^{2}n^{2}K}{(C_{1}n-C_{4}^{2}K)^{2}},a.s..$

Thus, from Assumption 2, we see that

n^{-1}E_{S}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})=n^{-1}E_{S}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})+O[n(C_{1}n-C_{4}^{2}K)^{-2}K].

Lemma 4.

Under Assumptions 1 $-$ 3, there is a constant $B_{2}>0$ such that

\|\hat{w}\|_{2}^{2}\leq B_{2}M(1+n^{-2}K^{2}),

\|\bar{w}\|_{2}^{2}\leq B_{2}M,

\|\tilde{w}\|_{2}^{2}\leq B_{2}M(1+n^{-2}K^{2}),

and

\|\hat{w}^{*}\|_{2}^{2}\leq B_{2}M^{2},

a.s..

Proof.

It follows from Assumptions 2 and 3 that

	$\displaystyle\\|\hat{w}\\|_{2}^{2}$
	$\displaystyle=\\|(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\hat{\Omega}^{{}^{\prime}}y-\sigma^{2}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\kappa\\|_{2}^{2}$
	$\displaystyle\leq 2y^{{}^{\prime}}\hat{\Omega}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\hat{\Omega}^{{}^{\prime}}y+2\sigma^{4}\kappa^{{}^{\prime}}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\kappa$
	$\displaystyle\leq 2C_{5}^{-2}C_{6}n^{-1}My^{{}^{\prime}}y+2\sigma^{4}C_{5}^{-2}n^{-2}\kappa^{{}^{\prime}}\kappa$
	$\displaystyle\leq 2C_{5}^{-2}C_{3}C_{6}M+2\sigma^{4}C_{5}^{-2}n^{-2}K^{2}M,a.s.,$

and

	$\displaystyle\\|\bar{w}\\|_{2}^{2}$
	$\displaystyle=\\|(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})^{-1}\bar{\Omega}^{{}^{\prime}}y\\|_{2}^{2}$
	$\displaystyle\leq y^{{}^{\prime}}\bar{\Omega}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})^{-1}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})^{-1}\bar{\Omega}^{{}^{\prime}}y$
	$\displaystyle\leq C_{5}^{-2}C_{6}n^{-1}My^{{}^{\prime}}y$
	$\displaystyle\leq C_{5}^{-2}C_{3}C_{6}M,a.s..$

In a similar way, we obtain

\|\tilde{w}\|_{2}^{2}\leq 2C_{5}^{-2}C_{3}C_{6}M+2\sigma^{4}C_{5}^{-2}n^{-2}K^{2}M,a.s..

On the other hand, it follows from Assumptions $\ref{ass:2}-\ref{ass:3}$ and Lemma 1 that

	$\displaystyle\\|\hat{w}^{*}\\|_{2}^{2}$
	$\displaystyle=\\|[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{*})\\|_{2}^{2}$
	$\displaystyle\leq E_{z^{}}(\hat{\gamma}y^{})[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{})$
	$\displaystyle\leq E_{z^{}}(\hat{\gamma}y^{})E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{})\lambda_{min}^{-2}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]$
	$\displaystyle\leq C_{5}^{-2}E_{z^{}}(\hat{\gamma}y^{})E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{})$
	$\displaystyle\leq C_{5}^{-2}E_{z^{}}(\hat{\gamma}\hat{\gamma}^{{}^{\prime}})E_{z^{}}(y^{*2})$
	$\displaystyle\leq C_{5}^{-2}tr[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]E_{z^{}}(y^{*2})$
	$\displaystyle\leq C_{5}^{-2}C_{3}C_{6}M^{2},a.s..$

We take $B_{2}=max\{2C_{5}^{-2}C_{3}C_{6},2\sigma^{4}C_{5}^{-2}\}$ to complete the proof.

Lemma 5.

Under Assumptions 1 $-$ 4, we have

E_{S}[C(\hat{w}^{-n},S)-C(\hat{w},S)]=O[K^{3}M^{2}(1+n^{-2}K^{2})],

E_{S}[J(\bar{w}^{-n},S)-J(\bar{w},S)]=O(K^{3}M^{2}),

and

E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*},S)]=O(n^{-1}K^{3}M^{3}).

Proof.

Note that

	$\displaystyle C(\hat{w}^{-n},S)-C(\hat{w},S)$
	$\displaystyle=C(\hat{w}^{-n},S)-C(\hat{w}^{-n},S^{-n})+C(\hat{w}^{-n},S^{-n})-C(\hat{w},S)$
	$\displaystyle\leq C(\hat{w}^{-n},S)-C(\hat{w}^{-n},S^{-n})+C(\hat{w},S^{-n})-C(\hat{w},S),$

and

	$\displaystyle E_{S}[C(w,S)-C(w,S^{-n})-(\lambda_{n}-\lambda_{n-1})w^{{}^{\prime}}w]$
	$\displaystyle=\sum_{i=1}^{n-1}E_{S}\Big{\{}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(w)]^{2}-[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}}+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(w)]^{2}\Big{\}}$
	$\displaystyle=(n-1)E_{S}\Big{\{}[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(w)]^{2}-[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}}+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(w)]^{2}\Big{\}},$

where

\hat{w}^{-n}=argmin_{w\in R^{M}}C(w,S^{-n}),

and

C(w,S^{-n})=\sum_{i=1}^{n-1}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}+2\sigma^{2}w^{{}^{\prime}}\kappa+\lambda_{n-1}w^{{}^{\prime}}w.

It follows from Lemmas 1, 2 and 4 that

	$\displaystyle\\|\hat{\theta}(\hat{w})\\|_{2}^{2}$
	$\displaystyle=\sum_{m=1}^{M}\sum_{t=1}^{M}\hat{w}_{m}\hat{w}_{t}\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}$
	$\displaystyle\leq\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}_{m}\hat{w}_{t}\|\|\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|$
	$\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq t\leq M}\|\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}_{m}\hat{w}_{t}\|$
	$\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq t\leq M}\\|\hat{\theta}_{m}\\|_{2}\\|\hat{\theta}_{t}\\|_{2}\Big{(}\sum_{m=1}^{M}\|\hat{w}_{m}\|\Big{)}^{2}$
	$\displaystyle=M\max_{1\leq m\leq M}\\|\hat{\theta}_{m}\\|_{2}^{2}\sum_{m=1}^{M}\hat{w}_{m}^{2}$
	$\displaystyle\leq B_{1}B_{2}KM^{2}(1+n^{-2}K^{2}),a.s.,$

and

	$\displaystyle E_{S}[\\|\hat{\theta}(\hat{w})-\hat{\theta}^{-n}(\hat{w})\\|_{2}^{2}]$
	$\displaystyle=E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}\hat{w}_{m}\hat{w}_{t}(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})\Big{]}$
	$\displaystyle=E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}_{m}\hat{w}_{t}\|\|(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})\|\Big{)}$
	$\displaystyle\leq E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}_{m}\hat{w}_{t}\|\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}\Big{)}$
	$\displaystyle\leq ME_{S}\Big{(}\max_{1\leq m\leq M}\max_{1\leq t\leq M}\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}\sum_{m=1}^{M}\hat{w}_{m}^{2}\Big{)}$
	$\displaystyle\leq B_{2}M^{2}(1+n^{-2}K^{2})E_{S}\Big{(}\max_{1\leq m\leq M}\max_{1\leq t\leq M}\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}\Big{)}$
	$\displaystyle=B_{2}M^{2}(1+n^{-2}K^{2})E_{S}\Big{(}\max_{1\leq m\leq M}\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}^{2}\Big{)}$
	$\displaystyle\leq C_{8}B_{2}n^{-2}K^{3}M^{2}(1+n^{-2}K^{2}).$

In a similar way, we obtain

\|\hat{\theta}^{-n}(\hat{w})\|_{2}^{2}\leq B_{1}B_{2}KM^{2}(1+n^{-2}K^{2}),a.s..

Further, from Assumption 2, we have

	$\displaystyle E_{S}\Big{\{}\\|x_{n-1}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]\\|_{2}^{2}\Big{\}}$
	$\displaystyle=\sum_{k=1}^{K}E_{S}\Big{\{}x_{{n-1}k}^{2}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\Big{\}}$
	$\displaystyle\leq C_{4}^{2}KE_{S}\Big{\{}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\Big{\}}$
	$\displaystyle\leq C_{4}^{2}KE_{S}\Big{\{}6y_{n-1}^{2}+3[x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}+3[x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\Big{\}}$
	$\displaystyle\leq C_{4}^{2}KE_{S}[6y_{n-1}^{2}+3\\|x_{n-1}\\|_{2}^{2}\\|\hat{\theta}(\hat{w})\\|_{2}^{2}+3\\|x_{n-1}\\|_{2}^{2}\\|\hat{\theta}^{-n}(\hat{w})\\|_{2}^{2}]$
	$\displaystyle\leq 6C_{3}C_{4}^{2}K+6C_{4}^{4}B_{1}B_{2}K^{3}M^{2}(1+n^{-2}K^{2}),$

and

\displaystyle E_{S}\{[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}\}\leq 2C_{3}+2C_{4}^{2}B_{1}B_{2}K^{2}M^{2}(1+n^{-2}K^{2}).

These arguments indicate that

	$\displaystyle\|E_{S}\Big{\{}[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}-[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\|\Big{\}}$
	$\displaystyle=\|E_{S}\Big{\{}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{(-n)}(\hat{w})][x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]\Big{\}}\|$
	$\displaystyle\leq\sqrt{E_{S}[\\|x_{n-1}(2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})\\|_{2}^{2}]}\sqrt{E_{S}[\\|\hat{\theta}(\hat{w})-\hat{\theta}^{-n}(\hat{w})\\|_{2}^{2}]}$
	$\displaystyle=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})],$

and

|E_{S}[C(\hat{w},S^{-n})-C(\hat{w},S)-(\lambda_{n}-\lambda_{n-1})\hat{w}^{{}^{\prime}}\hat{w}]|=O[K^{3}M^{2}(1+n^{-2}K^{2})].

Similarly, we obtain

|E_{S}[C(\hat{w}^{-n},S)-C(\hat{w}^{-n},S^{-n})-(\lambda_{n}-\lambda_{n-1})\hat{w}^{-n^{\prime}}\hat{w}^{-n}]|=O[K^{3}M^{2}(1+n^{-2}K^{2})].

Thus, it follows from Assumption 4 and Lemma 4 that

E_{S}[C(\hat{w}^{-n},S)-C(\hat{w},S)]=O[K^{3}M^{2}(1+n^{-2}K^{2})].

On the other hand, we notice that

	$\displaystyle J(\bar{w}^{-n},S)-J(\bar{w},S)$
	$\displaystyle=J(\bar{w}^{-n},S)-J(\bar{w}^{-n},S^{-n})+J(\bar{w}^{-n},S^{-n})-J(\bar{w},S)$
	$\displaystyle\leq J(\bar{w}^{-n},S)-J(\bar{w}^{-n},S^{-n})+J(\bar{w},S^{-n})-J(\bar{w},S),$

and

	$\displaystyle E_{S}[J(w,S)-J(w,S^{-n})-(\lambda_{n}-\lambda_{n-1})w^{{}^{\prime}}w]$
	$\displaystyle=\sum_{i=1}^{n-1}E_{S}\Big{\{}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-i}(w)]^{2}-[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-(i,n)}(w)]^{2}\Big{\}}+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}}$
	$\displaystyle=(n-1)E_{S}\Big{\{}[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-(n-1)}(w)]^{2}-[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-(n-1,n)}(w)]^{2}\Big{\}}$
	$\displaystyle+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}},$

where

\bar{w}^{-n}=argmin_{w\in R^{M}}J(w,S^{-n}),

and

J(w,S^{-n})=\sum_{i=1}^{n-1}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-(i,n)}(w)]^{2}+\lambda_{n-1}w^{{}^{\prime}}w.

So, we obtain

|E_{S}[J(\bar{w},S^{-n})-J(\bar{w},S)-(\lambda_{n}-\lambda_{n-1})\bar{w}^{{}^{\prime}}\bar{w}]|=O(K^{3}M^{2}),

and

|E_{S}[J(\bar{w}^{-n},S)-J(\bar{w}^{-n},S^{-n})-(\lambda_{n}-\lambda_{n-1})\bar{w}^{-n^{\prime}}\bar{w}^{-n}]|=O(K^{3}M^{2}).

Further, we have

E_{S}[J(\bar{w}^{-n},S)-J(\bar{w},S)]=O(K^{3}M^{2}).

Similarly, from

	$\displaystyle E_{S}[F(\hat{w}^{,-n},S)-F(\hat{w}^{},S)]$
	$\displaystyle=E_{S}[F(\hat{w}^{,-n},S)-F(\hat{w}^{,-n},S^{-n})+F(\hat{w}^{,-n},S^{-n})-F(\hat{w}^{},S)]$
	$\displaystyle\leq E_{S}[F(\hat{w}^{,-n},S)-F(\hat{w}^{,-n},S^{-n})+F(\hat{w}^{},S^{-n})-F(\hat{w}^{},S)],$

and

\displaystyle E_{S}[F(w,S^{-n})-F(w,S)]=E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(w)]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(w)]^{2}\Big{\}},

where

\hat{w}^{*,-n}=argmin_{w\in R^{M}}F(w,S^{-n}),

and

F(w,S^{-n})=E_{z^{*}}\{[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(w)]^{2}\},

we have

E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*},S)]=O(n^{-1}K^{3}M^{3}).

Lemma 6.

Under Assumptions 1 $-$ 4, we have

E_{S}[\|\hat{w}-\hat{w}^{-n}\|_{2}^{2}]=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})],

E_{S}[\|\bar{w}-\bar{w}^{-n}\|_{2}^{2}]=O(n^{-1}K^{3}M^{2}),

and

E_{S}[\|\hat{w}^{*}-\hat{w}^{*,-n}\|_{2}^{2}]=O(n^{-1}K^{3}M^{3}).

Proof.

It follows from the definition of $\hat{w}$ that $\frac{\partial C(\hat{w},S)}{\partial w}={\bf{0}}_{M}$ . So, we have

	$\displaystyle C(\hat{w}^{-n},S)-C(\hat{w},S)$
	$\displaystyle=(\hat{w}^{-n}-\hat{w})^{{}^{\prime}}\frac{\partial C(\hat{w},S)}{\partial w}+(\hat{w}^{-n}-\hat{w})^{{}^{\prime}}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})(\hat{w}^{-n}-\hat{w})$
	$\displaystyle=(\hat{w}^{-n}-\hat{w})^{{}^{\prime}}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})(\hat{w}^{-n}-\hat{w})$
	$\displaystyle\geq C_{5}n\\|\hat{w}^{-n}-\hat{w}\\|_{2}^{2},a.s..$

Further, it follows from Lemma 5 that

E_{S}(\|\hat{w}^{-n}-\hat{w}\|_{2}^{2})=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})].

On the other hand, it follows from the definition of $\bar{w}$ that $\frac{\partial J(\bar{w},S)}{\partial w}={\bf{0}}_{M}$ . So, we have

	$\displaystyle J(\bar{w}^{-n},S)-J(\bar{w},S)$
	$\displaystyle=(\bar{w}^{-n}-\bar{w})^{{}^{\prime}}\frac{\partial J(\bar{w},S)}{\partial w}+(\bar{w}^{-n}-\bar{w})^{{}^{\prime}}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})(\bar{w}^{-n}-\bar{w})$
	$\displaystyle=(\bar{w}^{-n}-\bar{w})^{{}^{\prime}}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})(\bar{w}^{-n}-\bar{w})$
	$\displaystyle\geq C_{5}n\\|\bar{w}^{-n}-\bar{w}\\|_{2}^{2},a.s..$

Further, from Lemma 5, we have

E_{S}(\|\bar{w}^{-n}-\bar{w}\|_{2}^{2})=O(n^{-1}K^{3}M^{2}).

Similarly, from the definition of $\hat{w}^{*}$ , Lemma 5 and

	$\displaystyle F(\hat{w}^{,-n},S)-F(\hat{w}^{},S)$
	$\displaystyle=(\hat{w}^{,-n}-\hat{w}^{})^{{}^{\prime}}\frac{\partial F(\hat{w}^{},S)}{\partial w}+(\hat{w}^{,-n}-\hat{w}^{})^{{}^{\prime}}E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})(\hat{w}^{,-n}-\hat{w}^{})$
	$\displaystyle=(\hat{w}^{,-n}-\hat{w}^{})^{{}^{\prime}}E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})(\hat{w}^{,-n}-\hat{w}^{*})$
	$\displaystyle\geq C_{5}\\|\hat{w}^{,-n}-\hat{w}^{}\\|_{2}^{2},a.s.,$

we obtain

E_{S}(\|\hat{w}^{*,-n}-\hat{w}^{*}\|_{2}^{2})=O(n^{-1}K^{3}M^{3}).

Proof of Theorem 3.4: Let $(\hat{c}_{1},...,\hat{c}_{M})^{{}^{\prime}}=\hat{P}^{{}^{\prime}}\hat{w}^{0},$ and $(\hat{d}_{1},...,\hat{d}_{M})^{{}^{\prime}}=\hat{P}^{{}^{\prime}}\hat{w}^{*}$ . Then, we have

	$\displaystyle\hat{M}_{1}(\lambda_{n})$
	$\displaystyle=\\|\hat{Z}\hat{w}^{0}-\hat{Z}\hat{w}^{}\\|_{2}^{2}+\\|\hat{Z}\hat{w}^{}-\hat{w}^{*}\\|_{2}^{2}$
	$\displaystyle=\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\hat{d}_{m}^{2}(\frac{\hat{\zeta}_{m}}{\lambda_{n}+\hat{\zeta}_{m}}-1)^{2}$
	$\displaystyle=\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\frac{\hat{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}},$

and

	$\displaystyle\frac{d}{d\lambda_{n}}\hat{M}_{1}(\lambda_{n})$
	$\displaystyle=\sum_{m=1}^{M}\frac{-2(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\hat{d}_{m}^{2}\lambda_{n}(\lambda_{n}+\hat{\zeta}_{m})-2\hat{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}$
	$\displaystyle=\sum_{m=1}^{M}\frac{-2(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\hat{d}_{m}^{2}\lambda_{n}\hat{\zeta}_{m}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}.$

From

	$\displaystyle\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}$
	$\displaystyle\geq\frac{1}{\lambda_{n}+\hat{\zeta}_{M}}\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}}$
	$\displaystyle=\frac{1}{\lambda_{n}+\hat{\zeta}_{M}}\\|\hat{Z}(\hat{w}^{0}-\hat{w}^{*})\\|_{2}^{2},$

we see that, when $\hat{w}^{0}\neq\hat{w}^{*}$ , $\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(0+\hat{\zeta}_{m})^{3}}>0$ . So, we have $\hat{\lambda}_{n}>0$ and $\hat{M}_{1}(\hat{\lambda}_{n})<\hat{M}_{1}(0)$ when $\hat{w}^{0}\neq\hat{w}^{*}$ .

Let $(\bar{c}_{1},...,\bar{c}_{M})^{{}^{\prime}}=\bar{P}^{{}^{\prime}}\bar{w}^{0},(\bar{d}_{1},...,\bar{d}_{M})^{{}^{\prime}}=\bar{P}^{{}^{\prime}}\hat{w}^{*}$ . Then, we have

	$\displaystyle\bar{M}_{1}(\lambda_{n})$
	$\displaystyle=\\|\bar{Z}\bar{w}^{0}-\bar{Z}\hat{w}^{}\\|_{2}^{2}+\\|\bar{Z}\hat{w}^{}-\hat{w}^{*}\\|_{2}^{2}$
	$\displaystyle=\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\bar{d}_{m}^{2}(\frac{\bar{\zeta}_{m}}{\lambda_{n}+\bar{\zeta}_{m}}-1)^{2}$
	$\displaystyle=\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\frac{\bar{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}},$

and

	$\displaystyle\frac{d}{d\lambda_{n}}\bar{M}_{1}(\lambda_{n})$
	$\displaystyle=\sum_{m=1}^{M}\frac{-2(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\bar{d}_{m}^{2}\lambda_{n}(\lambda_{n}+\bar{\zeta}_{m})-2\bar{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}$
	$\displaystyle=\sum_{m=1}^{M}\frac{-2(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\bar{d}_{m}^{2}\lambda_{n}\bar{\zeta}_{m}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}.$

Similarly, from

	$\displaystyle\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}$
	$\displaystyle\geq\frac{1}{\lambda_{n}+\bar{\zeta}_{M}}\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}}$
	$\displaystyle=\frac{1}{\lambda_{n}+\bar{\zeta}_{M}}\\|\bar{Z}(\bar{w}^{0}-\hat{w}^{*})\\|_{2}^{2},$

we see that, when $\bar{w}^{0}\neq\hat{w}^{*}$ , $\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(0+\bar{\zeta}_{m})^{3}}>0$ . So, we have $\bar{\lambda}_{n}>0$ and $\bar{M}_{1}(\bar{\lambda}_{n})<\bar{M}_{1}(0)$ when $\bar{w}^{0}\neq\hat{w}^{*}$ .

Proof of Theorem 3.5: It follows from Lemma 4 and Assumption 4 that

	$\displaystyle E_{S}[\hat{F}(\hat{w},S)-\hat{F}(\tilde{w},S)]$
	$\displaystyle=E_{S}[\hat{F}(\hat{w},S)-\frac{1}{n}C(\hat{w},S)+\frac{1}{n}C(\hat{w},S)-\hat{F}(\tilde{w},S)]$
	$\displaystyle\leq E_{S}[\hat{F}(\hat{w},S)-\frac{1}{n}C(\hat{w},S)+\frac{1}{n}C(\tilde{w},S)-\hat{F}(\tilde{w},S)]$
	$\displaystyle=E_{S}\Big{(}\frac{2\sigma^{2}\tilde{w}^{{}^{\prime}}\kappa}{n}+\frac{\lambda_{n}\tilde{w}^{{}^{\prime}}\tilde{w}}{n}-\frac{2\sigma^{2}\hat{w}^{{}^{\prime}}\kappa}{n}-\frac{\lambda_{n}\hat{w}^{{}^{\prime}}\hat{w}}{n}\Big{)}$
	$\displaystyle\leq\frac{4\sigma^{2}B_{2}^{\frac{1}{2}}KM(1+n^{-2}K^{2})^{\frac{1}{2}}}{n}+\frac{2B_{2}\lambda_{n}M(1+n^{-2}K^{2})}{n}$
	$\displaystyle=O[n^{-1}\log nKM^{2}(1+n^{-2}K^{2})].$

Further, from the proof of Lemma 5,

	$\displaystyle E_{S}[\hat{F}(\bar{w},S)-\hat{F}(\tilde{w},S)]$
	$\displaystyle=E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J(\bar{w},S)+\frac{1}{n}J(\bar{w},S)-\hat{F}(\tilde{w},S)]$
	$\displaystyle\leq E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J(\bar{w},S)+\frac{1}{n}J(\tilde{w},S)-\hat{F}(\tilde{w},S)],$

and

	$\displaystyle E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J(\bar{w},S)+\frac{\lambda_{n}\bar{w}^{{}^{\prime}}\bar{w}}{n}]$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}E_{S}\Big{\{}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}-[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-i}(\bar{w})]^{2}\Big{\}}$
	$\displaystyle=E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w})]^{2}\Big{\}},$

we have

E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w})]^{2}\Big{\}}=O(n^{-1}K^{3}M^{2}),

that is

E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J_{S}(\bar{w})]=O(n^{-1}\log nK^{3}M^{2}).

In a similar way, we obtain

\displaystyle E_{S}[\frac{1}{n}J_{S}(\tilde{w})-\hat{F}(\tilde{w},S)]=O(n^{-1}\log nK^{3}M^{2}).

So, we have

E_{S}[\hat{F}(\bar{w},S)-\hat{F}(\tilde{w},S)]=O(n^{-1}\log nK^{3}M^{2}).

Proof of Theorem 3.6: It follows from Lemmas 1 and 6 that

	$\displaystyle E_{S\sim\mathcal{D}^{n}}(\\|\hat{\theta}(\hat{w}^{-n})-\hat{\theta}(\hat{w})\\|_{2}^{2})$
	$\displaystyle=E_{S}\Big{(}\Big{\\|}\sum_{m=1}^{M}(\hat{w}^{-n}_{m}-\hat{w}_{m})\pi_{m}\hat{\theta}_{m}\Big{\\|}_{2}^{2}\Big{)}$
	$\displaystyle\leq E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}\|(\hat{w}^{-n}_{m}-\hat{w}_{m})(\hat{w}^{-n}_{t}-\hat{w}_{t})\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|\Big{]}$
	$\displaystyle\leq E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}^{-n}_{m}-\hat{w}_{m}\|\|\hat{w}^{-n}_{t}-\hat{w}_{t}\|\\|\hat{\theta}_{m}\\|_{2}\\|\hat{\theta}_{t}\\|_{2}\Big{)}$
	$\displaystyle\leq B_{1}KE_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}^{-n}_{m}-\hat{w}_{m}\|\|\hat{w}^{-n}_{t}-\hat{w}_{t}\|\Big{)}$
	$\displaystyle=B_{1}KE_{S}\Big{[}\Big{(}\sum_{m=1}^{M}\|\hat{w}^{-n}_{m}-\hat{w}_{m}\|\Big{)}^{2}\Big{]}$
	$\displaystyle\leq B_{1}MKE_{S}\Big{(}\sum_{m=1}^{M}\|\hat{w}^{-n}_{m}-\hat{w}_{m}\|^{2}\Big{)}$
	$\displaystyle=B_{1}MKE_{S}[\\|\hat{w}^{-n}-\hat{w}\\|_{2}^{2}]$
	$\displaystyle=O[n^{-1}K^{4}M^{3}(1+n^{-2}K^{2})].$

Further, it follows from the proof of Lemma 5 that

\displaystyle\Big{|}E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})]\Big{\}}\Big{|}=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})],

and

|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]\Big{\}}|=O[n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})].

From

	$\displaystyle\|E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}\Big{\}}\|$
	$\displaystyle=\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]\Big{\}}\|$
	$\displaystyle\leq\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})]\Big{\}}\|$
	$\displaystyle\ \ \ +\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]\Big{\}}\|,$

we have

E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}\Big{\}}=O[n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})].

In a similar way, we obtain

E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(\hat{w})]^{2}\Big{\}}=O[n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})].

On the other hand, it follows from Lemmas 1 and 6 that

	$\displaystyle E_{S\sim\mathcal{D}^{n}}[\\|\hat{\theta}(\bar{w}^{-n})-\hat{\theta}(\bar{w})\\|_{2}^{2}]$
	$\displaystyle=E_{S}\Big{[}\Big{\\|}\sum_{m=1}^{M}(\bar{w}^{-n}_{m}-\bar{w}_{m})\pi_{m}\hat{\theta}_{m}\Big{\\|}_{2}^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}\|(\bar{w}^{-n}_{m}-\bar{w}_{m})(\bar{w}^{-n}_{t}-\bar{w}_{t})\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|\Big{]}$
	$\displaystyle\leq E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\bar{w}^{-n}_{m}-\bar{w}_{m}\|\|\bar{w}^{-n}_{t}-\bar{w}_{t}\|\\|\hat{\theta}_{m}\\|_{2}\\|\hat{\theta}_{t}\\|_{2}\Big{)}$
	$\displaystyle\leq B_{1}KE_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\bar{w}^{-n}_{m}-\bar{w}_{m}\|\|\bar{w}^{-n}_{t}-\bar{w}_{t}\|\Big{)}$
	$\displaystyle=B_{1}KE_{S}\Big{[}\Big{(}\sum_{m=1}^{M}\|\bar{w}^{-n}_{m}-\bar{w}_{m}\|\Big{)}^{2}\Big{]}$
	$\displaystyle\leq B_{1}MKE_{S}\Big{(}\sum_{m=1}^{M}\|\bar{w}^{-n}_{m}-\bar{w}_{m}\|^{2}\Big{)}$
	$\displaystyle=B_{1}MKE_{S}(\\|\bar{w}^{-n}-\bar{w}\\|_{2}^{2})$
	$\displaystyle=O(n^{-1}K^{4}M^{3}).$

So, from the proof of Lemma 5, we have

\displaystyle|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})]\Big{\}}|=O(n^{-1}K^{3}M^{2}),

and

|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]\Big{\}}|=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}).

From

	$\displaystyle\|E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}\Big{\}}\|$
	$\displaystyle=\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]\Big{\}}\|$
	$\displaystyle\leq\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})]\Big{\}}\|$
	$\displaystyle+\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]\Big{\}}\|,$

we know that

E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}).

In a similar way, we obtain

E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(\bar{w})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}).

Similarly, it follows from Lemmas 1 and 6 that

	$\displaystyle E_{S\sim\mathcal{D}^{n}}[\\|\hat{\theta}(\hat{w}^{,-n})-\hat{\theta}(\hat{w}^{})\\|_{2}^{2}]$
	$\displaystyle=E_{S}\Big{[}\Big{\\|}\sum_{m=1}^{M}(\hat{w}^{,-n}_{m}-\hat{w}^{}_{m})\pi_{m}\hat{\theta}_{m}\Big{\\|}_{2}^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}\|(\hat{w}^{,-n}_{m}-\hat{w}^{}_{m})(\hat{w}^{,-n}_{t}-\hat{w}^{}_{t})\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|\Big{]}$
	$\displaystyle\leq E_{S}[\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}^{,-n}_{m}-\hat{w}^{}_{m}\|\|\hat{w}^{,-n}_{t}-\hat{w}^{}_{t}\|\\|\hat{\theta}_{m}\\|_{2}\\|\hat{\theta}_{t}\\|_{2}]$
	$\displaystyle\leq B_{1}KE_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}^{,-n}_{m}-\hat{w}^{}_{m}\|\|\hat{w}^{,-n}_{t}-\hat{w}^{}_{t}\|\Big{)}$
	$\displaystyle=B_{1}KE_{S}\Big{[}\Big{(}\sum_{m=1}^{M}\|\hat{w}^{,-n}_{m}-\hat{w}^{}_{m}\|\Big{)}^{2}\Big{]}$
	$\displaystyle\leq B_{1}MKE_{S}\Big{(}\sum_{m=1}^{M}\|\hat{w}^{,-n}_{m}-\hat{w}^{}_{m}\|^{2}\Big{)}$
	$\displaystyle=B_{1}MKE_{S}(\\|\hat{w}^{,-n}-\hat{w}^{}\\|_{2}^{2})$
	$\displaystyle=O(n^{-1}K^{4}M^{4}).$

Thus, from the proof of Lemma 5, we have

\displaystyle|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*,-n})]\Big{\}}|=O(n^{-1}K^{3}M^{3}),

and

|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]\Big{\}}|=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}).

From

	$\displaystyle\|E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{,-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{})]^{2}\Big{\}}\|$
	$\displaystyle=\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{})]\Big{\}}\|$
	$\displaystyle\leq\|E_{S}\Big{\{}(2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{,-n})]\Big{\}}\|$
	$\displaystyle+\|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{})]\Big{\}}\|,$

we know that

E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}).

In a similar way, we obtain

E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(\hat{w}^{*})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}).

References

Akaike (1998) Akaike, H., 1998. Information theory and an extension of the maximum likelihood principle, in: Selected papers of Hirotugu Akaike, pp. 199–213.
Bousquet and Elisseeff (2002) Bousquet, O., Elisseeff, A., 2002. Stability and generalization. The Journal of Machine Learning Research 2, 499–526.
Buckland et al. (1997) Buckland, S.T., Burnham, K.P., Augustin, N.H., 1997. Model selection: An integral part of inference. Biometrics 53, 603–618.
Dufour (1982) Dufour, J.M., 1982. Recursive stability analysis of linear regression relationships: An exploratory methodology. Journal of Econometrics 19, 31–76.
Fragoso et al. (2018) Fragoso, T.M., Bertoli, W., Louzada, F., 2018. Bayesian model averaging: A systematic review and conceptual classification. International Statistical Review 86, 1–28.
Gao et al. (2016) Gao, Y., Zhang, X., Wang, S., Zou, G., 2016. Model averaging based on leave-subject-out cross-validation. Journal of Econometrics 192, 139–151.
Hansen (2007) Hansen, B.E., 2007. Least squares model averaging. Econometrica 75, 1175–1189.
Hansen (2008) Hansen, B.E., 2008. Least-squares forecast averaging. Journal of Econometrics 146, 342–350.
Hansen and Racine (2012) Hansen, B.E., Racine, J.S., 2012. Jackknife model averaging. Journal of Econometrics 167, 38–46.
Hjort and Claeskens (2003) Hjort, N.L., Claeskens, G., 2003. Frequentist model average estimators. Journal of the American Statistical Association 98, 879–899.
Hoerl and Kennard (1970) Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67.
Kutin and Niyogi (2002) Kutin, S., Niyogi, P., 2002. Almost-everywhere algorithmic stability and generalization error, in: In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, p. 275–282.
Li et al. (2021) Li, X., Zou, G., Zhang, X., Zhao, S., 2021. Least squares model averaging based on generalized cross validation. Acta Mathematicae Applicatae Sinica, English Series 37, 495–509.
Liang et al. (2011) Liang, H., Zou, G., Wan, A.T., Zhang, X., 2011. Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association 106, 1053–1066.
Liao and Zou (2020) Liao, J., Zou, G., 2020. Corrected Mallows criterion for model averaging. Computational Statistics and Data Analysis 144, 106902.
Liu and Okui (2013) Liu, Q., Okui, R., 2013. Heteroskedasticity-robust $C_{p}$ model averaging. The Econometrics Journal 16, 463–472.
Mallows (1973) Mallows, C.L., 1973. Some comments on $C_{p}$ . Technometrics 15, 661–675.
Moral-Benito (2015) Moral-Benito, E., 2015. Model averaging in economics: An overview. Journal of Economic Surveys 29, 46–75.
Mukherjee et al. (2006) Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R., 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 161–193.
Rakhlin et al. (2005) Rakhlin, A., Mukherjee, S., Poggio, T., 2005. Stability results in learning theory. Analysis and Applications 3, 397–417.
Schomaker (2012) Schomaker, M., 2012. Shrinkage averaging estimation. Statistical Papers 53, 1015–1034.
Schwarz (1978) Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461–464.
Shalev-Shwartz et al. (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K., 2010. Learnability, stability and uniform convergence. The Journal of Machine Learning Research 11, 2635–2670.
Tong and Wu (2017) Tong, H., Wu, Q., 2017. Learning performance of regularized moving least square regression. Journal of Computational and Applied Mathematics 325, 42–55.
Vapnik (1998) Vapnik, V., 1998. Statistical Learning Theory. Wiley.
Wan and Zhang (2009) Wan, A.T., Zhang, X., 2009. On the use of model averaging in tourism research. Annals of Tourism Research 36, 525–532.
Wan et al. (2010) Wan, A.T., Zhang, X., Zou, G., 2010. Least squares model averaging by Mallows criterion. Journal of Econometrics 156, 277–283.
Wang et al. (2016) Wang, Y.X., Lei, J., Fienberg, S.E., 2016. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. The Journal of Machine Learning Research 17, 6353–6392.
Wooldridge (2003) Wooldridge, J.M., 2003. Introductory Econometrics. Thompson South-Western.
Xu et al. (2014) Xu, G., Wang, S., Huang, J.Z., 2014. Focused information criterion and model averaging based on weighted composite quantile regression. Scandinavian Journal of Statistics 41, 365–381.
Yang (2001) Yang, Y., 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96, 574–588.
Yuan and Yang (2005) Yuan, Z., Yang, Y., 2005. Combining linear regression models: When and how? Journal of the American Statistical Association 100, 1202–1214.
Zhang and Zou (2020) Zhang, H., Zou, G., 2020. Cross-validation model averaging for generalized functional linear model. Econometrics 8, 7.
Zhang and Liang (2011) Zhang, X., Liang, H., 2011. Focused information criterion and model averaging for generalized additive partial linear models. The Annals of Statistics 39, 174–200.
Zhang et al. (2012) Zhang, X., Wan, A.T., Zhou, S.Z., 2012. Focused information criteria, model selection, and model averaging in a Tobit model with a nonzero threshold. Journal of Business and Economic Statistics 30, 132–142.
Zhang et al. (2013) Zhang, X., Wan, A.T., Zou, G., 2013. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics 174, 82–94.
Zhang and Zou (2011) Zhang, X., Zou, G., 2011. Model averaging method and its application in forecast. Statistical Research 28, 6.
Zhao et al. (2020) Zhao, S., Liao, J., Yu, D., 2020. Model averaging estimator in ridge regression and its large sample properties. Statistical Papers 61, 1719–1739.
Zou and Zhang (2009) Zou, H., Zhang, H.H., 2009. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751.

	$\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}\\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\\|_{2}^{2}\Big{]}$
	$\displaystyle=E_{S}\Big{[}\max_{1\leq m\leq M}x_{n}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}\max_{1\leq m\leq M}\lambda_{max}^{2}[(X_{m}^{{}^{\prime}}X_{m})^{-1}]\\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\\|_{2}^{2}\Big{]}$
	$\displaystyle\leq E_{S}\Big{[}C_{1}^{-2}n^{-2}\max_{1\leq m\leq M}\\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\\|_{2}^{2}\Big{]}.$			(1)

	$\displaystyle\|E_{S}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x_{n})\|$
	$\displaystyle\leq\|E_{S}[x_{n}^{{}^{\prime}}\pi_{m}(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}]\|+\|E_{S}[x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}]\|$
	$\displaystyle\leq\sqrt{E_{S}(\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}^{2})}\sqrt{E_{S}(\\|x_{n}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\pi_{m}^{{}^{\prime}}x_{n}\\|_{2}^{2})}+\sqrt{E_{S}(\\|x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\pi_{t}^{{}^{\prime}}x_{n}\\|_{2}^{2})}\sqrt{E_{S}(\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}^{2})}$
	$\displaystyle\leq\sqrt{E_{S}(\\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\\|_{2}^{2})}\sqrt{E_{S}(\\|x_{n}\\|_{2}^{2}\\|\hat{\theta}_{t}\\|_{2}^{2}\\|x_{n}\\|_{2}^{2})}+\sqrt{E_{S}(\\|x_{n}\\|_{2}^{2}\\|\hat{\theta}_{m}^{-n}\\|_{2}^{2}\\|x_{n}\\|_{2}^{2})}\sqrt{E_{S}(\\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\\|_{2}^{2})}$
	$\displaystyle=O(n^{-1}K^{3}).$

	$\displaystyle E_{S,z^{}}(\frac{1}{n}\sum_{i=1}^{n}x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}E_{S,z^{}}(x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})$
	$\displaystyle=E_{S,z^{}}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})$
	$\displaystyle=E_{S,z^{}}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{})+E_{S,z^{}}(x^{^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{})$
	$\displaystyle=E_{S}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x_{n})+E_{S,z^{}}(x^{^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{}-x^{^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}).$

	$\displaystyle\\|\hat{w}^{*}\\|_{2}^{2}$
	$\displaystyle=\\|[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{*})\\|_{2}^{2}$
	$\displaystyle\leq E_{z^{}}(\hat{\gamma}y^{})[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{})$
	$\displaystyle\leq E_{z^{}}(\hat{\gamma}y^{})E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{})\lambda_{min}^{-2}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]$
	$\displaystyle\leq C_{5}^{-2}E_{z^{}}(\hat{\gamma}y^{})E_{z^{}}(\hat{\gamma}^{{}^{\prime}}y^{})$
	$\displaystyle\leq C_{5}^{-2}E_{z^{}}(\hat{\gamma}\hat{\gamma}^{{}^{\prime}})E_{z^{}}(y^{*2})$
	$\displaystyle\leq C_{5}^{-2}tr[E_{z^{}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]E_{z^{}}(y^{*2})$
	$\displaystyle\leq C_{5}^{-2}C_{3}C_{6}M^{2},a.s..$

	$\displaystyle\\|\hat{\theta}(\hat{w})\\|_{2}^{2}$
	$\displaystyle=\sum_{m=1}^{M}\sum_{t=1}^{M}\hat{w}_{m}\hat{w}_{t}\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}$
	$\displaystyle\leq\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}_{m}\hat{w}_{t}\|\|\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|$
	$\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq t\leq M}\|\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\|\sum_{m=1}^{M}\sum_{t=1}^{M}\|\hat{w}_{m}\hat{w}_{t}\|$
	$\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq t\leq M}\\|\hat{\theta}_{m}\\|_{2}\\|\hat{\theta}_{t}\\|_{2}\Big{(}\sum_{m=1}^{M}\|\hat{w}_{m}\|\Big{)}^{2}$
	$\displaystyle=M\max_{1\leq m\leq M}\\|\hat{\theta}_{m}\\|_{2}^{2}\sum_{m=1}^{M}\hat{w}_{m}^{2}$
	$\displaystyle\leq B_{1}B_{2}KM^{2}(1+n^{-2}K^{2}),a.s.,$

Stability and L2L_{2}-penalty in Model Averaging

Abstract

keywords:

1 Introduction

2 Consistency and Stability for Model Averaging

2.1 Model Averaging

2.2 Related Concepts

Definition 1 (Consistency).

Definition 2 (Symmetry).

Definition 3 (PLoo Stability).

Definition 4 (FLoo Stability).

Definition 5 (Ro Stability).

Definition 6 (AERM).

Definition 7 (Generalization).

2.3 Relationship between Different Concepts

Theorem 2.1.

Theorem 2.2.

Theorem 2.3.

3 L2L_{2}-penalty Model Averaging

3.1 Model Framework and Estimators

3.2 Weight Selection Criterion

3.3 Stability and Consistency

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 3.4.

Theorem 3.5.

Theorem 3.6.

Theorem 3.7.

3.4 Optimal Weighting Based on Cross Validation

4 Simulation Study

4.1 Nested Setting and Results

4.2 Non-nested Setting and Results

5 Real Data Analysis

6 Concluding Remarks

7 Acknowledgements

Appendix A Lemmas and Proofs

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

References

Stability and $L_{2}$ -penalty in Model Averaging

3 $L_{2}$ -penalty Model Averaging